## Data Loaders

This notebook demonstrates the use of **edge loader** in `pyTigerGraph`. The job of a data loader is to pull data from the TigerGraph database. Currently, the following data loaders are provided:
* EdgeLoader, which returns batches of edges.
* VertexLoader, which returns batches of vertices.
* GraphLoader, which returns randomly sampled (probably disconnected) subgraphs in pandas `dataframe`, `PyG` or `DGL` format.
* NeighborLoader, which returns subgraphs using neighbor sampling in `dataframe`, `PyG` or `DGL` format.
* EdgeNeighborLoader, which returns subgraphs using neighbor sampling from edges in `dataframe`, `PyG` or `DGL` format.

Every data loader above can either get all the batches as a HTTP response (default) or stream every batch through Kafka. The former mechanism is good for testing with small graphs and it is fast, but it subjects to a data size limit of 2GB. For large graphs, the HTTP channel will likely fail due to size limit and network connectivity issues. Streaming via Kafka is offered for data robustness and scalability. Also, Kafka excels at multi-consumer use cases, and it is efficient for model search or hyperparameter tuning when there are multiiple consumers of the same data. 

The data loaders support both homogeneous and heterogenous graphs. By default, they load from all vertex and edge types and treat the graph as a homogeneous graph. But they also allow users to specify what vertex and edge types to load from and what attributes to load from each type. This way users will get heterogeneous graph outputs.

### Connection to Database

The `TigerGraphConnection` class represents a connection to the TigerGraph database. Under the hood, it stores the necessary information to communicate with the database. It is able to perform quite a few database tasks. Please see its [documentation](https://docs.tigergraph.com/pytigergraph/current/intro/) for details.

In [1]:
from pyTigerGraph import TigerGraphConnection
import json

# Read in DB configs
with open('../../config.json', "r") as config_file:
    config = json.load(config_file)
    
conn = TigerGraphConnection(
    host=config["host"],
    username=config["username"],
    password=config["password"]
)

from pyTigerGraph.datasets import Datasets

dataset = Datasets("Cora")

conn.ingestDataset(dataset, getToken=config["getToken"])

from pyTigerGraph.visualization import drawSchema

drawSchema(conn.getSchema(force=True))

A folder with name Cora already exists in ./tmp. Skip downloading.
---- Checking database ----
A graph with name Cora already exists in the database. Skip ingestion.
Graph name is set to Cora for this connection.


CytoscapeWidget(cytoscape_layout={'name': 'circle', 'animate': True, 'padding': 1}, cytoscape_style=[{'selecto…

### Edge Loader

`EdgeLoader` pulls batches of edges from database. Specifically, it divides edges into `num_batches` and returns each batch separately. The boolean attribute provided to `filter_by` indicates which edges are included. If you need random batches, set `shuffle` to True.

**Note**: For the first time you initialize the loader on a graph in TigerGraph,
the initialization might take a minute as it installs the corresponding
query to the database and optimizes it. However, the query installation only
needs to be done once, so it will take no time when you initialize the loader
on the same TG graph again.

There are two ways to use the data loader. See
[here](https://github.com/TigerGraph-DevLabs/mlworkbench-docs/blob/main/tutorials/basics/2_dataloaders.ipynb) for examples.
* First, it can be used as an iterable, which means you can loop through it to get every batch of data. If you load all edges at once (`num_batches=1`), there will be only one batch (of all the edges) in the iterator.
* Second, you can access the `data` property of the class directly. If there is only one batch of data to load, it will give you the batch directly instead of an iterator, which might make more sense in that case. If there are multiple batches of data to load, it will return the loader again.

Args:
* attributes (list or dict, optional):
        Edge attributes to be included. If it is a list, then the attributes
        in the list from all edge types will be selected. An error will be thrown if
        certain attribute doesn't exist in all edge types. If it is a dict, keys of the 
        dict are edge types to be selected, and values are lists of attributes to be 
        selected for each edge type. Defaults to None.
* batch_size (int, optional):  
        Number of edges in each batch.  
        Defaults to None.  
* num_batches (int, optional):  
        Number of batches to split the edges.  
        Defaults to 1.  
* shuffle (bool, optional):  
        Whether to shuffle the edges before loading data.  
        Defaults to False.  
* filter_by (str, optional):
        A boolean attribute used to indicate which edges are included. Defaults to None.
* output_format (str, optional):
        Format of the output data of the loader. Only
        "dataframe" is supported. Defaults to "dataframe".
* loader_id (str, optional):
        An identifier of the loader which can be any string. It is
        also used as the Kafka topic name. If `None`, a random string will be generated
        for it. Defaults to None.
* buffer_size (int, optional):
        Number of data batches to prefetch and store in memory. Defaults to 4.
* timeout (int, optional):
        Timeout value for GSQL queries, in ms. Defaults to 300000.

# Testcase1: using edgeLoader with callback_fn to get batches of edges.(for homogeneous graph)    
## Results: run successfully, data loaded completely

In [3]:
def process_batch(batch):
    return batch
edge_loader2 = conn.gds.edgeLoader(
    num_batches=10,
    attributes=["time", "is_train"],
    shuffle=True,
    filter_by=None,
    callback_fn = process_batch
)
for i, batch in enumerate(edge_loader2):
    print("----Batch {}----".format(i))
    print(batch.shape)
    print(batch.head())

----Batch 0----
(1185, 4)
      source     target  time  is_train
0  100663306  116391959     0         0
1  100663306  117440546     0         0
2  100663306  119537750     0         0
3  100663306  127926289     0         0
4  100663306  131072005     0         0
----Batch 1----
(1106, 4)
      source     target  time  is_train
0  100663303  109051953     0         0
1  100663305  108003377     0         0
2  100663305  113246292     0         0
3  100663306  125829166     0         0
4  100663306  126877702     0         0
----Batch 2----
(979, 4)
      source     target  time  is_train
0  100663299  116392022     0         0
1  100663305  133169167     0         0
2  100663329  121634823     0         0
3  100663337  123732012     0         0
4  100663342  133169175     0         0
----Batch 3----
(1002, 4)
      source     target  time  is_train
0  100663297  103809038     0         0
1  100663306  113246279     0         0
2  100663309  123731975     0         0
3  100663310  121

# Testcase2: using edgeLoader with callback_fn to get batchs of edges(for heterogeneous graph).  
## case details:using edgeLoader without callback_fn first, then using edgeLoader with the same paprams but set a callback_fn to get part of data.
## Results: run successfully, data loaded completely

Since `Cora` is a homogeneous graph, we will connect to a different graph to demostrate the use case of heterogeneous graphs.

In [4]:
conn.graphname="hetero"

# COMMENT OUT THE LINE BELOW if you are NOT using a graph that requires token authentication
conn.getToken(conn.createSecret())

('6thp1g6mjra5a5fp8r34d6b0orthfgjj', 1674814923, '2023-01-27 10:22:03')

In [5]:
loader3 = conn.gds.edgeLoader(
    attributes={"v0v0": ["is_train", "is_val"],
                "v2v0": ["is_train", "is_val"]},
    batch_size=200,
    shuffle=False,
    filter_by=None,
)
for i, batch in enumerate(loader3):
    print("----Batch {}----".format(i))
    for j in batch:
        print("Vertex type:", j)
        print(batch[j].head(1))

----Batch 0----
Vertex type: v0v0
      source     target is_train is_val
0  139460609  137363457        0      0
Vertex type: v2v0
      source     target is_train is_val
0  203423744  144703488        0      0
----Batch 1----
Vertex type: v0v0
      source     target is_train is_val
0  136314880  139460608        0      0
Vertex type: v2v0
      source     target is_train is_val
0  175112194  147849219        0      0
----Batch 2----
Vertex type: v0v0
      source     target is_train is_val
0  134217729  136314883        0      0
Vertex type: v2v0
      source     target is_train is_val
0  201326592  143654912        0      0
----Batch 3----
Vertex type: v0v0
      source     target is_train is_val
0  134217728  136314884        0      0
Vertex type: v2v0
      source     target is_train is_val
0  175112192  139460609        0      0
----Batch 4----
Vertex type: v0v0
      source     target is_train is_val
0  134217730  150994945        0      0
Vertex type: v2v0
      source     tar

In [6]:
def process_batch(batch):
    return {"v0v0":batch["v0v0"]}
loader4 = conn.gds.edgeLoader(
    attributes={"v0v0": ["is_train", "is_val"],
                "v2v0": ["is_train", "is_val"]},
    batch_size=200,
    shuffle=False,
    filter_by=None,
    callback_fn=process_batch
)
for i, batch in enumerate(loader4):
    print("----Batch {}----".format(i))
    for j in batch:
        print("Vertex type:", j)
        print(batch[j].head(1))

----Batch 0----
Vertex type: v0v0
      source     target is_train is_val
0  136314881  136314881        0      0
----Batch 1----
Vertex type: v0v0
      source     target is_train is_val
0  136314880  139460608        0      0
----Batch 2----
Vertex type: v0v0
      source     target is_train is_val
0  148897794  139460608        0      0
----Batch 3----
Vertex type: v0v0
      source     target is_train is_val
0  137363456  136314884        0      0
----Batch 4----
Vertex type: v0v0
      source     target is_train is_val
0  157286401  154140672        0      0
----Batch 5----
Vertex type: v0v0
      source     target is_train is_val
0  136314882  135266305        0      0
----Batch 6----
Vertex type: v0v0
      source     target is_train is_val
0  136314881  137363457        0      0
----Batch 7----
Vertex type: v0v0
      source     target is_train is_val
0  134217730  142606338        0      0
----Batch 8----
Vertex type: v0v0
      source     target is_train is_val
0  135266304  

# Testcase3: using vertexLoader with callback_fn to loaddata(via Kafka).  
## Results: run successfully, data loaded completely

**Note**: Kafka streaming function is only available for the Enterprise Edition. You need to activate the Enterprise Edition to use it. 

In [7]:
conn.graphname="Cora"
# COMMENT OUT THE LINE BELOW if you are NOT using a graph that requires token authentication
conn.getToken(conn.createSecret())

('7baq2t9p5rnula7oqvjtj95hrjkf81rd', 1674814924, '2023-01-27 10:22:04')

#### Configure Kafka
Set up Kafka here. Once configured, the settings will be shared with all newly created data loaders and no need to set up Kafka for each loader. Please see official [doc](https://docs.tigergraph.com/pytigergraph/current/gds/gds#_configurekafka) for detailed settings.

In [8]:
conn.gds.configureKafka(kafka_address ="your_Kafka_address")

#### Get batches of vertices

In [9]:
def process_batch(batch):
    return batch
edge_loader5 = conn.gds.edgeLoader(
    num_batches=10,
    attributes=["time", "is_train"],
    shuffle=True,
    filter_by=None,
    callback_fn = process_batch
)
for i, batch in enumerate(edge_loader5):
    print("----Batch {}----".format(i))
    print(batch.shape)
    print(batch.head(1))



----Batch 0----
(1018, 4)
      source     target  time  is_train
0  100663301  125829150     0         0
----Batch 1----
(1080, 4)
      source     target  time  is_train
0  105906176  127926335     0         0
----Batch 2----
(1062, 4)
      source     target  time  is_train
0  101711874  106954776     0         0
----Batch 3----
(1028, 4)
      source     target  time  is_train
0  103809024  122683444     0         0
----Batch 4----
(1053, 4)
      source     target  time  is_train
0  100663299  114294832     0         0
----Batch 5----
(1041, 4)
      source     target  time  is_train
0  100663296  106954829     0         0
----Batch 6----
(1098, 4)
      source     target  time  is_train
0  101711874  119537750     0         0
----Batch 7----
(1072, 4)
      source     target  time  is_train
0  100663297  120586255     0         0
----Batch 8----
(1036, 4)
      source     target  time  is_train
0  105906178  110100514     0         0
----Batch 9----
(1068, 4)
      source     tar