### Connect to TigerGraph

The `TigerGraphConnection` class represents a connection to the TigerGraph database. Under the hood, it stores the necessary information to communicate with the database. It is able to perform quite a few database tasks. Please see its [documentation](https://docs.tigergraph.com/pytigergraph/current/intro/) for details.

To connect your database, modify the `config.json` file accompanying this notebook. Set the value of `getToken` based on whether token auth is enabled for your database. Token auth is always enabled for tgcloud databases. 

### Ingest Data

In [8]:
from pyTigerGraph import TigerGraphConnection
import json

# Read in DB configs
with open('../../config.json', "r") as config_file:
    config = json.load(config_file)
    
conn = TigerGraphConnection(
    host=config["host"],
    username=config["username"],
    password=config["password"]
)

In [9]:
from pyTigerGraph.datasets import Datasets

dataset = Datasets("Cora")

conn.ingestDataset(dataset, getToken=config["getToken"])

from pyTigerGraph.visualization import drawSchema

drawSchema(conn.getSchema(force=True))

A folder with name Cora already exists in ./tmp. Skip downloading.
---- Checking database ----
A graph with name Cora already exists in the database. Skip ingestion.
Graph name is set to Cora for this connection.


CytoscapeWidget(cytoscape_layout={'name': 'circle', 'animate': True, 'padding': 1}, cytoscape_style=[{'selecto…

# Testcase1: using nodepieceLoader with callback_fn to loaddata.(for homogeneous graph)  
## Results: run successfully, data loaded completely

In [11]:
import numpy as np
def process_batch(batch):
    return batch

np_loader_test01 = conn.gds.nodepieceLoader(filter_by =None,
                                     batch_size = 128,
                                     compute_anchors = True,
                                     clear_cache = True,
                                     anchor_percentage = 1,
                                     v_feats = ["y","x"], 
                                     target_vertex_types=None, 
                                     max_anchors=5,
                                     max_relational_context=5,
                                     e_types=conn.getEdgeTypes(),
                                     timeout=204_800_000,
                                     callback_fn = lambda x: process_batch(x)
                                           )
for i, batch in enumerate(np_loader_test01):
    print("----Batch {}----".format(i))
    for batch_key in batch:
        print("batch type:", batch_key)
        print("batch type dim:", batch[batch_key])

Number of Anchors: 0


Exception in thread Thread-15:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.9/site-packages/pyTigerGraph/gds/dataloaders.py", line 587, in _read_data
    data = BaseLoader._parse_data(
  File "/opt/conda/lib/python3.9/site-packages/pyTigerGraph/gds/dataloaders.py", line 965, in _parse_data
    return callback_fn(data)
  File "/opt/conda/lib/python3.9/site-packages/pyTigerGraph/gds/dataloaders.py", line 3278, in nodepiece_process
    ancs = data["closest_anchors"].apply(lambda x: processAnchors(x))
  File "/opt/conda/lib/python3.9/site-packages/pandas/core/series.py", line 4774, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
  File "/opt/conda/lib/python3.9/site-packages/pandas/core/apply.py", line 1100, in apply
    return self.app

KeyboardInterrupt: 

### IMDB dataset
We train the model on the IMDB dataset from [PyG datasets](https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.IMDB) with TigerGraph as the data store. The dataset contains 3 types of vertices: 4278 movies, 5257 actors, and 2081 directors; and 4 types of edges: 12828 actor to movie edges, 12828 movie to actor edges, 4278 director to movie edges, and 4278 movie to director edges. Each vertex is described by a 0/1-valued word vector indicating the absence/presence of the corresponding keywords from the plot (for movie) or from movies they participated (for actors and directors). Each movie is classified into one of three classes, action, comedy, and drama according to their genre. The goal is to predict the class of each movie in the graph.

In [4]:
from pyTigerGraph.datasets import Datasets

dataset = Datasets("imdb")

conn.ingestDataset(dataset, getToken=config["getToken"])

A folder with name imdb already exists in ./tmp. Skip downloading.
---- Checking database ----
A graph with name imdb already exists in the database. Skip ingestion.
Graph name is set to imdb for this connection.


In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd

### Visualize Schema

In [6]:
from pyTigerGraph.visualization import drawSchema

drawSchema(conn.getSchema(force=True))

CytoscapeWidget(cytoscape_layout={'name': 'circle', 'animate': True, 'padding': 1}, cytoscape_style=[{'selecto…

# Testcase2: using nodepieceLoader with callback_fn to loaddata.(for heterogeneous graph)    
## Results: run successfully, data loaded completely

In [8]:
import numpy as np

np_loader_test02 = conn.gds.nodepieceLoader(filter_by = "train_mask",
                                     batch_size = 128,
                                     compute_anchors = True,
                                     clear_cache = True,
                                     anchor_percentage = 0.1,
                                     v_feats = {"Movie": ["y", "x"], "Actor": [], "Director": []}, 
                                     target_vertex_types=["Movie"], 
                                     max_anchors=5,
                                     max_relational_context=5,
                                     e_types=conn.getEdgeTypes(),
                                     timeout=204_800_000,
                                     )
for i, batch in enumerate(np_loader_test02):
    print("----Batch {}----".format(i))
    for batch_key in batch:
        print("batch type:", batch_key)
        print("batch type dim:", batch[batch_key])

Number of Anchors: 1161
----Batch 0----
batch type: Movie
batch type dim:           vid   relational_context  y  \
0     1048608  [12, 12, 12, 13, 0]  0   
1     1048644  [12, 12, 12, 13, 0]  2   
2     1048680  [12, 12, 12, 13, 0]  1   
3     1048688  [12, 12, 12, 13, 0]  1   
4     1048696  [12, 12, 12, 13, 0]  2   
..        ...                  ... ..   
100  19923068  [12, 12, 12, 13, 0]  2   
101  25165828  [12, 12, 12, 13, 0]  1   
102  27263060  [12, 12, 12, 13, 0]  2   
103  27263072  [12, 12, 12, 13, 0]  1   
104  27263116  [12, 12, 12, 13, 0]  1   

                                                     x  \
0    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...   
1    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...   
2    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...   
3    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...   
4    0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ...   
..                                                 ...   
100  0 0 0 0 0 0 0 0 0 0 0 

In [9]:
import numpy as np
def process_batch(batch):
    x = {"relational_context": torch.tensor(batch["Movie"]["relational_context"], dtype=torch.long), 
         "anchors": torch.tensor(batch["Movie"]["anchors"], dtype=torch.long), 
         "distance": torch.tensor(batch["Movie"]["anchor_distances"], dtype=torch.long),
         "feats": torch.tensor(np.stack(batch["Movie"]["x"].apply(lambda x: np.fromstring(x, sep=" ")).values), dtype=torch.float),
         "y": torch.tensor(batch["Movie"]["y"].astype(int))}
    return x

np_loader_test03 = conn.gds.nodepieceLoader(filter_by = "train_mask",
                                     batch_size = 128,
                                     compute_anchors = True,
                                     clear_cache = True,
                                     anchor_percentage = 0.1,
                                     v_feats = {"Movie": ["y", "x"], "Actor": [], "Director": []}, 
                                     target_vertex_types=["Movie"], 
                                     max_anchors=5,
                                     max_relational_context=5,
                                     e_types=conn.getEdgeTypes(),
                                     timeout=204_800_000,
                                     callback_fn = lambda x: process_batch(x))
for i, batch in enumerate(np_loader_test03):
    print("----Batch {}----".format(i))
    for batch_key in batch:
        print("batch type:", batch_key)
        print("batch type dim:", batch[batch_key].size())
        print("sample lastone in batch:{}\n".format(batch[batch_key][-1]))

Number of Anchors: 1161
----Batch 0----
batch type: relational_context
batch type dim: torch.Size([105, 5])
sample lastone in batch:tensor([12, 12, 12, 13,  0])

batch type: anchors
batch type dim: torch.Size([105, 5])
sample lastone in batch:tensor([742, 695, 695, 695, 742])

batch type: distance
batch type dim: torch.Size([105, 5])
sample lastone in batch:tensor([7, 7, 9, 9, 9])

batch type: feats
batch type dim: torch.Size([105, 3066])
sample lastone in batch:tensor([0., 0., 0.,  ..., 0., 0., 0.])

batch type: y
batch type dim: torch.Size([105])
sample lastone in batch:1

----Batch 1----
batch type: relational_context
batch type dim: torch.Size([110, 5])
sample lastone in batch:tensor([12, 12, 12, 13,  0])

batch type: anchors
batch type dim: torch.Size([110, 5])
sample lastone in batch:tensor([ 337,   34,  337, 1132,  337])

batch type: distance
batch type dim: torch.Size([110, 5])
sample lastone in batch:tensor([5, 7, 7, 7, 7])

batch type: feats
batch type dim: torch.Size([110, 3

# Testcase3: using callback_fn in nodepieceLoader to create both train and valid data, then train a model.  
## Results:run successfully

## NodePiece Algorithm <a name="nodepiece_algorithm"></a>

The [NodePiece algorithm](https://arxiv.org/abs/2106.12144) was introduced as a way to both conserve the memory cost of vertex embeddings, as well as be able to generalize to unseen vertices during the testing process. This makes NodePiece a much more scalable approach for large, real-world graphs compared to other transductive techniques such as FastRP or Node2Vec. For more information about the algorithm, check out the author's [Medium post](https://towardsdatascience.com/nodepiece-tokenizing-knowledge-graphs-6dd2b91847aa).

We implement the NodePiece dataloader, which will allow us to iterate through batches of vertices. We take advantage of the callback functionality to process the batch into PyTorch tensors for less data manipulation in the training loop.

## Train on Vertex Samples <a name="train_vertex"></a>
We train the model on batches of vertices. We utilize both the trainable embeddings provided by NodePiece, as well as the `x` feature vector stored as an attribute on all Movie vertices.

### Construct model and optimizer

In [10]:
class BaseNodePiece(nn.Module):
    def __init__(self, 
                 vocab_size:int,
                 sequence_length:int,
                 embedding_dim:int=768):
        super().__init__()
        self.embedding_dim = embedding_dim
        self.sequence_length = sequence_length
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        torch.nn.init.xavier_uniform_(self.embedding.weight)

    def forward(self, x):
        anc_emb = self.embedding(x["anchors"])
        rel_emb = self.embedding(x["relational_context"])
        anc_emb += self.embedding(x["distance"])
        out = torch.concat([anc_emb, rel_emb], dim=1)
        return out

In [11]:
class MLP(nn.Module):
    def __init__(self,
                 embedding_model:BaseNodePiece,
                 out_dim:int=2,
                 num_hidden_layers:int=2,
                 hidden_dim:int=128):
        super().__init__()
        self.out_dim = out_dim
        self.num_layers = num_hidden_layers + 2
        self.hidden_layers = nn.ModuleList([nn.Linear(hidden_dim, hidden_dim) for _ in range(num_hidden_layers)])
        self.out_layer = nn.Linear(hidden_dim, out_dim)
        self.in_layer = nn.Linear((embedding_model.embedding_dim*embedding_model.sequence_length)+3066, hidden_dim)
        self.emb_model = embedding_model

    def forward(self, x):
        feats = x["feats"]
        x = self.emb_model(x)
        x = torch.flatten(x, start_dim=1)
        x = torch.cat((x, feats), dim=1)
        x = self.in_layer(x)
        for layer in self.hidden_layers:
            x = F.dropout(F.relu(layer(x)), p=0.6)
        x = self.out_layer(x)
        x = F.log_softmax(x, dim=1)
        return x

In [12]:
def process_batch(batch):
    x = {"relational_context": torch.tensor(batch["Movie"]["relational_context"], dtype=torch.long), 
         "anchors": torch.tensor(batch["Movie"]["anchors"], dtype=torch.long), 
         "distance": torch.tensor(batch["Movie"]["anchor_distances"], dtype=torch.long),
         "feats": torch.tensor(np.stack(batch["Movie"]["x"].apply(lambda x: np.fromstring(x, sep=" ")).values), dtype=torch.float),
         "y": torch.tensor(batch["Movie"]["y"].astype(int))}
    return x

In [13]:
np_loader = conn.gds.nodepieceLoader(filter_by = "train_mask",
                                     batch_size = 128,
                                     compute_anchors = True,
                                     clear_cache = True,
                                     anchor_percentage = 0.1,
                                     v_feats = {"Movie": ["y", "x"], "Actor": [], "Director": []}, 
                                     target_vertex_types=["Movie"], 
                                     max_anchors=5,
                                     max_relational_context=5,
                                     e_types=conn.getEdgeTypes(),
                                     timeout=204_800_000,
                                     callback_fn = lambda x: process_batch(x))

Number of Anchors: 1161


In [14]:
emb_model = BaseNodePiece(vocab_size=np_loader.num_tokens, # add in special tokens
                 sequence_length=np_loader._payload["max_rel_context"] + np_loader._payload["max_anchors"],
                 embedding_dim=128)

model = MLP(emb_model, out_dim=3, num_hidden_layers=2, hidden_dim=128)

loss = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-3, weight_decay=5e-5)

In [15]:
np_loader.saveTokens("./npAncs.pkl")

In [16]:
valid_loader = conn.gds.nodepieceLoader(anchor_cache_attr="anchors", 
                                        filter_by = "val_mask",
                                        batch_size = 8192,
                                        v_feats = {"Movie": ["y", "x"], "Actor": [], "Director": []}, 
                                        target_vertex_types=["Movie"], 
                                        compute_anchors=False,
                                        max_anchors=5,
                                        max_relational_context=5,
                                        use_cache = True,
                                        e_types=conn.getEdgeTypes(),
                                        timeout=204_800_000,
                                        tokenMap="./npAncs.pkl",
                                        callback_fn = lambda x: process_batch(x))

### Train the model

In [17]:
import time
import numpy as np
from pyTigerGraph.gds.metrics import Accuracy


for i in range(10):
    acc = Accuracy()
    epoch_loss = 0
    start = time.time()
    for batch in np_loader:
        labels = batch["y"]
        out = model(batch)
        loss_val = loss(out, labels)
        acc.update(out.argmax(dim=1), labels)
        optimizer.zero_grad()
        loss_val.backward()
        optimizer.step()
        epoch_loss += loss_val.item()
    end = time.time()
    val_acc = Accuracy()
    val_epoch_loss = 0
    for val_batch in valid_loader:
        with torch.no_grad():
            labels = val_batch["y"]
            out = model(val_batch)
            loss_val = loss(out, labels)
            val_acc.update(out.argmax(dim=1), labels)
            val_epoch_loss += loss_val.item()
    print("EPOCH {}: {}".format(i, epoch_loss/np_loader.num_batches), 
          "Train Accuracy:", acc.value, 
          "Time:", end-start, 
          "Valid Loss: {}".format(val_epoch_loss/valid_loader.num_batches), 
          "Valid Accuracy:", val_acc.value)

EPOCH 0: 1.095803141593933 Train Accuracy: 0.3975 Time: 0.9867422580718994 Valid Loss: 1.06856107711792 Valid Accuracy: 0.3775
EPOCH 1: 1.077799916267395 Train Accuracy: 0.375 Time: 1.099294900894165 Valid Loss: 1.0795835256576538 Valid Accuracy: 0.4125
EPOCH 2: 1.021700456738472 Train Accuracy: 0.4925 Time: 1.1532020568847656 Valid Loss: 1.0630719661712646 Valid Accuracy: 0.4575
EPOCH 3: 0.7498434334993362 Train Accuracy: 0.7275 Time: 1.29939866065979 Valid Loss: 1.2397489547729492 Valid Accuracy: 0.475
EPOCH 4: 0.3719747066497803 Train Accuracy: 0.9 Time: 1.25838303565979 Valid Loss: 2.004598379135132 Valid Accuracy: 0.45
EPOCH 5: 0.11212521605193615 Train Accuracy: 0.9725 Time: 1.348665475845337 Valid Loss: 3.170536756515503 Valid Accuracy: 0.4625
EPOCH 6: 0.06731609907001257 Train Accuracy: 0.98 Time: 1.4826509952545166 Valid Loss: 3.757309675216675 Valid Accuracy: 0.4625
EPOCH 7: 0.030457580054644495 Train Accuracy: 0.9925 Time: 1.5900390148162842 Valid Loss: 4.077134609222412 Val

# Testcase4: using nodepieceLoader with callback_fn to loaddata(via Kafka).  
## Results: run successfully, data loaded completely

In [18]:
import numpy as np
def process_batch(batch):
    x = {"relational_context": torch.tensor(batch["Movie"]["relational_context"], dtype=torch.long), 
         "anchors": torch.tensor(batch["Movie"]["anchors"], dtype=torch.long), 
         "distance": torch.tensor(batch["Movie"]["anchor_distances"], dtype=torch.long),
         "feats": torch.tensor(np.stack(batch["Movie"]["x"].apply(lambda x: np.fromstring(x, sep=" ")).values), dtype=torch.float),
         "y": torch.tensor(batch["Movie"]["y"].astype(int))}
    return x

conn.gds.configureKafka(kafka_address ="kaf.ml.tgcloud.io:19092")
np_loader_test05 = conn.gds.nodepieceLoader(filter_by = "train_mask",
                                     batch_size = 128,
                                     compute_anchors = True,
                                     clear_cache = True,
                                     anchor_percentage = 0.1,
                                     v_feats = {"Movie": ["y", "x"], "Actor": [], "Director": []}, 
                                     target_vertex_types=["Movie"], 
                                     max_anchors=5,
                                     max_relational_context=5,
                                     e_types=conn.getEdgeTypes(),
                                     timeout=204_800_000,
                                     callback_fn = lambda x: process_batch(x))
for i, batch in enumerate(np_loader_test05):
    print("----Batch {}----".format(i))
    for batch_key in batch:
        print("batch type:", batch_key)
        print("batch type dim:", batch[batch_key].size())
        print("sample lastone in batch:{}\n".format(batch[batch_key][-1]))



Number of Anchors: 1161
----Batch 0----
batch type: relational_context
batch type dim: torch.Size([105, 5])
sample lastone in batch:tensor([12, 12, 12, 13,  0])

batch type: anchors
batch type dim: torch.Size([105, 5])
sample lastone in batch:tensor([466, 144, 144, 144, 466])

batch type: distance
batch type dim: torch.Size([105, 5])
sample lastone in batch:tensor([7, 7, 9, 9, 9])

batch type: feats
batch type dim: torch.Size([105, 3066])
sample lastone in batch:tensor([0., 0., 0.,  ..., 0., 0., 0.])

batch type: y
batch type dim: torch.Size([105])
sample lastone in batch:1

----Batch 1----
batch type: relational_context
batch type dim: torch.Size([110, 5])
sample lastone in batch:tensor([12, 12, 12, 13,  0])

batch type: anchors
batch type dim: torch.Size([110, 5])
sample lastone in batch:tensor([ 492,  815, 1101,  181,  150])

batch type: distance
batch type dim: torch.Size([110, 5])
sample lastone in batch:tensor([7, 7, 7, 7, 7])

batch type: feats
batch type dim: torch.Size([110, 3