# GraphSAGE
This notebook demonstrates the training of [GraphSAGE models](https://arxiv.org/abs/1706.02216) with TigerGraph. [DGL](https://www.dgl.ai/)'s implementation of GraphSAGE is used here. We train the model on the Cora dataset from [PyG datasets](https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.Planetoid) with TigerGraph as the data store. The dataset contains 2708 machine learning papers and 10556 citation links between the papers.  Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from a dictionary. The dictionary consists of 1433 unique words. Each paper is classified into one of seven classes based on the topic. The goal is to predict the class of each vertex in the graph.

## Table of Contents
* [Data Processing](#data_processing)  
* [Train on whole graph](#train_whole)  
* [Train on neighborhood subgraphs](#train_subgraph)  

## Data Processing <a name="data_processing"></a>

Here we assume the dataset is already ingested into the TigerGraph database. If not, please refer to the example on data ingestion first. Since the dataset already has a split of vertices into train/validation/test sets, we don't need to do so. But we still include the code below for general use cases.

### Connect to TigerGraph

In [1]:
from tgml.data import TigerGraph

tgraph = TigerGraph(
    host="http://35.230.92.92",
    graph="Cora",
    username="tigergraph",
    password="tigergraphml",
)

In [2]:
tgraph.info()

Using graph 'Cora'
---- Graph Cora
Vertex Types: 
  - VERTEX Paper(PRIMARY_ID id INT, x LIST<INT>, y INT, train_mask BOOL, val_mask BOOL, test_mask BOOL, tmp_id INT, tmp_id2 INT, tmp_id3 INT) WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
Edge Types: 
  - DIRECTED EDGE Cite(FROM Paper, TO Paper)

Graphs: 
  - Graph Cora(Paper:v, Cite:e)
Jobs: 
Queries: 
  - export_edge(string output_path) (installed v2)
  - export_edge_batch(string output_path, int batch_id, int num_batches) (installed v2)
  - export_vertex_(string output_path) (installed v2)
  - export_vertex_batch_x_y(string output_path, int batch_id, int num_batches) (installed v2)
  - export_vertex_train_mask_val_mask_test_mask(string output_path) (installed v2)
  - export_vertex_x_y(string output_path) (installed v2)
  - export_vertex_x_y_train_mask_val_mask_test_mask(string output_path) (installed v2)
  - get_vertex_number(string v_type, string filter_by) (installed v2)
  - shuffle_vertices(string tmp_id) (ins

In [3]:
tgraph.number_of_vertices()

2708

In [4]:
tgraph.number_of_edges()

10556

### Train/validation/test split

In [5]:
# The code in this cell is commented out because there is no need to split the vertices into training/validation/test sets, as the split is already done in the original dataset. See notebook 1_data_processing for examples on the split function.

# from tgml.utils import split_vertices
# split_vertices(tgraph, train_mask=0.8, val_mask=0.1, test_mask=0.1)

In [6]:
print(
    "Number of vertices in training set:",
    tgraph.number_of_vertices(filter_by="train_mask"),
)
print(
    "Number of vertices in validation set:",
    tgraph.number_of_vertices(filter_by="val_mask"),
)
print(
    "Number of vertices in test set:", tgraph.number_of_vertices(filter_by="test_mask")
)

Number of vertices in training set: 2183
Number of vertices in validation set: 273
Number of vertices in test set: 252


## Train on whole graph <a name="train_whole"></a>
We first train the model on the whole graph. This works when the graph is small, but it is not efficient or even not possible when the graph is large. Hyperparameters for the model and training environment are defined below.

In [7]:
# Hyperparameters
hp = {"hidden_dim": 64, 
      "num_layers": 2, 
      "dropout": 0.6, 
      "lr": 0.01, 
      "l2_penalty": 5e-4}

### Construct graph loader

The `GraphLoader` will get the whole graph from database all at once. (See the tutorial on dataloaders for details.) 

In [8]:
from tgml.dataloaders import GraphLoader

graph_loader = GraphLoader(
    graph=tgraph,
    v_in_feats="x",
    v_out_labels="y:int",
    v_extra_feats="train_mask:bool,val_mask:bool,test_mask:bool",
    output_format="DGL",
)

In [9]:
# Get the whole graph from the loader in DGL format
data = graph_loader.data

data

Using backend: pytorch


Graph(num_nodes=2708, num_edges=10556,
      ndata_schemes={'feat': Scheme(shape=(1433,), dtype=torch.float32), 'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool)}
      edata_schemes={})

### Construct model and optimizer

We build a graphSAGE model with 2 convolutional layers, and use the Adam optimizer with a learning rate of 0.01.

In [10]:
import dgl.function as fn
import dgl.nn.pytorch as dglnn
import torch
import torch.nn as nn
import torch.nn.functional as F

In [11]:
class GraphSage(nn.Module):
    def __init__(self):
        super(GraphSage, self).__init__()
        self.layer1 = dglnn.conv.SAGEConv(1433, hp['hidden_dim'], aggregator_type="mean")
        self.layer2 = dglnn.conv.SAGEConv(hp['hidden_dim'], 7, aggregator_type="mean")

    def forward(self, g, features):
        x = F.relu(self.layer1(g, features))
        x = self.layer2(g, x)
        return x

model = GraphSage()
print(model)

GraphSage(
  (layer1): SAGEConv(
    (feat_drop): Dropout(p=0.0, inplace=False)
    (fc_self): Linear(in_features=1433, out_features=64, bias=False)
    (fc_neigh): Linear(in_features=1433, out_features=64, bias=False)
  )
  (layer2): SAGEConv(
    (feat_drop): Dropout(p=0.0, inplace=False)
    (fc_self): Linear(in_features=64, out_features=7, bias=False)
    (fc_neigh): Linear(in_features=64, out_features=7, bias=False)
  )
)


In [12]:
optimizer = torch.optim.Adam(
    model.parameters(), lr=hp["lr"], weight_decay=hp["l2_penalty"]
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [14]:
from datetime import datetime

from tgml.metrics import Accuracy
from torch.utils.tensorboard import SummaryWriter

In [15]:
log_dir = "logs/cora/gcn/wholegraph/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tb_log = SummaryWriter(log_dir)
logs = {}
data = data.to(device)
for epoch in range(20):
    # Train
    model.train()
    acc = Accuracy()
    # Forward pass
    out = model(data, data.ndata["feat"])
    # Calculate loss
    loss = F.cross_entropy(out[data.ndata["train_mask"]], data.ndata["label"][data.ndata["train_mask"]])
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    # Evaluate
    val_acc = Accuracy()
    with torch.no_grad():
        pred = out.argmax(dim=1)
        acc.update(pred[data.ndata["train_mask"]], data.ndata["label"][data.ndata["train_mask"]])
        valid_loss = F.cross_entropy(out[data.ndata["val_mask"]], data.ndata["label"][data.ndata["val_mask"]])
        val_acc.update(pred[data.ndata["val_mask"]], data.ndata["label"][data.ndata["val_mask"]])
    # Logging
    logs["loss"] = loss.item()
    logs["val_loss"] = valid_loss.item()
    logs["acc"] = acc.value
    logs["val_acc"] = val_acc.value
    print(
        "Epoch: {:02d}, Train Loss: {:.4f}, Valid Loss: {:.4f}, Train Accuracy: {:.4f}, Valid Accuracy: {:.4f}".format(
            epoch, logs["loss"], logs["val_loss"], logs["acc"], logs["val_acc"]
        )
    )
    tb_log.add_scalars(
        "Loss", {"Train": logs["loss"], "Validation": logs["val_loss"]}, epoch
    )
    tb_log.add_scalars(
        "Accuracy", {"Train": logs["acc"], "Validation": logs["val_acc"]}, epoch
    )
    tb_log.flush()

Epoch: 00, Train Loss: 1.9294, Valid Loss: 1.8947, Train Accuracy: 0.2071, Valid Accuracy: 0.2308
Epoch: 01, Train Loss: 1.2665, Valid Loss: 1.3486, Train Accuracy: 0.4622, Valid Accuracy: 0.4286
Epoch: 02, Train Loss: 0.7014, Valid Loss: 0.8547, Train Accuracy: 0.8475, Valid Accuracy: 0.7912
Epoch: 03, Train Loss: 0.4334, Valid Loss: 0.6164, Train Accuracy: 0.9116, Valid Accuracy: 0.8498
Epoch: 04, Train Loss: 0.2869, Valid Loss: 0.4978, Train Accuracy: 0.9226, Valid Accuracy: 0.8425
Epoch: 05, Train Loss: 0.1998, Valid Loss: 0.4245, Train Accuracy: 0.9441, Valid Accuracy: 0.8681
Epoch: 06, Train Loss: 0.1556, Valid Loss: 0.3970, Train Accuracy: 0.9501, Valid Accuracy: 0.8791
Epoch: 07, Train Loss: 0.1255, Valid Loss: 0.3935, Train Accuracy: 0.9620, Valid Accuracy: 0.8755
Epoch: 08, Train Loss: 0.0993, Valid Loss: 0.3972, Train Accuracy: 0.9734, Valid Accuracy: 0.8828
Epoch: 09, Train Loss: 0.0766, Valid Loss: 0.4012, Train Accuracy: 0.9794, Valid Accuracy: 0.8864
Epoch: 10, Train Los

In [16]:
model.eval()
acc = Accuracy()
with torch.no_grad():
    pred = model(data, data.ndata["feat"]).argmax(dim=1)
    acc.update(pred[data.ndata["test_mask"]], data.ndata["label"][data.ndata["test_mask"]])
print("Accuracy: {:.4f}".format(acc.value))

Accuracy: 0.8810


## Train on Neighborhood Subgraphs <a name="train_subgraph"></a>
Alternatively, we train the model on the neighborhood subgraphs. Each subgraph contains the 2 hop neighborhood of certain seed vertices. This method  will allow us to train the model on graphs that are way larger than the CORA dataset because we don't load the whole graph into memory all at once. 

We will use the same parameters as before, but we will use the NeighborLoader to load subgraphs. Once we finish iterating over all the subgraphs generated by the loader, it is guaranteed to cover all vertices in the graph (except for those filtered by a user provided mask). 

In [17]:
# Hyperparameters
hp = {"batch_size": 64, 
      "num_neighbors": 10, 
      "num_hops": 2, 
      "hidden_dim": 64, 
      "num_layers": 2, 
      "dropout": 0.6, 
      "lr": 0.01, 
      "l2_penalty": 5e-4}

### Construct neighborhood subgraph loader

In [18]:
from tgml.dataloaders import NeighborLoader

Here we construct 3 subgraph loaders. The `train_loader` only uses vertices in the training set as seeds, the `valid_loader` only uses vertices in the validation set, and the `test_loader` only uses vertices in the test set.

In [19]:
train_loader = NeighborLoader(
    graph=tgraph,
    tmp_id="tmp_id",
    v_in_feats="x",
    v_out_labels="y:int",
    v_extra_feats="train_mask:bool,val_mask:bool,test_mask:bool",
    output_format="DGL",
    batch_size=hp["batch_size"],
    num_neighbors=hp["num_neighbors"],
    num_hops=hp["num_hops"],
    shuffle=True,
    filter_by="train_mask",
    add_self_loop=True
)

In [20]:
valid_loader = NeighborLoader(
    graph=tgraph,
    tmp_id="tmp_id2",
    v_in_feats="x",
    v_out_labels="y:int",
    v_extra_feats="train_mask:bool,val_mask:bool,test_mask:bool",
    output_format="DGL",
    batch_size=hp["batch_size"],
    num_neighbors=hp["num_neighbors"],
    num_hops=hp["num_hops"],
    shuffle=False,
    filter_by="val_mask",
    add_self_loop=True
)

In [21]:
test_loader = NeighborLoader(
    graph=tgraph,
    tmp_id="tmp_id3",
    v_in_feats="x",
    v_out_labels="y:int",
    v_extra_feats="train_mask:bool,val_mask:bool,test_mask:bool",
    output_format="DGL",
    batch_size=hp["batch_size"],
    num_neighbors=hp["num_neighbors"],
    num_hops=hp["num_hops"],
    shuffle=False,
    filter_by="test_mask",
    add_self_loop=True
)

### Construct model and optimizer
We build a GCN model with 2 convolutional layers, and use the Adam optimizer with a learning rate of 0.01.

In [23]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = GraphSage().to(device)

optimizer = torch.optim.Adam(
    model.parameters(), lr=hp["lr"], weight_decay=hp["l2_penalty"]
)

### Train the model

In [24]:
from datetime import datetime

from tgml.metrics import Accumulator, Accuracy
from torch.utils.tensorboard import SummaryWriter

In [25]:
log_dir = "logs/cora/gcn/subgraph/" + datetime.now().strftime("%Y%m%d-%H%M%S")
train_log = SummaryWriter(log_dir+"/train")
valid_log = SummaryWriter(log_dir+"/valid")
global_steps = 0
logs = {}
for epoch in range(10):
    # Train
    model.train()
    epoch_train_loss = Accumulator()
    epoch_train_acc = Accuracy()
    for bid, batch in enumerate(train_loader):
        batchsize = batch.num_nodes()
        batch.to(device)
        # Forward pass
        out = model(batch, batch.ndata["feat"])
        # Calculate loss
        loss = F.cross_entropy(out[batch.ndata["train_mask"]], batch.ndata["label"][batch.ndata["train_mask"]])
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_train_loss.update(loss.item() * batchsize, batchsize)
        # Predict on training data
        with torch.no_grad():
            pred = out.argmax(dim=1)
            epoch_train_acc.update(pred[batch.ndata["train_mask"]], batch.ndata["label"][batch.ndata["train_mask"]])
        # Log training status after each batch
        logs["loss"] = epoch_train_loss.mean
        logs["acc"] = epoch_train_acc.value
        print(
            "Epoch {}, Train Batch {}, Loss {:.4f}, Accuracy {:.4f}".format(
                epoch, bid, logs["loss"], logs["acc"]
            )
        )
        train_log.add_scalar("Loss", logs["loss"], global_steps)
        train_log.add_scalar("Accuracy", logs["acc"], global_steps)
        train_log.flush()
        global_steps += 1
    # Evaluate
    model.eval()
    epoch_val_loss = Accumulator()
    epoch_val_acc = Accuracy()
    for batch in valid_loader:
        batchsize = batch.num_nodes()
        batch.to(device)
        with torch.no_grad():
            # Forward pass
            out = model(batch, batch.ndata["feat"])
            # Calculate loss
            valid_loss = F.cross_entropy(out[batch.ndata["val_mask"]], batch.ndata["label"][batch.ndata["val_mask"]])
            epoch_val_loss.update(valid_loss.item() * batchsize, batchsize)
            # Prediction
            pred = out.argmax(dim=1)
            epoch_val_acc.update(pred[batch.ndata["val_mask"]], batch.ndata["label"][batch.ndata["val_mask"]])
    # Log testing result after each epoch
    logs["val_loss"] = epoch_val_loss.mean
    logs["val_acc"] = epoch_val_acc.value
    print(
        "Epoch {}, Valid Loss {:.4f}, Valid Accuracy {:.4f}".format(
            epoch, logs["val_loss"], logs["val_acc"]
        )
    )
    valid_log.add_scalar("Loss", logs["val_loss"], global_steps)
    valid_log.add_scalar("Accuracy", logs["val_acc"], global_steps)
    valid_log.flush()

Epoch 0, Train Batch 0, Loss 2.1366, Accuracy 0.1272
Epoch 0, Train Batch 1, Loss 1.7572, Accuracy 0.3767
Epoch 0, Train Batch 2, Loss 1.5762, Accuracy 0.4620
Epoch 0, Train Batch 3, Loss 1.3986, Accuracy 0.5150
Epoch 0, Train Batch 4, Loss 1.3005, Accuracy 0.5572
Epoch 0, Train Batch 5, Loss 1.1661, Accuracy 0.6097
Epoch 0, Train Batch 6, Loss 1.0671, Accuracy 0.6455
Epoch 0, Train Batch 7, Loss 0.9866, Accuracy 0.6729
Epoch 0, Train Batch 8, Loss 0.9094, Accuracy 0.7003
Epoch 0, Train Batch 9, Loss 0.8649, Accuracy 0.7158
Epoch 0, Train Batch 10, Loss 0.8127, Accuracy 0.7334
Epoch 0, Train Batch 11, Loss 0.7643, Accuracy 0.7504
Epoch 0, Train Batch 12, Loss 0.7239, Accuracy 0.7633
Epoch 0, Train Batch 13, Loss 0.6920, Accuracy 0.7748
Epoch 0, Train Batch 14, Loss 0.6677, Accuracy 0.7823
Epoch 0, Train Batch 15, Loss 0.6400, Accuracy 0.7915
Epoch 0, Train Batch 16, Loss 0.6115, Accuracy 0.8003
Epoch 0, Train Batch 17, Loss 0.5896, Accuracy 0.8068
Epoch 0, Train Batch 18, Loss 0.5689, 

### Test the model

In [26]:
model.eval()
acc = Accuracy()
for batch in test_loader:
    batch.to(device)
    with torch.no_grad():
        pred = model(batch, batch.ndata["feat"]).argmax(dim=1)
        acc.update(pred[batch.ndata["test_mask"]], batch.ndata["label"][batch.ndata["test_mask"]])
print("Accuracy: {:.4f}".format(acc.value))

Accuracy: 0.8342
