## Stochastic Training of GNN for Node Classification on Large Graphs 

This tutorial's content include 
- Creating your DGL graph from your own data in other formats such as CSV.
- Training a GNN model with a single machine, a single GPU, on a graph of any size.

In [3]:
!wget https://snap.stanford.edu/ogb/data/nodeproppred/products.zip

--2022-08-30 05:17:38--  https://snap.stanford.edu/ogb/data/nodeproppred/products.zip
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1480993786 (1.4G) [application/zip]
Saving to: ‘products.zip’


2022-08-30 05:21:37 (5.92 MB/s) - ‘products.zip’ saved [1480993786/1480993786]



In [5]:
!unzip -o products.zip

Archive:  products.zip
   creating: products/
   creating: products/split/
   creating: products/split/sales_ranking/
  inflating: products/split/sales_ranking/test.csv.gz  
  inflating: products/split/sales_ranking/train.csv.gz  
  inflating: products/split/sales_ranking/valid.csv.gz  
   creating: products/processed/
   creating: products/raw/
  inflating: products/raw/node-label.csv.gz  
 extracting: products/raw/num-node-list.csv.gz  
 extracting: products/raw/num-edge-list.csv.gz  
  inflating: products/raw/node-feat.csv.gz  
  inflating: products/raw/edge.csv.gz  
   creating: products/mapping/
  inflating: products/mapping/README.md  
 extracting: products/mapping/labelidx2productcategory.csv.gz  
  inflating: products/mapping/nodeidx2asin.csv.gz  
  inflating: products/RELEASE_v1.txt  


In [1]:
import torch 

- `products/raw/edge.csv` (source-destination pairs)
- `products/raw/node-feat.csv` (node features)
- `products/raw-node-label.csv` (node label)
- `products/raw/num-edge-list.csv` (number of edges)
- `product/raw/num-node-list.csv` (number of nodes)

In [4]:
import pandas as pd 
edges = pd.read_csv('products/raw/edge.csv.gz', header = None).values
node_features = pd.read_csv('products/raw/node-feat.csv.gz', header = None).values 
node_labels = pd.read_csv('products/raw/node-label.csv.gz', header=None).values[:,0]

In [5]:
train_nids = pd.read_csv('products/split/sales_ranking/train.csv.gz', header=None).values[:,0]
valid_nids = pd.read_csv('products/split/sales_ranking/valid.csv.gz', header=None).values[:,0]
test_nids = pd.read_csv('products/split/sales_ranking/test.csv.gz', header=None).values[:,0]

In [6]:
import pickle

import dgl 
import torch 


graph = dgl.graph((edges[:, 0], edges[:, 1]))
node_features = torch.FloatTensor(node_features)
node_labels = torch.LongTensor(node_labels)

with open('data.pkl', 'wb') as f:
    pickle.dump((graph, node_features, node_labels, train_nids, valid_nids, test_nids), f)

DGL backend not selected or invalid.  Assuming PyTorch for now.


Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)


In [7]:
print('Graph')
print(graph)
print('Shape of node features:', node_features.shape)
print('Shape of node labels:', node_labels.shape)

num_features = node_features.shape[1]
num_classes = (node_labels.max() + 1).item()
print('Number of classis:', num_classes)

Graph
Graph(num_nodes=2449029, num_edges=61859140,
      ndata_schemes={}
      edata_schemes={})
Shape of node features: torch.Size([2449029, 100])
Shape of node labels: torch.Size([2449029])
Number of classis: 47


### Define a Data Loader with Neighbor Sampling 

#### Neighbor sampling overview 

The formulation of message passing usually has the following form:

$$ a^{(l)}_v = \rho^{(l)} (\{ h^{(l-1)}_u:u \in \mathcal{N}(v) \}) $$
$$ h^{(l)}_v = \phi^{(l)} (h^{(l-1)}_v, a^{(l)}_v) $$

where $\rho^{(l)}$ and $\phi^{(l)}$ are parameterized functions, and $\mathcal{N}(v)$ represents the set of predecessors (or equivalently neighbors) of $v$ on graph $\mathcal{G}$

$$ \mathcal{N}(v) = \{ s(e):e\in \mathbb{E}, t(e) = v \} $$

In [8]:
sampler = dgl.dataloading.MultiLayerNeighborSampler([4, 4, 4]) # MultiLayerNeighborSampler K-hop high-order
train_dataloader = dgl.dataloading.NodeDataLoader(
    graph, train_nids, sampler, 
    batch_size = 1024, 
    shuffle = True, 
    drop_last = False, 
    num_workers = 0
)

In [9]:
example_minibatch = next(iter(train_dataloader))
print(example_minibatch) # input, output, bipartites

[tensor([50604,  2334, 72881,  ..., 45141, 19313, 28940]), tensor([ 50604,   2334,  72881,  ..., 115477, 139654, 172127]), [Block(num_src_nodes=35054, num_dst_nodes=15761, num_edges=51350), Block(num_src_nodes=15761, num_dst_nodes=4585, num_edges=15978), Block(num_src_nodes=4585, num_dst_nodes=1024, num_edges=3701)]]


In [10]:
input_nodes, output_nodes, bipartites = example_minibatch 
print(f"To compute {len(output_nodes)} nodes' output we need {len(input_nodes)} nodes' input features")

To compute 1024 nodes' output we need 35054 nodes' input features


### Defining Model

In [11]:
import torch.nn as nn 
import torch.optim as optim 
import torch.nn.functional as F 
import dgl.nn as dglnn

In [12]:
class SAGE(nn.Module):
    def __init__(self, in_feats, n_hidden, n_classes, n_layers):
        super(SAGE, self).__init__()
        self.n_layers = n_layers 
        self.n_hidden = n_hidden 
        self.n_classes = n_classes 

        self.layers = nn.ModuleList()
        self.layers.append(dglnn.SAGEConv(in_feats, n_hidden, 'mean'))

        for i in range(1, n_layers - 1):
            self.layers.append(dglnn.SAGEConv(n_hidden, n_hidden, 'mean'))
        self.layers.append(dglnn.SAGEConv(n_hidden, n_classes, 'mean'))

    def forward(self, bipartites, x):
        for l, (layer, bipartite) in enumerate(zip(self.layers, bipartites)):
            x = layer(bipartite, x)
            if l != self.n_layers -1 :
                x = F.relu(x)
        return x 

### Defining Training Loop 

In [13]:
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
model = SAGE(num_features, 128, num_classes, 3).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

In [14]:
valid_dataloader = dgl.dataloading.NodeDataLoader(
    graph, valid_nids, sampler, 
    batch_size = 1024, 
    shuffle = False, 
    drop_last = False, 
    num_workers = 2
)

In [48]:
from re import L
import tqdm 
import numpy as np 
import sklearn.metrics as metrics 

best_acc = 0 
best_model_path = 'model.pt'
for epoch in range(10):
    model.train()
    with tqdm.tqdm(train_dataloader) as tq:
        for i, (input_nodes, output_nodes, bipartites) in enumerate(tq):
            bipartites = [b.to(device) for b in bipartites]
            inputs = node_features[input_nodes].to(device)
            labels = node_labels[output_nodes].to(device)
            predictions = model(bipartites, inputs)

            loss = F.cross_entropy(predictions, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            acc = metrics.accuracy_score(labels.cpu().numpy(), predictions.argmax(1).detach().cpu().numpy())

            tq.set_postfix({'loss': '%.03f' %loss.item(), 'acc': '%.03f'%acc}, refresh = False)


        
    
    model.eval()

    predictions = []
    labels = []

    with tqdm.tqdm(valid_dataloader) as tq, torch.no_grad():
        for (input_nodes, output_nodes, bipartites) in tq:
            bipartites = [b.to(device) for b in bipartites]
            input_nodes = node_features[input_nodes].to(device)
            labels.append(node_labels[output_nodes].numpy())
            predictions.append(model(bipartites, input_nodes).argmax(1).cpu().numpy())
        
        predictions = np.concatenate(predictions)
        labels = np.concatenate(labels)
        acc = metrics.accuracy_score(labels, predictions)
        print(f'Epoch {epoch+1} Validation Accuracy {acc*100:.2f}%')
        if best_acc < acc:
            best_acc = acc 
            torch.save(model.state_dict(), best_model_path)

100%|██████████| 193/193 [00:05<00:00, 32.49it/s, loss=0.096, acc=1.000]
100%|██████████| 39/39 [00:01<00:00, 30.01it/s]


Epoch 1 Validation Accuracy 88.45%


100%|██████████| 193/193 [00:05<00:00, 35.14it/s, loss=0.197, acc=0.857]
100%|██████████| 39/39 [00:01<00:00, 29.44it/s]


Epoch 2 Validation Accuracy 88.43%


100%|██████████| 193/193 [00:05<00:00, 32.87it/s, loss=0.316, acc=0.857]
100%|██████████| 39/39 [00:01<00:00, 29.42it/s]


Epoch 3 Validation Accuracy 88.22%


100%|██████████| 193/193 [00:05<00:00, 32.88it/s, loss=0.251, acc=0.857]
100%|██████████| 39/39 [00:01<00:00, 29.57it/s]


Epoch 4 Validation Accuracy 88.44%


100%|██████████| 193/193 [00:05<00:00, 33.35it/s, loss=0.197, acc=1.000]
100%|██████████| 39/39 [00:01<00:00, 29.01it/s]


Epoch 5 Validation Accuracy 88.45%


100%|██████████| 193/193 [00:05<00:00, 33.17it/s, loss=0.050, acc=1.000]
100%|██████████| 39/39 [00:01<00:00, 29.35it/s]


Epoch 6 Validation Accuracy 88.76%


100%|██████████| 193/193 [00:05<00:00, 33.03it/s, loss=0.603, acc=0.857]
100%|██████████| 39/39 [00:01<00:00, 29.91it/s]


Epoch 7 Validation Accuracy 88.34%


100%|██████████| 193/193 [00:05<00:00, 33.43it/s, loss=0.557, acc=0.857]
100%|██████████| 39/39 [00:01<00:00, 29.09it/s]


Epoch 8 Validation Accuracy 88.48%


100%|██████████| 193/193 [00:05<00:00, 33.05it/s, loss=0.587, acc=0.857]
100%|██████████| 39/39 [00:01<00:00, 29.25it/s]


Epoch 9 Validation Accuracy 88.29%


100%|██████████| 193/193 [00:05<00:00, 33.30it/s, loss=0.395, acc=0.714]
100%|██████████| 39/39 [00:01<00:00, 29.80it/s]

Epoch 10 Validation Accuracy 88.51%





In [57]:
def inference(model, graph, input_features, batch_size):
    nodes = torch.arange(graph.number_of_nodes())

    sampler = dgl.dataloading.MultiLayerNeighborSampler([None])
    dataloader = dgl.dataloading.NodeDataLoader(
        graph, nodes, sampler, 
        batch_size = batch_size, 
        shuffle = False, 
        drop_last = False, 
        num_worker = 0
    )

    with torch.no_grad():
        for l, layer in enumerate(model.layers):
            output_features = torch.zeros(graph.number_of_nodes(), model.n_hidden if l != model.n_layers -1 else model.n_classes)

        for input_nodes, output_nodes, bipartites in tqdm.tqdm(dataloader):
            bipartite = bipartites[0].to(device)

            x = input_features[input_nodes].to(device)

            x = layer(bipartite, x)
            if l != model.n_layers -1:
                x = F.relu(x)

            output_features[output_nodes] = x.cpu()
        input_features = output_features 
    return output_features 

In [None]:
model.load_state_dict(torch.load(best_model_path))
all_predictions = inference(model, graph, node_features, 8192)

test_predictions = all_predictions[test_nids].argmax(1)
test_labels = node_labels[test_nids]
test_accuracy = metrics.accuracy_score(test_predictions.numpy(), test_labels.numpy())
print('Test accuracy:', test_accuracy)

### Conclusion 

In this tutorial, you have learned how to train a multi-layer GraphSAGE with neighbor sampling on a large dataset that cannot fit into GPU. 

The method you have learned can scale to a graph of any size, and works on a single machine with a single GPU.