# Stochastic Training of GNN for Node Classification on Large Heterogeneous  Graphs

This tutorial shows how to train a multi-layer R-GCN for node classification on `ogbn-mag` dataset provided by OGB.

The ogbn-mag dataset is a heterogeneous network composed of a subset of the Microsoft Academic Graph (MAG) [1]. It contains four types of entities—papers (736,389 nodes), authors (1,134,649 nodes), institutions (8,740 nodes), and fields of study (59,965 nodes)—as well as four types of directed relations connecting two types of entities—an author is “affiliated with” an institution, an author “writes” a paper, a paper “cites” a paper, and a paper “has a topic of” a field of study.

This tutorial's contents include

* Creating a DGL graph using the dgl ogb data loader.
* Training a GNN model with a single machine, a single GPU, on a graph of any size.

## Load Dataset

Although you can directly use the Python package provided by OGB, for demonstration, we will instead manually download the dataset, peek into its contents, and process it with only `numpy`.

In [1]:
!pip install ogb -qq

In [2]:
from ogb.nodeproppred import DglNodePropPredDataset

dataset = DglNodePropPredDataset(name='ogbn-mag')

Using backend: pytorch


The dataset contains the following:

* DGL graph object (source-destination pairs)
* The node label tensor

We can also use the utility function in the dataset to get the train, validation, test splits

In [3]:
import dgl

graph, label = dataset[0] # graph: dgl graph object, label: torch tensor of shape (num_nodes, 1)


split_idx = dataset.get_idx_split()
train_nids, valid_nids, test_nids = split_idx["train"], split_idx["valid"], split_idx["test"]

We can see the size of the graph, features, and labels as follows.

In [4]:
print(graph)

print('Node labels')
node_labels = label['paper'].flatten()
print('Shape of target node labels:', node_labels.shape)
num_classes = (node_labels.max() + 1).item()
print('Number of classes:', num_classes)

print('Node features')
node_features = graph.nodes['paper'].data['feat']
num_features = node_features.shape[1]
print('Shape of features of paper node type: {}'.format(num_features))

Graph(num_nodes={'author': 1134649, 'field_of_study': 59965, 'institution': 8740, 'paper': 736389},
      num_edges={('author', 'affiliated_with', 'institution'): 1043998, ('author', 'writes', 'paper'): 7145660, ('paper', 'cites', 'paper'): 5416271, ('paper', 'has_topic', 'field_of_study'): 7505078},
      metagraph=[('author', 'institution', 'affiliated_with'), ('author', 'paper', 'writes'), ('paper', 'paper', 'cites'), ('paper', 'field_of_study', 'has_topic')])
Node labels
Shape of target node labels: torch.Size([736389])
Number of classes: 349
Node features
Shape of features of paper node type: 128


In [5]:
src_writes, dst_writes = graph.all_edges(etype="writes")
src_topic, dst_topic = graph.all_edges(etype="has_topic")
src_aff, dst_aff = graph.all_edges(etype="affiliated_with")


graph = dgl.heterograph({
    ("author", "writes", "paper"): (src_writes, dst_writes),
    ("paper", "has_topic", "field_of_study"): (src_topic, dst_topic),
    ("author", "affiliated_with", "institution"): (src_aff, dst_aff),
    ("paper", "writes-rev", "author"): (dst_writes, src_writes),
    ("field_of_study", "has_topic-rev", "paper"): (dst_topic, src_topic),
    ("institution", "affiliated_with-rev", "author"): (dst_aff, src_aff),
})

### Defining neighbor sampler and data loader in DGL

DGL provides useful tools to generate such computation dependencies while iterating over the dataset in minibatches.  For node classification, you can use `dgl.dataloading.NodeDataLoader` for iterating over the dataset, and `dgl.dataloading.MultiLayerNeighborSampler` to generate computation dependencies of the nodes from a multi-layer GNN with neighbor sampling.

The syntax of `dgl.dataloading.NodeDataLoader` is mostly similar to a PyTorch `DataLoader`, with the addition that it needs a graph to generate computation dependency from, a set of node IDs to iterate on, and the neighbor sampler you defined.

Let's consider training a 2-layer R-GCN with neighbor sampling, and each node will gather message from 15 neighbors on each layer.  The code defining the data loader and neighbor sampler will look like the following.

In [6]:
import dgl

sampler = dgl.dataloading.MultiLayerNeighborSampler([15, 15])
train_dataloader = dgl.dataloading.NodeDataLoader(
    graph, train_nids, sampler,
    batch_size=1024,
    shuffle=True,
    drop_last=False,
    num_workers=0
)

We can iterate over the data loader we created and see what it gives us.

In [7]:
example_minibatch = next(iter(train_dataloader))
print(example_minibatch)

({'author': tensor([   711, 459442,  31044,  ..., 581587, 396341,  48262]), 'field_of_study': tensor([10471, 10977, 13664,  ...,  5120,  7796, 40196]), 'institution': tensor([1884, 4668, 6897,  ..., 3792, 3663, 5924]), 'paper': tensor([ 70408, 110589,   2339,  ...,  80321, 500612, 602814])}, {'author': tensor([], dtype=torch.int64), 'field_of_study': tensor([], dtype=torch.int64), 'institution': tensor([], dtype=torch.int64), 'paper': tensor([ 70408, 110589,   2339,  ..., 562045, 597263, 410335])}, [Block(num_src_nodes={'author': 4778, 'field_of_study': 3556, 'institution': 1312, 'paper': 74682},
      num_dst_nodes={'author': 4582, 'field_of_study': 3556, 'institution': 0, 'paper': 1024},
      num_edges={('author', 'affiliated_with', 'institution'): 0, ('author', 'writes', 'paper'): 4639, ('field_of_study', 'has_topic-rev', 'paper'): 10546, ('institution', 'affiliated_with-rev', 'author'): 6600, ('paper', 'has_topic', 'field_of_study'): 51627, ('paper', 'writes-rev', 'author'): 39915

`NodeDataLoader` gives us three items per iteration.

* The input node list for the nodes whose input features are needed to compute the outputs.
* The output node list whose GNN representation are to be computed.
* The list of computation dependency for each layer.

In [8]:
input_nodes, output_nodes, bipartites = example_minibatch
print("To compute {} target nodes' output we need {} nodes' input features".format(len(output_nodes['paper']), len(input_nodes['paper'])))

print("")
print("Output nodes")
print(output_nodes)

print("")
print("Input nodes")
print(input_nodes)

To compute 1024 target nodes' output we need 74682 nodes' input features

Output nodes
{'author': tensor([], dtype=torch.int64), 'field_of_study': tensor([], dtype=torch.int64), 'institution': tensor([], dtype=torch.int64), 'paper': tensor([ 70408, 110589,   2339,  ..., 562045, 597263, 410335])}

Input nodes
{'author': tensor([   711, 459442,  31044,  ..., 581587, 396341,  48262]), 'field_of_study': tensor([10471, 10977, 13664,  ...,  5120,  7796, 40196]), 'institution': tensor([1884, 4668, 6897,  ..., 3792, 3663, 5924]), 'paper': tensor([ 70408, 110589,   2339,  ...,  80321, 500612, 602814])}


In [9]:
for block in bipartites:
    print(block)
    print()

Block(num_src_nodes={'author': 4778, 'field_of_study': 3556, 'institution': 1312, 'paper': 74682},
      num_dst_nodes={'author': 4582, 'field_of_study': 3556, 'institution': 0, 'paper': 1024},
      num_edges={('author', 'affiliated_with', 'institution'): 0, ('author', 'writes', 'paper'): 4639, ('field_of_study', 'has_topic-rev', 'paper'): 10546, ('institution', 'affiliated_with-rev', 'author'): 6600, ('paper', 'has_topic', 'field_of_study'): 51627, ('paper', 'writes-rev', 'author'): 39915},
      metagraph=[('author', 'institution', 'affiliated_with'), ('author', 'paper', 'writes'), ('institution', 'author', 'affiliated_with-rev'), ('paper', 'field_of_study', 'has_topic'), ('paper', 'author', 'writes-rev'), ('field_of_study', 'paper', 'has_topic-rev')])

Block(num_src_nodes={'author': 4582, 'field_of_study': 3556, 'institution': 0, 'paper': 1024},
      num_dst_nodes={'author': 0, 'field_of_study': 0, 'institution': 0, 'paper': 1024},
      num_edges={('author', 'affiliated_with', 'i

## Defining Model

The model can be written as follows:

In [10]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import dgl.nn as dglnn

class RGCN(nn.Module):
    def __init__(self, in_feats, n_hidden, n_classes, n_layers, rel_names):
        super().__init__()
        
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.n_classes = n_classes
        self.layers = nn.ModuleList()
        
        self.layers.append(dglnn.HeteroGraphConv({
            rel: dglnn.GraphConv(in_feats, n_hidden)
            for rel in rel_names}, aggregate='sum'))
        
        for i in range(1, n_layers - 1):
            self.layers.append(dglnn.HeteroGraphConv({
                rel: dglnn.GraphConv(n_hidden, n_hidden)
                for rel in rel_names}, aggregate='sum'))
            
        self.layers.append(dglnn.HeteroGraphConv({
            rel: dglnn.GraphConv(n_hidden, n_classes)
            for rel in rel_names}, aggregate='sum'))

    def forward(self, bipartites, x):
        # inputs are features of nodes
        for l, (layer, bipartite) in enumerate(zip(self.layers, bipartites)):
            x = layer(bipartite, x)
            if l != self.n_layers - 1:
                x = {k: F.relu(v) for k, v in x.items()}
        return x
    

class NodeEmbed(nn.Module):
    def __init__(self, num_nodes, embed_size,):
        super(NodeEmbed, self).__init__()
        self.embed_size = embed_size
        self.node_embeds = nn.ModuleDict()
        for ntype in num_nodes:
            node_embed = torch.nn.Embedding(num_nodes[ntype], self.embed_size)
            nn.init.uniform_(node_embed.weight, -1.0, 1.0)
            self.node_embeds[str(ntype)] = node_embed
    
    def forward(self, node_ids):
        embeds = {}
        for ntype in node_ids:
            embeds[ntype] = self.node_embeds[ntype](node_ids[ntype])
        return embeds

You can see that here we are iterating over the pairs of NN module layer and bipartite graphs generated by the data loader.

## Defining Training Loop

The following initializes the model and defines the optimizer.

In [11]:
num_nodes = {ntype: graph.number_of_nodes(ntype) for ntype in graph.ntypes if ntype != 'paper'}
num_layers = 2
hidden_dim = 128
embed = NodeEmbed(num_nodes, hidden_dim)
model = RGCN(num_features, hidden_dim, num_classes, num_layers, graph.etypes).cuda()
all_params = list(model.parameters()) + list(embed.parameters())
opt = torch.optim.Adam(all_params)

In [12]:
embed

NodeEmbed(
  (node_embeds): ModuleDict(
    (author): Embedding(1134649, 128)
    (field_of_study): Embedding(59965, 128)
    (institution): Embedding(8740, 128)
  )
)

In [13]:
model

RGCN(
  (layers): ModuleList(
    (0): HeteroGraphConv(
      (mods): ModuleDict(
        (affiliated_with): GraphConv(in=128, out=128, normalization=both, activation=None)
        (affiliated_with-rev): GraphConv(in=128, out=128, normalization=both, activation=None)
        (has_topic): GraphConv(in=128, out=128, normalization=both, activation=None)
        (has_topic-rev): GraphConv(in=128, out=128, normalization=both, activation=None)
        (writes): GraphConv(in=128, out=128, normalization=both, activation=None)
        (writes-rev): GraphConv(in=128, out=128, normalization=both, activation=None)
      )
    )
    (1): HeteroGraphConv(
      (mods): ModuleDict(
        (affiliated_with): GraphConv(in=128, out=349, normalization=both, activation=None)
        (affiliated_with-rev): GraphConv(in=128, out=349, normalization=both, activation=None)
        (has_topic): GraphConv(in=128, out=349, normalization=both, activation=None)
        (has_topic-rev): GraphConv(in=128, out=349, n

When computing the validation score for model selection, usually you can also do neighbor sampling.  To do that, you need to define another data loader.

In [14]:
valid_dataloader = dgl.dataloading.NodeDataLoader(
    graph, valid_nids, sampler,
    batch_size=1024,
    shuffle=False,
    drop_last=False,
    num_workers=0
)

The following is a training loop that performs validation every epoch.  It also saves the model with the best validation accuracy into a file.

In [69]:
import tqdm
import numpy as np
import sklearn.metrics

best_accuracy = 0
best_model_path = 'model.pt'
for epoch in range(10):
    model.train()
    
    with tqdm.tqdm(train_dataloader) as tq:
        for step, (input_nodes, output_nodes, bipartites) in enumerate(tq):
            bipartites = [b.to(torch.device('cuda')) for b in bipartites]
            
            # Get node ids for node types that don't have input features
            nodes_to_embed = {ntype: node_ids for ntype, node_ids in input_nodes.items() if ntype != "paper"}
            
            # Get node embeddings for node types that don't have input features and copy to gpu
            embeddings = {ntype: node_embedding.cuda() for ntype, node_embedding in embed(nodes_to_embed).items()}
            
            # Get input features for node type 'paper' which has input features
            inputs = {'paper': node_features[input_nodes['paper']].cuda()}
            
            # Merge feature inputs with input that has features
            inputs.update(embeddings)
            
            labels = node_labels[output_nodes['paper']].cuda()
            predictions = model(bipartites, inputs)['paper']

            loss = F.cross_entropy(predictions, labels)
            opt.zero_grad()
            loss.backward()
            opt.step()

            accuracy = sklearn.metrics.accuracy_score(labels.cpu().numpy(), predictions.argmax(1).detach().cpu().numpy())
            
            tq.set_postfix({'loss': '%.03f' % loss.item(), 'acc': '%.03f' % accuracy}, refresh=False)
        
    model.eval()
    
    predictions = []
    labels = []
    with tqdm.tqdm(valid_dataloader) as tq, torch.no_grad():
        for input_nodes, output_nodes, bipartites in tq:
            bipartites = [b.to(torch.device('cuda')) for b in bipartites]
            
            nodes_to_embed = {ntype: node_ids for ntype, node_ids in input_nodes.items() if ntype != "paper"}
            embeddings = {ntype: node_embedding.cuda() for ntype, node_embedding in embed(nodes_to_embed).items()}
            inputs = {'paper': node_features[input_nodes['paper']].cuda()}
            inputs.update(embeddings)
            
            labels.append(node_labels[output_nodes['paper']].numpy())
            predictions.append(model(bipartites, inputs)['paper'].argmax(1).cpu().numpy())
        predictions = np.concatenate(predictions)
        labels = np.concatenate(labels)
        accuracy = sklearn.metrics.accuracy_score(labels, predictions)
        print('Epoch {} Validation Accuracy {}'.format(epoch, accuracy))
        if best_accuracy < accuracy:
            best_accuracy = accuracy
            torch.save(model.state_dict(), best_model_path)

100%|██████████| 615/615 [00:56<00:00, 10.84it/s, loss=2.011, acc=0.461]
100%|██████████| 64/64 [00:04<00:00, 13.11it/s]
  0%|          | 1/615 [00:00<01:21,  7.55it/s, loss=2.092, acc=0.438]

Epoch 0 Validation Accuracy 0.3604864439957459


100%|██████████| 615/615 [00:55<00:00, 11.09it/s, loss=1.968, acc=0.460]
100%|██████████| 64/64 [00:04<00:00, 13.31it/s]
  0%|          | 1/615 [00:00<01:19,  7.73it/s, loss=2.025, acc=0.426]

Epoch 1 Validation Accuracy 0.36361534548929547


100%|██████████| 615/615 [00:53<00:00, 11.54it/s, loss=1.952, acc=0.446]
100%|██████████| 64/64 [00:04<00:00, 13.38it/s]
  0%|          | 1/615 [00:00<01:20,  7.65it/s, loss=1.937, acc=0.453]

Epoch 2 Validation Accuracy 0.359993218144546


100%|██████████| 615/615 [00:53<00:00, 11.44it/s, loss=2.099, acc=0.414]
100%|██████████| 64/64 [00:04<00:00, 13.15it/s]
  0%|          | 1/615 [00:00<01:19,  7.73it/s, loss=2.009, acc=0.436]

Epoch 3 Validation Accuracy 0.3551071995560967


100%|██████████| 615/615 [00:53<00:00, 11.42it/s, loss=2.056, acc=0.429]
100%|██████████| 64/64 [00:04<00:00, 13.28it/s]
  0%|          | 1/615 [00:00<01:20,  7.66it/s, loss=1.944, acc=0.452]

Epoch 4 Validation Accuracy 0.36632808767089503


100%|██████████| 615/615 [00:53<00:00, 11.42it/s, loss=1.941, acc=0.440]
100%|██████████| 64/64 [00:04<00:00, 12.91it/s]
  0%|          | 1/615 [00:00<01:21,  7.52it/s, loss=1.946, acc=0.443]

Epoch 5 Validation Accuracy 0.36259806717119564


100%|██████████| 615/615 [00:54<00:00, 11.37it/s, loss=1.988, acc=0.466]
100%|██████████| 64/64 [00:04<00:00, 13.03it/s]
  0%|          | 1/615 [00:00<01:44,  5.90it/s, loss=1.916, acc=0.440]

Epoch 6 Validation Accuracy 0.3612879360039458


100%|██████████| 615/615 [00:54<00:00, 11.22it/s, loss=1.996, acc=0.410]
100%|██████████| 64/64 [00:04<00:00, 13.49it/s]
  0%|          | 1/615 [00:00<01:18,  7.86it/s, loss=1.897, acc=0.474]

Epoch 7 Validation Accuracy 0.3614112424667458


100%|██████████| 615/615 [00:52<00:00, 11.65it/s, loss=1.945, acc=0.454]
100%|██████████| 64/64 [00:04<00:00, 13.41it/s]
  0%|          | 1/615 [00:00<01:18,  7.84it/s, loss=1.898, acc=0.443]

Epoch 8 Validation Accuracy 0.3627984401732456


100%|██████████| 615/615 [00:52<00:00, 11.62it/s, loss=2.026, acc=0.453]
100%|██████████| 64/64 [00:04<00:00, 13.67it/s]

Epoch 9 Validation Accuracy 0.3630758797145455





## Offline Inference without Neighbor Sampling

Usually for offline inference it is desirable to aggregate over the entire neighborhood to eliminate randomness introduced by neighbor sampling.  However, using the same methodology in training is not efficient, because there will be a lot of redundant computation.  Moreover, simply doing neighbor sampling by taking all neighbors will often exhaust GPU memory because the number of nodes required for input features may be too large to fit into GPU memory.

Instead, you need to compute the representations layer by layer: you first compute the output of the first GNN layer for all nodes, then you compute the output of second GNN layer for all nodes using the first GNN layer's output as input, etc.  This gives us a different algorithm from what is being used in training.  During training we have an outer loop that iterates over the nodes, and an inner loop that iterates over the layers.  In contrast, during inference we have an outer loop that iterates over the layers, and an inner loop that iterates over the nodes.

If you do not care about randomness too much (e.g., during model selection in validation), you can still use the `dgl.dataloading.MultiLayerNeighborSampler` and `dgl.dataloading.NodeDataLoader` to do offline inference, since it is usually faster for evaluating a small number of nodes.

In [74]:
def inference(model, graph, input_features, batch_size):
    nodes = {ntype: torch.arange(graph.number_of_nodes(ntype)) for ntype in graph.ntypes}
    
    sampler = dgl.dataloading.MultiLayerNeighborSampler([None])  # one layer at a time, taking all neighbors
    dataloader = dgl.dataloading.NodeDataLoader(
        graph, nodes, sampler
        ,
        batch_size=batch_size,
        shuffle=False,
        drop_last=False,
        num_workers=0)
    
    with torch.no_grad():
        for l, layer in enumerate(model.layers):
            # Allocate a buffer of output representations for every node
            # Note that the buffer is on CPU memory.
            output_features = {ntype: torch.zeros(
                graph.number_of_nodes(ntype), model.n_hidden if l != model.n_layers - 1 else model.n_classes)
                for ntype in graph.ntypes}

            for input_nodes, output_nodes, bipartites in tqdm.tqdm(dataloader):
                bipartite = bipartites[0].to(torch.device('cuda'))

                # send features for nodes in batch to gpu 
                x = {ntype: input_features[ntype][input_nodes[ntype]].cuda() for ntype in input_nodes}

                # the following code is identical to the loop body in model.forward()
                x = layer(bipartite, x)
                if l != model.n_layers - 1:
                    x = {k: F.relu(v) for k, v in x.items()}
                
                for ntype in x:
                    output_features[ntype][output_nodes[ntype]] = x[ntype].cpu()
            input_features = output_features
    return output_features

The following code loads the best model from the file saved previously and performs offline inference.  It computes the accuracy on the test set afterwards.

In [75]:
model.load_state_dict(torch.load(best_model_path))

nodes_to_embed = {ntype: torch.arange(num_nodes_ntype) for ntype, num_nodes_ntype in num_nodes.items()}
embeddings = {ntype: node_embedding for ntype, node_embedding in embed(nodes_to_embed).items()}
inputs = {'paper': node_features}
inputs.update(embeddings)

all_predictions = inference(model, graph, inputs, 8192)

100%|██████████| 237/237 [00:22<00:00, 10.53it/s]
100%|██████████| 237/237 [00:23<00:00, 10.24it/s]


In [72]:
test_predictions = all_predictions['paper'][test_nids['paper']].argmax(1)
test_labels = node_labels[test_nids['paper']]
test_accuracy = sklearn.metrics.accuracy_score(test_predictions.numpy(), test_labels.numpy())
print('Test accuracy:', test_accuracy)

Test accuracy: 0.3197501132597344


## Conclusion

In this tutorial, you have learned how to train a multi-layer RGCN with neighbor sampling on a large heterogeneous dataset.  The method used here works on a single machine with a single GPU.

## What's next?

The next tutorial will be about scaling the training procedure out to multiple GPUs on a single machine.