# Stochastic Training of GNN for Node Classification on Large Heterogeneous  Graphs

This tutorial shows how to train a multi-layer R-GCN for node classification on `ogbn-mag` dataset provided by OGB.

The ogbn-mag dataset is a heterogeneous network composed of a subset of the Microsoft Academic Graph (MAG) [1]. It contains four types of entities—papers (736,389 nodes), authors (1,134,649 nodes), institutions (8,740 nodes), and fields of study (59,965 nodes)—as well as four types of directed relations connecting two types of entities—an author is “affiliated with” an institution, an author “writes” a paper, a paper “cites” a paper, and a paper “has a topic of” a field of study.

This tutorial's contents include

* Creating a DGL graph using the dgl ogb data loader.
* Training a GNN model with a single machine, a single GPU, on a graph of any size.

In [None]:
import numpy as np
import dgl
import torch
import dgl.nn as dglnn
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel
import torch.nn.functional as F
import torch.multiprocessing as mp
import sklearn.metrics
import tqdm

import utils

## Load Dataset

Although you can directly use the Python package provided by OGB, for demonstration, we will instead manually download the dataset, peek into its contents, and process it with only `numpy`.

In [None]:
from ogb.nodeproppred import DglNodePropPredDataset

dataset = DglNodePropPredDataset(name='ogbn-mag')

graph, label = dataset[0] # graph: dgl graph object, label: torch tensor of shape (num_nodes, 1)

split_idx = dataset.get_idx_split()
train_nids, valid_nids, test_nids = split_idx["train"], split_idx["valid"], split_idx["test"]

In [None]:
print(graph)

print('Node labels')
node_labels = label['paper'].flatten()
print('Shape of target node labels:', node_labels.shape)
num_classes = (node_labels.max() + 1).item()
print('Number of classes:', num_classes)

print('Node features')
node_features = graph.nodes['paper'].data['feat']
num_features = node_features.shape[1]
print('Shape of features of paper node type: {}'.format(num_features))

In [None]:
src_writes, dst_writes = graph.all_edges(etype="writes")
src_topic, dst_topic = graph.all_edges(etype="has_topic")
src_aff, dst_aff = graph.all_edges(etype="affiliated_with")


graph = dgl.heterograph({
    ("author", "writes", "paper"): (src_writes, dst_writes),
    ("paper", "has_topic", "field_of_study"): (src_topic, dst_topic),
    ("author", "affiliated_with", "institution"): (src_aff, dst_aff),
    ("paper", "writes-rev", "author"): (dst_writes, src_writes),
    ("field_of_study", "has_topic-rev", "paper"): (dst_topic, src_topic),
    ("institution", "affiliated_with-rev", "author"): (dst_aff, src_aff),
})

### Defining neighbor sampler and data loader in DGL

DGL provides useful tools to generate such computation dependencies while iterating over the dataset in minibatches.  For node classification, you can use `dgl.dataloading.NodeDataLoader` for iterating over the dataset, and `dgl.dataloading.MultiLayerNeighborSampler` to generate computation dependencies of the nodes from a multi-layer GNN with neighbor sampling.

The syntax of `dgl.dataloading.NodeDataLoader` is mostly similar to a PyTorch `DataLoader`, with the addition that it needs a graph to generate computation dependency from, a set of node IDs to iterate on, and the neighbor sampler you defined.

Let's consider training a 2-layer R-GCN with neighbor sampling, and each node will gather message from 15 neighbors on each layer.  The code defining the data loader and neighbor sampler will look like the following.

In [None]:
def create_dataloader(rank, world_size, graph, nids, fanout):
    part_nids = {}
    for ntype, ids in nids.items():
        partition_size = len(ids) // world_size
        partition_offset = partition_size * rank
        ids = ids[partition_offset:partition_offset+partition_size]
        part_nids[ntype] = ids
    
    sampler = dgl.dataloading.MultiLayerNeighborSampler(fanout)
    dataloader = dgl.dataloading.NodeDataLoader(
        graph, part_nids, sampler,
        batch_size=1024,
        shuffle=True,
        drop_last=False,
        num_workers=0
    )
    
    return dataloader


## Defining Model

The model can be written as follows:

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import dgl.nn as dglnn

class RGCN(nn.Module):
    def __init__(self, in_feats, n_hidden, n_classes, n_layers, rel_names):
        super().__init__()
        
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.n_classes = n_classes
        self.layers = nn.ModuleList()
        
        self.layers.append(dglnn.HeteroGraphConv({
            rel: dglnn.GraphConv(in_feats, n_hidden)
            for rel in rel_names}, aggregate='sum'))
        
        for i in range(1, n_layers - 1):
            self.layers.append(dglnn.HeteroGraphConv({
                rel: dglnn.GraphConv(n_hidden, n_hidden)
                for rel in rel_names}, aggregate='sum'))
            
        self.layers.append(dglnn.HeteroGraphConv({
            rel: dglnn.GraphConv(n_hidden, n_classes)
            for rel in rel_names}, aggregate='sum'))

    def forward(self, bipartites, x):
        # inputs are features of nodes
        for l, (layer, bipartite) in enumerate(zip(self.layers, bipartites)):
            x = layer(bipartite, x)
            if l != self.n_layers - 1:
                x = {k: F.relu(v) for k, v in x.items()}
        return x
    

class NodeEmbed(nn.Module):
    def __init__(self, num_nodes, embed_size,):
        super(NodeEmbed, self).__init__()
        self.embed_size = embed_size
        self.node_embeds = nn.ModuleDict()
        for ntype in num_nodes:
            node_embed = torch.nn.Embedding(num_nodes[ntype], self.embed_size)
            nn.init.uniform_(node_embed.weight, -1.0, 1.0)
            self.node_embeds[str(ntype)] = node_embed
    
    def forward(self, node_ids):
        embeds = {}
        for ntype in node_ids:
            embeds[ntype] = self.node_embeds[ntype](node_ids[ntype])
        return embeds

In [None]:
def init_model(rank, in_feats, n_hidden, n_classes, n_layers, rel_names):
    model = RGCN(in_feats, n_hidden, n_classes, n_layers, rel_names).to(rank)
    return DistributedDataParallel(model, device_ids=[rank], output_device=rank, find_unused_parameters=True)

## Defining Training Loop

The following initializes the model and defines the optimizer.

In [None]:
@utils.fix_openmp
def train(rank, world_size, data):
    # data is the output of load_data
    torch.distributed.init_process_group(
        backend='nccl',
        init_method='tcp://127.0.0.1:12345',
        world_size=world_size,
        rank=rank)
    torch.cuda.set_device(rank)
    
    graph, node_features, node_labels, train_nids, valid_nids, test_nids, num_features, num_classes = data
    
    fanout = [15, 15]
    
    train_dataloader = create_dataloader(rank, world_size, graph, train_nids, fanout)
    # We only use one worker for validation
    valid_dataloader = create_dataloader(0, 1, graph, valid_nids, fanout)
    
    num_nodes = {ntype: graph.number_of_nodes(ntype) for ntype in graph.ntypes if ntype != 'paper'}
    num_layers = 2
    hidden_dim = 128
    embed = NodeEmbed(num_nodes, hidden_dim)
    
    model = init_model(rank, num_features, hidden_dim, num_classes, num_layers, graph.etypes)
    opt = torch.optim.Adam(model.parameters())
    torch.distributed.barrier()
    
    best_accuracy = 0
    best_model_path = 'model.pt'
    for epoch in range(10):
        model.train()

        for step, (input_nodes, output_nodes, bipartites) in enumerate(train_dataloader):
            bipartites = [b.to(rank) for b in bipartites]
            
            nodes_to_embed = {ntype: node_ids for ntype, node_ids in input_nodes.items() if ntype != "paper"}
            embeddings = {ntype: node_embedding.cuda() for ntype, node_embedding in embed(nodes_to_embed).items()}
            
            inputs = {'paper': node_features[input_nodes['paper']].cuda()}
            inputs.update(embeddings)
            
            labels = node_labels[output_nodes['paper']].cuda()
            predictions = model(bipartites, inputs)['paper']

            loss = F.cross_entropy(predictions, labels)
            opt.zero_grad()
            loss.backward()
            opt.step()

            accuracy = sklearn.metrics.accuracy_score(labels.cpu().numpy(), predictions.argmax(1).detach().cpu().numpy())

            if rank == 0 and step % 10 == 0:
                print('Epoch {:05d} Step {:05d} Loss {:.04f}'.format(epoch, step, loss.item()))

        torch.distributed.barrier()
        
        if rank == 0:
            model.eval()
            predictions = []
            labels = []
            with torch.no_grad():
                for input_nodes, output_nodes, bipartites in valid_dataloader:
                    bipartites = [b.to(rank) for b in bipartites]
                    
                    nodes_to_embed = {ntype: node_ids for ntype, node_ids in input_nodes.items() if ntype != "paper"}
                    embeddings = {ntype: node_embedding.cuda() for ntype, node_embedding in embed(nodes_to_embed).items()}
                    inputs = {'paper': node_features[input_nodes['paper']].cuda()}
                    inputs.update(embeddings)
            
                    labels.append(node_labels[output_nodes['paper']].numpy())
                    predictions.append(model(bipartites, inputs)['paper'].argmax(1).cpu().numpy())
                predictions = np.concatenate(predictions)
                labels = np.concatenate(labels)
                accuracy = sklearn.metrics.accuracy_score(labels, predictions)
                print('Epoch {} Validation Accuracy {}'.format(epoch, accuracy))
                if best_accuracy < accuracy:
                    best_accuracy = accuracy
                    torch.save(model.module.state_dict(), best_model_path)
                    
        torch.distributed.barrier()

In [None]:
if __name__ == '__main__':
    procs = []
    data = graph, node_features, node_labels, train_nids, valid_nids, test_nids, num_features, num_classes
    for proc_id in range(4):    # 4 gpus
        p = mp.Process(target=train, args=(proc_id, 4, data))
        p.start()
        procs.append(p)
    for p in procs:
        p.join()

## Conclusion

In this tutorial, you have learned how to train a multi-layer RGCN with neighbor sampling on a large heterogeneous dataset that cannot fit into a single GPU.  The method you have learned can scale to a graph of any size, and works on a single machine with a single GPU.

In [1]:
## Conclusion

In this tutorial, you have learned how to train a multi-layer RGCN with neighbor sampling on a large heterogeneous dataset that cannot fit into a single GPU.  The method you have learned can scale to a graph of any size, and works on a single machine with a single GPU.

Using backend: pytorch


## Load Dataset

Although you can directly use the Python package provided by OGB, for demonstration, we will instead manually download the dataset, peek into its contents, and process it with only `numpy`.

In [2]:
from ogb.nodeproppred import DglNodePropPredDataset

dataset = DglNodePropPredDataset(name='ogbn-mag')

graph, label = dataset[0] # graph: dgl graph object, label: torch tensor of shape (num_nodes, 1)

split_idx = dataset.get_idx_split()
train_nids, valid_nids, test_nids = split_idx["train"], split_idx["valid"], split_idx["test"]

In [3]:
print(graph)

print('Node labels')
node_labels = label['paper'].flatten()
print('Shape of target node labels:', node_labels.shape)
num_classes = (node_labels.max() + 1).item()
print('Number of classes:', num_classes)

print('Node features')
node_features = graph.nodes['paper'].data['feat']
num_features = node_features.shape[1]
print('Shape of features of paper node type: {}'.format(num_features))

Graph(num_nodes={'author': 1134649, 'field_of_study': 59965, 'institution': 8740, 'paper': 736389},
      num_edges={('author', 'affiliated_with', 'institution'): 1043998, ('author', 'writes', 'paper'): 7145660, ('paper', 'cites', 'paper'): 5416271, ('paper', 'has_topic', 'field_of_study'): 7505078},
      metagraph=[('author', 'institution', 'affiliated_with'), ('author', 'paper', 'writes'), ('paper', 'paper', 'cites'), ('paper', 'field_of_study', 'has_topic')])
Node labels
Shape of target node labels: torch.Size([736389])
Number of classes: 349
Node features
Shape of features of paper node type: 128


In [4]:
src_writes, dst_writes = graph.all_edges(etype="writes")
src_topic, dst_topic = graph.all_edges(etype="has_topic")
src_aff, dst_aff = graph.all_edges(etype="affiliated_with")


graph = dgl.heterograph({
    ("author", "writes", "paper"): (src_writes, dst_writes),
    ("paper", "has_topic", "field_of_study"): (src_topic, dst_topic),
    ("author", "affiliated_with", "institution"): (src_aff, dst_aff),
    ("paper", "writes-rev", "author"): (dst_writes, src_writes),
    ("field_of_study", "has_topic-rev", "paper"): (dst_topic, src_topic),
    ("institution", "affiliated_with-rev", "author"): (dst_aff, src_aff),
})

### Defining neighbor sampler and data loader in DGL

DGL provides useful tools to generate such computation dependencies while iterating over the dataset in minibatches.  For node classification, you can use `dgl.dataloading.NodeDataLoader` for iterating over the dataset, and `dgl.dataloading.MultiLayerNeighborSampler` to generate computation dependencies of the nodes from a multi-layer GNN with neighbor sampling.

The syntax of `dgl.dataloading.NodeDataLoader` is mostly similar to a PyTorch `DataLoader`, with the addition that it needs a graph to generate computation dependency from, a set of node IDs to iterate on, and the neighbor sampler you defined.

Let's consider training a 2-layer R-GCN with neighbor sampling, and each node will gather message from 15 neighbors on each layer.  The code defining the data loader and neighbor sampler will look like the following.

In [5]:
def create_dataloader(rank, world_size, graph, nids, fanout):
    part_nids = {}
    for ntype, ids in nids.items():
        partition_size = len(ids) // world_size
        partition_offset = partition_size * rank
        ids = ids[partition_offset:partition_offset+partition_size]
        part_nids[ntype] = ids
    
    sampler = dgl.dataloading.MultiLayerNeighborSampler(fanout)
    dataloader = dgl.dataloading.NodeDataLoader(
        graph, part_nids, sampler,
        batch_size=1024,
        shuffle=True,
        drop_last=False,
        num_workers=0
    )
    
    return dataloader


## Defining Model

The model can be written as follows:

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import dgl.nn as dglnn

class RGCN(nn.Module):
    def __init__(self, in_feats, n_hidden, n_classes, n_layers, rel_names):
        super().__init__()
        
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.n_classes = n_classes
        self.layers = nn.ModuleList()
        
        self.layers.append(dglnn.HeteroGraphConv({
            rel: dglnn.GraphConv(in_feats, n_hidden)
            for rel in rel_names}, aggregate='sum'))
        
        for i in range(1, n_layers - 1):
            self.layers.append(dglnn.HeteroGraphConv({
                rel: dglnn.GraphConv(n_hidden, n_hidden)
                for rel in rel_names}, aggregate='sum'))
            
        self.layers.append(dglnn.HeteroGraphConv({
            rel: dglnn.GraphConv(n_hidden, n_classes)
            for rel in rel_names}, aggregate='sum'))

    def forward(self, bipartites, x):
        # inputs are features of nodes
        for l, (layer, bipartite) in enumerate(zip(self.layers, bipartites)):
            x = layer(bipartite, x)
            if l != self.n_layers - 1:
                x = {k: F.relu(v) for k, v in x.items()}
        return x
    

class NodeEmbed(nn.Module):
    def __init__(self, num_nodes, embed_size,):
        super(NodeEmbed, self).__init__()
        self.embed_size = embed_size
        self.node_embeds = nn.ModuleDict()
        for ntype in num_nodes:
            node_embed = torch.nn.Embedding(num_nodes[ntype], self.embed_size)
            nn.init.uniform_(node_embed.weight, -1.0, 1.0)
            self.node_embeds[str(ntype)] = node_embed
    
    def forward(self, node_ids):
        embeds = {}
        for ntype in node_ids:
            embeds[ntype] = self.node_embeds[ntype](node_ids[ntype])
        return embeds

In [7]:
def init_model(rank, in_feats, n_hidden, n_classes, n_layers, rel_names):
    model = RGCN(in_feats, n_hidden, n_classes, n_layers, rel_names).to(rank)
    return DistributedDataParallel(model, device_ids=[rank], output_device=rank, find_unused_parameters=True)

## Defining Training Loop

The following initializes the model and defines the optimizer.

In [8]:
@utils.fix_openmp
def train(rank, world_size, data):
    # data is the output of load_data
    torch.distributed.init_process_group(
        backend='nccl',
        init_method='tcp://127.0.0.1:12345',
        world_size=world_size,
        rank=rank)
    torch.cuda.set_device(rank)
    
    graph, node_features, node_labels, train_nids, valid_nids, test_nids, num_features, num_classes = data
    
    fanout = [15, 15]
    
    train_dataloader = create_dataloader(rank, world_size, graph, train_nids, fanout)
    # We only use one worker for validation
    valid_dataloader = create_dataloader(0, 1, graph, valid_nids, fanout)
    
    num_nodes = {ntype: graph.number_of_nodes(ntype) for ntype in graph.ntypes if ntype != 'paper'}
    num_layers = 2
    hidden_dim = 128
    embed = NodeEmbed(num_nodes, hidden_dim)
    
    model = init_model(rank, num_features, hidden_dim, num_classes, num_layers, graph.etypes)
    opt = torch.optim.Adam(model.parameters())
    torch.distributed.barrier()
    
    best_accuracy = 0
    best_model_path = 'model.pt'
    for epoch in range(10):
        model.train()

        for step, (input_nodes, output_nodes, bipartites) in enumerate(train_dataloader):
            bipartites = [b.to(rank) for b in bipartites]
            
            nodes_to_embed = {ntype: node_ids for ntype, node_ids in input_nodes.items() if ntype != "paper"}
            embeddings = {ntype: node_embedding.cuda() for ntype, node_embedding in embed(nodes_to_embed).items()}
            
            inputs = {'paper': node_features[input_nodes['paper']].cuda()}
            inputs.update(embeddings)
            
            labels = node_labels[output_nodes['paper']].cuda()
            predictions = model(bipartites, inputs)['paper']

            loss = F.cross_entropy(predictions, labels)
            opt.zero_grad()
            loss.backward()
            opt.step()

            accuracy = sklearn.metrics.accuracy_score(labels.cpu().numpy(), predictions.argmax(1).detach().cpu().numpy())

            if rank == 0 and step % 10 == 0:
                print('Epoch {:05d} Step {:05d} Loss {:.04f}'.format(epoch, step, loss.item()))

        torch.distributed.barrier()
        
        if rank == 0:
            model.eval()
            predictions = []
            labels = []
            with torch.no_grad():
                for input_nodes, output_nodes, bipartites in valid_dataloader:
                    bipartites = [b.to(rank) for b in bipartites]
                    
                    nodes_to_embed = {ntype: node_ids for ntype, node_ids in input_nodes.items() if ntype != "paper"}
                    embeddings = {ntype: node_embedding.cuda() for ntype, node_embedding in embed(nodes_to_embed).items()}
                    inputs = {'paper': node_features[input_nodes['paper']].cuda()}
                    inputs.update(embeddings)
            
                    labels.append(node_labels[output_nodes['paper']].numpy())
                    predictions.append(model(bipartites, inputs)['paper'].argmax(1).cpu().numpy())
                predictions = np.concatenate(predictions)
                labels = np.concatenate(labels)
                accuracy = sklearn.metrics.accuracy_score(labels, predictions)
                print('Epoch {} Validation Accuracy {}'.format(epoch, accuracy))
                if best_accuracy < accuracy:
                    best_accuracy = accuracy
                    torch.save(model.module.state_dict(), best_model_path)
                    
        torch.distributed.barrier()

In [None]:
if __name__ == '__main__':
    procs = []
    data = graph, node_features, node_labels, train_nids, valid_nids, test_nids, num_features, num_classes
    for proc_id in range(4):    # 4 gpus
        p = mp.Process(target=train, args=(proc_id, 4, data))
        p.start()
        procs.append(p)
    for p in procs:
        p.join()

Epoch 00000 Step 00000 Loss 6.4136
Epoch 00000 Step 00010 Loss 4.9499
Epoch 00000 Step 00020 Loss 4.5907
Epoch 00000 Step 00030 Loss 4.3993
Epoch 00000 Step 00040 Loss 4.0387
Epoch 00000 Step 00050 Loss 3.9396
Epoch 00000 Step 00060 Loss 3.7282
Epoch 00000 Step 00070 Loss 3.5739
Epoch 00000 Step 00080 Loss 3.5383
Epoch 00000 Step 00090 Loss 3.3513
Epoch 00000 Step 00100 Loss 3.3080
Epoch 00000 Step 00110 Loss 3.2446
Epoch 00000 Step 00120 Loss 3.1437
Epoch 00000 Step 00130 Loss 3.1746
Epoch 00000 Step 00140 Loss 3.0759
Epoch 00000 Step 00150 Loss 3.0929


## Conclusion

In this tutorial, you have learned how to train a multi-layer RGCN with neighbor sampling on a large heterogeneous dataset that cannot fit into a single GPU.  The method you have learned can scale to a graph of any size, and works on a single machine with a single GPU.