# Speedup Training Using GPUs

In this tutorial, you will learn:

* How to copy graph and feature data to GPU.
* Train a GNN model on GPU.

In [1]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F
import itertools

Using backend: pytorch


## Copy graph and feature data to GPU

We first load the Zachery's Karate club graph and node labels as from the previous sessions.

In [2]:
from tutorial_utils import load_zachery

# ----------- 0. load graph -------------- #
g = load_zachery()
print(g)

Graph(num_nodes=34, num_edges=156,
      ndata_schemes={'club': Scheme(shape=(), dtype=torch.int64), 'club_onehot': Scheme(shape=(2,), dtype=torch.int64)}
      edata_schemes={})


Right now the graph and all its feature data are stored in CPU. Use the `to` API to copy them to another device.

In [3]:
print('Current device:', g.device)
g = g.to('cuda:0')
print('New device:', g.device)

Current device: cpu
New device: cuda:0


Verify that features are also copied to GPU.

In [4]:
print(g.ndata['club'].device)
print(g.ndata['club_onehot'].device)

cuda:0
cuda:0


## Create a GNN model on GPU

The step is the same as creating a CNN or RNN model on GPU. In PyTorch, one can use the `to` API to achieve so.

In [5]:
# ----------- 1. node features -------------- #
node_embed = nn.Embedding(g.number_of_nodes(), 5)  # Every node has an embedding of size 5.
# Copy node embeddings to GPU
node_embed = node_embed.to('cuda:0')
inputs = node_embed.weight                         # Use the embedding weight as the node features.
nn.init.xavier_uniform_(inputs)

Parameter containing:
tensor([[ 0.0062, -0.1141, -0.2312,  0.3353, -0.3236],
        [-0.3308, -0.0468, -0.1497, -0.0883,  0.0622],
        [ 0.3628, -0.0341, -0.1598,  0.3915, -0.2788],
        [-0.1002, -0.2527,  0.0084, -0.3870,  0.0156],
        [-0.0527, -0.2129,  0.2782, -0.2657, -0.2701],
        [ 0.2799,  0.3184,  0.1105,  0.3310, -0.3773],
        [-0.3297,  0.1229, -0.0549, -0.3048,  0.1531],
        [ 0.0727,  0.1640, -0.2764,  0.0586,  0.0815],
        [-0.1895,  0.0511, -0.1282,  0.2278,  0.2251],
        [ 0.2987,  0.0811,  0.0514,  0.0829,  0.1878],
        [-0.2568,  0.1782,  0.2640, -0.1311,  0.3104],
        [-0.2857,  0.3534,  0.1447, -0.3519,  0.2148],
        [-0.2627,  0.2351,  0.0018, -0.2252, -0.2876],
        [ 0.2623, -0.3375,  0.0329, -0.3104, -0.0386],
        [ 0.3618,  0.2089,  0.3130,  0.2907, -0.0047],
        [-0.2638, -0.0793,  0.3894,  0.1217, -0.1996],
        [ 0.3133, -0.0858,  0.2046,  0.1231, -0.3094],
        [-0.3384, -0.1559,  0.3215, -0.2914

The community label is stored in the `'club'` node feature (0 for instructor, 1 for club president). Only nodes 0 and 33 are labeled.

In [6]:
labels = g.ndata['club']
labeled_nodes = [0, 33]
print('Labels', labels[labeled_nodes])

Labels tensor([0, 1], device='cuda:0')


### Define a GraphSAGE model

Our model consists of two layers, each computes new node representations by aggregating neighbor information. The equations are:

$$
h_{\mathcal{N}(v)}^k\leftarrow \text{AGGREGATE}_k\{h_u^{k-1},\forall u\in\mathcal{N}(v)\}
$$

$$
h_v^k\leftarrow \sigma\left(W^k\cdot \text{CONCAT}(h_v^{k-1}, h_{\mathcal{N}(v)}^k) \right)
$$

DGL provides implementation of many popular neighbor aggregation modules. They all can be invoked easily with one line of codes. See the full list of supported [graph convolution modules](https://docs.dgl.ai/api/python/nn.pytorch.html#module-dgl.nn.pytorch.conv).

In [7]:
from dgl.nn import SAGEConv

# ----------- 2. create model -------------- #
# build a two-layer GraphSAGE model
class GraphSAGE(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(GraphSAGE, self).__init__()
        self.conv1 = SAGEConv(in_feats, h_feats, 'mean')
        self.conv2 = SAGEConv(h_feats, num_classes, 'mean')
    
    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h
    
# Create the model with given dimensions 
# input layer dimension: 5, node embeddings
# hidden layer dimension: 16
# output layer dimension: 2, the two classes, 0 and 1
net = GraphSAGE(5, 16, 2)

Copy the network to GPU

In [8]:
net = net.to('cuda:0')

In [9]:
# ----------- 3. set up loss and optimizer -------------- #
# in this case, loss will in training loop
optimizer = torch.optim.Adam(itertools.chain(net.parameters(), node_embed.parameters()), lr=0.01)

# ----------- 4. training -------------------------------- #
all_logits = []
for e in range(100):
    # forward
    logits = net(g, inputs)
    
    # compute loss
    logp = F.log_softmax(logits, 1)
    loss = F.nll_loss(logp[labeled_nodes], labels[labeled_nodes])
    
    # backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    all_logits.append(logits.detach())
    
    if e % 5 == 0:
        print('In epoch {}, loss: {}'.format(e, loss))

In epoch 0, loss: 0.5166622996330261
In epoch 5, loss: 0.21810603141784668
In epoch 10, loss: 0.07244576513767242
In epoch 15, loss: 0.021864021196961403
In epoch 20, loss: 0.006832215469330549
In epoch 25, loss: 0.0025985222309827805
In epoch 30, loss: 0.0012380237458273768
In epoch 35, loss: 0.0007198769017122686
In epoch 40, loss: 0.00048893439816311
In epoch 45, loss: 0.0003713267797138542
In epoch 50, loss: 0.00030447341850958765
In epoch 55, loss: 0.00026276218704879284
In epoch 60, loss: 0.0002343975065741688
In epoch 65, loss: 0.000213600171264261
In epoch 70, loss: 0.00019715270900633186
In epoch 75, loss: 0.00018344626005273312
In epoch 80, loss: 0.00017146786558441818
In epoch 85, loss: 0.00016068120021373034
In epoch 90, loss: 0.00015054999676067382
In epoch 95, loss: 0.00014101463602855802


In [10]:
# ----------- 5. check results ------------------------ #
pred = torch.argmax(logits, axis=1)
print('Accuracy', (pred == labels).sum().item() / len(pred))

Accuracy 0.35294117647058826


**What if the graph and its feature data cannot fit into one GPU memory?**

* Instead of running a GNN on the full graph, run it on some sample subgraphs till converge.
* Issue different samples to different GPUs to enjoy even more acceleration.
* Partition the graph to multiple machines and train it distributedly.

Our later sessions will cover each of these methods.