# Semi-supervised node classification using Graph Neural Networks

In this tutorial, you will

* Introduce a popular citation network in DGL
* Build a GCN model, a popular Graph Neural Network architecture proposed by [Kipf et al.](https://arxiv.org/abs/1609.02907)
* Train the model and understand the result.

In [1]:
import dgl
import torch
import torch.nn as nn 
import torch.nn.functional as F
import itertools
import numpy as np
import time

Using backend: pytorch


## Problem formulation

- Given the graph structure, node features, and node labels on a subset of nodes
- Predict the labels on the rest of the nodes

## Classification in citation networks

- Load the graph data the Cora dataset
- Consists of 2708 scientific publications classified into one of seven classes. 
- The citation network consists of 5429 links. 
- Each paper has a 0/1-valued word vector as feature indicating the absence/presence of the corresponding word from the dictionary. 

<img src='https://miro.medium.com/max/1400/1*oygeCjtUsS87duvFoDT8tA.png' align='center' width="400px" height="300px" />


In [8]:
from dgl.data import CoraGraphDataset
# ----------- 0. load graph -------------- #
data = CoraGraphDataset()
g = data[0]
print(g)

Loading from cache failed, re-processing.
Finished data loading and preprocessing.
  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done saving data into cached files.
Graph(num_nodes=2708, num_edges=10556,
      ndata_schemes={'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'label': Scheme(shape=(), dtype=torch.int64), 'feat': Scheme(shape=(1433,), dtype=torch.float32)}
      edata_schemes={})


- Print data attributes

In [9]:
    features = g.ndata['feat']
    labels = g.ndata['label']
    train_mask = g.ndata['train_mask']
    val_mask = g.ndata['val_mask']
    test_mask = g.ndata['test_mask']
    in_feats = features.shape[1]
    n_classes = data.num_labels
    n_edges = data.graph.number_of_edges()
    print("""----Data statistics------'
      #Edges %d
      #Classes %d
      #Train samples %d
      #Val samples %d
      #Test samples %d""" %
          (n_edges, n_classes,
              train_mask.int().sum().item(),
              val_mask.int().sum().item(),
              test_mask.int().sum().item()))


----Data statistics------'
  #Edges 10556
  #Classes 7
  #Train samples 140
  #Val samples 500
  #Test samples 1000


## Define a GCN model

Our model consists of two layers, each computes new node representations by aggregating neighbor information as follows

$$
 h_i^{(l+1)} = \sigma(b^{(l)} + \sum_{j\in\mathcal{N}(i)}\frac{1}{c_{ij}}h_j^{(l)}W^{(l)})
$$
<img src='https://tkipf.github.io/graph-convolutional-networks/images/gcn_web.png' align='center' width="400px" height="300px" />

DGL provides implementation of many popular neighbor aggregation modules. They all can be invoked easily with one line of codes. See the full list of supported [graph convolution modules](https://docs.dgl.ai/api/python/nn.pytorch.html#module-dgl.nn.pytorch.conv).

In [10]:
from dgl.nn import GraphConv

# ----------- 2. create model -------------- #
# build a two-layer GCN model
class GCN(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(GCN, self).__init__()
        self.conv1 = GraphConv(in_feats, h_feats)
        self.conv2 = GraphConv(h_feats, num_classes)
    
    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h
    
# Create the model with given dimensions 
# input layer dimension: 1433, node features
# hidden layer dimension: 16
# output layer dimension: n_classes
model = GCN(in_feats, 16, n_classes)

In [11]:
def evaluate(g,model, features, labels, mask):
    model.eval()
    with torch.no_grad():
        logits = model(g,features)
        logits = logits[mask]
        labels = labels[mask]
        _, indices = torch.max(logits, dim=1)
        correct = torch.sum(indices == labels)
        return correct.item() * 1.0 / len(labels)


In [12]:
# ----------- 3. set up loss and optimizer -------------- #
# in this case, loss will in training loop
optimizer = torch.optim.Adam(itertools.chain(model.parameters()), lr=0.01)
loss_fcn = torch.nn.CrossEntropyLoss()
# ----------- 4. training -------------------------------- #
n_epochs=200
for epoch in range(n_epochs):
        model.train()

        # forward
        logits = model(g,features)
        loss = loss_fcn(logits[train_mask], labels[train_mask])

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()


        acc = evaluate(g,model, features, labels, val_mask)
        if epoch%20==0:
            print("Epoch {:05d} | Loss {:.4f} | Accuracy {:.4f} | ". format(epoch, loss.item(),
                                             acc))
print()


Epoch 00000 | Loss 1.9455 | Accuracy 0.1880 | 
Epoch 00020 | Loss 1.5699 | Accuracy 0.5900 | 
Epoch 00040 | Loss 0.9627 | Accuracy 0.7080 | 
Epoch 00060 | Loss 0.4592 | Accuracy 0.7460 | 
Epoch 00080 | Loss 0.2122 | Accuracy 0.7600 | 
Epoch 00100 | Loss 0.1109 | Accuracy 0.7600 | 
Epoch 00120 | Loss 0.0661 | Accuracy 0.7660 | 
Epoch 00140 | Loss 0.0436 | Accuracy 0.7660 | 
Epoch 00160 | Loss 0.0311 | Accuracy 0.7700 | 
Epoch 00180 | Loss 0.0234 | Accuracy 0.7720 | 



In [13]:
# ----------- 5. check results ------------------------ #
acc = evaluate(g,model, features, labels, test_mask)
print("Test accuracy {:.2%}".format(acc))

Test accuracy 76.40%
