# SD212: Graph mining

## Lab 7: Graph neural networks

In this lab, you will learn to classify nodes using graph neural networks.

We use [DGL](https://www.dgl.ai) (deep graph library), which relies on pytorch.

In [None]:
# pip install dgl

## Import

In [1]:
import numpy as np
from scipy import sparse

In [2]:
from sknetwork.data import load_netset
from sknetwork.classification import DiffusionClassifier
from sknetwork.embedding import Spectral
from sknetwork.utils import directed2undirected

In [3]:
import dgl
from dgl.nn import SAGEConv
from dgl import function as fn

In [4]:
import torch
from torch import nn
import torch.nn.functional as F

In [5]:
# ignore warnings from DGL
import warnings
warnings.filterwarnings('ignore')

## Load data

We will work on the following datasets (see the [NetSet](https://netset.telecom-paris.fr/) collection for details):
* Cora (directed graph + bipartite graph)
* WikiVitals (directed graph + bipartite graph)

Both datasets are graphs with node features (given by the bipartite graph) and ground-truth labels.

In [6]:
cora = load_netset('cora')
wikivitals = load_netset('wikivitals')

Parsing files...
Done.
Parsing files...
Done.


In [7]:
dataset = cora

In [58]:
adjacency = dataset.adjacency
biadjacency = dataset.biadjacency
labels = dataset.labels

In [9]:
# we use undirected graphs
adjacency = directed2undirected(adjacency)

In [10]:
adjacency

<2708x2708 sparse matrix of type '<class 'numpy.intc'>'
	with 10556 stored elements in Compressed Sparse Row format>

In [11]:
biadjacency

<2708x1433 sparse matrix of type '<class 'numpy.bool_'>'
	with 49216 stored elements in Compressed Sparse Row format>

In [12]:
# for Wikivitals, use spectral embedding of the bipartite graph as features

if dataset.meta.name.startswith('Wikivitals'):
    spectral = Spectral(50)
    features = spectral.fit_transform(biadjacency)
else:
    features = biadjacency.toarray()

In [13]:
def split_train_test_val(n_samples, test_ratio=0.1, val_ratio=0.1, seed=None):
    """Split the samples into train / test / validation sets.
    
    Returns
    -------
    train: np.ndarray
        Boolean mask
    test: np.ndarray
        Boolean mask
    validation: np.ndarray
        Boolean mask
    """
    if seed:
        np.random.seed(seed)

    # test
    index = np.random.choice(n_samples, int(np.ceil(n_samples * test_ratio)), replace=False)
    test = np.zeros(n_samples, dtype=bool)
    test[index] = 1
    
    # validation
    index = np.random.choice(np.argwhere(~test).ravel(), int(np.ceil(n_samples * val_ratio)), replace=False)
    val = np.zeros(n_samples, dtype=bool)
    val[index] = 1
    
    # train
    train = np.ones(n_samples, dtype=bool)
    train[test] = 0
    train[val] = 0
    return train, test, val

In [14]:
train, test, val = split_train_test_val(len(labels))

## Graph and tensors

In DGL, the graph is represented as an object, the features and labels as tensors.

In [15]:
# graph as an object
graph = dgl.from_scipy(adjacency)

In [16]:
type(graph)

dgl.heterograph.DGLGraph

In [67]:
# features and labels as tensors
features = torch.Tensor(features)
labels = torch.Tensor(labels).long()

In [18]:
# masks as tensors
train = torch.Tensor(train).bool()
test = torch.Tensor(test).bool()
val = torch.Tensor(val).bool()

## Graph neural network

We start with a simple graph neural network without hidden layer. The output layer is of type [GraphSAGE](https://docs.dgl.ai/generated/dgl.nn.pytorch.conv.SAGEConv.html).

In [19]:
class GNN(nn.Module):
    def __init__(self, dim_input, dim_output):
        super(GNN, self).__init__()
        self.conv = SAGEConv(dim_input, dim_output, 'mean')
        
    def forward(self, graph, features):
        output = self.conv(graph, features)
        return output

## To do

* Train the model on Cora and get accuracy.
* Compare with the same model trained on an empty graph.
* Add a hidden layer with ReLu activation function (e.g., dimension = 20) and retrain the model. 
* Compare with a classifier based on heat diffusion.

In [20]:
def init_model(model, features, labels):
    '''Init the GNN with appropriate dimensions.'''
    dim_input = features.shape[1]
    dim_output = len(labels.unique())
    return model(dim_input, dim_output)   

In [21]:
def eval_model(gnn, graph, features, labels, test=test):
    '''Evaluate the model in terms of accuracy.'''
    gnn.eval()
    with torch.no_grad():
        output = gnn(graph, features)
        labels_pred = torch.max(output, dim=1)[1]
        score = np.mean(np.array(labels[test]) == np.array(labels_pred[test]))
    return score

In [22]:
def train_model(gnn, graph, features, labels, train=train, val=val, n_epochs=100, lr=0.01, verbose=True):
    '''Train the GNN.'''
    optimizer = torch.optim.Adam(gnn.parameters(), lr=lr)
    
    gnn.train()
    scores = []
    
    for t in range(n_epochs):   
        
        # forward
        output = gnn(graph, features)
        logp = nn.functional.log_softmax(output, 1)
        loss = nn.functional.nll_loss(logp[train], labels[train])

        # backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # evaluation
        score = eval_model(gnn, graph, features, labels, val)
        scores.append(score)
        
        if verbose and t % 10 == 0:
            print("Epoch {:02d} | Loss {:.3f} | Accuracy {:.3f}".format(t, loss.item(), score))

In [23]:
gnn = init_model(GNN, features, labels)

In [24]:
train_model(gnn, graph, features, labels)

Epoch 00 | Loss 1.953 | Accuracy 0.365
Epoch 10 | Loss 0.586 | Accuracy 0.815
Epoch 20 | Loss 0.246 | Accuracy 0.830
Epoch 30 | Loss 0.135 | Accuracy 0.823
Epoch 40 | Loss 0.087 | Accuracy 0.827
Epoch 50 | Loss 0.064 | Accuracy 0.823
Epoch 60 | Loss 0.050 | Accuracy 0.815
Epoch 70 | Loss 0.041 | Accuracy 0.815
Epoch 80 | Loss 0.035 | Accuracy 0.815
Epoch 90 | Loss 0.030 | Accuracy 0.819


In [25]:
eval_model(gnn, graph, features, labels)

0.8450184501845018

## Empty graph

In [26]:
graph2 = dgl.graph(([], []), num_nodes=2708)
gnn = init_model(GNN, features, labels)
train_model(gnn, graph2, features, labels)
eval_model(gnn, graph2, features, labels)

Epoch 00 | Loss 1.969 | Accuracy 0.343
Epoch 10 | Loss 1.101 | Accuracy 0.653
Epoch 20 | Loss 0.676 | Accuracy 0.727
Epoch 30 | Loss 0.466 | Accuracy 0.738
Epoch 40 | Loss 0.348 | Accuracy 0.720
Epoch 50 | Loss 0.275 | Accuracy 0.708
Epoch 60 | Loss 0.227 | Accuracy 0.708
Epoch 70 | Loss 0.192 | Accuracy 0.712
Epoch 80 | Loss 0.167 | Accuracy 0.716
Epoch 90 | Loss 0.147 | Accuracy 0.716


0.7343173431734318

## ReLU

In [31]:
class GNN2(nn.Module):
    def __init__(self, dim_input, dim_output):
        super(GNN2, self).__init__()
        self.conv1 = SAGEConv(dim_input, h_node, 'mean')
        self.conv2 = SAGEConv(h_node, dim_output, 'mean')
        
    def forward(self, graph, features):
        h = self.conv1(graph, features)
        h = F.relu(h)
        output = self.conv2(graph, h)
        return output

In [32]:
gnn = init_model(GNN2, features, labels)
train_model(gnn, graph, features, labels)
eval_model(gnn, graph, features, labels)

Epoch 00 | Loss 2.124 | Accuracy 0.465
Epoch 10 | Loss 0.446 | Accuracy 0.797
Epoch 20 | Loss 0.099 | Accuracy 0.845
Epoch 30 | Loss 0.022 | Accuracy 0.863
Epoch 40 | Loss 0.006 | Accuracy 0.867
Epoch 50 | Loss 0.003 | Accuracy 0.863
Epoch 60 | Loss 0.002 | Accuracy 0.867
Epoch 70 | Loss 0.002 | Accuracy 0.863
Epoch 80 | Loss 0.001 | Accuracy 0.867
Epoch 90 | Loss 0.001 | Accuracy 0.867


0.8745387453874539

In [38]:
(labels==0).shape

torch.Size([2708])

In [40]:
np.argwhere(labels==0)[0][:2]

tensor([ 6, 27])

## Heat diffusion

In [66]:
labels.shape

(2708,)

In [115]:
train_index_list = []
for i in range(7):
    train_index_list = train_index_list + np.random.choice(np.argwhere(labels==i)[0], 100, replace=False).tolist()

In [116]:
len(train_index_list)

700

In [117]:
classifier = DiffusionClassifier()
train_labels = {i: labels.numpy()[i] for i in train_index_list}
labels_pred = classifier.fit_predict(adjacency, train_labels)
print('Accuracy: ', np.average(labels_pred==labels.numpy()))

Accuracy:  0.46824224519940916


In [118]:
np.unique(labels_pred)

array([-1,  0,  1,  2,  3,  4,  5,  6], dtype=int64)

In [119]:
np.sum(labels_pred==-1)

1217

In [120]:
labels_pred, labels

(array([ 2,  2,  1, ...,  6, -1, -1], dtype=int64),
 tensor([2, 2, 1,  ..., 6, 6, 6]))

In [121]:
labels.numpy()

array([2, 2, 1, ..., 6, 6, 6], dtype=int64)

In [123]:
np.sum(labels_pred==labels.numpy())

1268

## Build your own GNN

You will now build your own GNN. We start with the graph convolution layer.

In [24]:
class GraphConvLayer(nn.Module):
    def __init__(self, dim_input, dim_output):
        super(GraphConvLayer, self).__init__()
        self.layer = nn.Linear(dim_input, dim_output)
        
    def forward(self, graph, signal):
        with graph.local_scope():
            # message passing
            graph.ndata['node'] = signal
            graph.update_all(fn.copy_u('node', 'message'), fn.mean('message', 'average'))
            h = graph.ndata['average']
            return self.layer(h)

## To do

* Build a GNN with two layers (hidden layer + output) based on this graph convolution layer.
* Train this GNN and compare the results with the previous one.
* Add the input signal to the output of the graph convolution layer and observe the results.
* Retrain the same GNN without message passing in the first layer.

In [None]:
class GNN3(nn.Module):
    def __init__(self, dim_input, dim_output):
        super(GNN3, self).__init__()
        self.layer1 = GraphConvLayer(dim_input, dim_hid)
        self.layer2 = GraphConvLayer(dim_hid, dim_output)
        
    def forward(self, graph, features):
        hidden_layer = self.layer1(graph, features)
        output = self.conv(graph, features)
        return output

## Heat diffusion as a GNN

Node classification by heat diffusion can be seen as a GNN without training, using a one-hot encoding of labels. Features are ignored.

## To do

* Build a special GNN whose output corresponds to one step of heat diffusion in the graph.
* Use this GNN to classify nodes by heat diffusion, with temperature centering.

In [None]:
from sknetwork.utils import get_membership

In [None]:
labels_one_hot = get_membership(labels).toarray()
labels_one_hot = torch.Tensor(labels_one_hot)

In [None]:
class Diffusion(nn.Module):
    def __init__(self):
        super(Diffusion, self).__init__()
        
    def forward(self, graph, signal, mask):
        '''Mask is a boolean tensor giving the training set.'''
        with graph.local_scope():
            # to be modified
            h = signal
            return h

In [None]:
diffusion = Diffusion()

In [None]:
n_iter = 20

temperatures = labels_one_hot
temperatures[~train] = 0
for t in range(n_iter):
    temperatures = diffusion(graph, temperatures, train)
    
# temperature centering
temperatures -= temperatures.mean(axis=0)