# SD212: Graph mining

## Lab 7: Graph neural networks

In this lab, you will learn to classify nodes using graph neural networks.

We use [DGL](https://www.dgl.ai) (deep graph library), which relies on pytorch.

In [307]:
# pip install dgl

## Import

In [308]:
import numpy as np
from scipy import sparse

In [309]:
from sknetwork.data import load_netset
from sknetwork.classification import DiffusionClassifier
from sknetwork.embedding import Spectral
from sknetwork.utils import directed2undirected

In [310]:
import dgl
from dgl.nn import SAGEConv
from dgl import function as fn

In [311]:
import torch
from torch import nn
import torch.nn.functional as F

In [312]:
# ignore warnings from DGL
import warnings
warnings.filterwarnings('ignore')

## Load data

We will work on the following datasets (see the [NetSet](https://netset.telecom-paris.fr/) collection for details):
* Cora (directed graph + bipartite graph)
* WikiVitals (directed graph + bipartite graph)

Both datasets are graphs with node features (given by the bipartite graph) and ground-truth labels.

In [313]:
cora = load_netset('cora')
wikivitals = load_netset('wikivitals')

Parsing files...
Done.
Parsing files...
Done.


In [314]:
dataset = cora

In [315]:
adjacency = dataset.adjacency
biadjacency = dataset.biadjacency
labels = dataset.labels

In [316]:
# we use undirected graphs
adjacency = directed2undirected(adjacency)

In [317]:
# for Wikivitals, use spectral embedding of the bipartite graph as features

if dataset.meta.name.startswith('Wikivitals'):
    spectral = Spectral(50)
    features = spectral.fit_transform(biadjacency)
else:
    features = biadjacency.toarray()

In [318]:
def split_train_test_val(n_samples, test_ratio=0.1, val_ratio=0.1, seed=None):
    """Split the samples into train / test / validation sets.
    
    Returns
    -------
    train: np.ndarray
        Boolean mask
    test: np.ndarray
        Boolean mask
    validation: np.ndarray
        Boolean mask
    """
    if seed:
        np.random.seed(seed)

    # test
    index = np.random.choice(n_samples, int(np.ceil(n_samples * test_ratio)), replace=False)
    test = np.zeros(n_samples, dtype=bool)
    test[index] = 1
    
    # validation
    index = np.random.choice(np.argwhere(~test).ravel(), int(np.ceil(n_samples * val_ratio)), replace=False)
    val = np.zeros(n_samples, dtype=bool)
    val[index] = 1
    
    # train
    train = np.ones(n_samples, dtype=bool)
    train[test] = 0
    train[val] = 0
    return train, test, val

In [319]:
train, test, val = split_train_test_val(len(labels))

## Graph and tensors

In DGL, the graph is represented as an object, the features and labels as tensors.

In [320]:
# graph as an object
graph = dgl.from_scipy(adjacency)

In [321]:
type(graph)

dgl.heterograph.DGLGraph

In [322]:
# features and labels as tensors
features = torch.Tensor(features)
labels = torch.Tensor(labels).long()

In [323]:
labels

tensor([2, 2, 1,  ..., 6, 6, 6])

In [324]:
# masks as tensors
train = torch.Tensor(train).bool()
test = torch.Tensor(test).bool()
val = torch.Tensor(val).bool()

## Graph neural network

We start with a simple graph neural network without hidden layer. The output layer is of type [GraphSAGE](https://docs.dgl.ai/generated/dgl.nn.pytorch.conv.SAGEConv.html).

In [325]:
class GNN(nn.Module):
    def __init__(self, dim_input, dim_output):
        super(GNN, self).__init__()
        self.conv = SAGEConv(dim_input, 20, 'mean')
        self.hidden = nn.Linear(20, dim_output)
        self.relu = nn.ReLU()
        
    def forward(self, graph, features):
        output_layer1 = self.conv(graph, features)
        output_hidden = self.hidden(output_layer1)
        output = self.relu(output_hidden)
        return output

## To do

* Train the model on Cora and get accuracy.
* Compare with the same model trained on an empty graph.
* Add a hidden layer with ReLu activation function (e.g., dimension = 20) and retrain the model. 
* Compare with a classifier based on heat diffusion.

In [326]:
def init_model(model, features, labels):
    '''Init the GNN with appropriate dimensions.'''
    dim_input = features.shape[1]
    dim_output = len(labels.unique())
    return model(dim_input, dim_output)   

In [327]:
def eval_model(gnn, graph, features, labels, test=test):
    '''Evaluate the model in terms of accuracy.'''
    gnn.eval()
    with torch.no_grad():
        output = gnn(graph, features)
        labels_pred = torch.max(output, dim=1)[1]
        score = np.mean(np.array(labels[test]) == np.array(labels_pred[test]))
    return score

In [328]:
def train_model(gnn, graph, features, labels, train=train, val=val, n_epochs=100, lr=0.01, verbose=True):
    '''Train the GNN.'''
    optimizer = torch.optim.Adam(gnn.parameters(), lr=lr)
    
    gnn.train()
    scores = []
    
    for t in range(n_epochs):   
        
        # forward
        output = gnn(graph, features)
        logp = nn.functional.log_softmax(output, 1)
        loss = nn.functional.nll_loss(logp[train], labels[train])

        # backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # evaluation
        score = eval_model(gnn, graph, features, labels, val)
        scores.append(score)
        
        if verbose and t % 10 == 0:
            print("Epoch {:02d} | Loss {:.3f} | Accuracy {:.3f}".format(t, loss.item(), score))

In [329]:
gnn = init_model(GNN, features, labels)

In [330]:
train_model(gnn, graph, features, labels)

Epoch 00 | Loss 1.963 | Accuracy 0.317
Epoch 10 | Loss 0.480 | Accuracy 0.819


Epoch 20 | Loss 0.104 | Accuracy 0.834
Epoch 30 | Loss 0.021 | Accuracy 0.841
Epoch 40 | Loss 0.007 | Accuracy 0.834
Epoch 50 | Loss 0.004 | Accuracy 0.838
Epoch 60 | Loss 0.002 | Accuracy 0.841
Epoch 70 | Loss 0.002 | Accuracy 0.834
Epoch 80 | Loss 0.001 | Accuracy 0.830
Epoch 90 | Loss 0.001 | Accuracy 0.830


In [331]:
eval_model(gnn, graph, features, labels)

0.8597785977859779

In [332]:
# Create empty graph of same size
graph_dgl = dgl.DGLGraph()
graph_dgl.add_nodes(graph.number_of_nodes())

adjacency_empty = sparse.csr_matrix(adjacency)

graph_empty = dgl.from_scipy(adjacency)

spectral_empty = Spectral(50)
features_empty = spectral.fit_transform(adjacency)

features_empty = torch.Tensor(features_empty)

gnn_empty = init_model(GNN, features_empty, labels)

train_model(gnn_empty, graph_empty, features_empty, labels)

eval_model(gnn_empty, graph_empty, features_empty, labels)




Epoch 00 | Loss 1.984 | Accuracy 0.140
Epoch 10 | Loss 1.514 | Accuracy 0.561
Epoch 20 | Loss 1.073 | Accuracy 0.668
Epoch 30 | Loss 0.817 | Accuracy 0.745
Epoch 40 | Loss 0.690 | Accuracy 0.793
Epoch 50 | Loss 0.627 | Accuracy 0.812
Epoch 60 | Loss 0.593 | Accuracy 0.834
Epoch 70 | Loss 0.572 | Accuracy 0.830
Epoch 80 | Loss 0.559 | Accuracy 0.830
Epoch 90 | Loss 0.548 | Accuracy 0.830


0.8154981549815498

> WIth an empty graph it gives random values of accuracy as there is no data to fit on.
> ReLu activation function gives better accuracy than the previous model (0.87 vs 0.83)
> But with adding an hidden layer with ReLu, it is worse (0.81 vs 0.83)

In [333]:
heat_classifier = DiffusionClassifier()

# We keep the values of the train set and set the others to 0

labels_train = - np.ones(len(labels), dtype=int)
labels_train[train] = labels[train]

temperatures = heat_classifier.fit_predict(adjacency, labels=labels_train)


np.mean(temperatures[test] == np.array(labels)[test])






0.8560885608856088

## Build your own GNN

You will now build your own GNN. We start with the graph convolution layer.

In [334]:
class GraphConvLayer(nn.Module):
    def __init__(self, dim_input, dim_output):
        super(GraphConvLayer, self).__init__()
        self.layer = nn.Linear(dim_input, dim_output)
        
    def forward(self, graph, signal):
        with graph.local_scope():
            # message passing
            graph.ndata['node'] = signal
            graph.update_all(fn.copy_u('node', 'message'), fn.mean('message', 'average'))
            h = graph.ndata['average']
            return self.layer(h)

## To do

* Build a GNN with two layers (hidden layer + output) based on this graph convolution layer.
* Train this GNN and compare the results with the previous one.
* Add the input signal to the output of the graph convolution layer and observe the results.
* Retrain the same GNN without message passing in the first layer.

In [335]:
class GNNHeat(nn.Module):
    def __init__(self, dim_input, dim_output):
        super(GNNHeat, self).__init__()
        self.layer = GraphConvLayer(dim_input, 20)
        self.hidden = nn.Linear(20, dim_output)

        self.relu = nn.ReLU()
        
    def forward(self, graph, features):
        output_layer1 = self.layer.forward(graph, features)
        output_hidden = self.hidden(output_layer1)
        output = self.relu(output_hidden)
        return output

In [336]:
gnn_heat = init_model(GNNHeat, features, labels)

train_model(gnn_heat, graph, features, labels)

eval_model(gnn_heat, graph, features, labels)

Epoch 00 | Loss 1.929 | Accuracy 0.351
Epoch 10 | Loss 1.067 | Accuracy 0.642
Epoch 20 | Loss 0.652 | Accuracy 0.694
Epoch 30 | Loss 0.340 | Accuracy 0.834
Epoch 40 | Loss 0.161 | Accuracy 0.838
Epoch 50 | Loss 0.098 | Accuracy 0.830
Epoch 60 | Loss 0.067 | Accuracy 0.834
Epoch 70 | Loss 0.051 | Accuracy 0.830
Epoch 80 | Loss 0.042 | Accuracy 0.827
Epoch 90 | Loss 0.036 | Accuracy 0.812


0.8228782287822878

## Heat diffusion as a GNN

Node classification by heat diffusion can be seen as a GNN without training, using a one-hot encoding of labels. Features are ignored.

## To do

* Build a special GNN whose output corresponds to one step of heat diffusion in the graph.
* Use this GNN to classify nodes by heat diffusion, with temperature centering.

In [337]:
from sknetwork.utils import get_membership

In [338]:
labels_one_hot = get_membership(labels).toarray()
labels_one_hot = torch.Tensor(labels_one_hot)

In [339]:
class Diffusion(nn.Module):
    def __init__(self):
        super(Diffusion, self).__init__()
        
    def forward(self, graph, signal, mask):
        '''Mask is a boolean tensor giving the training set.'''
        with graph.local_scope():
            # message passing
            graph.ndata['node'] = signal
            graph.update_all(fn.copy_u('node', 'message'), fn.mean('message', 'average'))
            h = graph.ndata['average']

            # diffusion
            h = h * mask[:, None]

            # normalization

            #h = h / torch.norm(h, dim=1)[:, None]
            return h
            

In [340]:
diffusion = Diffusion()

In [341]:
n_iter = 20

temperatures = labels_one_hot
temperatures[~train] = 0
for t in range(n_iter):
    temperatures = diffusion(graph, temperatures, train)
    
# temperature centering
temperatures -= temperatures.mean(axis=0)