# MDI343
## Lab on real-world graph analysis -- graph neural networks

The objective of this lab is to get a feeling of real-world graphs. For information on the `scikit-network` library, [the documentation is handy](https://scikit-network.readthedocs.io/).

## Import

In [1]:
import numpy as np

In [2]:
import dgl
import dgl.function as fn
import torch as th
import torch.nn as nn
import torch.nn.functional as F
from dgl import DGLGraph

Using backend: pytorch


In [3]:
from itertools import groupby

In [4]:
import sknetwork as skn

In [5]:
# Util function to plot the inverse cumulative distribution
def ccdf(values):
    x = []
    y = []
    values = sorted(values)

    # First make dist
    dist = [(key, len(list(group))) for key, group in groupby(values)]

    # Then compute inverse cumulative
    total = 1.0
    for (val, count) in dist:
        x.append(val)
        y.append(total)
        total -= count/len(values)
    return x, y

# Util function to return the distribution of values
def dist(values):
    values = sorted(values)

    # First make dist
    dist = [(key, len(list(group))) for key, group in groupby(values)]
    
    return [x[0] for x in dist], [x[1] for x in dist]

## Load data

We will work on 2 graphs induced by the [Vital articles of Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/4), a selection of about 10,000 articles of the English Wikipedia:
* the directed graph of hyperlinks between these articles,
* the bipartite graph between articles and (stemmed) words used in their summary.

In [6]:
data = skn.data.load_netset('wikivitals')
data.keys()

dict_keys(['adjacency', 'biadjacency', 'names', 'names_row', 'names_col', 'labels', 'labels_hierarchy', 'names_labels', 'names_labels_hierarchy', 'meta'])

In [7]:
# graph of links
adjacency = dgl.from_scipy(data.adjacency)

In [8]:
# graph of words
biadjacency = dgl.bipartite_from_scipy(data.biadjacency, "articles", "words", "occurrence")

In [9]:
# article names
names = data.names

In [10]:
# article categories
categories = data.names_labels
categories

array(['People', 'History', 'Geography', 'Arts',
       'Philosophy and religion', 'Everyday life',
       'Society and social sciences', 'Biology and health sciences',
       'Physical sciences', 'Technology', 'Mathematics'], dtype='<U27')

In [11]:
# words
words = data.names_col
words

array(['moos', 'tonnag', 'separatist', ..., 'luteum', 'radiat', 'helena'],
      dtype='<U22')

In [12]:
node_index = {name:i for i, name in enumerate(names)}

In [13]:
num_words, num_articles = biadjacency.num_dst_nodes(), biadjacency.num_src_nodes()

In [14]:
labels = data.labels

## To do

For the 2 graphs:
* Separate the data into training and validation sets
* Fill the code to implement a GCN with the deep graph library

In [15]:
gcn_msg = fn.copy_src(src='h', out='m')
gcn_reduce = fn.sum(msg='m', out='h')

# Change for Wikipedia
num_features = 1433
num_classes = 7

In [16]:
class GCNLayer(nn.Module):
    def __init__(self, features_in, features_out):
        super(GCNLayer, self).__init__()
        self.linear = nn.Linear(features_in, features_out)
        
    def forward(self, g, feature):
        with g.local_scope():
            g.ndata["h"] = feature
            g.update_all(gcn_msg, gcn_reduce)
            h = g.ndata["h"]
            return self.linear(h)

In [17]:
class GCNet(nn.Module):
    def __init__(self):
        super(GCNet, self).__init__()
        self.layer1 = GCNLayer(num_features, 16)
        self.layer2 = GCNLayer(16, num_classes)
        
    def forward(self, g, features):
        x = F.relu(self.layer1(g, features))
        x = self.layer2(g, x)
        return x

In [18]:
net = GCNet()
print(net)

GCNet(
  (layer1): GCNLayer(
    (linear): Linear(in_features=1433, out_features=16, bias=True)
  )
  (layer2): GCNLayer(
    (linear): Linear(in_features=16, out_features=7, bias=True)
  )
)


Let us use some more common dataset, just to get a hang of how things work

In [31]:
from dgl.data import citation_graph as citegrh
def load_cora_data():
    data = citegrh.load_cora()
    features = th.FloatTensor(data.features)
    labels = th.LongTensor(data.labels)
    train_mask = th.BoolTensor(data.train_mask)
    test_mask = th.BoolTensor(data.test_mask)
    g = dgl.from_networkx(data.graph)
    return g, features, labels, train_mask, test_mask

In [29]:
def evaluate(model, g, features, labels, mask):
    model.eval()
    with th.no_grad():
        logits = model(g, features)
        logits = logits[mask]
        labels = labels[mask]
        _, indices = th.max(logits, dim=1)
        correct = th.sum(indices == labels)
        return correct.item() * 1.0 / len(labels)

In [32]:
g, features, labels, train_mask, test_mask = load_cora_data()

# Add edges between each node and itself to preserve old node representations
g.add_edges(g.nodes(), g.nodes())
optimizer = th.optim.Adam(net.parameters(), lr=1e-2)

for epoch in range(50):

    net.train()
    logits = net(g, features)
    logp = F.log_softmax(logits, 1)
    loss = F.nll_loss(logp[train_mask], labels[train_mask])

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    acc = evaluate(net, g, features, labels, test_mask)
    print("Epoch {:05d} | Loss {:.4f} | Accuracy on test {:.4f}".format(
            epoch, loss.item(), acc))

Loading from cache failed, re-processing.
Finished data loading and preprocessing.
  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done saving data into cached files.
Epoch 00000 | Loss 0.0104 | Accuracy on test 0.7500
Epoch 00001 | Loss 0.0077 | Accuracy on test 0.7370
Epoch 00002 | Loss 0.0060 | Accuracy on test 0.7350
Epoch 00003 | Loss 0.0046 | Accuracy on test 0.7420
Epoch 00004 | Loss 0.0034 | Accuracy on test 0.7430
Epoch 00005 | Loss 0.0026 | Accuracy on test 0.7390
Epoch 00006 | Loss 0.0020 | Accuracy on test 0.7370
Epoch 00007 | Loss 0.0016 | Accuracy on test 0.7350
Epoch 00008 | Loss 0.0012 | Accuracy on test 0.7350
Epoch 00009 | Loss 0.0010 | Accuracy on test 0.7320
Epoch 00010 | Loss 0.0008 | Accuracy on test 0.7330
Epoch 00011 | Loss 0.0006 | Accuracy on test 0.7330
Epoch 00012 | Loss 0.0005 | Accuracy on test 0.7300
Epoch 00013 | Loss 0.0004 | Accuracy on test 0.7300
Epoch 00

## Todo

 * Separate our dataset into train and test (you just need to define a `train_mask` and a `test_mask`, which are Boolean vectors)
 * Adapt the wikipedia dataset to run our GCN model
 * Use the GCN to "find out" the article category of the articles in the test set 

For example, for features, you can use a vector of 0/1 indicating the absence/presence of a word in a given article.

In [81]:
def load_wikivitals_data():
    % 
    return g, features, labels, train_mask, test_mask