# Semi-supervised Community Detection using Graph Neural Networks

Almost every computer 101 class starts with a "Hello World" example. Like MNIST for deep learning, in graph domain we have the Zachary's Karate Club problem. The karate club is a social network that includes 34 members and documents pairwise links between members who interact outside the club. The club later divides into two communities led by the instructor (node 0) and the club president (node 33). The network is visualized as follows with the color indicating the community.

<img src='../asset/karat_club.png' align='center' width="400px" height="300px" />

In this tutorial, you will learn:

* Formulate the community detection problem as a semi-supervised node classification task.
* Build a GraphSAGE model, a popular Graph Neural Network architecture proposed by [Hamilton et al.](https://arxiv.org/abs/1706.02216)
* Train the model and understand the result.

In [1]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F
import itertools

Using backend: pytorch


## Community detection as node classification

The study of community structure in graphs has a long history. Many proposed methods are *unsupervised* (or *self-supervised* by recent definition), where the model predicts the community labels only by connectivity. Recently, [Kipf et al.,](https://arxiv.org/abs/1609.02907) proposed to formulate the community detection problem as a semi-supervised node classification task. With the help of only a small portion of labeled nodes, a GNN can accurately predict the community labels of the others.

In this tutorial, we apply Kipf's setting to the Zachery's Karate Club network to predict the community membership, where only the labels of a few nodes are used.

We first load the graph and node labels as is covered in the [last session](./1_load_data.ipynb). Here, we have provided you a function for loading the data.

In [2]:
from tutorial_utils import load_zachery

# ----------- 0. load graph -------------- #
g = load_zachery()
print(g)

Graph(num_nodes=34, num_edges=156,
      ndata_schemes={'club': Scheme(shape=(), dtype=torch.int64), 'club_onehot': Scheme(shape=(2,), dtype=torch.int64)}
      edata_schemes={})


In the original Zachery's Karate Club graph, nodes are feature-less. (The `'Age'` attribute is an artificial one mainly for tutorial purposes). For feature-less graph, a common practice is to use an embedding weight that is updated during training for every node.

We can use PyTorch's `Embedding` module to achieve this.

In [3]:
# ----------- 1. node features -------------- #
node_embed = nn.Embedding(g.number_of_nodes(), 5)  # Every node has an embedding of size 5.
inputs = node_embed.weight                         # Use the embedding weight as the node features.
nn.init.xavier_uniform_(inputs)
print(inputs)

Parameter containing:
tensor([[ 0.3817, -0.1982,  0.0354, -0.0420, -0.3786],
        [ 0.0577, -0.2174,  0.3806, -0.1385, -0.0466],
        [-0.0625,  0.2613,  0.2505, -0.1168,  0.0517],
        [ 0.1652, -0.3654, -0.1174, -0.1066, -0.0265],
        [-0.0626,  0.0081, -0.0805, -0.2065,  0.1847],
        [ 0.2763,  0.1923,  0.1724, -0.1276,  0.3726],
        [-0.2887,  0.3318,  0.2087, -0.0511,  0.2060],
        [-0.3117, -0.3187,  0.3622, -0.1624, -0.0315],
        [-0.1482, -0.0845, -0.1926, -0.0573,  0.3398],
        [ 0.1377, -0.1056,  0.2965,  0.3668, -0.1756],
        [ 0.1430, -0.1686, -0.0689,  0.3754,  0.3886],
        [-0.3051, -0.0276,  0.1080,  0.0697,  0.2749],
        [-0.2268,  0.3638,  0.3825,  0.1868, -0.2312],
        [ 0.2403, -0.2899,  0.3139, -0.0270,  0.1848],
        [-0.1717,  0.3562,  0.1251,  0.2082, -0.3107],
        [ 0.3846, -0.0761, -0.1512,  0.3800,  0.2337],
        [-0.0489,  0.0889,  0.2310, -0.1197,  0.0476],
        [-0.3504,  0.2113, -0.0286, -0.2377

The community label is stored in the `'club'` node feature (0 for instructor, 1 for club president). Only nodes 0 and 33 are labeled.

In [55]:
import random
import numpy as np
random.seed(0)
labels = g.ndata['club']
print('#nodes:', len(labels))
train_nodes = np.unique([0, 33] + random.sample(range(len(labels)), 3))
test_nodes = np.delete(np.arange(len(labels)), train_nodes)
print('#labeled nodes:', len(train_nodes))
print('Labels', labels[train_nodes])

#nodes: 34
#labeled nodes: 5
Labels tensor([0, 0, 1, 1, 1])


## Define a GraphSAGE model

Our model consists of two layers, each computes new node representations by aggregating neighbor information. The equations are:

$$
h_{\mathcal{N}(v)}^k\leftarrow \text{AGGREGATE}_k\{h_u^{k-1},\forall u\in\mathcal{N}(v)\}
$$

$$
h_v^k\leftarrow \sigma\left(W^k\cdot \text{CONCAT}(h_v^{k-1}, h_{\mathcal{N}(v)}^k) \right)
$$

DGL provides implementation of many popular neighbor aggregation modules. They all can be invoked easily with one line of codes. See the full list of supported [graph convolution modules](https://docs.dgl.ai/api/python/nn.pytorch.html#module-dgl.nn.pytorch.conv).

In [56]:
from dgl.nn import SAGEConv

# ----------- 2. create model -------------- #
# build a two-layer GraphSAGE model
class GraphSAGE(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(GraphSAGE, self).__init__()
        self.conv1 = SAGEConv(in_feats, h_feats, 'mean')
        self.conv2 = SAGEConv(h_feats, num_classes, 'mean')
    
    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h
    
# Create the model with given dimensions 
# input layer dimension: 5, node embeddings
# hidden layer dimension: 16
# output layer dimension: 2, the two classes, 0 and 1
net = GraphSAGE(5, 16, 2)

In [57]:
# ----------- 3. set up loss and optimizer -------------- #
# in this case, loss will in training loop
optimizer = torch.optim.Adam(itertools.chain(net.parameters(), node_embed.parameters()), lr=0.01)

# ----------- 4. training -------------------------------- #
all_logits = []
for e in range(100):
    # forward
    logits = net(g, inputs)
    
    # compute loss
    logp = F.log_softmax(logits, 1)
    loss = F.nll_loss(logp[labeled_nodes], labels[labeled_nodes])
    
    # backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    all_logits.append(logits.detach())
    
    if e % 5 == 0:
        print('In epoch {}, loss: {}'.format(e, loss))

In epoch 0, loss: 1.2463643550872803
In epoch 5, loss: 0.20613953471183777
In epoch 10, loss: 0.014657475054264069
In epoch 15, loss: 0.0025392910465598106
In epoch 20, loss: 0.0009291935712099075
In epoch 25, loss: 0.0005137986736372113
In epoch 30, loss: 0.0003468797367531806
In epoch 35, loss: 0.0002645134227350354
In epoch 40, loss: 0.0002182624739361927
In epoch 45, loss: 0.0001900716160889715
In epoch 50, loss: 0.00017162640870083123
In epoch 55, loss: 0.00015870961942709982
In epoch 60, loss: 0.000149152911035344
In epoch 65, loss: 0.0001415979495504871
In epoch 70, loss: 0.00013540136569645256
In epoch 75, loss: 0.00013011037663090974
In epoch 80, loss: 0.00012539127783384174
In epoch 85, loss: 0.00012110114039387554
In epoch 90, loss: 0.00011719231406459585
In epoch 95, loss: 0.00011347407416906208


In [58]:
# ----------- 5. check results ------------------------ #
pred = torch.argmax(logits, axis=1)
print('Accuracy', (pred == labels)[test_nodes].sum().item() / len(pred))

Accuracy 0.7941176470588235


## Exercise

Play with the GNN models by using other [graph convolution modules](https://docs.dgl.ai/api/python/nn.pytorch.html#module-dgl.nn.pytorch.conv).