# 用GNN进行边预测

GNNs are powerful tools for many machine learning tasks on graphs. This tutorial teaches the basic workflow of using GNNs for link prediction. We again use the Zachery's Karate Club graph but try to predict interactions between two members.

In this tutorial, you will learn:
* Prepare training and testing sets for link prediction task.
* Build a GNN-based link prediction model.
* Train the model and verify the result.

In [1]:
from tutorial_utils import setup_tf
setup_tf()

In [2]:
import dgl
import tensorflow as tf
import itertools
import numpy as np
import scipy.sparse as sp

## 导入图结构和特征

Following the last [session](./2_gnn-CN.ipynb), we first load the Zachery's Karate Club graph and creates node embeddings.

In [3]:
from tutorial_utils import load_zachery

# ----------- 0. load graph -------------- #
g = load_zachery()
print(g)

# ----------- 1. node features -------------- #
node_embed = tf.keras.layers.Embedding(g.number_of_nodes(), 5,
                                       embeddings_initializer='glorot_uniform')  # Every node has an embedding of size 5.
node_embed(1) # intialize embedding layer
inputs = node_embed.embeddings # # Use the embedding weight as the node features.
print(inputs)

Graph(num_nodes=34, num_edges=156,
      ndata_schemes={'club': Scheme(shape=(), dtype=tf.int64), 'club_onehot': Scheme(shape=(2,), dtype=tf.float32)}
      edata_schemes={})
<tf.Variable 'embedding/embeddings:0' shape=(34, 5) dtype=float32, numpy=
array([[-0.38376212,  0.02762738,  0.2652063 ,  0.32293776,  0.04524353],
       [ 0.30032715,  0.02468556, -0.1900916 , -0.04702508, -0.30461857],
       [-0.1929569 ,  0.24494252, -0.38531214, -0.08113599,  0.06808767],
       [ 0.02020505,  0.26825735, -0.3504967 ,  0.2496821 , -0.34984836],
       [ 0.0258891 , -0.15108845, -0.35368958,  0.37582836, -0.29545236],
       [ 0.2677717 ,  0.08830869,  0.28496173,  0.02015099,  0.05002049],
       [-0.00256091,  0.10553828, -0.10098866, -0.25102186,  0.20928678],
       [ 0.23899326, -0.27900234,  0.23708245, -0.20309108, -0.11720824],
       [-0.07901543,  0.31122229,  0.01442784,  0.03468132, -0.16346353],
       [ 0.00062189,  0.32725772,  0.22976199, -0.09203568,  0.0605621 ],
       [ 0.

## 准备训练和测试集

In general, a link prediction data set contains two types of edges, *positive* and *negative edges*. Positive edges are usually drawn from the existing edges in the graph. In this example, we randomly pick 50 edges for testing and leave the rest for training.

In [4]:
# Split edge set for training and testing
u, v = g.edges()
u, v = u.numpy(), v.numpy()
eids = np.arange(g.number_of_edges())
eids = np.random.permutation(eids)
test_pos_u, test_pos_v = u[eids[:50]], v[eids[:50]]
train_pos_u, train_pos_v = u[eids[50:]], v[eids[50:]]

Since the number of negative edges is large, sampling is usually desired. How to choose proper negative sampling algorithms is a widely-studied topic and is out of scope of this tutorial. Since our example graph is quite small (with only 34 nodes), we enumerate all the missing edges and randomly pick 50 for testing and 150 for training.

In [5]:
# Find all negative edges and split them for training and testing
adj = sp.coo_matrix((np.ones(len(u)), (u, v)))
adj_neg = 1 - adj.todense() - np.eye(34)
neg_u, neg_v = np.where(adj_neg != 0)
neg_eids = np.random.choice(len(neg_u), 200)
test_neg_u, test_neg_v = neg_u[neg_eids[:50]], neg_v[neg_eids[:50]]
train_neg_u, train_neg_v = neg_u[neg_eids[50:]], neg_v[neg_eids[50:]]

Put positive and negative edges together and form training and testing sets.

In [6]:
# Create training set.
train_u = tf.concat([train_pos_u, train_neg_u], axis=0)
train_v = tf.concat([train_pos_v, train_neg_v], axis=0)
train_label = tf.concat([tf.zeros(len(train_pos_u)), tf.ones(len(train_neg_u))], axis=0)

# Create testing set.
test_u = tf.concat([test_pos_u, test_neg_u], axis=0)
test_v = tf.concat([test_pos_v, test_neg_v], axis=0)
test_label = tf.concat([tf.zeros(len(test_pos_u)), tf.ones(len(test_neg_u))], axis=0)

## 定义GraphSAGE的模型

Our model consists of two layers, each computes new node representations by aggregating neighbor information. The equations are:

$$
h_{\mathcal{N}(v)}^k\leftarrow \text{AGGREGATE}_k\{h_u^{k-1},\forall u\in\mathcal{N}(v)\}
$$

$$
h_v^k\leftarrow \text{ReLU}\left(W^k\cdot \text{CONCAT}(h_v^{k-1}, h_{\mathcal{N}(v)}^k) \right)
$$

DGL provides implementation of many popular neighbor aggregation modules. They all can be invoked easily with one line of codes. See the full list of supported [graph convolution modules](https://docs.dgl.ai/api/python/nn.pytorch.html#module-dgl.nn.pytorch.conv).

In [7]:
from dgl.nn import SAGEConv

# ----------- 2. create model -------------- #
# build a two-layer GraphSAGE model
class GraphSAGE(tf.keras.layers.Layer):
    def __init__(self, in_feats, h_feats):
        super(GraphSAGE, self).__init__()
        self.conv1 = SAGEConv(in_feats, h_feats, 'mean')
        self.conv2 = SAGEConv(h_feats, h_feats, 'mean')
    
    def call(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = tf.nn.relu(h)
        h = self.conv2(g, h)
        return h
    
# Create the model with given dimensions 
# input layer dimension: 5, node embeddings
# hidden layer dimension: 16
net = GraphSAGE(5, 16)

## 对边预测使用针对性的损失函数

We then optimize the model using the following loss function.

$$
\hat{y}_{u\sim v} = \sigma(h_u^T h_v)
$$

$$
\mathcal{L} = -\sum_{u\sim v\in \mathcal{D}}\left( y_{u\sim v}\log(\hat{y}_{u\sim v}) + (1-y_{u\sim v})\log(1-\hat{y}_{u\sim v})) \right)
$$

Essentially, the model predicts a score for each edge by dot-producting the representations of its two end-points. It then computes a binary cross entropy loss with the target $y$ being 0 or 1 meaning whether the edge is a positive one or not.

In [8]:
# ----------- 3. set up loss and optimizer -------------- #
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
loss_fcn = tf.keras.losses.BinaryCrossentropy(
    from_logits=False)

# ----------- 4. training -------------------------------- #
all_logits = []
for e in range(100):
    
    with tf.GradientTape() as tape:
        tape.watch(inputs) # optimize embedding layer also
        # forward
        logits = net(g, inputs)
        pred = tf.sigmoid(tf.reduce_sum(tf.gather(logits, train_u) *
                                        tf.gather(logits, train_v), axis=1))

        # compute loss
        loss = loss_fcn(train_label, pred)

        # backward
        grads = tape.gradient(loss, net.trainable_weights + node_embed.trainable_weights)        
        optimizer.apply_gradients(zip(grads, net.trainable_weights + node_embed.trainable_weights))
        all_logits.append(logits.numpy())

    if e % 5 == 0:
        print('In epoch {}, loss: {}'.format(e, loss))

Instructions for updating:
Use tf.identity instead.
In epoch 0, loss: 0.6808468699455261
In epoch 5, loss: 0.5773298740386963
In epoch 10, loss: 0.44662487506866455
In epoch 15, loss: 0.39926254749298096
In epoch 20, loss: 0.3401777446269989
In epoch 25, loss: 0.32041066884994507
In epoch 30, loss: 0.2868993878364563
In epoch 35, loss: 0.2574978470802307
In epoch 40, loss: 0.23335880041122437
In epoch 45, loss: 0.20876789093017578
In epoch 50, loss: 0.1869775354862213
In epoch 55, loss: 0.16242516040802002
In epoch 60, loss: 0.1356896162033081
In epoch 65, loss: 0.11280933022499084
In epoch 70, loss: 0.08605014532804489
In epoch 75, loss: 0.054293327033519745
In epoch 80, loss: 0.02977687492966652
In epoch 85, loss: 0.013111919164657593
In epoch 90, loss: 0.005503080785274506
In epoch 95, loss: 0.0021488661877810955


In [9]:
# ----------- 5. check results ------------------------ #
pred = tf.sigmoid(tf.reduce_sum(tf.gather(logits, test_u) *
                                tf.gather(logits, test_v), axis=1)).numpy()
print('Accuracy', ((pred >= 0.5) == test_label.numpy()).sum().item() / len(pred))

Accuracy 0.8
