This tutorial is adapted from [WikiNet — An Experiment in Recurrent Graph Neural Networks](https://medium.com/stanford-cs224w/wikinet-an-experiment-in-recurrent-graph-neural-networks-3f149676fbf3) by Alexander Hurtado.

# WikiNet

WikiNet tackles the target prediction problem on the Wikispeedia dataset. Namely, given a sequence of articles clicked by a player, the task is to predict the final target article the user is searching for. The following code is of the model definition, training, and evaluation for the experiments.

While installing the dependencies you can try the game here: [wikispeedia](https://dlab.epfl.ch/wikispeedia/play/)

More about the dataset details can be found [here](https://snap.stanford.edu/data/wikispeedia.html)

First, we begin by installing the necessary libraries and dataset!

In [2]:
!pip install torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu111.html
!pip install torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu111.html
!pip install torch-geometric
!pip install class-resolver

!wget --no-cache https://github.com/alexanderjhurtado/cs224w_wikinet/raw/main/colab_starter_pack/graph_with_features.gml.zip
!wget --no-cache https://github.com/alexanderjhurtado/cs224w_wikinet/raw/main/colab_starter_pack/paths_and_labels.tsv
!unzip -o /content/graph_with_features.gml.zip

Looking in links: https://data.pyg.org/whl/torch-1.10.0+cu111.html
Collecting torch-scatter
  Downloading torch_scatter-2.1.2.tar.gz (108 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.0/108.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: torch-scatter
  Building wheel for torch-scatter (setup.py) ... [?25l[?25hdone
  Created wheel for torch-scatter: filename=torch_scatter-2.1.2-cp310-cp310-linux_x86_64.whl size=507268 sha256=e75adf742e2b53fb516f9cc480dcd36ee1f804dba1fbd8240b0186a992c60258
  Stored in directory: /root/.cache/pip/wheels/92/f1/2b/3b46d54b134259f58c8363568569053248040859b1a145b3ce
Successfully built torch-scatter
Installing collected packages: torch-scatter
Successfully installed torch-scatter-2.1.2
Looking in links: https://data.pyg.org/whl/torch-1.10.0+cu111.html
Collecting torch-sparse
  Downloading torch_sparse-0.6.18.tar.gz (209 kB)
[2K     

Here, we import all libraries that will be used by the code.

In [18]:
import json
import pandas as pd
import time
import networkx as nx
from torch_geometric.utils import from_networkx

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCN, GAT, GraphSAGE
from torch.utils.data import Dataset, DataLoader

In [8]:
# Getting the dataset
!wget https://github.com/alexanderjhurtado/cs224w_wikinet/blob/main/colab_starter_pack/graph_with_features.gml.zip
!wget https://github.com/alexanderjhurtado/cs224w_wikinet/blob/main/colab_starter_pack/paths_and_labels.tsv

--2024-07-09 20:38:47--  https://github.com/alexanderjhurtado/cs224w_wikinet/blob/main/colab_starter_pack/graph_with_features.gml.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘graph_with_features.gml.zip.2’

          graph_wit     [<=>                 ]       0  --.-KB/s               graph_with_features     [ <=>                ] 280.93K  --.-KB/s    in 0.006s  

2024-07-09 20:38:48 (49.0 MB/s) - ‘graph_with_features.gml.zip.2’ saved [287669]

--2024-07-09 20:38:48--  https://github.com/alexanderjhurtado/cs224w_wikinet/blob/main/colab_starter_pack/paths_and_labels.tsv
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘paths_and_labels.tsv.2’

paths_and_labels

In [9]:
nx_graph = nx.read_gml('graph_with_features.gml')
G = from_networkx(nx_graph, group_node_attrs=['out_degree', 'in_degree', 'category_multi_hot', 'article_embed'])
path_data = pd.read_csv('paths_and_labels.tsv', sep='\t', header=None)

In [40]:
print(G)

Data(edge_index=[2, 119882], article=[4604], weight=[119882], x=[4604, 445])


In [41]:
print(G.num_node_features)

445


In [42]:
print(G.num_nodes)

4604


In [43]:
print(G.num_edges)

119882


In [44]:
print(G.num_edge_features)

0


In [45]:
G.x.shape

torch.Size([4604, 445])

In [46]:
print(G.edge_index.shape)

torch.Size([2, 119882])


The following function will be called during training and evaluation to evaluate the model on the validation and test datasets.

In [47]:
def get_evaluation_metrics(model, device, dataloader, dataset_size):
    model.eval()
    avg_loss = 0
    num_correct = 0
    with torch.no_grad():
        for i, data in enumerate(dataloader):
            # get data
            inputs, labels = data['indices'].to(device), data['label'].to(device)
            outputs = model(inputs)
            # get loss
            loss = F.nll_loss(outputs, labels)
            avg_loss += loss.item()
            # get accuracy
            pred = outputs.argmax(dim=1)
            correct = (pred == labels).sum()
            num_correct += correct
    acc = int(num_correct) / dataset_size
    avg_loss /= dataset_size
    return acc, avg_loss

This defines the dataset class we use to represent the path data.

In [48]:
class CustomPathDataset(Dataset):
    def __init__(self, path_data):
        self.x = path_data[0].apply(json.loads)
        self.labels = path_data[1]
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        x = torch.LongTensor(self.x[idx])
        label = self.labels[idx]
        sample = {"indices": x, "label": label}
        return sample

Here, we set up the `train / val / test` split as `90 / 5 / 5`. Moreover, we define the hyperparameters, including the learning rate, the optimizer (Adam), and the batch size.

In [49]:
# get the dataset + splits
dataset = CustomPathDataset(path_data)
train_size = int(0.9 * len(dataset))
test_size = int(0.05 * len(dataset))
val_size = len(dataset) - train_size - test_size
train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, val_size, test_size])

# set up for training + validation
batch_size = 1024
trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)
validloader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=True, num_workers=2)
testloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True, num_workers=2)

In [50]:
next(iter(trainloader))['indices'].shape

torch.Size([1024, 32])

In [51]:
next(iter(trainloader))['indices']

tensor([[  -1,   -1,   -1,  ..., 1281, 1433, 3021],
        [  -1,   -1,   -1,  ...,  978, 1281, 3011],
        [  -1,   -1,   -1,  ...,  393, 1025,  357],
        ...,
        [  -1,   -1,   -1,  ..., 1036, 1349, 1354],
        [  -1,   -1,   -1,  ..., 3823, 2022,  982],
        [  -1,   -1,   -1,  ..., 1281, 3810,  128]])

In [52]:
next(iter(trainloader))

{'indices': tensor([[  -1,   -1,   -1,  ..., 3335, 1246, 3780],
         [  -1,   -1,   -1,  ..., 4297, 4017, 1119],
         [  -1,   -1,   -1,  ..., 2949, 2098, 2114],
         ...,
         [  -1,   -1,   -1,  ..., 4091, 3458, 2205],
         [  -1,   -1,   -1,  ...,  915,  907, 3196],
         [  -1,   -1,   -1,  ..., 1528,  590, 1238]]),
 'label': tensor([ 899, 3797, 2554,  ..., 1091, 4362, 1428])}

In [53]:
next(iter(trainloader))['indices'][0]

tensor([  -1,   -1,   -1,   -1,   -1,   -1,   -1,   -1,   -1,   -1,   -1,   -1,
          -1,   -1,   -1,   -1,   -1,   -1,   -1,   -1,   -1,   -1,   -1,   -1,
          -1,   -1,   -1,   -1,   -1,   -1, 1694, 4297])

## Baseline Model

This is the class definition for the baseline model, an LSTM. Run this cell to be able to train the baseline model.

In [95]:
class Baseline(torch.nn.Module):
    def __init__(self, graph, device, sequence_path_length=32, lstm_hidden_size=32):
        super().__init__()
        # print("graph.x.shape", graph.x.shape)
        self.graphX = graph.x.to(device)

        self.graphEdgeIndex = graph.edge_index.to(device)

        self.lstm_input_size = self.graphX.shape[1]

        self.lstm = nn.LSTM(input_size=self.lstm_input_size,
                            hidden_size=lstm_hidden_size,
                            batch_first=True)

        self.batch_norm_lstm = nn.BatchNorm1d(sequence_path_length)

        self.pred_head = nn.Linear(lstm_hidden_size, self.graphX.shape[0])

    def forward(self, indices):
        node_emb = self.graphX
        node_emb_with_padding = torch.cat([node_emb, torch.zeros((1, self.lstm_input_size)).to(device)])
        paths = node_emb_with_padding[indices]
        paths = self.batch_norm_lstm(paths)
        _, (h_n, _) = self.lstm(paths)
        predictions = self.pred_head(torch.squeeze(h_n))
        return F.log_softmax(predictions, dim=1)

In [96]:
# set up the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Baseline(G, device).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

This is the training script. We train the model for 20 epochs and print training loss, validation loss, validation accuracy, and time spent for each epoch.

Moreover, we train by running one batch through the model at a time and using the Negative Log Likelihood loss function. We also save the model weights for the best validation accuracy we see after an epoch. These weights will be used in the evaluation step.

In [97]:
MODEL_WEIGHT_PATH = "model_weights.pth"

best_acc = 0
training_losses = []
validation_losses = []
validation_accs = []
model.train()
for epoch in range(5):  # loop over the dataset multiple times
    print('Epoch:', epoch+1)
    model.train()
    epoch_loss = 0
    start_time = time.time()
    for i, data in enumerate(trainloader):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data['indices'].to(device), data['label'].to(device)
        # zero the parameter gradients
        optimizer.zero_grad()
        # forward + backward + optimize
        outputs = model(inputs)
        loss = F.nll_loss(outputs, labels)
        epoch_loss += loss.item()
        loss.backward()
        optimizer.step()
    # validate epoch and print results
    training_losses.append(epoch_loss / train_size)
    print('Training Loss:', training_losses[-1])
    acc, valid_loss = get_evaluation_metrics(model, device, validloader, val_size)
    validation_losses.append(valid_loss)
    validation_accs.append(acc)
    if acc > best_acc:
        torch.save(model.state_dict(), MODEL_WEIGHT_PATH)
        best_acc = acc
    print("Validation accuracy:", acc)
    print("Validation loss:", valid_loss)
    print('Time elapsed:', time.time() - start_time)
    print()

Epoch: 1
Training Loss: 0.007172050284736
Validation accuracy: 0.11267056530214425
Validation loss: 0.007454270368431047
Time elapsed: 37.720823764801025

Epoch: 2
Training Loss: 0.005631540172583682
Validation accuracy: 0.14892787524366471
Validation loss: 0.006440677605641981
Time elapsed: 35.28937840461731

Epoch: 3
Training Loss: 0.004937904951513219
Validation accuracy: 0.16803118908382067
Validation loss: 0.005960288057085599
Time elapsed: 36.43272161483765

Epoch: 4
Training Loss: 0.004559457469979362
Validation accuracy: 0.18323586744639375
Validation loss: 0.0056781476933588995
Time elapsed: 35.318153619766235

Epoch: 5
Training Loss: 0.0043047392508307625
Validation accuracy: 0.1968810916179337
Validation loss: 0.005473834775809424
Time elapsed: 37.035537004470825



This code runs evaluation on the test dataset. In particular, it uses the weights from the best validation accuracy to obtain the test accuracy.

This cell will print out the "loss" and accuracy on the testing dataset.

In [98]:
# model.load_state_dict(torch.load(MODEL_WEIGHT_PATH))
model.eval()
acc, test_loss = get_evaluation_metrics(model, device, testloader, test_size)
print("Test accuracy:", acc)
print("Test loss:", test_loss)

Test accuracy: 0.20124804992199688
Test loss: 0.0055526839031630115


## Exercise:



## Graph Neural Network

This is the class definition for the Graph Neural Network - based model.If you would like to use GCN or GAT, simply use `self.gnn = GraphSAGE(...)` with `self.gnn = GCN(...)` or `self.gnn = GAT(...)`, respectively. The arguments are the same for all 3 models.

In [99]:
class Model(torch.nn.Module):
    def __init__(self, graph, device, sequence_path_length=32, gnn_hidden_size=128, node_embed_size=64, lstm_hidden_size=32):
        super().__init__()
        self.graphX = graph.x.to(device)
        self.graphEdgeIndex = graph.edge_index.to(device)

        self.gnn = GraphSAGE(in_channels=self.graphX.shape[1],
                       hidden_channels=gnn_hidden_size,
                       num_layers=3,
                       out_channels=node_embed_size,
                       dropout=0.1)

        self.batch_norm_lstm = nn.BatchNorm1d(sequence_path_length)
        self.batch_norm_linear = nn.BatchNorm1d(lstm_hidden_size)
        self.lstm_input_size = node_embed_size
        self.lstm = nn.LSTM(input_size=self.lstm_input_size,
                            hidden_size=lstm_hidden_size,
                            batch_first=True)
        self.pred_head = nn.Linear(lstm_hidden_size, self.graphX.shape[0])

    def forward(self, indices):
        node_emb = self.gnn(self.graphX, self.graphEdgeIndex)
        node_emb_with_padding = torch.cat([node_emb, torch.zeros((1, self.lstm_input_size)).to(device)])
        paths = node_emb_with_padding[indices]
        paths = self.batch_norm_lstm(paths)
        _, (h_n, _) = self.lstm(paths)
        h_n = self.batch_norm_linear(torch.squeeze(h_n))
        predictions = self.pred_head(h_n)
        return F.log_softmax(predictions, dim=1)

In [100]:
# set up the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Model(G, device).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

We train by running one batch through the model at a time and using the Negative Log Likelihood loss function. We also save the model weights for the best validation accuracy we see after an epoch. These weights will be used in the evaluation step.

In [102]:
best_acc = 0
training_losses = []
validation_losses = []
validation_accs = []
model.train()
for epoch in range(5):  # loop over the dataset multiple times
    print('Epoch:', epoch+1)
    model.train()
    epoch_loss = 0
    start_time = time.time()
    for i, data in enumerate(trainloader):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data['indices'].to(device), data['label'].to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(inputs)
        loss = F.nll_loss(outputs, labels)
        epoch_loss += loss.item()
        loss.backward()
        optimizer.step()
    # validate epoch and print results
    training_losses.append(epoch_loss / train_size)
    print('Training Loss:', training_losses[-1])
    acc, valid_loss = get_evaluation_metrics(model, device, validloader, val_size)
    validation_losses.append(valid_loss)
    validation_accs.append(acc)
    if acc > best_acc:
        torch.save(model.state_dict(), MODEL_WEIGHT_PATH)
        best_acc = acc
    print("Validation accuracy:", acc)
    print("Validation loss:", valid_loss)
    print('Time elapsed:', time.time() - start_time)
    print()

Epoch: 1
Training Loss: 0.006116383719465971
Validation accuracy: 0.15789473684210525
Validation loss: 0.006488299788090221
Time elapsed: 39.034395694732666

Epoch: 2
Training Loss: 0.004774074928028133
Validation accuracy: 0.19922027290448344
Validation loss: 0.005681933761804889
Time elapsed: 37.760902643203735

Epoch: 3
Training Loss: 0.004126869289601351
Validation accuracy: 0.22690058479532163
Validation loss: 0.005213679114745142
Time elapsed: 37.95190382003784

Epoch: 4
Training Loss: 0.0037679500016150765
Validation accuracy: 0.253411306042885
Validation loss: 0.004903466706155104
Time elapsed: 38.64852023124695

Epoch: 5
Training Loss: 0.003509110148529424
Validation accuracy: 0.2764132553606238
Validation loss: 0.004712960687529506
Time elapsed: 37.983099937438965



This code runs evaluation on the test dataset. In particular, it uses the weights from the best validation accuracy to obtain the test accuracy.

This cell will print out the "loss" and accuracy on the testing dataset.

In [104]:
# model.load_state_dict(torch.load(MODEL_WEIGHT_PATH))
model.eval()
acc, test_loss = get_evaluation_metrics(model, device, testloader, test_size)
print("Test accuracy:", acc)
print("Test loss:", test_loss)

Test accuracy: 0.2578003120124805
Test loss: 0.004841907719926045


Graph Recurrent Neural Network performed better.