# 0. Overview of this Notebook

In this notebook, we train various models on the ogbn-arxiv dataset for node prediction. We start with GCN and Label Propagation, and work our way up to the GCN-LPA model. We highlight connections between the mathematical formulation of the algorithms and our PyTorch implementation.

Much of the code is borrowed and reworked from code on the OGB leaderboard, including the OGB author's example code for GCN, as well as Horace He's Label Propagation code.

In [82]:
import argparse

import torch
import torch.nn.functional as F

import torch_geometric.transforms as T
import torch_sparse
from torch_geometric.nn import GCNConv, SAGEConv

from ogb.nodeproppred import PygNodePropPredDataset, Evaluator

In [None]:
## TODO: Replace this with argsparse code that is compatible with Jupyter Notebook

class args:
    hidden_channels = 256
    num_layers = 3
    dropout = 0.5

# 1. Setup PyTorch and Load the Dataset

In [103]:
## Setup PyTorch
device = torch.device('cpu')

## Load the dataset
dataset = PygNodePropPredDataset(name='ogbn-arxiv', transform=T.ToSparseTensor())
data = dataset[0]
# TODO: What in the world does this line do? The OGB authors include it in their example code
data.adj_t = data.adj_t.to_symmetric()
data = data.to(device)

split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx["train"], split_idx["valid"], split_idx["test"]

## Connection to Mathematical Notation

For the sake of understanding, here's a connection between mathematical notation and our code's dataset variables.

The number of nodes $n \in \mathbb{N}$ is
$$ n = \text{data.num_nodes} $$

The number of features for each node $F \in \mathbb{N}$ is
$$ F = \text{data.num_features} $$

The number of possible labels $c \in \mathbb{N}$ is
$$ c = \text{dataset.num_classes} $$

The number of labeled nodes $m \in \mathbb{N}$, $m < n$ is
$$ m = \text{train_idx.shape[0]} $$

The feature matrix $X \in \mathbb{R}^{n \times F}$ is the torch.Tensor object
$$ X = \text{data.x} $$

The label vector $Y \in \mathbb{R}^n$, $Y_i \in [0,c]$ is the torch.Tensor object
$$ Y = \text{data.y} $$

The adjacency matrix $A \in \mathbb{R}^{n \times n}$ is the torch_sparse.SparseTensor object
$$ A = \text{data.adj_t} $$

I don't understand exactly what the code sets data.adj_t to yet... its possible that A is a "modified" adjacency matrix

## Helpful Reference Code

Below is some helpful reference code. These variables go unused (instead we use the data variable, to keep in parallel with the OGB example code). 
They are just here to further understand connections to the mathematical notation.

In [104]:
## Conversion to mathematical notation
n = X.shape[0]
f = X.shape[1]
c = torch.unique(Y).numel()
m = train_idx.shape[0]
X = data.x
Y = data.y
A = data.adj_t

## Helpful tests for reference
assert type(X) == torch.Tensor 
assert type(Y) == torch.Tensor
assert type(A) == torch_sparse.SparseTensor
assert type(n) == int 
assert type(f) == int
assert type(c) == int 
assert type(m) == int
assert X.shape == torch.Size([n, f])
assert Y.shape == torch.Size([n, 1])
assert A.sizes() == [n, n] # NOTE: SparseTensor doesn't have a shape attribute!

# 2. GCN (Graph Convolutional Network)



In [105]:
class GCN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, num_layers, 
                dropout):
        super(GCN, self).__init__()

        self.convs = torch.nn.ModuleList()
        self.convs.append(GCNConv(in_channels, hidden_channels, cached=True))
        self.bns = torch.nn.ModuleList()
        self.bns.append(torch.nn.BatchNorm1d(hidden_channels))
        for _ in range(num_layers - 2):
            self.convs.append(
                GCNConv(hidden_channels, hidden_channels, cached=True))
            self.bns.append(torch.nn.BatchNorm1d(hidden_channels))
        self.convs.append(GCNConv(hidden_channels, out_channels, cached=True))

        self.dropout = dropout

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()
        for bn in self.bns:
            bn.reset_parameters()

    def forward(self, x, adj_t):
        for i, conv in enumerate(self.convs[:-1]):
            x = conv(x, adj_t)
            x = self.bns[i](x)
            x = F.relu(x)
            x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.convs[-1](x, adj_t)
        return x.log_softmax(dim=-1)

In [113]:
model = GCN(data.num_features, args.hidden_channels,
                    dataset.num_classes, args.num_layers,
                    args.dropout).to(device)

NameError: name 'args' is not defined

In [114]:
print(dataset.num_classes)

40
