In [86]:
'''
Based on https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html

- useful:
1. https://towardsdatascience.com/how-to-do-deep-learning-on-graphs-with-graph-convolutional-networks-7d2250723780
2. https://pytorch-geometric.readthedocs.io/en/latest/notes/create_gnn.html
3. https://arxiv.org/pdf/1609.02907.pdf
'''
from torch_geometric.nn import GCNConv
import torch
from torch_geometric.utils import dense_to_sparse
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.datasets import Planetoid

In [3]:
dataset = Planetoid(root='/tmp/Cora', name='Cora')

In [78]:
dataset.data.edge_index[0][:10], dataset.data.edge_index[1][:10]

(tensor([0, 0, 0, 1, 1, 1, 2, 2, 2, 2]),
 tensor([ 633, 1862, 2582,    2,  652,  654,    1,  332, 1454, 1666]))

In [None]:
# TODO : Understand the structure of the dataset; it is a graph.
# TODO : Understand GCNConv layer from PyTorch. -> https://pytorch-geometric.readthedocs.io/en/latest/notes/create_gnn.html#the-messagepassing-base-class
# https://tkipf.github.io/graph-convolutional-networks/

## 1. Understanding GCN Layers

### Self loops

To add self loops one adds the identity matrix to the *adjacency matrix* **A**:


$
A \to \tilde{A} = A + I
$

### Normalization

A major limitation is that **A** is typically not normalized and therefore the multiplication with **A** will completely change the scale of the feature vectors.

Normalizing **A** such that all rows sum to one, i.e. $D^{−1}A$, where **D** is the diagonal node degree matrix, gets rid of this problem. Multiplying with $D^{−1}A$ now corresponds to taking the average of neighboring node features. In practice, dynamics get more interesting when we use a symmetric normalization, i.e. $D^{−1/2}AD^{−1/2}$

$$
D_{ij} =  \delta_{ij}\sum_{k}A_{kj} \implies D^{-1/2}_{ij} = \frac{1}{\sqrt{ \sum_{k}A_{kj}}}\delta_{ij}
$$

After including self-loops and normalization we are left with:
$$
H^{(i+1)} = \sigma\left( \tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} H^{(i)}W^{(i)}\right)
$$
where the weights $W$ are shared between all the nodes (similarly to a CNN in a grid) $\implies$ two nodes that are far apart but have the similar neighboring features and structures should be classified similarly.

[PyTorch](https://pytorch-geometric.readthedocs.io/en/latest/notes/create_gnn.html) defines **message passing** through the following equation:

$$
\mathbf{x}_i^{(k)} = \gamma^{(k)} \left( \mathbf{x}_i^{(k-1)}, \square_{j \in \mathcal{N}(i)} \, \phi^{(k)}\left(\mathbf{x}_i^{(k-1)}, \mathbf{x}_j^{(k-1)},\mathbf{e}_{j,i}\right) \right),
$$

with $\mathbf{x}^{(k-1)}_i \in \mathbb{R}^F$, $\mathbf{e}_{j,i} \in \mathbb{R}^D$ and:

- $\mathbf{x}^{(k-1)}_i \rightarrow  $ features of node i in layer (k-1).
- $\mathbf{e}_{j,i} \rightarrow $ edge features from node 𝑗 to node 𝑖.
- $\square \rightarrow$  denotes a differentiable, permutation invariant function, e.g., sum, mean or max.
- $\phi \rightarrow$ is a differentiable function such as MLPs (Multi Layer Perceptrons). In the [code implementation](https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/nn/conv/gcn_conv.html#GCNConv) of the GCN Layer it corresponds to the function **message()**.
- $\gamma \rightarrow$ is a differentiable function such as MLPs (Multi Layer Perceptrons). In the [code implementation](https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/nn/conv/message_passing.html#MessagePassing) of the GCN Layer it corresponds to the function **update()**.

For more clarity we could represent the indices not shown in the PyTorch docu as *greek letters*.

A simple example could be:
$
x^{(k)}_{i,\alpha} = \sigma\left(A_{ij}x^{(k-1)}_{j,\beta}W^{(k-1)}_{\beta\alpha}\right)
$

Where $\sigma$ is a non-linear activation function such as the ReLU function.

$$
x^{(k)}_{i,\alpha}
\quad
   \begin{cases}
      k & \text{is the k'th layer}\\
      i, & \text{is the number of nodes}\\
      \alpha, & \text{is the number of features}
    \end{cases}
$$

$A$ is a square matrix, because we want the preserve the number of nodes. $W$ instead can be a non square matrix, such that we change the number of features of the node from layer $k$ to layer $k+1$.


- $\square \rightarrow$  $Ax^{(k-1)}\left(= A_{ij}x^{(k-1)}_{j,\alpha}\right)$
- $\phi \rightarrow$ $\phi^{(k)}\left(\mathbf{x}_i^{(k-1)}, \mathbf{x}_j^{(k-1)},\mathbf{e}_{j,i}\right) \equiv \phi^{(k)}\left(\mathbf{x}_i^{(k-1)}, \mathbf{x}_j^{(k-1)}\right) = \sigma\left(Ax^{(k-1)}W^{(k-1)}\right)$
- $\gamma \rightarrow$ is just the identity function, basically it does nothing.

Where in the parenthesis I have included all the indices and used Einstein summation convention.

Lets work it out in the particular example of the GCN Layer:


$$
\mathbf{x}_i^{(k)} = \sum_{j \in \mathcal{N}(i) \cup \{ i \}} \frac{1}{\sqrt{\deg(i)} \cdot \sqrt{\deg(j)}} \cdot \left( \mathbf{\Theta} \cdot \mathbf{x}_j^{(k-1)} \right)\, ,\quad \left(=  \tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} x^{(k-1)}\mathbf{\Theta}\right)
$$

where $\deg(i)$ is the degree of the node $\rightarrow$ [Degree (graph theory)](https://en.wikipedia.org/wiki/Degree_(graph_theory))

- $\square \rightarrow$  $\sum_{j \in \mathcal{N}(i) \cup \{ i \}}$
- $\phi \rightarrow$ $\phi^{(k)}\left(\mathbf{x}_i^{(k-1)}, \mathbf{x}_j^{(k-1)},\mathbf{e}_{j,i}\right) \equiv \phi^{(k)}\left(\mathbf{x}_i^{(k-1)}, \mathbf{x}_j^{(k-1)}\right) =  \frac{1}{\sqrt{\deg(i)} \cdot \sqrt{\deg(j)}} \cdot \left( \mathbf{\Theta} \cdot \mathbf{x}_j^{(k-1)} \right)$
- $\gamma \rightarrow$ is just the identity function, basically it does nothing.

I am actually not sure about what $\phi$ does, because the normalization is done outside of *message* and the product with the weights $\mathbf{\Theta}$ as well (in the PyTorch code).

$$
\mathbf{x}_i^{(k)} = \square_{j \in \mathcal{N}(i)} \, \phi^{(k)}\left(\mathbf{x}_i^{(k-1)}, \mathbf{x}_j^{(k-1)}\right) = \sum_{j \in \mathcal{N}(i) \cup \{ i \}} \frac{1}{\sqrt{\deg(i)} \cdot \sqrt{\deg(j)}} \cdot \left( \mathbf{\Theta} \cdot \mathbf{x}_j^{(k-1)} \right),
$$

## 2. Exploring the dataset

In [122]:
dataset??

In [118]:
dataset.data??

In [109]:
dataset.data.x.nonzero()

tensor([[   0,   19],
        [   0,   81],
        [   0,  146],
        ...,
        [2707, 1328],
        [2707, 1412],
        [2707, 1414]])

In [115]:
torch.unique(dataset.data.x)

tensor([0., 1.])

In [114]:
dataset.data.x.histc(bins=2)

tensor([3831348.,   49216.])

In [111]:
dataset.data.x[2707, 1328]

tensor(1.)

- Data set structure:

In [116]:
dataset.__dict__

{'name': 'Cora',
 'root': '/tmp/Cora',
 'transform': None,
 'pre_transform': None,
 'pre_filter': None,
 '__indices__': None,
 'data': Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708]),
 'slices': {'x': tensor([   0, 2708]),
  'edge_index': tensor([    0, 10556]),
  'y': tensor([   0, 2708]),
  'train_mask': tensor([   0, 2708]),
  'val_mask': tensor([   0, 2708]),
  'test_mask': tensor([   0, 2708])},
 'split': 'public'}

- Data structure:

In [73]:
dataset.data.__dict__

{'x': tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 'edge_index': tensor([[   0,    0,    0,  ..., 2707, 2707, 2707],
         [ 633, 1862, 2582,  ...,  598, 1473, 2706]]),
 'edge_attr': None,
 'y': tensor([3, 4, 4,  ..., 3, 3, 3]),
 'pos': None,
 'norm': None,
 'face': None,
 'train_mask': tensor([ True,  True,  True,  ..., False, False, False]),
 'val_mask': tensor([False, False, False,  ..., False, False, False]),
 'test_mask': tensor([False, False, False,  ...,  True,  True,  True])}

In [60]:
print('Node feature matrix with shape [num_nodes, num_node_features]: ', dataset.data.x.shape)
print('\nGraph connectivity in COO format with shape [2, num_edges]: ', dataset.data.edge_index.shape)
print('\nNumber of classes: ', dataset.num_classes)

Node feature matrix with shape [num_nodes, num_node_features]:  torch.Size([2708, 1433])

Graph connectivity in COO format with shape [2, num_edges]:  torch.Size([2, 10556])

Number of classes:  7


In [64]:
max(dataset.data.edge_index[0]), max(dataset.data.edge_index[1])

(tensor(2707), tensor(2707))

##  3. The model in PyTorch

In [47]:
GCNConv??

In [101]:
class Net(torch.nn.Module):
    def __init__(self, n_hidden=16, n_output=1):
        super(Net, self).__init__()
        self.conv1 = GCNConv(dataset.num_node_features, n_hidden)
        self.conv2 = GCNConv(n_hidden, n_output)

    def forward(self, data):
        ''' x (Tensor): Node feature matrix. Shape [num_nodes, num_node_features]'''
        ''' edge_index (LongTensor): Graph connectivity in COO format. Shape [2, num_edges]'''
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

In [124]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Net(n_output=dataset.num_classes).to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

In [54]:
data

Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])

In [135]:
model._modules['conv1']._parameters['weight'][:3]

tensor([[-0.0053, -0.0312,  0.0071, -0.0277,  0.0302, -0.0191,  0.0528,  0.0823,
         -0.0631,  0.0729,  0.0596,  0.0298,  0.0294, -0.0091,  0.0151,  0.0407],
        [-0.0509,  0.2361, -0.0555,  0.2472, -0.0025,  0.1550, -0.0704, -0.0828,
         -0.0008,  0.0061, -0.0575, -0.0340,  0.0319, -0.0304,  0.1290, -0.1260],
        [-0.0963,  0.0383, -0.0828, -0.0552,  0.0221,  0.0051,  0.0631,  0.1872,
         -0.0259, -0.0057,  0.1380,  0.1858, -0.1658,  0.0753, -0.0261, -0.0763]],
       grad_fn=<SliceBackward>)

In [128]:
model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

In [121]:
model(data).shape

torch.Size([2708, 7])

In [None]:
model.eval()
_, pred = model(data).max(dim=1)
correct = int(pred[data.test_mask].eq(data.y[data.test_mask]).sum().item())
acc = correct / int(data.test_mask.sum())
print('Accuracy: {:.4f}'.format(acc))