## Optional expert task 09.3: Generalisation of Convolution in Graph Neural Networks

ITU KSADMAL1KU - Advanced Machine Learning for Computer Science 2023

by Stefan Heinrich, with material by Kevin Murphy.

This notebook is based on material by Emanuele Rodolà, Luca Moschella, and Antonio Norelli.

All info and static material: https://learnit.itu.dk/course/view.php?id=3022225

-------------------------------------------------------------------------------

*Note: the notebook includes a god amount of reading material code! You are not supposed to understand it all, but follow the questions from the tasks and the inline hints closely.*

In [None]:
# @title #### import dependencies
from __future__ import print_function, division

!sudo pip install python-igraph
!sudo pip install plotly

from IPython.display import HTML

from typing import Mapping, Union, Optional, List
from pathlib import Path

import os
import math
import pickle
import numpy as np
import argparse

import torch
import torch.nn as nn
from torch.nn import Module
import torch.nn.functional as F
import torch.optim as optim

import networkx as nx
from sklearn.metrics import classification_report

import plotly as py
import plotly.graph_objects as go
import plotly.express as px
import igraph
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import torchvision
from torchvision import datasets, models, transforms

from tqdm.notebook import tqdm
from tqdm import trange

In [None]:
# @title #### for reproducibility
# Whenever we use randomness, we should make sure that our results are still reproducible, so we use fixed random seeds.

import random
np.random.seed(4)
random.seed(4)

torch.cuda.manual_seed(4)
torch.manual_seed(4)
torch.backends.cudnn.deterministic = True  # Note that this Deterministic mode can have a performance impact
torch.backends.cudnn.benchmark = False

### Introduction: Digesting *a priori* the information in the structure 

We have already learned how to consider the structure information in a simple case; when the domain $\Omega$ is the Euclidean $\mathbb{R}^2$, i.e. when the relations between variables are represented by a 2-dimensional grid, such as the pixels of an image.

![immagine griglia 2d](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/09/pics/euclidean.PNG) 

The filters of a convolutional neural network elaborate pixels considering only their neighbors and are applied with the same weights all over the image. In this way **CNNs are digesting *a priori* the structure information**, using the properties of the domain -- the translational invariance and the locality of neighbours -- to **reduce the free parameters of the model**. This leads to a crucial speed-up of the training process and allows larger and more powerful models. 

CNNs can be naturally extended to general Euclidean domains $\mathbb{R}^n$, but what can we do when $\Omega \neq \mathbb{R}^n$?

In many different fields such as Biology, Physics, Social Sciences and Computer Graphics we have to process signals defined on non-Euclidean domains, such as Graphs $\Omega = G(\mathcal{V, E})$ or [Manifolds](https://en.wikipedia.org/wiki/Manifold) $\Omega = \mathcal{X}$, where $\mathcal{V} = \left\{1,2,\dots,n\right\}$ are vertices or nodes and $\mathcal{E}\subseteq \mathcal{V}\times \mathcal{V}$ are edges or connections.

We would like to come out with a solution analoguous to CNNs; digesting *a priori* the structure information to reduce the free parameters proved to be very convenient in a learning setting based on a gradient descent optimization.


#### Representing non-Euclidean data in an Euclidean memory

Working with non-Euclidean domains provides a further challenge in representing the data. 

In the Euclidean setting, encoding the data in ordered matrices, vectors or tensors is so natural and effective that we do not even think to alternatives, and indeed the computer memory structure is itself Euclidean.

In the non-Euclidean setting we have many alternatives, consider for instance a manifold $\mathcal{X}$. It can be represented by a triangle mesh with vertices, edges and faces $\mathcal{V,E,F}$, by a n-polygonal mesh, where we admit also non-triangle faces, by a simple point cloud or by a subdivision surface. 

And even when we have chosen the representation, we still have to come out with a last encoding procedure to store our data in matrices and tensors as required by the Euclidean structure of the physical memory in our computers. Is it always going to be this way? Maybe one day we will have new pieces of hardware structured as graphs, like human brains.


#### More background: Generalizing the convolution operation

In this section we will explore the Spectral convolution, a generalization of the convolution operation to non-Euclidean domains.



Given two functions $f, g: [-\pi, \pi] \rightarrow \mathbb{R}$ their convolution is defined as:

  $$
  (f\star g)(x) = \int_{-\pi}^\pi f(x') g(x-x') dx' 
  $$

Convolution is a linear operation, so in the discrete case can be written as a matrix multiplication.


The convolution of two vectors $\mathbf{f} = (f_1, \cdots, f_{n})^\top$ and $\mathbf{g} = (g_1, \cdots, g_{n})^\top$ is:


$$
\mathbf{f}\star \mathbf{g}
=
\mathbf{G} \mathbf{f} =
\begin{pmatrix}
g_{1} & g_2 & \cdots & \cdots & g_{n} \\
g_{n} & g_1 & g_2 & \cdots & g_{n-1} \\
\vdots & \vdots & \ddots & \ddots & \vdots \\
g_{3} & g_{4} &  \cdots &g_1 & g_2 \\
g_{2} & g_{3} &  \cdots &\cdots & g_1 \\
\end{pmatrix}
\begin{pmatrix}
f_1\\
\vdots\\
f_n\\
\end{pmatrix}
$$


Notice that this is a circulant matrix, i.e. each row vector is rotated one element to the right relative to the preceding row vector. 

If a linear operator admits an eigendecomposition, we can express it in the basis of its eigenvectors $V = \{v_1, \cdots, v_n\}$ through a diagonal matrix.

It turns out that all circulant matrices are diagonalized by the same basis $\Phi_\mathcal{E} = \{\phi_1, \cdots, \phi_n\}$:

$$
\mathbf{G}
=
\begin{pmatrix}
g_{1} & g_2 & \cdots & \cdots & g_{n} \\
g_{n} & g_1 & g_2 & \cdots & g_{n-1} \\
\vdots & \vdots & \ddots & \ddots & \vdots \\
g_{3} & g_{4} &  \cdots &g_1 & g_2 \\
g_{2} & g_{3} &  \cdots &\cdots & g_1 \\
\end{pmatrix}
=
\Phi_\mathcal{E}
\begin{pmatrix}
\hat{g_1} & &\\
& \ddots &\\
& & \hat{g_n}\\
\end{pmatrix}
\Phi^\top_\mathcal{E}
$$

This basis $\Phi_\mathcal{E}$ is very special, it is the discretized Fourier basis in the Euclidean domain:

$$
\Phi_\mathcal{E} = \{\phi_1, \cdots, \phi_n\}
\;\;\;\;\;\;
\text{with}
\;\;\;\;\;\;
\phi_k = 
\begin{pmatrix}
w_n^{0k} \\
w_n^{1k} \\
w_n^{2k} \\
\vdots \\
w_n^{(n-1)k}
\end{pmatrix}
\;\;\;\;\;\;
\text{and}
\;\;\;\;\;\;
w_n^{jk}=e^{\frac{2 \pi i}{n}jk}
$$

The expression of $\mathbf{G}$ as $\Phi_\mathcal{E} \mathbf{\hat{G}} \Phi^\top_\mathcal{E}$ will be our bridge towards non-Euclidean domains.

 

Infact we know a **generalization of the Fourier basis to graphs and manifolds**, the eigenvectors $\Phi$ of the Laplacian operator:

$$\Delta = \Phi \Lambda \Phi^\top$$ 

where $\Lambda$ is the diagonal matrix containing the eigenvalues of the Laplacian.

![fourier laplacian](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/09/pics/fourier.png) 

On these non-Euclidean domains, the idea is to calculate the eigenvectors of the Laplacian in the first place, which constitutes the generalized Fourier basis $\Phi$, and then **define the convolution operator** as:

$$
\mathbf{W}
=
\Phi
\begin{pmatrix}
\hat{w_1} & &\\
& \ddots &\\
& & \hat{w_n}\\
\end{pmatrix}
\Phi^\top
$$

Where $\hat{w_i}$ are learnable parameters.

Notice that in the Euclidean case this expression coincides with the standard convolution defined above, since the eigenvectors of the Laplacian in that case *are* the Euclidean Fourer basis. This is a desired property.

However we have several drawbacks:
- The coefficients $\hat{w_i}$ depend on the basis $\Phi$. So learned filters do not generalize across domains; the addition of a single node in a graph or the small differences in a mesh after a change of pose fatally changes the basis $\Phi$. For instance a convolutional filter with parameters $\hat{w_i}$ tuned to spot edges changes completely behaviour on a slightly different domain.

![conv_broken](https://raw.githubusercontent.com/lucmos/DLAI-s2-2020-tutorials/master/09/pics/broken_filter.png) 
- The number of trainable parameters per filter depends on $n$, the size of the domain. We want a convolutional filter with a fixed number of parameters like in the Euclidean case.
- Since the trainable parameters $\hat{w_i}$ are not properly constrained, there is a high chance that the learned filter is not localized in space, as seen in the lecture slides.

To address these problems we put some constraints on the matrix $\mathbf{\hat{W}}$, parametrizing it in a different way.

Instead of having a degree of liberty per element of the diagonal ($n$ learnable parameters), we substitute $\mathbf{\hat{W}}$ with the fixed eigenvalues of the Laplacian $\Lambda$ altered by a single parametrized transformation $\tau_\alpha(\lambda)$, which depends on a fixed number of learnable parameters $\alpha$. 
Our new $\mathbf{W}$ will be:

$$
\mathbf{W}
=
\Phi
\begin{pmatrix}
\tau_\alpha(\lambda_1) & &\\
& \ddots &\\
& & \tau_\alpha(\lambda_n)\\
\end{pmatrix}
\Phi^\top
$$

In the second part of this tutorial we will explore an application of this generalized convolution operation on graphs. We will define the Laplacian operator on graphs, then we will compute its eigendecomposition and define a proper transformation $\tau_\alpha$.

### GCN & CORA

The code in the following sections comes mainly from [this repository](https://github.com/tkipf/pygcn) and it is inspired by the paper [Semi-Supervised Classification with Graph Convolutional Networks](https://arxiv.org/abs/1609.02907), where Thomas Kipf presents the Graph Convolutional Networks in PyTorch.


#### The CORA dataset

The CORA dataset is a much bigger dataset when compared to the Karate graph, the task is again node classification.

In the CORA graph where:
- Each **node** is a Machine Learning paper. 
- An **edge** represents one citation from one paper to another.
- Each node is classified into one of seven possible Machine Learning sub-fields:
 - Case Based
 - Genetic Algorithms
 - Neural Networks
 - Probabilistic Methods
 - Reinforcement Learning
 - Rule Learning
 - Theory

We will use a subset of the CORA dataset, preprocessed as suggested by [Thomas Kipf](https://github.com/tkipf/pygcn/tree/master/data/cora):

- It considers only papers that are cited or cite at least once.
- The words are [stemmed](https://en.wikipedia.org/wiki/Stemming)
- Stopwords and infrequent words are removed.

This subset contains **2708** paper with **1433** unique words.


--- 

[Here](http://networkrepository.com/cora.php) you can have fun exploring the complete CORA dataset, and many other graph datasets.

In [None]:
classes = [
    'Case_Based',
    'Genetic_Algorithms',
    'Neural_Networks',
    'Probabilistic_Methods',
    'Reinforcement_Learning',
    'Rule_Learning',
    'Theory',
]

##### Dataset format

In [None]:
# The preprocessed CORA is contained in this repository under ./data/cora
!git clone https://github.com/tkipf/pygcn.git

The directory `pygnc/data/cora` contains two files: `cora.content` and `cora.cites`.

The `cora.content` contains the description of each node. For each line it contains:
 - The id of the node.
 - The [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) text representation.
 - The label of that node.

In [None]:
import pandas
headers = ['PaperID'] + [f'word{i}' for i in range(1433)] + ['label']
pandas.read_csv('pygcn/data/cora/cora.content', sep="\t", names=headers)

The `cora.cites` contains the relationships between nodes. For each line it contains:

- The first entry is id of the cited paper
- The second entry id of the citing paper

That is, the direction of the entry is right to left.

In [None]:
import pandas
headers = ['Cited PaperID', 'Citing PaperID']
pandas.read_csv('pygcn/data/cora/cora.cites', sep="\t", names=headers)

##### Data loading

The repository provides python functions to parse the preprocessed data.

It returns the adjacency matrix, the node features, the labels for each node, the indices to split into train-test:


In [None]:
# Add the folder to the python path
import sys
sys.path.insert(0,'./pygcn/pygcn')

from utils import load_data
adj, features, labels, idx_train, idx_val, idx_test = load_data(path='pygcn/data/cora/')

In [None]:
# As expected its shape is num_paper*num_paper
adj.shape  

In [None]:
# The adjacency matrix is... not just an adjacency matrix!
# It is a normalized Laplacian matrix.
# You can see the function used to build and normalize the Laplacian here:
# https://github.com/tkipf/pygcn/blob/1600b5b748b3976413d1e307540ccc62605b4d6d/pygcn/utils.py#L56
print(torch.sum(adj.to_dense(), 0))
print(torch.sum(adj.to_dense(), 1))

In [None]:
# Each paper has 1433 features. The BOW representation of its text
features.shape  

#### Utility functions
Ignore the methods in this section. The code is not necessary to unterstand for the tasks. They are just used for some nicer visualisations.

In [None]:
# @title ##### Model training

def get_predictions(output, labels):
    preds = output.max(1)[1].type_as(labels)
    correct = preds.eq(labels).double()
    return correct

def accuracy(output, labels):
    correct = get_predictions(output, labels)
    correct = correct.sum()
    return correct / len(labels)


def plot_loss(losses):
    fig = go.Figure()

    fig.add_trace(go.Scatter(
        x=list(range(len(losses))),
        y=losses,
        # name="Name of Trace 1"       # this sets its legend entry
    ))

    fig.update_layout(
        title="Train loss",
        xaxis_title="Epoch",
        yaxis_title="Loss",
        font=dict(
            family="Courier New, monospace",
            size=18,
            color="#7f7f7f"
        )
    )
    return fig


In [None]:
# @title ##### Visualisations
plt_glayt = 'fr'

def refresh_bar(bar, desc):
    bar.set_description(desc)
    bar.refresh()

def plot_graph(adj, node_colors, colors_legend = classes, title='CORA graph', 
               layt = None):
    N = adj.shape[0]
    adj = adj.coalesce()
    edgeA, edgeB = adj.indices()[0, :], adj.indices()[1, :]
    edgeA = edgeA.tolist()
    edgeB = edgeB.tolist()

    G = igraph.Graph.Adjacency((adj.to_dense() > 0).tolist())
    if layt is None:
        layt=G.layout(plt_glayt, dim=3)

    Xn=[layt[k][0] for k in range(N)]# x-coordinates of nodes
    Yn=[layt[k][1] for k in range(N)]# y-coordinates
    Zn=[layt[k][2] for k in range(N)]# z-coordinates
    Xe=[]
    Ye=[]
    Ze=[]
    for e in zip(edgeA, edgeB):
        Xe+=[layt[e[0]][0],layt[e[1]][0], None]# x-coordinates of edge ends
        Ye+=[layt[e[0]][1],layt[e[1]][1], None]
        Ze+=[layt[e[0]][2],layt[e[1]][2], None]

    trace1=go.Scatter3d(x=Xe,
                y=Ye,
                z=Ze,
                mode='lines',
                line=dict(color='rgb(125,125,125)', width=1),
                hoverinfo='none'
                )

    trace2=go.Scatter3d(x=Xn,
                y=Yn,
                z=Zn,
                mode='markers',
                name='actors',
                marker=dict(symbol='circle',
                                size=6,
                                color=node_colors,
                                colorscale='Viridis',
                                line=dict(color='rgb(50,50,50)', width=0.5)
                                ),
                text=colors_legend,
                hoverinfo='text'
                )

    axis=dict(showbackground=False,
            showline=False,
            zeroline=False,
            showgrid=False,
            showticklabels=False,
            title=''
            )

    layout = go.Layout(
            title=title,
            width=800,
            height=800,
            showlegend=False,
            scene=dict(
                xaxis=dict(axis),
                yaxis=dict(axis),
                zaxis=dict(axis),
            ),
        margin=dict(
            t=100
        ),
        hovermode='closest', 
        )

    data=[trace1, trace2]
    fig=go.Figure(data=data, layout=layout)
    return fig, layt

#### CORA Graph

Let's visualize the graph!

As you can see, the graph has many disconnected compontents, many of which are very small. The preprocessing ensures that each component has at least 2 nodes.


Note that in the visualization, nodes with the same colors have the same label.


In [None]:
fig, layt = plot_graph(adj, labels)
fig

#### MLP approach

The simplest approach is to use a Multi Layer Perceptron on the features of each node, independently.

This means that we aim to predict the sub-field of each machine learning paper looking exclusively at its text, encoded in a BOW. We're not considering at all at the structure of the graph, i.e. the citations that link the papers.



In [None]:
def mlp_accuracy(model):
    """
    Perfom a forward pass `y_pred = model(x)` and computes the accuracy
    between `y_pred` and `y_true`
    """
    model.eval()
    y_pred = model(features[idx_test])
    acc = accuracy(y_pred, labels[idx_test])
    print(f"Accuracy: {acc:.5}")
    return acc

In [None]:
# Model definition
mlp = nn.Sequential(nn.Linear(1433, 500), 
                    nn.ReLU(),
                    nn.Linear(500, 100),
                    nn.ReLU(),
                    nn.Linear(100, 7))

In [None]:
print("Loss before training")
_ = mlp_accuracy(mlp)

Cora graph visualization:
- Yellow nodes: correct predictions
- Blue nodes: wrong predictions

In [None]:
correct = get_predictions(mlp(features), labels)
fig, layt = plot_graph(adj, correct, title="MLP performance before training", layt=layt)
fig


In [None]:
opt = optim.Adam(mlp.parameters())

losses = []
mlp.train()

for epoch in trange(500):
    opt.zero_grad()
    output = mlp(features[idx_train])
    loss = F.cross_entropy(output, labels[idx_train])  # train only on the train samples
    loss.backward()
    opt.step()

    losses.append(loss.item())

plot_loss(losses)

In [None]:
print("Loss after training")
accmlp = mlp_accuracy(mlp)

Cora graph visualization:
- Yellow nodes: correct predictions
- Blue nodes: wrong predictions

In [None]:
correct = get_predictions(mlp(features), labels)
fig, layt = plot_graph(adj, correct, title="MLP performance after training", layt=layt)
fig


#### Graph convolutional network


We can define an equivalent of the `nn.Layer` that uses the adjacency matrix in the forward pass:

In [None]:
from torch.nn import Parameter

import math
class GraphConvolution(Module):

    def __init__(self, in_features, out_features):
        super(GraphConvolution, self).__init__()

        # A nn.Parameter is a normal tensor
        # that is automatically registered as a model parameter
        # so that it is inclued in `model.parameters()`.
        self.weight = Parameter(torch.FloatTensor(in_features, out_features))
        self.bias = Parameter(torch.FloatTensor(out_features))
        self.reset_parameters()

    def reset_parameters(self):
        stdv = 1. / math.sqrt(self.weight.size(1))
        self.weight.data.uniform_(-stdv, stdv)
        self.bias.data.uniform_(-stdv, stdv)

    def forward(self, input, adj):
        support = torch.mm(input, self.weight)
        output = torch.spmm(adj, support)  # sparse matrix multiplication
        return output + self.bias

-------------------------------------------------------------------------------
#### **Tasks:**
At first, let's truly understand the GCN model definition.

* Did you notice anything strange in the forward pass?
* Which is the shape of the adj matrix? Which is the shape of the output?

Secondly, in contrast with this GCN, we can see that the MLP above, as expected, cannot learn much about the nodes that we didn't train on. Thus understand and discuss again:

* Why was the GCN able to achieve this?
-------------------------------------------------------------------------------

In [None]:
#@title Solution 👀

# We are performing the forward pass using the whole graph!
# 
# We are not doing any batching, 
# since we need to perform a matrix multiplication 
# w.r.t the whole adjacency matrix!

These layers can be combined togheter to build complex models 

In [None]:
class GCN(nn.Module):
    def __init__(self, nfeat, nhid, nclass):
        super(GCN, self).__init__()
        self.gc1 = GraphConvolution(nfeat, nhid)
        self.gc2 = GraphConvolution(nhid, nclass)

    def forward(self, x, adj):
        x = F.relu(self.gc1(x, adj))
        x = self.gc2(x, adj)
        return x

In [None]:
def gcn_accuracy(model):
    """
    Perfom a forward pass `y_pred = model(x)` and computes the accuracy
    between `y_pred` and `y_true`.

    It is particuarly tricky to perform batching in GCN.
    As you can see, here the forward pass is performed on the whole graph
    """
    model.eval()
    y_pred = model(features, adj)  # Do you notice the difference?
    acc = accuracy(y_pred[idx_test], labels[idx_test]) 
    print(f"Accuracy: {acc:.5}")
    return acc

In [None]:
gcn = GCN(1433, 50, 7)

print("Loss before training")
_ = gcn_accuracy(gcn)

#### Visualization

Cora graph visualization:
- Yellow nodes: correct predictions
- Blue nodes: wrong predictions

In [None]:
correct = get_predictions(gcn(features, adj), labels)
fig, layt = plot_graph(adj, correct, title="GCN performance before training", layt=layt)
fig


In [None]:
opt = optim.Adam(gcn.parameters())

losses = []
gcn.train()

for epoch in trange(1000):
    opt.zero_grad()
    output = gcn(features, adj)  # compute all outputs, even for the nodes in the test set
    loss = F.cross_entropy(output[idx_train], labels[idx_train])  # Train only on the train samples!
    loss.backward()
    opt.step()

    losses.append(loss.item())

plot_loss(losses)

In [None]:
print("Loss after training")
accgcn = gcn_accuracy(gcn)

In [None]:
correct = get_predictions(gcn(features, adj), labels)
fig, layt = plot_graph(adj, correct, title="GCN performance after training", layt=layt)
fig

In [None]:
def num_params(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad) 

print('Number of parameters MLP: ', num_params(mlp))
print('Number of parameters GCN: ', num_params(gcn))

#### Performance comparison

In [None]:
fig = go.Figure([go.Bar(x=['MLP', 'GCN'], y=[accmlp.item(), accgcn.item()])])
fig.update_layout(title='Performance comparison',
                  yaxis_title="Accuracy [%]",
                  xaxis_title="Model type")
fig.show()

### **Credits**

- Deep Learning & Applied AI @Sapienza Course material byEmanuele Rodolà, Luca Moschella, and Antonio Norelli - [Original sources](https://erodola.github.io/DLAI-s2-2021/)
- Geometric deep learning [tutorial](https://vistalab-technion.github.io/cs236781/tutorials/tutorial_09/)
- Kipf T, Welling M. Semi-Supervised Classification with Graph Convolutional Networks (2016).
- Bronstein M. M., et al. (2017) Geometric Deep Learning: Going beyond Euclidean data. IEEE Signal Process Mag 34(4).