# Graph Neural Networks

From MLPs to GCNs and GATs.

In [None]:
# https://towardsdatascience.com/graph-neural-networks-part-1-graph-convolutional-networks-explained-9c6aaa8a406e

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from tqdm import tqdm

In [None]:
# create a new environment with Poetry
#!pip install poetry
!poetry init --no-interaction
!poetry add torch torchvision torchaudio torch-geometric matplotlib scikit-learn

The Cora dataset is a benchmark dataset for graph neural networks. The dataset contains data about 2708 scientific publications. These publications are the nodes of the graph. An edge between nodes (publications) is created when a publication references the other one. The target is to predict the subject of each paper, there are seven classes in total.

In [19]:
#!pip install torch-geometric
from torch_geometric.datasets import Planetoid

dataset= Planetoid(root='.', name='Cora', force_reload=True)
data= dataset[0]

Processing...
Done!


# Neural Network - MLP

In [200]:
"""
Activation functions implemented: relu, tanh.
"""

class MLP_Hidden(nn.Module):

    def __init__(self, input_dim, output_dim, layer_norm, activation, dropout) -> None:

        super(MLP_Hidden, self).__init__()
        self.fc_hn= nn.Linear(input_dim, output_dim)
        self.norm= None
        if layer_norm:
            self.norm= nn.LayerNorm(output_dim)

        if 'relu' in activation:
            self.activ= nn.ReLU()
        else:
            self.activ= nn.Tanh()

        self.dropout= None
        if dropout> 0.0:
            self.dropout= nn.Dropout(p=dropout)


    def forward(self, x):
        x= self.fc_hn(x)
        if self.norm is not None:
            x= self.norm(x)
        x= self.activ(x)
        if self.dropout is not None:
            x= self.dropout(x)

        return x



class MLP(nn.Module):

    def __init__(self, input_dim, hidden_dim=[16,], output_dim=1, layer_norm=False,
                 activation='relu', dropout=0.0) -> None:
        """
        Implements a customizable MLP.
        """

        super(MLP, self).__init__()
        if isinstance(hidden_dim, int):
            hidden_dim= [hidden_dim]
        n_hidden_layers= len(hidden_dim)

        if n_hidden_layers== 0:
            raise Exception('hidden_dim cannot be an empty list')

        self.fc_hnin= MLP_Hidden(input_dim, hidden_dim[0], layer_norm, activation, dropout)

        if n_hidden_layers> 1:
            self.fc_hn= nn.Sequential(*[
                MLP_Hidden(d, hidden_dim[i+1], layer_norm, activation, dropout)
                for i, d in enumerate(hidden_dim[:-1])
            ])
        else: self.fc_hn= None

        self.fc_hnout= nn.Linear(hidden_dim[-1], output_dim)


    def forward(self, x):  # no graph structure, only node features
        x= self.fc_hnin(x)
        if self.fc_hn is not None:
            x= self.fc_hn(x)
        x= self.fc_hnout(x)

        return F.log_softmax(x, dim=1)


In [82]:
data_size= data.x.shape[0]
dev_size= 500
test_size= 500
train_size= data_size - dev_size - test_size

train_mask= torch.tensor([i< train_size for i in range(data_size)])
dev_mask= torch.tensor([i>= train_size and i< (data_size - test_size) for i in range(data_size)])
test_mask= torch.tensor([i>= (train_size + dev_size) for i in range(data_size)])

data.train_mask= train_mask
data.val_mask= dev_mask
data.test_mask= test_mask

In [101]:
device= torch.device('cuda' if torch.cuda.is_available() else 'cpu')

data= data.to(device)

Xtr, Ytr= data.x[data.train_mask], data.y[data.train_mask]
Xdev, Ydev= data.x[data.val_mask], data.y[data.val_mask]
Xte, Yte= data.x[data.test_mask], data.y[data.test_mask]

edge_idx= data.edge_index

num_inputs= data.x.shape[1]          # used for input_dim
num_labels= len(set(data.y.numpy())) # used for output_dim

In [211]:
model= MLP(input_dim=num_inputs, hidden_dim=[32,], output_dim=num_labels,
           layer_norm=True, dropout=0.1).to(device)

total_params= sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'Number of parameters: {total_params}')

Number of parameters: 46183


# Graph Convolutional Network - GCN

There are three common types of prediction tasks in graphs:
- You can predict on graph level. The input of the model is many different graphs, and every graph gets one classification. For example the class a molecule belongs to: every molecule is represented by one graph, and every molecule needs a prediction. Another example is image classification. Yes, images can also be represented as graphs!
- Another way to use GNNs is by predicting on node level. The input of the GNN is one graph, and every node needs a prediction. This prediction is a characteristic of the node. Node regression is of course possible as well. Compared to classification, you only need to change the output layer activation function, the loss function, evaluation metric, and obviously the target.
- Finally, we can predict on edge level. The value of an edge is predicted, or the likelihood of an edge that will appear soon. An example is recommended friends on social media (a.k.a. link prediction).

For understanding one node, we need to look at its neighborhood and include that information in the GNN.

There is one important step we should take before actually implementing a GNN, and that is normalization. Imagine, without normalization, nodes with more connections (e.g. one node having 10 neighbors vs. another with just 1) can dominate the learning process. The node with 10 neighbors would aggregate far more information than the one with 1, leading to imbalance and unstable learning. Normalization ensures that each node's contribution is appropriately scaled, so the network learns from the graph structure rather than being skewed by uneven data distribution.

In GNNs it's common to use symmetric normalization. The idea is to normalize each node's aggregated features by the square root of its degree (the number of neighbors, including itself for self-loops). This helps to ensure that nodes with different degrees contribute equally during aggregation.

In [205]:
import torch_geometric.nn as gnn

class GCN_Hidden(nn.Module):

    def __init__(self, input_dim, output_dim, activation) -> None:
        super(GCN_Hidden, self).__init__()
        self.conv_hn= gnn.GCNConv(input_dim, output_dim)
        if 'relu' in activation:
            self.activ= nn.ReLU()
        else:
            self.activ= nn.Tanh()


    def forward(self, x, e):
        x= self.conv_hn(x, e)
        x= self.activ(x)

        return x



class GCN(nn.Module):
    """
    Implementing a Graph Convolutional Network.
    """

    def __init__(self, input_dim, hidden_dim=[16,], output_dim=1, layer_norm=False,
                 activation='relu', dropout=0.0) -> None:
        super(GCN, self).__init__()
        if isinstance(hidden_dim, int):
            hidden_dim= [hidden_dim]
        n_hidden_layers= len(hidden_dim)

        if n_hidden_layers== 0:
            raise Exception('hidden_dim cannot be an empty list')

        self.conv_hnin= GCN_Hidden(input_dim, hidden_dim[0], activation)

        if n_hidden_layers> 1:
            self.conv_hn= nn.Sequential(*[
                GCN_Hidden(d, hidden_dim[i+1], activation) for i, d in enumerate(hidden_dim[:-1])
            ])
        else:
            self.conv_hn= None

        self.conv_hnout= gnn.GCNConv(hidden_dim[-1], output_dim)


    def forward(self, x, e):
        x= self.conv_hnin(x, e)
        if self.conv_hn is not None:
            x= self.conv_hn(x, e)
        x= self.conv_hnout(x, e)

        return F.log_softmax(x, dim=1)


In [206]:
model= GCN(input_dim=num_inputs, hidden_dim=[32,], output_dim=num_labels).to(device)

total_params= sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'Number of parameters: {total_params}')

Number of parameters: 46119


Traditional neural networks can be efficiently batched during training. For graph neural networks, it's harder to batch the data because nodes have different neighbors, resulting in potentially uneven mini-batches. Efficient sampling techniques (like GraphSAGE) or mini-batch training are necessary for scalability.

In [208]:
# training procedure - we train 10 times and calculate the average accuracy and standard deviation

def supervised_training(model_class, learning_rate=1e-3, epochs=500, eval_interval=50, batches=True,
                        verbose=False):
    if batches:
        batch_size= 100
        epoch_size= round(Xtr.shape[0]/ batch_size)-1
    else:
        batch_size= Xtr.shape[0]
        epoch_size= 1

    results= []

    for i in tqdm(range(10)):
        if verbose: print(f'Training {model_class} iteration {i+1}')

        model= model_class(input_dim=num_inputs, hidden_dim=[32,], output_dim=num_labels,
                           layer_norm=True, dropout=0.1).to(device)

        # create a PyTorch optimizer
        optimizer= torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=5e-4)

        # loss function
        class_weights= torch.bincount(data.y) / len(data.y)
        loss_fn= nn.CrossEntropyLoss(weight=1/class_weights).to(device)

        # training loop
        for epoch in range(epochs):
            # iterating over all batches
            for i in range(epoch_size):
                # --- minibatch construction ---
                Xb= Xtr[(i * batch_size):((i+1) * batch_size)]
                Yb= Ytr[(i * batch_size):((i+1) * batch_size)]

                # --- forward pass ---
                if isinstance(model, MLP):
                    y_pred= model(Xb)
                elif isinstance(model, GCN):
                    y_pred= model(data.x, data.edge_index)[data.train_mask]
                tr_loss= loss_fn(y_pred, Yb)

                # --- backward pass ---
                model.train(True)
                optimizer.zero_grad()
                tr_loss.backward()

                # --- update ---
                optimizer.step()

            # --- track stats ---
            if epoch% eval_interval== 0:
                model.eval()
                with torch.no_grad():
                    if isinstance(model, MLP):
                        y_pred= model(Xdev)
                    elif isinstance(model, GCN):
                        y_pred= model(data.x, data.edge_index)[data.val_mask]

                    val_loss= loss_fn(y_pred, Ydev)
                    val_acc= (y_pred.argmax(dim=1)== Ydev).sum().item()/ Ydev.shape[0]
                    if verbose:
                        print(f'Epoch {epoch} | Training Loss: {tr_loss.item():.2f} | Validation Loss: {val_loss.item():.2f} | Validation Acc: {val_acc:>5.2f}')

        # final evaluation on the test set
        model.eval()
        with torch.no_grad():
            if isinstance(model, MLP):
                y_pred= model(Xte)
            elif isinstance(model, GCN):
                y_pred= model(data.x, data.edge_index)[data.test_mask]

            test_loss= loss_fn(y_pred, Yte)
            test_acc= (y_pred.argmax(dim=1)== Yte).sum().item()/ Yte.shape[0]
            if verbose: print(f'{model_class} Test Loss: {test_loss.item():.2f} | Test Acc: {test_acc:>5.2f}')
            results.append([val_acc, test_acc])

    return torch.tensor(results)


In [100]:
# print average on test set and standard deviation
results= supervised_training(MLP, learning_rate=0.01, epochs=1000, eval_interval=100)
print(f'MLP - Test Accuracy: {100*results[:,1].mean():.2f} ± {100*results[:,1].std():.2f}')

100%|██████████| 10/10 [05:56<00:00, 35.61s/it]

MLP - Test Accuracy: 71.42 ± 1.48





In [209]:
# print average on test set and standard deviation
results= supervised_training(GCN, learning_rate=0.01, epochs=1000, eval_interval=100, batches=False)
print(f'GCN - Test Accuracy: {100*results[:,1].mean():.2f} ± {100*results[:,1].std():.2f}')

100%|██████████| 10/10 [03:55<00:00, 23.55s/it]

GCN - Test Accuracy: 87.58 ± 0.18





The graph structure should really make a difference for the problem you are trying to solve. The structure should be meaningful for the prediction task at hand. Testing is important here. You can try to formulate the graph in different ways to see if one way of formulating works better than another one.

Training a graph neural network takes more time than training a normal neural network. So if the results improve only a little bit and training time is important, the normal neural network can be the best choice. Also, the effectiveness among types of graph neural networks (GCN, GAT, GraphSAGE) can vary greatly based on the problem.

Just like in standard neural networks, transfer learning (pre-training a GNN on a large dataset and fine-tuning on the target dataset) can be effective for GNNs. Checking for available pre-trained models for your task can be valuable.

As we've seen, simply adding graph information to a basic neural network can dramatically boost performance, as was the case when we moved from a normal neural network to a GCN for the Cora dataset. By aggregating information from neighboring nodes, GCNs can provide a richer representation of the data, leading to more accurate predictions. But, it's crucial to remember that GNNs aren't a magic bullet for every problem. The graph structure must be truly meaningful to the prediction task, and the increase in training complexity might not always justify the performance boost, especially when training time is critical.

# Graph Attention Network - GAT