<a href="https://colab.research.google.com/github/dp457/Graph-Neural-Network/blob/main/Introduction_to_Graph_Neural_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install torch_geometric

Collecting torch_geometric
  Downloading torch_geometric-2.6.1-py3-none-any.whl.metadata (63 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/63.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.1/63.1 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiohttp (from torch_geometric)
  Downloading aiohttp-3.12.15-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting aiohappyeyeballs>=2.5.0 (from aiohttp->torch_geometric)
  Downloading aiohappyeyeballs-2.6.1-py3-none-any.whl.metadata (5.9 kB)
Collecting aiosignal>=1.4.0 (from aiohttp->torch_geometric)
  Downloading aiosignal-1.4.0-py3-none-any.whl.metadata (3.7 kB)
Collecting frozenlist>=1.1.1 (from aiohttp->torch_geometric)
  Downloading frozenlist-1.7.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting propcache>=0.2.0 (from aiohttp->

# Data Handling of Graphs

A single graph in Pytorch Geometric consists of the following attributes by default:
* Node feature matrix with the shape **[num_nodes, num_node_features].**
* Graph connectivity with the shape **[2, num_edges]** with the type long.
* Edge feature matrix.
* Target to train against or graph-level targets.
* Node position matrix with a particular shape.









In [2]:
import torch
from torch_geometric.data import Data

edge_index = torch.tensor([[0, 1, 1, 2],
                           [1, 0, 2, 1]], dtype=torch.long)

In [4]:
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)
data = Data(x=x, edge_index=edge_index)
Data(edge_index=[2,4], x=[3,1])

Data(x=[2], edge_index=[2])

Besides holding a number of node-level, edge-level or graph-level attributes, *Data* provides a number of utility functions.

In [5]:
data.num_nodes

3

In [6]:
data.num_edges

4

In [7]:
device = torch.device('cuda')

## Benchmark Dataset

All graph classifications consists of dataset like QM7, QM9 dataset and a handful of 3D mesh/point cloud dataset like FAUST, ModelNet and ShapeNet.

Initialization of the dataset is straightforward

In [8]:
from torch_geometric.datasets import TUDataset

dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')

Downloading https://www.chrsmrrs.com/graphkerneldatasets/ENZYMES.zip
Processing...
Done!


In [10]:
print(len(dataset))
print(dataset.num_classes)

600
6


In [11]:
dataset.num_node_features

3

In [13]:
data = dataset[0]
data

Data(edge_index=[2, 168], x=[37, 3], y=[1])

The data consists of 37 nodes, each having 3 features. There are 168/2=84 undirected edges and the graph is assigned exactly 1 class. Data is holding exactly one graph level target. There are ways where we can use the slices, long or bool tensors to split the dataset.

In [15]:
train_dataset = dataset[:540]
test_dataset = dataset[540:]

ENZYMES(540)

We download another dataset, standard benchmark one used for semi-supervised graph node classification.

In [16]:
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='/tmp/Cora', name = 'Cora')
dataset

Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index
Processing...
Done!


Cora()

In [17]:
len(dataset)

1

In [18]:
dataset.num_classes

7

In [19]:
dataset.num_node_features

1433

In [21]:
data = dataset[0]
data

Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

In [23]:
data.is_undirected()

True

In [24]:
data.train_mask.sum().item()

140

In [25]:
data.val_mask.sum().item()

500

In [26]:
data.test_mask.sum().item()

1000

Here the data holds the label for each node and additional node-level attributes where,

*   *train_mask* denotes against which nodes to train (140 nodes)
*   *val_mask* denotes which nodes to use for validation. (perform early stopping).
*  *test_mask* denotes against which nodes to test (1000 nodes).



# Mini-batches

NN are trained in a batch-wise fashion. Here the parallelization over a mini-batch is obtained by creating a sparse block diagonal matrices and concatenating the target and feature matrices in the node dimension. This allows a differing number of nodes and edges over examples in one batch.

$\mathbf{A} = \begin{bmatrix}
\mathbf{A}_1 &  &  \\
 & \ddots  &  \\
 &  & \mathbf{A}_n \\
\end{bmatrix}$ , $\mathbf{X} = \begin{bmatrix}
\mathbf{X_1} \\
\vdots  \\ \mathbf{X}_n
\end{bmatrix}$ , $\mathbf{Y} = \begin{bmatrix}
\mathbf{Y_1} \\
\vdots  \\ \mathbf{Y}_n
\end{bmatrix}$

The **Dataloader** object takes care of the concatenation process as per the above matrix.

In [27]:
from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader

dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES', use_node_attr=True)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

for batch in loader:
  print(batch)

DataBatch(edge_index=[2, 3536], x=[970, 21], y=[32], batch=[970], ptr=[33])
DataBatch(edge_index=[2, 4020], x=[1275, 21], y=[32], batch=[1275], ptr=[33])
DataBatch(edge_index=[2, 3972], x=[1030, 21], y=[32], batch=[1030], ptr=[33])
DataBatch(edge_index=[2, 4502], x=[1219, 21], y=[32], batch=[1219], ptr=[33])
DataBatch(edge_index=[2, 3608], x=[913, 21], y=[32], batch=[913], ptr=[33])
DataBatch(edge_index=[2, 3926], x=[989, 21], y=[32], batch=[989], ptr=[33])
DataBatch(edge_index=[2, 4190], x=[1079, 21], y=[32], batch=[1079], ptr=[33])
DataBatch(edge_index=[2, 3696], x=[962, 21], y=[32], batch=[962], ptr=[33])
DataBatch(edge_index=[2, 4208], x=[1087, 21], y=[32], batch=[1087], ptr=[33])
DataBatch(edge_index=[2, 4138], x=[1032, 21], y=[32], batch=[1032], ptr=[33])
DataBatch(edge_index=[2, 4212], x=[1089, 21], y=[32], batch=[1089], ptr=[33])
DataBatch(edge_index=[2, 3762], x=[930, 21], y=[32], batch=[930], ptr=[33])
DataBatch(edge_index=[2, 4176], x=[1131, 21], y=[32], batch=[1131], ptr=[3

In [28]:
batch.num_graphs

24

**batch** column vector maps each nodes to its respective graph in the batch. It is used to average the node features in the node dimension for each graph individually.


In [31]:
from torch_geometric.utils import scatter
from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader

dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES', use_node_attr=True)
loader = DataLoader(dataset, batch_size = 32, shuffle=True)

for data in loader:
  x = scatter(data.x, data.batch, dim=0, reduce='mean')
  print(x.size())

torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([32, 21])
torch.Size([24, 21])


# Learning Methods on Graphs

After learning about data handling, datasets, loader and transforms, its time to implement our first GNN. We will use a simple GCN and replicate the experiments on Cora citation dataset.

In [36]:
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='/tmp/Cora', name='Cora')

In [37]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
  def __init__(self):
    super().__init__()
    self.conv1 = GCNConv(dataset.num_node_features, 16)
    self.conv2 = GCNConv(16, dataset.num_classes)

  def forward(self, data):
    x, edge_index = data.x, data.edge_index
    x = self.conv1(x, edge_index)
    x = F.relu(x) # nonlinearity not integrated in the conv calls hence called separately
    x = F.dropout(x, training=self.training)
    x = self.conv2(x, edge_index)

    return F.log_softmax(x, dim=1)


In [38]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GCN().to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

model.train()
for epoch in range(200):
  optimizer.zero_grad()
  out = model(data)
  loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
  loss.backward()
  optimizer.step()

In [39]:
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f'Accuracy: {acc:.4f}')

Accuracy: 0.8060
