# Graph Neural Networks with PyTorch Geometric (PyG)

This tutorial walks through building a simple Graph Convolutional Network (GCN) for node classification on the Cora citation network using PyTorch Geometric. You'll learn to:

- Represent graphs with `torch_geometric.data.Data`
- Build a GCN with `torch_geometric.nn.GCNConv`
- Train with train/val/test masks (Planetoid split)
- Compute masked accuracy and evaluate the model

Prerequisites: Basic PyTorch, tensors, and neural network training loops.

In [2]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.data import Data

This cell imports PyTorch and PyTorch Geometric modules we'll use:
- `torch` for tensors and training utilities
- `torch.nn.functional as F` for activation functions
- `GCNConv` for the graph convolution layers
- `Data` to hold graph structures (`x` and `edge_index`)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  # pick GPU if available, else CPU
print(f"PyTorch: {torch.__version__}")  # print PyTorch version
print(f"Using device: {device}")  # confirm which device we'll use

## 1. Setup and Imports

In [None]:
node_features = torch.tensor([[-1.0, 1.0],  # Node 0 features
                               [0.0, 1.0],   # Node 1 features
                               [1.0, 1.0],   # Node 2 features
                               [1.0, -1.0]], # Node 3 features
                              dtype=torch.float)

edge_index = torch.tensor([[0, 0, 1, 2, 3],  # sources
                           [1, 2, 2, 3, 0]], # targets
                          dtype=torch.long)

graph_data = Data(x=node_features, edge_index=edge_index)
graph_data

Data(x=[4, 2], edge_index=[2, 5])

A simple example on how to create a simple GNN

## 2. A Tiny Graph Example
We start with a toy graph to see how `Data(x, edge_index)` represents a graph in PyG.

In [None]:
class SimpleGNN(torch.nn.Module):
  def __init__(self, num_features, num_classes):
    super().__init__()
    self.conv1 = GCNConv(in_channels=num_features, out_channels=16)  # first GCN layer
    self.conv2 = GCNConv(in_channels=16, out_channels=num_classes)   # output layer -> num_classes

  def forward(self, data):
    x, edge_index = data.x, data.edge_index  # node features and graph edges
    x = self.conv1(x, edge_index)            # message passing layer
    x = F.relu(x)                             # non-linearity
    x = self.conv2(x, edge_index)            # logits per class
    x = F.log_softmax(x, dim=1)              # log-probs for NLLLoss
    return x

## 3. Dataset: Cora (Planetoid)
We'll use the Cora citation network from the Planetoid benchmark. The dataset provides node features, labels, and train/val/test masks.

In [None]:
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='/tmp/Cora', name='Cora')  # download/cache Cora under /tmp/Cora

data = dataset[0].to(device)  # single graph -> move to device
print(data)  # quick overview of tensors and masks

Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

This cell downloads and loads the Cora dataset:
- `root` controls the cache folder (change to a persistent path if you want it saved)
- `dataset[0]` returns the single `Data` graph with features, labels, and masks
- We move `data` to the selected device

In [None]:
dataset.num_node_features, dataset.num_classes, data.x.shape, data.edge_index.shape, data.y.shape

(1433, 7)

Quick sanity-check on shapes:
- `num_node_features` and `num_classes` from the dataset
- `x` (#nodes, #features), `edge_index` (2, #edges), `y` (#nodes)

In [None]:
model = SimpleGNN(num_features=dataset.num_node_features, num_classes=dataset.num_classes).to(device)
optimizer = torch.optim.Adam(params=model.parameters(), lr=0.01, weight_decay=5e-4)
loss_fn = torch.nn.NLLLoss()

Instantiate the model and training utilities:
- `SimpleGNN` with two GCN layers
- `Adam` optimizer with a small L2 (`weight_decay`)
- `NLLLoss` because the model returns log-probabilities

In [None]:
y_pred = model(data)
y_true = data.y
print(y_pred.shape, y_true.shape)
loss_fn(y_pred[data.train_mask], y_true[data.train_mask])

tensor(1.9413, grad_fn=<NllLossBackward0>)

First forward pass to verify shapes and the training loss on the training nodes (using the mask).

In [None]:
len(dataset.train_mask)  # number of nodes in the training split

2708

In [None]:
len(y_pred[dataset.train_mask])  # same number of predictions as training nodes

140

In [None]:
def masked_accuracy(logits, y, mask):
    preds = logits.argmax(dim=1)
    correct = (preds[mask] == y[mask]).sum()
    return (correct.float() / mask.sum()).item()

masked_accuracy(model(data), data.y, data.train_mask)

EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value('int32'), 'references': Value('int32')}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
        >>> print(results)
        {'accuracy': 0.5}

    Exa

Utility to compute accuracy on a subset of nodes defined by a boolean mask (train/val/test).

In [None]:
for epoch in range(200):
  model.train()
  logits = model(data)  # forward pass
  loss = loss_fn(logits[data.train_mask], data.y[data.train_mask])  # compute loss on train nodes
  optimizer.zero_grad()  # reset gradients
  loss.backward()  # backprop
  optimizer.step()  # update weights

  if epoch % 10 == 0:  # log every 10 epochs
    model.eval()
    with torch.inference_mode():
      logits = model(data)  # recompute logits without grad
      train_acc = masked_accuracy(logits, data.y, data.train_mask)  # train accuracy
      val_acc = masked_accuracy(logits, data.y, data.val_mask)  # val accuracy
    print(f'Epoch {epoch:03d}: Loss={loss.item():.4f} | Train Acc={train_acc:.3f} | Val Acc={val_acc:.3f}')

Epoch 000: Loss = 1.9413
Accuracy: {'accuracy': 0.404}
Epoch 010: Loss = 0.6166
Accuracy: {'accuracy': 0.726}
Epoch 020: Loss = 0.1024
Accuracy: {'accuracy': 0.768}
Epoch 030: Loss = 0.0211
Accuracy: {'accuracy': 0.756}
Epoch 040: Loss = 0.0076
Accuracy: {'accuracy': 0.75}
Epoch 050: Loss = 0.0042
Accuracy: {'accuracy': 0.75}
Epoch 060: Loss = 0.0030
Accuracy: {'accuracy': 0.75}
Epoch 070: Loss = 0.0024
Accuracy: {'accuracy': 0.746}
Epoch 080: Loss = 0.0021
Accuracy: {'accuracy': 0.744}
Epoch 090: Loss = 0.0018
Accuracy: {'accuracy': 0.744}
Epoch 100: Loss = 0.0016
Accuracy: {'accuracy': 0.744}
Epoch 110: Loss = 0.0015
Accuracy: {'accuracy': 0.744}
Epoch 120: Loss = 0.0013
Accuracy: {'accuracy': 0.746}
Epoch 130: Loss = 0.0012
Accuracy: {'accuracy': 0.746}
Epoch 140: Loss = 0.0011
Accuracy: {'accuracy': 0.746}
Epoch 150: Loss = 0.0010
Accuracy: {'accuracy': 0.746}
Epoch 160: Loss = 0.0010
Accuracy: {'accuracy': 0.746}
Epoch 170: Loss = 0.0009
Accuracy: {'accuracy': 0.746}
Epoch 180: Lo

Training loop:
- Forward pass on all nodes, compute NLL loss on training nodes only
- Backprop and optimizer step
- Every few epochs, compute accuracy on train/val masks

## 4. Evaluation on Test Split
After training, evaluate the model on the held-out test nodes.

In [None]:
model.eval()
with torch.inference_mode():
    logits = model(data)
    test_acc = masked_accuracy(logits, data.y, data.test_mask)
print({"test_accuracy": round(test_acc, 4)})

Finally, evaluate accuracy on the held-out test nodes to report the result you'd publish.

## 5. Notes and References

- Loss: We used `NLLLoss` with `log_softmax` outputs. Alternatively, output raw logits and use `CrossEntropyLoss`.
- Masks: `train_mask`, `val_mask`, `test_mask` are boolean vectors that select nodes for each split.
- Device: We moved both the data object and model to the selected device.

References:
- PyTorch Geometric docs: https://pytorch-geometric.readthedocs.io/
- GCN paper: https://arxiv.org/abs/1609.02907