# CS224W Final Project: Tutorial on the Augmentation of Graphs in PyG

### Jerry Chan, Jihee Suh, John So

Data augmentation is a widely used technique that leverages existing data to further train a model, improving its performance and generalization. For structured data formats such as images, augmentation methods can be quite straightforward, including operations like cropping, resizing, rotating, and adding noise. These augmentations are useful for reducing overfitting to the training dataset and adding invariance to certain transformations, such as color shifts, different camera models, and even different camera poses. 

In graphs, where a local spatial area is not quite as evenly defined, a good augmentation scheme is less obvious, but still remains powerful ways to reduce overfitting. Beyond robustness, graph augmentations can address noted issues, including over-smoothing, aggregation schemes, and graph structure learning to create powerful GNNs. 

In this tutorial, we aim to provide an intuitive explanation to various graph augmentation schemes. PyG provides several tools to manipulate the underlying graph structure and features, as well as dynamically manipulate them during training time. Using PyG, we will walk through setting up different graph learning problems, apply each augmentation scheme, and demonstrate when using each is a good idea.


## Installation and Setup

### Notebook setup: install PyG + torch

In [2]:
import torch
torch_version = str(torch.__version__)
if "2.4.0" not in torch_version:
  !pip install torch==2.4.0

In [3]:
print(torch_version)

2.4.0+cu121


In [4]:
scatter_src = f"https://pytorch-geometric.com/whl/torch-{torch_version}.html"
sparse_src = f"https://pytorch-geometric.com/whl/torch-{torch_version}.html"
!pip install torch-scatter -f $scatter_src
!pip install torch-sparse -f $sparse_src
!pip install torch-geometric
!pip install ogb

Defaulting to user installation because normal site-packages is not writeable
Looking in links: https://pytorch-geometric.com/whl/torch-2.4.0+cu121.html
Defaulting to user installation because normal site-packages is not writeable
Looking in links: https://pytorch-geometric.com/whl/torch-2.4.0+cu121.html
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [5]:
import os
import random

import numpy as np
from tqdm import tqdm

import torch
from torch_geometric.nn.models import GraphSAGE
from torch_geometric.loader import NeighborLoader
import torch_geometric.transforms as T
from torch_geometric.utils import to_undirected
from torch_geometric.datasets import KarateClub
from ogb.nodeproppred import PygNodePropPredDataset, Evaluator

In [6]:
def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

seed_everything(42)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

### Dataset and Tasks



To demonstrate the power of graph augmentations, let’s consider a realistic, grounded graph to benchmark on. The Open Graph Benchmark (OGB) provides many such graphs, including ogbn-products and ogbn-arxiv, which we utilize here. [EDIT IF WE USE OTHER DATASETS TOO] Ogbn-products is a graph of ~2.5M nodes and ~62M edges, and represents an Amazon product co-purchasing network. The task here is to predict the category of a product in a multi-class classification setup. Ogbn-arxiv is also a multi-class classification task, but is of smaller scale with ~170K nodes and ~1.2M edges, and represents the citation network between arXiv papers.

Both datasets already have their own splits: ogbn-products has the top 8% products in sales rankings as the training set, the next 2% as validation, and the rest as a test set, while ogbn-arxiv train on papers published until 2017, validate on 2018 papers, then test on the rest. We will leverage these splits for an inductive learning setting and see how different augmentations affect generalization.

Here, we use ogbn-products.

In [7]:
def load_dataset(transform=None):
    dataset = PygNodePropPredDataset(name='ogbn-products', root='./products/', transform=transform)
    print(dataset, flush=True)
    data = dataset[0]
    print(data, flush=True)
    return dataset

In [8]:
dataset = load_dataset()

  self.data, self.slices = torch.load(self.processed_paths[0])


PygNodePropPredDataset()
Data(num_nodes=2449029, edge_index=[2, 123718280], x=[2449029, 100], y=[2449029, 1])


### Training and Evaluation Utilities

To keep experiments consistent, we will use one network architecture throughout this tutorial. We will use a 2-layer GraphSAGE model followed by dropout, ReLU activation, and a linear layer for the classification head. The GraphSAGE model learns to generate node embeddings by sampling and aggregating features from a node’s neighborhood, and generalizes well to previously unseen nodes. This model can be stacked with multiple layers, iterating the process of sampling and aggregating for each layer. For a simple demonstration, we limit the model to 2 layers.

In [9]:
def train(model, optimizer, dataloader, transform=None):
    model.train()
    total_loss = 0
    total_correct = 0
    n = 0

    for batch in tqdm(dataloader):
        optimizer.zero_grad()
        batch = batch.to(device)
        if transform is not None:
            batch = transform(batch)
        output = model(batch.x, batch.edge_index)[:batch.batch_size]
        y = batch.y[:batch.batch_size].squeeze().to(torch.long)
        loss = model.loss_fn(output, y)
    
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        total_correct += int(output.argmax(dim=-1).eq(y).sum())
        n += batch.batch_size

    return total_loss/n, total_correct/n

# Test function here
@torch.no_grad()
def test(model, dataloader, transform=None):
    model.eval()
    total_loss = 0
    total_correct = 0
    n = 0

    for batch in tqdm(dataloader):
        batch = batch.to(device)
        if transform is not None:
            batch = transform(batch)
        out = model(batch.x, batch.edge_index)[:batch.batch_size]
        y = batch.y[:batch.batch_size].squeeze().to(torch.long)
        loss = model.loss_fn(out, y)

        total_loss += loss.item()
        total_correct += int(out.argmax(dim=-1).eq(y).sum())
        n += batch.batch_size

    return total_loss/n, total_correct/n

In [10]:
input_dim = dataset[0].x.shape[1]
hidden_dim = 128
learning_rate = 0.0001
num_epochs = 20
batch_size = 32
num_layers = 2

fan_out = 10
num_workers = 2

In [11]:
def get_model(input_dim=input_dim, dataset=dataset):
    class GraphSAGENodeClassification(torch.nn.Module):
        def __init__(self, input_dim, hidden_dim, num_layers, num_classes):
            super(GraphSAGENodeClassification, self).__init__()
            self.graph_sage = GraphSAGE(in_channels = input_dim, hidden_channels = hidden_dim, num_layers=num_layers)
            self.cls_head = torch.nn.Sequential(
                torch.nn.Dropout(0.1),
                torch.nn.ReLU(),
                torch.nn.Linear(hidden_dim, num_classes),
            )
            self.loss_fn = torch.nn.CrossEntropyLoss()

        def forward(self, x, edge_index):
            h = self.graph_sage(x, edge_index)
            return self.cls_head(h)

    model = GraphSAGENodeClassification(input_dim, hidden_dim, num_layers, dataset.num_classes)
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    model.to(device)
    return model, optimizer

In [12]:
def get_dataloader(dataset, split):
    data = dataset[0]

    return NeighborLoader(
        data,
        input_nodes=split,
        num_neighbors=[fan_out] * num_layers,
        batch_size=batch_size,
        shuffle=True,
        pin_memory=True,
        num_workers=num_workers
    )

In [13]:
split_idx = dataset.get_idx_split()
train_loader = get_dataloader(dataset, split_idx['train'])



In [14]:
val_loader = get_dataloader(dataset, split_idx['valid'])

In [15]:
test_loader = get_dataloader(dataset, split_idx['test'])

In [19]:
def benchmark(model, optimizer, train_loader, val_loader, test_loader, transform=None):
    all_train_acc, all_val_acc, all_test_acc = [], [], []
    best_val_ind, best_val_acc = 0, 0
    for epoch in range(1, num_epochs + 1):
        train_loss, train_acc = train(model, optimizer, train_loader, transform)
        val_loss, val_acc = test(model, val_loader, transform)
        test_loss, test_acc = test(model, test_loader, transform)

        if val_acc > best_val_acc:
            best_val_acc = val_acc
            best_val_ind = epoch

        print(f'Train {train_loss:.4f} ({100.0 * train_acc:.2f}%) | Val {val_loss:.4f} ({100.0 * val_acc:.2f}%) | Test {test_loss:.4f} ({100.0 * test_acc:.2f}%)')

        all_train_acc.append(train_acc)
        all_val_acc.append(val_acc)
        all_test_acc.append(test_acc)
    
    return {
        'all_train_acc': np.array(all_train_acc),
        'all_val_acc': np.array(all_val_acc),
        'all_test_acc': np.array(all_test_acc),
        'best_val_ind': best_val_ind,
        'model': model
    }

In [22]:
!pip install seaborn

Defaulting to user installation because normal site-packages is not writeable
Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
Installing collected packages: seaborn
Successfully installed seaborn-0.13.2


In [23]:
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, Optional

# Set a clean, modern aesthetic
plt.style.use('seaborn-v0_8-pastel')
sns.set_palette("deep")

def plot(x: Optional[np.ndarray] = None,
         y: Dict[str, np.ndarray] = dict(),
         xlabel: str = "",
         ylabel: str = "accuracy"):

  plt.figure(figsize=(6, 4), dpi=300)

  for key, value in y.items():
    if x is not None:
      plt.plot(x, value, label=key)
    else:
      plt.plot(value, label=key)

  plt.grid(True, linestyle='--', linewidth=0.5, color='grey', alpha=0.7)
  plt.title('training accuracy', fontsize=16)
  if x is not None:
    plt.xlabel(xlabel)
  plt.ylabel(ylabel)

  plt.legend(frameon=True, fancybox=True, framealpha=0.7)
  plt.tight_layout()
  plt.gca().set_facecolor('none')
  plt.gcf().patch.set_alpha(0.0)

  plt.show()

We run this setup without any transformations to get a baseline performance.

In [None]:
model, optimizer = get_model()
results = benchmark(model, optimizer, train_loader, val_loader, test_loader)

to_plot = {
    "train": results['all_train_acc'],
    "val": results['all_val_acc'],
}

plot(y=to_plot, xlabel="epoch")

100%|███████████████████████████████████████████████████████████████████████████████| 6145/6145 [01:03<00:00, 97.48it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 1229/1229 [00:12<00:00, 101.08it/s]
100%|████████████████████████████████████████████████████████████████████████████| 69160/69160 [09:35<00:00, 120.24it/s]


Train 0.0288 (76.68%) | Val 0.0188 (84.67%) | Test 0.0485 (68.50%)


100%|██████████████████████████████████████████████████████████████████████████████| 6145/6145 [00:58<00:00, 104.15it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 1229/1229 [00:10<00:00, 115.78it/s]
 34%|██████████████████████████                                                  | 23773/69160 [02:58<05:29, 137.82it/s]

## Node Feature Augmentation

### Why Positional Encoding?

In Graph Neural Networks (GNNs), unless nodes possess distinguishing features, isomorphic nodes will inevitably share identical embeddings. If the nodes already have meaningful features that uniquely identify them, this is ideal. However, failing this, additional feature augmentations are required. Basing this augmentation on the position of the node in the graph is known as positional encoding, and provides an inductive bias that helps with learning. A simplified example is provided below to illustrate its functionality.

In [None]:
import networkx as nx
from torch_geometric.utils import to_networkx
from pylab import show

Here, we create a small random graph to demonstrate on. We set the number of nodes to 10, make a constant feature of 1 for every node, and create 20 random edges. Feel free to adjust the number of nodes and edges if you'd like.

In [None]:
from torch_geometric.data import Data

num_nodes = 10
simple_x = torch.ones(num_nodes)
simple_edge_index = torch.randint(num_nodes, (2, num_nodes*2))
simple_data = Data(x=simple_x, edge_index=simple_edge_index, num_nodes=num_nodes)
simple_data, simple_data.x[0]

This random graph is visualized below, based on its features. All the nodes are the same color, blue, because we have constant features.

In [None]:
G = to_networkx(simple_data)
plt.figure(figsize=(3,3))
pos = nx.spring_layout(G)
node_color = [simple_data.x[node] for node in G.nodes()]
nx.draw(G, pos=pos, cmap=plt.get_cmap('coolwarm'), node_color=node_color)
show()

### Random Walk Positional Encoding (RWPE)

PyG supports two kinds of positional encoding, the first being Random Walk Positional Encoding. As the name implies, this method is based on the random walk diffusion process. Given a hyperparameter `walk_length` and node `v`, we calculate the probability that a random walk originating at `v` will land back on `v` after 1, 2, 3, ..., `walk_length` steps. This becomes a vector of `walk_length` length that we can concatenate to its original feature vector. Note that, since this is based on a random walk, if two nodes have the exact same neighborhood structure, they will still have the same positional encoding. The upside is that if the structures vary, even at a very far distance, this positional encoding will help differentiate them without using quite so many GNN layers.

In [None]:
walk_length = 3
rwpe = T.AddRandomWalkPE(walk_length=walk_length, attr_name=None)
rwpe(simple_data), rwpe(simple_data).x[0]

To visualize the changed node features, we map the norm of the feature vector to colors. We can see that some of the nodes have clearly differentiated themselves, as we intended.

In [None]:
plt.figure(figsize=(3,3))
node_color = [torch.norm(rwpe(simple_data).x[node]) for node in G.nodes()]
nx.draw(G, pos=pos, cmap=plt.get_cmap('coolwarm'), node_color=node_color)
show()

### Laplacian Eigenvector Positional Encoding (LapPE)

If we want a positional encoding that will provide stronger uniqueness to each node, we can use the Laplacian Eigenvector Positional Encoding. This takes the first `lappe_k` eigenvectors of the graph's laplacian matrix (where `lappe_k` is a hyperparameter) and adds it to the node feature matrix. This is not only unique, but is distance-sensitive with respect to the Euclidean norm. One thing to be careful of is that since `lappe_k` is the number of eigenvectors to look at, this shouldn't exceed the number of nodes.

In [None]:
lappe_k = 3
lappe = T.AddLaplacianEigenvectorPE(k=lappe_k, is_undirected=True, attr_name=None)
lappe(simple_data), lappe(simple_data).x[0]

Below, we see that the node features have become much more varied. A limitation exists with this method as well, however, and it is that eigenvectors are sign-ambiguous. An eigenvector with the signs flipped is still the eigenvector of the same eigenvalue, and since we don't have a clear way of deciding which one to choose at each time, a model using this PE scheme must learn to be invariant towards the sign flip. 

In [None]:
plt.figure(figsize=(3,3))
node_color = [np.linalg.norm(lappe(simple_data).x[node]) for node in G.nodes()]
nx.draw(G, pos=pos, cmap=plt.get_cmap('coolwarm'), node_color=node_color)
show()

### Benchmark Node Feature Augmentation Methods

Now, we'll benchmark these methods on ogbn-products. ogbn-products have node features that are already quite distinctive for each feature, so positional encodings don't actually improve the performance by much. But we can still see that it does not hurt performance, and is a stable addition for any dataset.

In [None]:
walk_length = 4
rwpe = T.AddRandomWalkPE(walk_length=walk_length, attr_name=None)

In [None]:
model, optimizer = get_model(input_dim=input_dim+walk_length)
results = benchmark(model, optimizer, train_loader, val_loader, test_loader, rwpe)

to_plot = {
    "train": results['all_train_acc'],
    "val": results['all_val_acc'],
}

plot(y=to_plot, xlabel="epoch")

In [None]:
lappe_k = 4
lappe = T.AddLaplacianEigenvectorPE(k=lappe_k, is_undirected=True, attr_name=None)

In [None]:
model, optimizer = get_model(input_dim=input_dim+lappe_k)
results = benchmark(model, optimizer, train_loader, val_loader, test_loader, lappe)

to_plot = {
    "train": results['all_train_acc'],
    "val": results['all_val_acc'],
}

plot(y=to_plot, xlabel="epoch")

Next, head over to (link) to see the graph structure transformations!