<a href="https://colab.research.google.com/github/deokhwajeong/stanford-xcs224w/blob/main/XCS224W_Colab2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CS224W - Colab 2**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/scpd-proed/XCS224W-Colab2/blob/main/Notebook/XCS224W_Colab2.ipynb)

Before opening the colab with the badge, you would need to allow Google Colab to access the GitHub private repositories. Please check therefore [this tutorial](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb#:~:text=Navigate%20to%20http%3A%2F%2Fcolab,to%20read%20the%20private%20files.).

If colab is opened with this badge, make sure please **save copy to drive** in 'File' menu before running the notebook.

In Colab 2, you will construct your first graph neural network using PyTorch Geometric (PyG) and apply the model on two Open Graph Benchmark (OGB) datasets. These two datasets will be used to benchmark your model's performance on two different graph-based tasks: 1) node property prediction (predicting the properties of single nodes) and 2) graph property prediction (predicting properties of entire graphs or subgraphs).

First, you will learn how PyTorch Geometric stores graphs as PyTorch tensors.

Then, you will load and inspect one of the Open Graph Benchmark (OGB) datasets by using the `ogb` package. OGB is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. The `ogb` package not only provides data loaders for each dataset but also model evaluators.

Lastly, you will build our own graph neural network using PyTorch Geometric. You will train and evaluate you model on the OGB node property prediction and graph property prediction tasks.

**Note**: Make sure to **sequentially run all the cells in each section**, so that the intermediate variables / packages will carry over to the next cell

Have fun and good luck on Colab 2 :)

## Building + Debugging Notes
While working through this Colab and future Colabs, we strongly encourage you to follow a couple of building / debugging strategies:
- During debugging make sure to run your notebook using the CPU runtime. You can change the notebook runtime by selecting `Runtime` and then `Change runtime type`. From the dropdown, select `None` as the `hardware accelerator`.
- When working with PyTorch and Neural Network models, understanding the shapes of different tensors, especially the input and output tensors is incredibly helpful.
- When training models, it is helpful to start by only running 1 epoch or even just a couple of batch iterations. This way you can check that all your tensor shapes and logic match up, while also tracking expected behavior, such as a decreasing training loss. Remember to comment out / save the default number of epochs that we provide you.


# Device
For the final testing of your models you will want to use a GPU for this Colab to run quickly.

Please click `Runtime` and then `Change runtime type`. Then set the `hardware accelerator` to **GPU**.

# Setup
As discussed in Colab 0 and 1, the installation of PyG on Colab can be a little bit tricky. First let us check which version of PyTorch you are running

In [None]:
import os
# Install PyTorch
if 'IS_GRADESCOPE_ENV' not in os.environ:
    !pip install torch==2.5.1+cu124 -f https://download.pytorch.org/whl/torch

Looking in links: https://download.pytorch.org/whl/torch
Collecting torch==2.5.1+cu124
  Downloading https://download.pytorch.org/whl/cu124/torch-2.5.1%2Bcu124-cp311-cp311-linux_x86_64.whl (908.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m908.3/908.3 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.5.1+cu124)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.5.1+cu124)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch==2.5.1+cu124)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.5.1+cu124)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nv

In [None]:
import torch
print("PyTorch has version {}".format(torch.__version__))

PyTorch has version 2.5.1+cu124


Download the necessary packages for PyG. Make sure that your version of torch matches the output from the cell above. In case of any issues, more information can be found on the [PyG's installation page](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html).

In [None]:
# Install torch geometric
if 'IS_GRADESCOPE_ENV' not in os.environ:
  !pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-2.5.1+cu124.html
  !pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-2.5.1+cu124.html
  !pip install torch-geometric
  !pip install ogb

Looking in links: https://pytorch-geometric.com/whl/torch-2.5.1+cu124.html
Collecting torch-scatter
  Downloading https://data.pyg.org/whl/torch-2.5.0%2Bcu124/torch_scatter-2.1.2%2Bpt25cu124-cp311-cp311-linux_x86_64.whl (10.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torch-scatter
Successfully installed torch-scatter-2.1.2+pt25cu124
Looking in links: https://pytorch-geometric.com/whl/torch-2.5.1+cu124.html
Collecting torch-sparse
  Downloading https://data.pyg.org/whl/torch-2.5.0%2Bcu124/torch_sparse-0.6.18%2Bpt25cu124-cp311-cp311-linux_x86_64.whl (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m49.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch-sparse
Successfully installed torch-sparse-0.6.18+pt25cu124
Collecting torch-geometric
  Downloading torch_geometric-2.6.1-py3-none-any.whl.metadata (63 kB)
[2K

In [None]:
import torch_geometric
torch_geometric.__version__

'2.6.1'

# 1) PyTorch Geometric (Datasets and Data)


PyTorch Geometric has two classes for storing and/or transforming graphs into tensor format. One is `torch_geometric.datasets`, which contains a variety of common graph datasets. Another is `torch_geometric.data`, the class which provides the data handling of graphs as PyTorch tensors.

In this section, you will learn how to use `torch_geometric.datasets` and `torch_geometric.data` together.

## PyG Datasets

The `torch_geometric.datasets` class has many common graph datasets. Here you will explore its usage through one example dataset.

In [None]:
from torch_geometric.datasets import TUDataset

# Load the ENZYMES dataset (600 graphs)
root = './enzymes'
name = 'ENZYMES'
pyg_dataset = TUDataset(root, name)
print(pyg_dataset)   # should output: ENZYMES(600)


Downloading https://www.chrsmrrs.com/graphkerneldatasets/ENZYMES.zip
Processing...


ENZYMES(600)


Done!


## Question 1: How many classes and features are in the ENZYMES dataset? (2 points)

In [None]:
# ──────────────────────────────────────────────────────────────
# Question 1 함수 구현 및 결과 출력

def get_num_classes(pyg_dataset):
    """PyG dataset 객체에서 클래스 수를 반환."""
    return pyg_dataset.num_classes

def get_num_features(pyg_dataset):
    """PyG dataset 객체에서 노드 특성(feature) 차원을 반환."""
    return pyg_dataset.num_node_features

# 결과 계산
num_classes  = get_num_classes(pyg_dataset)
num_features = get_num_features(pyg_dataset)

# 출력
print(f"{name} dataset has {num_classes} classes")
print(f"{name} dataset has {num_features} features")
# ──────────────────────────────────────────────────────────────


ENZYMES dataset has 6 classes
ENZYMES dataset has 3 features


## PyG Data

Each PyG dataset stores a list of `torch_geometric.data.Data` objects, where each `torch_geometric.data.Data` object represents a graph. You can easily get the `Data` object by indexing into the dataset.

For more information such as what is stored in the `Data` object, please refer to the [documentation](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Data).

## Question 2: What is the label of the graph with index 100 in the ENZYMES dataset? (1 points)

In [None]:
# Question 2: ENZYMES dataset에서 index 100 그래프의 레이블을 구하기

def get_graph_class(pyg_dataset, idx):
    # 입력된 인덱스의 그래프 y 값을 정수로 반환
    return int(pyg_dataset[idx].y)

# 데이터셋 직접 확인 (Gradescope guard 제거)
graph_0 = pyg_dataset[0]
print(graph_0)

# index 100 그래프 레이블 출력
idx = 100
label = get_graph_class(pyg_dataset, idx)
print(f"Graph with index {idx} has label {label}")


Data(edge_index=[2, 168], x=[37, 3], y=[1])
Graph with index 100 has label 4


## Question 3: How many edges does the graph with index 200 have? (1 points)

In [None]:
# Question 3: ENZYMES dataset에서 index 200 그래프의 엣지 개수 구하기

def get_graph_num_edges(pyg_dataset, idx):
    """
    입력된 인덱스의 그래프에서 undirected edge 수를 반환.
    edge_index 텐서가 [2, 2 * num_edges] 형태이므로 절반(/2) 해줌.
    """
    edge_index = pyg_dataset[idx].edge_index
    return edge_index.size(1) // 2

# 직접 그래프 정보 확인
graph_200 = pyg_dataset[200]
print(graph_200)   # Data(edge_index=[2, XXX], ...)

# index 200의 엣지 개수 출력
idx = 200
num_edges = get_graph_num_edges(pyg_dataset, idx)
print(f"Graph with index {idx} has {num_edges} edges")


Data(edge_index=[2, 106], x=[29, 3], y=[1])
Graph with index 200 has 53 edges


# 2) Open Graph Benchmark (OGB)

The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. Its datasets are automatically downloaded, processed, and split using the OGB Data Loader. A model's performance over these datasets can then be evaluated using the OGB Evaluator in a unified manner.

## Dataset and Data

OGB also supports PyG dataset and data classes. Here you will explore the `ogbn-arxiv` dataset.

In [None]:
import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset

# Gradescope에서도 항상 실행되도록 guard 삭제
dataset_name = 'ogbn-arxiv'
dataset      = PygNodePropPredDataset(name=dataset_name, transform=None)
print(f"The {dataset_name} dataset has {len(dataset)} graph")

# 추출해 보기
data = dataset[0]
print(data)


Downloading http://snap.stanford.edu/ogb/data/nodeproppred/arxiv.zip


Downloaded 0.08 GB: 100%|██████████| 81/81 [00:01<00:00, 49.23it/s]


Extracting dataset/arxiv.zip


Processing...


Loading necessary files...
This might take a while.
Processing graphs...


100%|██████████| 1/1 [00:00<00:00, 5127.51it/s]


Converting graphs into PyG objects...


100%|██████████| 1/1 [00:00<00:00, 2669.83it/s]

Saving...



Done!
  self.data, self.slices = torch.load(self.processed_paths[0])


The ogbn-arxiv dataset has 1 graph
Data(num_nodes=169343, edge_index=[2, 1166243], x=[169343, 128], node_year=[169343, 1], y=[169343, 1])


## Question 4: How many features are in the ogbn-arxiv graph? (1 points)

In [None]:
# Question 4: ogbn-arxiv 그래프의 feature 개수를 반환
def graph_num_features(data):
    # data.x 의 두 번째 차원이 feature 수
    return data.x.size(1)

# 바로 계산 및 출력
num_features = graph_num_features(data)
print(f"The graph has {num_features} features")



The graph has 128 features


# 3) GNN: Node Property Prediction

In this section you will build your first graph neural network using PyTorch Geometric. Then you will apply it to the task of node property prediction (node classification).

Specifically, you will use GCN as the foundation for your graph neural network ([Kipf et al. (2017)](https://arxiv.org/pdf/1609.02907.pdf)). To do so, you will work with PyG's built-in `GCNConv` layer.

## Setup

In [None]:
import torch
import pandas as pd
import torch.nn.functional as F
print(torch.__version__)

# The PyG built-in GCNConv
from torch_geometric.nn import GCNConv

import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset, Evaluator

2.5.1+cu124


## Load and Preprocess the Dataset

In [None]:
# ──────────────────────────────────────────────────────────────
# Load and preprocess the ogbn-arxiv dataset

from ogb.nodeproppred import PygNodePropPredDataset, Evaluator
import torch_geometric.transforms as T

# 항상 실행되도록 guard 삭제
dataset_name = 'ogbn-arxiv'
dataset      = PygNodePropPredDataset(name=dataset_name, transform=T.ToSparseTensor())
print(f"Loaded {dataset_name} with {len(dataset)} graph")

# 그래프 추출 및 대칭화
data = dataset[0]
data.adj_t = data.adj_t.to_symmetric()

# 디바이스 설정
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Device: {device}")

# 데이터와 인덱스들을 디바이스로 이동
data = data.to(device)
split_idx = dataset.get_idx_split()
train_idx = split_idx['train'].to(device)
valid_idx = split_idx['valid'].to(device)
test_idx  = split_idx['test'].to(device)
# ──────────────────────────────────────────────────────────────


Loaded ogbn-arxiv with 1 graph


  self.data, self.slices = torch.load(self.processed_paths[0])


Device: cuda


## GCN Model

Now that you have loaded the datasets, you will implement your own GCN model!

Please follow the figure below to help in implementing the `forward` function.


![test](https://drive.google.com/uc?id=128AuYAXNXGg7PIhJJ7e420DoPWKb-RtL)

In [None]:
class GCN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers,
                 dropout, return_embeds=False):
        super().__init__()
        # convolution + batchnorm lists
        self.convs = torch.nn.ModuleList()
        self.bns   = torch.nn.ModuleList()

        # 첫 번째 레이어 (in → hidden)
        self.convs.append(GCNConv(input_dim, hidden_dim))
        self.bns.append(torch.nn.BatchNorm1d(hidden_dim))

        # 중간 레이어들 (hidden → hidden)
        for _ in range(num_layers - 2):
            self.convs.append(GCNConv(hidden_dim, hidden_dim))
            self.bns.append(torch.nn.BatchNorm1d(hidden_dim))

        # 마지막 레이어 (hidden → out)
        self.convs.append(GCNConv(hidden_dim, output_dim))

        # 출력용 log-softmax
        self.soft_max = torch.nn.LogSoftmax(dim=1)
        self.dropout  = dropout
        self.return_embeds = return_embeds

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()
        for bn in self.bns:
            bn.reset_parameters()

    def forward(self, x, adj_t):
        # 중간 레이어들: conv → BN → ReLU → Dropout
        for i, conv in enumerate(self.convs[:-1]):
            x = conv(x, adj_t)
            x = self.bns[i](x)
            x = F.relu(x)
            x = F.dropout(x, p=self.dropout, training=self.training)
        # 마지막 레이어
        x = self.convs[-1](x, adj_t)
        if not self.return_embeds:
            x = self.soft_max(x)
        return x
# 잘 정의되었는지 테스트
# model = GCN(input_dim=128, hidden_dim=64, output_dim=5, num_layers=3, dropout=0.5)
# model.eval()  # 평가 모드로 전환
# # 노드(feature) 10개, 피쳐 차원 128짜리 더미
# x_dummy = torch.randn(10, 128)
# # 임의의 adjacency in sparse tensor 형식 (여기서는 완전 연결 그래프 예시)
# adj_dummy = torch.eye(10).to_sparse()
# out = model(x_dummy, adj_dummy)
# print(out.shape)


In [None]:
def train(model, data, train_idx, optimizer, loss_fn):
    model.train()
    optimizer.zero_grad()

    # 모델 출력
    out = model(data.x, data.adj_t)

    # 학습 손실 계산: target을 1-D 텐서로 변환
    y = data.y[train_idx].view(-1)         # (N,) 형태
    loss = loss_fn(out[train_idx], y)      # out[train_idx]: (N, num_classes)

    loss.backward()
    optimizer.step()
    return loss.item()


In [None]:


@torch.no_grad()
def test(model, data, split_idx, evaluator, save_model_results=False):
    model.eval()
    out = model(data.x, data.adj_t)        # (num_nodes, num_classes)

    # 예측: 가장 큰 차원 인덱스를 2-D (N,1) 형태로.
    y_pred = out.argmax(dim=-1, keepdim=True)  # (num_nodes, 1)

    # 원래 data.y 는 (num_nodes,1) 이므로, 그대로.
    results = {}
    for split in ['train','valid','test']:
        mask = split_idx[split]
        results[split] = evaluator.eval({
            'y_true': data.y[mask],           # (Ns,1)
            'y_pred': y_pred[mask],           # (Ns,1)
        })['acc']

    # 예측 저장
    if save_model_results:
        import pandas as pd
        df = pd.DataFrame({'y_pred': y_pred.cpu().numpy().squeeze()})
        df.to_csv('ogbn-arxiv_node.csv', index=False)

    return results['train'], results['valid'], results['test']





In [None]:
# Please do not change the args
if 'IS_GRADESCOPE_ENV' not in os.environ:
  args = {
      'device': device,
      'num_layers': 3,
      'hidden_dim': 256,
      'dropout': 0.5,
      'lr': 0.01,
      'epochs': 100,
  }
  args

In [None]:
# Gradescope 환경에서만 실행되지 않도록 guard 포함
if 'IS_GRADESCOPE_ENV' not in os.environ:
    # GCN 모델 생성
    model = GCN(
        data.num_features,
        args['hidden_dim'],
        dataset.num_classes,
        args['num_layers'],
        args['dropout']
    ).to(device)

    # (PyG 2.3.1 / Torch 2.0.1 에서 compile 지원이 불안정하여 주석 처리)
    # try:
    #     model = torch_geometric.compile(model)
    #     print("GCN Model compiled")
    # except Exception as err:
    #     print(f"Model compile not supported: {err}")

    # OGB evaluator 초기화
    evaluator = Evaluator(name='ogbn-arxiv')


In [None]:
import copy
if 'IS_GRADESCOPE_ENV' not in os.environ:
  # reset the parameters to initial random value
  model.reset_parameters()

  optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
  loss_fn = F.nll_loss

  best_model = None
  best_valid_acc = 0

  for epoch in range(1, 1 + args["epochs"]):
    loss = train(model, data, train_idx, optimizer, loss_fn)
    result = test(model, data, split_idx, evaluator)
    train_acc, valid_acc, test_acc = result
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        best_model = copy.deepcopy(model)
    print(f'Epoch: {epoch:02d}, '
          f'Loss: {loss:.4f}, '
          f'Train: {100 * train_acc:.2f}%, '
          f'Valid: {100 * valid_acc:.2f}% '
          f'Test: {100 * test_acc:.2f}%')

Epoch: 01, Loss: 4.2698, Train: 24.14%, Valid: 28.07% Test: 25.24%
Epoch: 02, Loss: 2.3786, Train: 24.45%, Valid: 22.29% Test: 27.86%
Epoch: 03, Loss: 1.9599, Train: 28.01%, Valid: 24.46% Test: 29.41%
Epoch: 04, Loss: 1.7893, Train: 46.04%, Valid: 47.25% Test: 49.48%
Epoch: 05, Loss: 1.6790, Train: 46.84%, Valid: 44.82% Test: 41.68%
Epoch: 06, Loss: 1.5856, Train: 44.17%, Valid: 43.63% Test: 41.44%
Epoch: 07, Loss: 1.5194, Train: 40.99%, Valid: 43.38% Test: 45.37%
Epoch: 08, Loss: 1.4712, Train: 39.45%, Valid: 41.96% Test: 45.63%
Epoch: 09, Loss: 1.4160, Train: 39.31%, Valid: 41.40% Test: 45.45%
Epoch: 10, Loss: 1.3899, Train: 38.51%, Valid: 40.22% Test: 44.61%
Epoch: 11, Loss: 1.3631, Train: 37.89%, Valid: 39.34% Test: 44.04%
Epoch: 12, Loss: 1.3309, Train: 38.05%, Valid: 39.15% Test: 43.87%
Epoch: 13, Loss: 1.3006, Train: 38.64%, Valid: 39.17% Test: 43.79%
Epoch: 14, Loss: 1.2851, Train: 39.85%, Valid: 40.66% Test: 44.98%
Epoch: 15, Loss: 1.2647, Train: 42.94%, Valid: 44.86% Test: 48

## Question 5: What are your `best_model` validation and test accuracies?(20 points)

Run the cell below to see the results of your best model and save your model's predictions to a file named *ogbn-arxiv_node.csv*.

You can view this file by clicking on the *Folder* icon on the left side pannel. As in Colab 1, when you sumbit your assignment, you will have to download this file and attatch it to your submission.

In [None]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
  best_result = test(best_model, data, split_idx, evaluator, save_model_results=True)
  train_acc, valid_acc, test_acc = best_result
  print(f'Best model: '
        f'Train: {100 * train_acc:.2f}%, '
        f'Valid: {100 * valid_acc:.2f}% '
        f'Test: {100 * test_acc:.2f}%')

Best model: Train: 73.80%, Valid: 72.01% Test: 71.19%


# 4) GNN: Graph Property Prediction

In this section you will create a graph neural network for graph property prediction (graph classification).


## Load and preprocess the dataset

In [None]:
# 4) GNN: Graph Property Prediction
# Load and preprocess the OGBG‐MolHIV dataset for graph classification

from ogb.graphproppred import PygGraphPropPredDataset, Evaluator
from torch_geometric.loader import DataLoader
from tqdm import tqdm
import torch

# 데이터셋 로드 (인접 행렬 변환 없이 일반 edge_index로 읽어오기)
dataset = PygGraphPropPredDataset(name='ogbg-molhiv', transform=None)
print(f"{dataset.name} dataset has {len(dataset)} graphs")

# 장치 설정
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

# 학습/검증/테스트 인덱스 분할 정보 가져오기
split_idx = dataset.get_idx_split()
train_idx = split_idx['train']
valid_idx = split_idx['valid']
test_idx  = split_idx['test']

# PyG DataLoader 생성
batch_size = 32
train_loader = DataLoader(dataset[train_idx], batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(dataset[valid_idx], batch_size=batch_size, shuffle=False)
test_loader  = DataLoader(dataset[test_idx], batch_size=batch_size, shuffle=False)

# OGB evaluator 초기화
evaluator = Evaluator(name='ogbg-molhiv')



Downloading http://snap.stanford.edu/ogb/data/graphproppred/csv_mol_download/hiv.zip


Downloaded 0.00 GB: 100%|██████████| 3/3 [00:00<00:00,  7.54it/s]
Processing...


Extracting dataset/hiv.zip
Loading necessary files...
This might take a while.
Processing graphs...


100%|██████████| 41127/41127 [00:00<00:00, 69702.85it/s]


Converting graphs into PyG objects...


100%|██████████| 41127/41127 [00:01<00:00, 20773.54it/s]


Saving...
ogbg-molhiv dataset has 41127 graphs
Using device: cuda


Done!
  self.data, self.slices = torch.load(self.processed_paths[0])


In [None]:
# Load the dataset splits into corresponding dataloaders
# We will train the graph classification task on a batch of 32 graphs
# Shuffle the order of graphs for training set

train_loader = DataLoader(dataset[split_idx["train"]], batch_size=32, shuffle=True, num_workers=0)
valid_loader = DataLoader(dataset[split_idx["valid"]], batch_size=32, shuffle=False, num_workers=0)
test_loader  = DataLoader(dataset[split_idx["test"]],  batch_size=32, shuffle=False, num_workers=0)


## Initialize Model Training Parameters
During debugging and testing we recommend setting `epochs` to a lower value such as 1 or 2.

In [None]:
# Initialize model training parameters for OGBG-MolHIV graph classification
import os

# (Gradescope guard 제거)

args = {
    'device'    : device,   # cuda or cpu
    'num_layers': 5,        # number of graph conv layers
    'hidden_dim': 256,      # hidden dimensionality
    'dropout'   : 0.5,      # dropout probability
    'lr'        : 0.001,    # learning rate
    'epochs'    : 15,       # number of training epochs
    'batch_size': 32        # batch size for DataLoader
}

args


{'device': 'cuda',
 'num_layers': 5,
 'hidden_dim': 256,
 'dropout': 0.5,
 'lr': 0.001,
 'epochs': 15,
 'batch_size': 32}

## Graph Prediction Model

### Graph Mini-Batching
Before diving into the actual model, we introduce the concept of mini-batching with graphs. In order to parallelize the processing of a mini-batch of graphs, PyG combines the graphs into a single disconnected graph data object (*torch_geometric.data.Batch*). *torch_geometric.data.Batch* inherits from *torch_geometric.data.Data* (introduced earlier) and contains an additional attribute called `batch`.

The `batch` attribute is a vector mapping each node to the index of its corresponding graph within the mini-batch:

    batch = [0, ..., 0, 1, ..., n - 2, n - 1, ..., n - 1]

This attribute is crucial for associating which graph each node belongs to and can be used to e.g. average the node embeddings for each graph individually to compute graph level embeddings.



### Implementation
Now, you have all of the tools to implement a GCN Graph Prediction model!  

To do so, you will reuse the your existing GCN model to generate `node_embeddings` for a graph and then use `Global Pooling` over these node embeddings to create a graph level embeddings that can be used to predict graph properties. Remeber that the `batch` attribute will be essential for performining Global Pooling over our mini-batch of graphs.

In [None]:
from ogb.graphproppred.mol_encoder import AtomEncoder
from torch_geometric.nn import global_mean_pool

class GCN_Graph(torch.nn.Module):
    def __init__(self, hidden_dim, output_dim, num_layers, dropout):
        super(GCN_Graph, self).__init__()

        # 1) Atom feature encoder
        self.node_encoder = AtomEncoder(hidden_dim)

        # 2) Reuse your prior node-level GCN (return_embeds=True 로 설정되어 있어야 함)
        self.gnn_node = GCN(
            hidden_dim,   # input_dim = hidden_dim (encoded atom features)
            hidden_dim,   # hidden_dim of GCN
            hidden_dim,   # output_dim = hidden_dim (we only need embeddings)
            num_layers,
            dropout,
            return_embeds=True
        )

        # 3) Global pooling layer
        self.pool = global_mean_pool

        # 4) 최종 그래프 레벨 예측을 위한 Linear
        self.linear = torch.nn.Linear(hidden_dim, output_dim)

    def reset_parameters(self):
        # AtomEncoder에는 reset_parameters가 없으므로 호출하지 않음.
        # self.node_encoder.reset_parameters()

        # 실제 초기화가 필요한 부분만 호출
        self.gnn_node.reset_parameters()
        self.linear.reset_parameters()

    def forward(self, batched_data):
        # batched_data: torch_geometric.data.Batch
        x      = batched_data.x
        edge_index = batched_data.edge_index
        batch  = batched_data.batch   # 그래프별 노드 소속 인덱스

        # 1) AtomEncoder → GCN 으로 노드 임베딩 생성
        x = self.node_encoder(x)
        x = self.gnn_node(x, edge_index)

        # 2) 노드 임베딩을 그래프 단위 임베딩으로 집계
        x = self.pool(x, batch)

        # 3) 그래프 레벨 속성 예측
        out = self.linear(x)
        return out


In [None]:
from tqdm import tqdm

def train(model, device, data_loader, optimizer, loss_fn):
    model.train()
    total_loss = 0.0

    for step, batch in enumerate(tqdm(data_loader, desc="Iteration")):
        batch = batch.to(device)

        # 1) gradient 초기화
        optimizer.zero_grad()

        # 2) 모델에 통과시켜 예측값 얻기
        out = model(batch)  # size = [batch_size] 혹은 [batch_size, 1]

        # 3) 레이블된 샘플만 선택
        #    OGB-MolHIV 예시: y==-1 이면 레이블 없음
        mask = batch.y != -1
        if mask.sum() == 0:
            continue  # 레이블된 그래프가 하나도 없으면 건너뛰기

        # 4) loss 계산
        #    batch.y 를 float32 로 바꿔서 MSE나 BCE 등에 사용
        y_true = batch.y[mask].to(torch.float32)
        y_pred = out[mask].view(-1)

        loss = loss_fn(y_pred, y_true)

        # 5) backprop + step
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    return total_loss / (step + 1)


In [None]:
# The evaluation function
def eval(model, device, loader, evaluator, save_model_results=False, save_file=None):
    model.eval()
    y_true = []
    y_pred = []

    for step, batch in enumerate(tqdm(loader, desc="Iteration")):
        batch = batch.to(device)

        if batch.x.shape[0] == 1:
            pass
        else:
            with torch.no_grad():
                pred = model(batch)

            y_true.append(batch.y.view(pred.shape).detach().cpu())
            y_pred.append(pred.detach().cpu())

    y_true = torch.cat(y_true, dim = 0).numpy()
    y_pred = torch.cat(y_pred, dim = 0).numpy()

    input_dict = {"y_true": y_true, "y_pred": y_pred}

    if save_model_results:
        print ("Saving Model Predictions")

        # Create a pandas dataframe with a two columns
        # y_pred | y_true
        data = {}
        data['y_pred'] = y_pred.reshape(-1)
        data['y_true'] = y_true.reshape(-1)

        df = pd.DataFrame(data=data)
        # Save to csv
        df.to_csv('ogbg-molhiv_graph_' + save_file + '.csv', sep=',', index=False)

    return evaluator.eval(input_dict)

In [None]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
  model = GCN_Graph(args['hidden_dim'],
              dataset.num_tasks, args['num_layers'],
              args['dropout']).to(device)
  # Disable compile as this does not seem to work yet in PyTorch 2.0.1/PyG 2.3.1
  # try:
  #   model = torch_geometric.compile(model)
  #   print("Graph Prediction Model compiled")
  # except Exception as err:
  #   print(f"Model compile not supported: {err}")

  evaluator = Evaluator(name='ogbg-molhiv')

In [None]:
import copy

if 'IS_GRADESCOPE_ENV' not in os.environ:
  model.reset_parameters()

  optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
  loss_fn = torch.nn.BCEWithLogitsLoss()

  best_model = None
  best_valid_acc = 0

  for epoch in range(1, 1 + args["epochs"]):
    print('Training...')
    loss = train(model, device, train_loader, optimizer, loss_fn)

    print('Evaluating...')
    train_result = eval(model, device, train_loader, evaluator)
    val_result = eval(model, device, valid_loader, evaluator)
    test_result = eval(model, device, test_loader, evaluator)

    train_acc, valid_acc, test_acc = train_result[dataset.eval_metric], val_result[dataset.eval_metric], test_result[dataset.eval_metric]
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        best_model = copy.deepcopy(model)
    print(f'Epoch: {epoch:02d}, '
          f'Loss: {loss:.4f}, '
          f'Train: {100 * train_acc:.2f}%, '
          f'Valid: {100 * valid_acc:.2f}% '
          f'Test: {100 * test_acc:.2f}%')

Training...


Iteration: 100%|██████████| 1029/1029 [00:16<00:00, 62.25it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 134.66it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 104.04it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 105.38it/s]


Epoch: 01, Loss: 0.1568, Train: 72.18%, Valid: 74.10% Test: 69.49%
Training...


Iteration: 100%|██████████| 1029/1029 [00:13<00:00, 77.95it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 134.71it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 147.35it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 146.74it/s]


Epoch: 02, Loss: 0.1482, Train: 75.97%, Valid: 76.18% Test: 73.22%
Training...


Iteration: 100%|██████████| 1029/1029 [00:16<00:00, 62.49it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 134.38it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 152.21it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 151.18it/s]


Epoch: 03, Loss: 0.1454, Train: 76.18%, Valid: 73.96% Test: 65.14%
Training...


Iteration: 100%|██████████| 1029/1029 [00:12<00:00, 84.79it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 142.23it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 107.80it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 142.14it/s]


Epoch: 04, Loss: 0.1426, Train: 77.22%, Valid: 74.59% Test: 72.50%
Training...


Iteration: 100%|██████████| 1029/1029 [00:12<00:00, 84.55it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 146.06it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 146.06it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 146.51it/s]


Epoch: 05, Loss: 0.1409, Train: 77.39%, Valid: 74.58% Test: 68.36%
Training...


Iteration: 100%|██████████| 1029/1029 [00:12<00:00, 83.63it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 134.63it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 145.75it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 150.14it/s]


Epoch: 06, Loss: 0.1396, Train: 78.48%, Valid: 74.93% Test: 69.81%
Training...


Iteration: 100%|██████████| 1029/1029 [00:12<00:00, 79.50it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 134.68it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 146.86it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 143.77it/s]


Epoch: 07, Loss: 0.1385, Train: 78.61%, Valid: 75.22% Test: 73.51%
Training...


Iteration: 100%|██████████| 1029/1029 [00:12<00:00, 83.48it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 133.94it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 144.52it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 143.92it/s]


Epoch: 08, Loss: 0.1377, Train: 78.84%, Valid: 74.40% Test: 70.95%
Training...


Iteration: 100%|██████████| 1029/1029 [00:12<00:00, 83.83it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 141.73it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 106.01it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 145.38it/s]


Epoch: 09, Loss: 0.1360, Train: 80.21%, Valid: 75.34% Test: 74.15%
Training...


Iteration: 100%|██████████| 1029/1029 [00:12<00:00, 84.74it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 145.49it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 148.18it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 151.79it/s]


Epoch: 10, Loss: 0.1351, Train: 79.77%, Valid: 77.35% Test: 73.30%
Training...


Iteration: 100%|██████████| 1029/1029 [00:12<00:00, 84.83it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 135.08it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 147.23it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 146.60it/s]


Epoch: 11, Loss: 0.1340, Train: 79.42%, Valid: 73.36% Test: 70.74%
Training...


Iteration: 100%|██████████| 1029/1029 [00:12<00:00, 84.62it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 136.01it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 148.75it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 145.39it/s]


Epoch: 12, Loss: 0.1325, Train: 78.22%, Valid: 75.62% Test: 74.18%
Training...


Iteration: 100%|██████████| 1029/1029 [00:12<00:00, 85.17it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 134.47it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 148.74it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 145.93it/s]


Epoch: 13, Loss: 0.1330, Train: 80.85%, Valid: 75.74% Test: 73.01%
Training...


Iteration: 100%|██████████| 1029/1029 [00:12<00:00, 84.30it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 146.15it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 127.15it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 112.58it/s]


Epoch: 14, Loss: 0.1314, Train: 80.43%, Valid: 75.94% Test: 71.00%
Training...


Iteration: 100%|██████████| 1029/1029 [00:12<00:00, 85.53it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 142.77it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 148.65it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 147.50it/s]

Epoch: 15, Loss: 0.1315, Train: 81.78%, Valid: 75.81% Test: 72.01%





## Question 6: What are your `best_model` validation and test ROC-AUC scores? (20 points)

Run the cell below to see the results of your best model and save your model's predictions over the validation and test datasets. The resulting files are named *ogbg-molhiv_graph_valid.csv* and *ogbg-molhiv_graph_test.csv*.

Again, you can view these files by clicking on the *Folder* icon on the left side pannel. As in Colab 1, when you sumbit your assignment, you will have to download these files and attatch them to your submission.

In [None]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
  train_acc = eval(best_model, device, train_loader, evaluator)[dataset.eval_metric]
  valid_acc = eval(best_model, device, valid_loader, evaluator, save_model_results=True, save_file="valid")[dataset.eval_metric]
  test_acc  = eval(best_model, device, test_loader, evaluator, save_model_results=True, save_file="test")[dataset.eval_metric]

  print(f'Best model: '
      f'Train: {100 * train_acc:.2f}%, '
      f'Valid: {100 * valid_acc:.2f}% '
      f'Test: {100 * test_acc:.2f}%')

Iteration: 100%|██████████| 1029/1029 [00:07<00:00, 134.99it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 144.77it/s]


Saving Model Predictions


Iteration: 100%|██████████| 129/129 [00:00<00:00, 145.87it/s]

Saving Model Predictions
Best model: Train: 79.77%, Valid: 77.35% Test: 73.30%





## Question 7 (Optional): Experiment with the two other global pooling layers in Pytorch Geometric.

In [None]:
# Q7: 글로벌 pooling 비교 (data.x.float() 추가 버전)

import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, global_mean_pool, global_add_pool, global_max_pool
from torch_geometric.loader import DataLoader

# 1) GNNConvLayer 정의
class GNNConvLayer(torch.nn.Module):
    def __init__(self, in_dim, hidden_dim, num_layers):
        super().__init__()
        self.convs = torch.nn.ModuleList()
        self.convs.append(GCNConv(in_dim, hidden_dim))
        for _ in range(num_layers - 1):
            self.convs.append(GCNConv(hidden_dim, hidden_dim))
    def forward(self, x, edge_index):
        for conv in self.convs:
            x = conv(x, edge_index).relu()
        return x

# 2) GraphPoolNet 정의 (x.float() 추가)
class GraphPoolNet(torch.nn.Module):
    def __init__(self, conv_layers, hidden_dim, num_classes, pool_fn):
        super().__init__()
        self.gnn  = conv_layers
        self.pool = pool_fn
        self.lin  = torch.nn.Linear(hidden_dim, num_classes)
    def forward(self, data):
        # ——— 여기서 int → float 캐스팅 ———
        x = data.x.float()
        edge_index, batch = data.edge_index, data.batch
        x = self.gnn(x, edge_index)
        x = self.pool(x, batch)
        return self.lin(x)

# 3) 실험 설정
hidden_dim  = 128
num_classes = dataset.num_classes
n_epochs    = 20
conv_layers = GNNConvLayer(dataset.num_features, hidden_dim, num_layers=2)

# (train_loader, valid_loader, test_loader, device, evaluate 함수는 이전에 정의된 그대로 사용)
# ——— evaluate 함수 정의 ———
def evaluate(model, loader):
    model.eval()                             # 평가 모드로 전환
    correct = 0
    for batch in loader:
        batch = batch.to(device)            # GPU로 올리고
        batch.x = batch.x.float()           # int → float (GCNConv 에러 방지용)
        out = model(batch)                  # 순전파
        pred = out.argmax(dim=1)            # 예측 클래스
        correct += int((pred == batch.y.view(-1)).sum())  # 맞춘 개수 누적
    return correct / len(loader.dataset)    # 전체 대비 정확도


# 4) 학습·평가 헬퍼
def run_experiment(pool_fn, name):
    model     = GraphPoolNet(conv_layers, hidden_dim, num_classes, pool_fn).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    best_valid = 0.0

    for epoch in range(1, n_epochs+1):
        model.train()
        for batch in train_loader:
            batch = batch.to(device)
            optimizer.zero_grad()
            out = model(batch)
            loss = F.cross_entropy(out, batch.y.view(-1))
            loss.backward()
            optimizer.step()

        model.eval()
        train_acc = evaluate(model, train_loader)
        valid_acc = evaluate(model, valid_loader)
        test_acc  = evaluate(model, test_loader)
        best_valid = max(best_valid, valid_acc)

    print(f"{name} pooling → "
          f"Train: {train_acc*100:.2f}%  "
          f"Valid: {valid_acc*100:.2f}%  "
          f"Test : {test_acc*100:.2f}%")

# 5) 세 가지 pooling 방식 비교
run_experiment(global_mean_pool, 'mean')  # 평균
run_experiment(global_add_pool,  'sum')   # 합
run_experiment(global_max_pool,  'max')   # 최대


mean pooling → Train: 96.23%  Valid: 97.96%  Test : 96.84%
sum pooling → Train: 96.43%  Valid: 97.54%  Test : 96.99%
max pooling → Train: 96.23%  Valid: 97.96%  Test : 96.69%


# Submission

You will need to submit four files on Gradescope to complete this notebook.

1.   Your completed *XCS224W_Colab2.ipynb*. From the "File" menu select "Download .ipynb" to save a local copy of your completed Colab.
2.  *ogbn-arxiv_node.csv*
3.  *ogbg-molhiv_graph_valid.csv*
4.  *ogbg-molhiv_graph_test.csv*

Download the csv files by selecting the *Folder* icon on the left panel.

To submit your work, zip the files downloaded in steps 1-4 above and submit to gradescope. **NOTE:** DO NOT rename any of the downloaded files.