# Hyperparameter Tuning Workflow

This notebook performs end-to-end hyperparameter optimization for GCN and GAT models predicting NO₂ on road segments. It is organized into:

## Contents

1. **Environment Setup**  
   Install and configure PyTorch-Geometric, Optuna, and other dependencies.  
   _Code block 1_

2. **Core Imports & Drive Mount**  
   Import data, geospatial, graph, ML, plotting, and GNN libraries; mount Google Drive.  
   _Code block 2_

3. **Data Preparation & Graph Construction**  
   - Load road-network GeoJSON; compute centroids & feature matrix.  
   - Build spatial NetworkX graph of touching segments.  
   _First part of Code block 3_

4. **Outlier Detection**  
   Identify spatial outliers in NO₂ by 1-hop neighbor residuals with MAD thresholds.  
   _Middle of Code block 3_

5. **Graph Augmentation & Data Builder**  
   - Add far-range “feature_sim” edges via grouped KNN & spatial filters.  
   - Create `torch_geometric.data.Data` objects with train/test masks (dropping outliers).  
   _Second half of Code block 3_

6. **Optuna Hyperparameter Tuning**  
   Use Tree-structured Parzen Estimator + Hyperband pruning to optimize GCN & GAT hyperparameters:  
   layers, hidden size, dropout, learning rate, weight decay, activation.  
   _Last part of Code block 3_

7. **Results**  
   Print best hyperparameters for GCN and GAT.  
   _End of Code block 3_

> **Tip:** Run each section in order to reproduce the full optimization pipeline.  

### 1. Environment Setup

Installs Optuna and the matching PyTorch-Geometric wheels for your current PyTorch version so you can run GNNs and hyperparameter searches.


In [None]:
# Ensure we have the correct PyTorch version for PyG wheels
import torch

!pip install optuna -q
!pip uninstall torch-scatter torch-sparse torch-geometric torch-cluster --y -q
!pip install torch-scatter -f https://data.pyg.org/whl/torch-{torch.__version__}.html -q
!pip install torch-sparse  -f https://data.pyg.org/whl/torch-{torch.__version__}.html -q
!pip install torch-cluster -f https://data.pyg.org/whl/torch-{torch.__version__}.html -q
!pip install git+https://github.com/pyg-team/pytorch_geometric.git -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/386.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m378.9/386.6 kB[0m [31m33.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m386.6/386.6 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/242.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.5/242.5 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies .

### 2. Core Imports & Drive Mount

Loads all required Python packages for data I/O, geospatial processing, graph operations, ML metrics, plotting, and GNN layers, then mounts Google Drive for easy data access.


In [None]:
# data & geospatial
import numpy as np
import pandas as pd
import geopandas as gpd
import os

# graph handling
import networkx as nx

# scikit-learn utilities
from sklearn.neighbors    import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from sklearn.metrics       import mean_squared_error, r2_score, mean_absolute_error
from sklearn.linear_model import LinearRegression
from scipy.stats           import pearsonr

# progress bars
from tqdm.auto import tqdm

# miscellaneous
import random
from collections import Counter

# plotting
import matplotlib.pyplot as plt

# PyTorch & PyTorch-Geometric
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.utils import from_networkx
from torch_geometric.data  import Data
from torch_geometric.nn    import GCNConv, GATConv

# Optuna for hyperparameter search
import optuna
from optuna.samplers import TPESampler
from optuna.pruners  import HyperbandPruner

# Mount Google Drive for data I/O
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### 3. Full Hyperparameter Tuning Pipeline

Defines fixed “best” graph-augmentation parameters; loads, preprocesses, and outlier-filters your road-segment data; augments the graph; wraps everything into PyTorch-Geometric `Data` objects; and runs an Optuna study to tune both GCN and GAT models end-to-end.

In [None]:
# This is the fixed full version of your model hyperparameter tuning setup
# using the best found graph augmentation parameters

# Fixed best augmentation parameters
best_graph_params_1 = {
    'top_n': 1000,
    'neighbors': 120,
    'sim_thresh': 0.9888,
    'min_dist': 197,
    'max_dist': 1152,
    'hop_thresh': 5,
    'max_edges': 3000,
    'per_node_cap': 4
}

best_graph_params_2 = {
    'top_n': 1000,
    'neighbors': 180,
    'sim_thresh': 0.9834,
    'min_dist': 167,
    'max_dist': 1583,
    'hop_thresh': 4,
    'max_edges': 4500,
    'per_node_cap': 2
}

# Data prep
fp = '/content/drive/MyDrive/Universiteit Utrecht/Thesis/data/road_network_lufeature.geojson'
gdf = gpd.read_file(fp).to_crs(epsg=28992).reset_index(drop=True)
gdf['centroid'] = gdf.geometry.centroid
coords = np.column_stack([gdf.centroid.x, gdf.centroid.y])

target_col = 'NO2d'
feature_cols = [c for c in gdf.columns if any(s in c for s in ['AGRI','INDUS','NATUR','PORT','RES','TRANS','URBG','WATER','POP','EEA','HHOLD','RDL','TLOA','HLOA','MRDL','TMLOA','HMLOA','TRAF','DINV'])]
feature_matrix = gdf[feature_cols].to_numpy()

# Graph & outliers
G = nx.Graph()
sidx = gdf.sindex
for idx, row in gdf.iterrows():
    G.add_node(idx, **row.drop('geometry').to_dict())
for idx, geom in enumerate(gdf.geometry):
    for j in sidx.intersection(geom.bounds):
        if idx != j and geom.touches(gdf.geometry[j]):
            G.add_edge(idx, j)

gdf['is_highway'] = gdf['TRAFMAJOR'] > 20000
group_thresh = {True: 9.0, False: 5.0}

def detect_outliers(G, gdf):
    nodes = list(G.nodes())
    vals, neigh_means = [], []
    for n in nodes:
        v = gdf.at[n, target_col]
        if pd.isna(v):
            vals.append(np.nan); neigh_means.append(np.nan)
            continue
        sp = nx.single_source_shortest_path_length(G, n, cutoff=1)
        neigh = [m for m in sp if m != n]
        nbr_vals = gdf.loc[neigh, target_col].dropna().values
        vals.append(v)
        neigh_means.append(np.nan if nbr_vals.size == 0 else nbr_vals.mean())
    vals = np.array(vals); neigh_means = np.array(neigh_means)
    valid = ~np.isnan(vals) & ~np.isnan(neigh_means)
    residuals = vals[valid] - neigh_means[valid]
    valid_nodes = np.array(nodes)[valid]
    med = np.median(residuals)
    mad = np.median(np.abs(residuals - med))
    cutoffs = np.array([
        (group_thresh.get(gdf.at[n, 'is_highway'], 3.0)) * mad for n in valid_nodes
    ])
    is_outlier = np.abs(residuals - med) > cutoffs
    return valid_nodes[is_outlier].tolist()

outliers = detect_outliers(G, gdf)

# ────────────────────────────────────────────────────────────
# 2) Classify outliers: errors vs real extremes
# ────────────────────────────────────────────────────────────
palmes_fp  = '/content/drive/MyDrive/Universiteit Utrecht/Thesis/data/road_palmes_25m.geojson'
palmes_gdf = gpd.read_file(palmes_fp).to_crs(gdf.crs)

hot_thresh = np.percentile(gdf['NO2d'].dropna(), 75)
mask_valid = gdf['NO2d'].notna() & gdf['TRAFNEAR'].notna()
X_tr = gdf.loc[mask_valid, 'TRAFNEAR'].values.reshape(-1,1)
y_tr = gdf.loc[mask_valid, 'NO2d'].values
lr = LinearRegression().fit(X_tr, y_tr)
traffic_rmse = np.sqrt(mean_squared_error(y_tr, lr.predict(X_tr)))

error_segs, real_segs = [], []
for seg in outliers:
    seg_val   = gdf.at[seg, 'NO2d']
    traf_near = gdf.at[seg, 'TRAFNEAR']
    pt        = gdf.geometry[seg]

    # 1) Palmes nearby?
    nearby_pal = palmes_gdf[palmes_gdf.geometry.distance(pt) <= 50]
    if len(nearby_pal):
        pal_val = nearby_pal['mean_annual_palmes_no2'].mean()
        if seg_val > 2 * pal_val:
            error_segs.append(seg)
        else:
            real_segs.append(seg)
        continue

    # 2) Neighbors high?
    nbrs     = list(gdf.sindex.intersection(pt.buffer(200).bounds))
    nbr_vals = gdf.loc[nbrs, 'NO2d'].dropna()
    if len(nbr_vals) and (nbr_vals > hot_thresh).mean() > 0.5:
        real_segs.append(seg)
        continue

    # 3) Traffic prediction
    if not np.isnan(traf_near):
        pred_traf = lr.predict([[traf_near]])[0]
        if abs(seg_val - pred_traf) > 3 * traffic_rmse:
            error_segs.append(seg)
        else:
            real_segs.append(seg)
        continue

    # default
    real_segs.append(seg)

print(f"Classified {len(error_segs)} errors, {len(real_segs)} real extremes.")

# Graph augmentation + Data builder
def augment_grouped_far_knn(G, gdf, groups, coords, feature_matrix, feature_cols,
                            top_n, neighbors, sim_thresh, min_dist, max_dist,
                            hop_thresh, max_edges, per_node_cap,
                            road_id_col="ROAD_FID", suffix='grp_far_knn'):
    road_ids = gdf[road_id_col].to_numpy()
    col_to_idx = {c:i for i,c in enumerate(feature_cols)}
    candidates = set()
    for cols in groups.values():
        if any(c not in col_to_idx for c in cols): continue
        intensity = gdf[cols].sum(axis=1)
        top_idx   = intensity.nlargest(top_n).index.to_numpy()
        if top_idx.size < 2: continue
        idxs = [col_to_idx[c] for c in cols]
        subF = feature_matrix[top_idx][:, idxs]
        subF /= np.linalg.norm(subF, axis=1, keepdims=True).clip(1e-6)
        nbr = NearestNeighbors(n_neighbors=min(neighbors+1, len(top_idx)), metric='cosine', n_jobs=-1).fit(subF)
        dists, nn_idxs = nbr.kneighbors(subF)
        sims = 1 - dists
        for ii, src in enumerate(top_idx):
            close = set(nx.single_source_shortest_path_length(G, int(src), cutoff=hop_thresh))
            for rank, dst_j in enumerate(nn_idxs[ii,1:], start=1):
                if sims[ii, rank] < sim_thresh: break
                dst = top_idx[dst_j]; u, v = sorted((int(src), int(dst)))
                if road_ids[src] == road_ids[dst]: continue
                dxy = np.hypot(*(coords[src] - coords[dst]))
                if dxy < min_dist or dxy > max_dist or dst in close: continue
                candidates.add((u, v))
    final, counts = [], Counter()
    for u, v in random.sample(list(candidates), len(candidates)):
        if counts[u] < per_node_cap and counts[v] < per_node_cap:
            final.append((u, v)); counts[u] += 1; counts[v] += 1
        if len(final) >= max_edges: break
    G2 = G.copy(); G2.add_edges_from(final, feature_sim=suffix)
    return G2, final

def build_data_mask_missing(G, gdf, feature_cols, target_col, outliers=None):
    gdf2 = gdf.reset_index(drop=True)
    G2 = nx.relabel_nodes(G, {old:new for new,old in enumerate(gdf.index)})
    X = StandardScaler().fit_transform(gdf2[feature_cols].values)
    y = gdf2[target_col].values.reshape(-1,1)
    edges = np.array(list(G2.edges())).T
    edge_index = torch.tensor(np.concatenate([edges, edges[::-1]], axis=1), dtype=torch.long)
    data = Data(
        x=torch.tensor(X, dtype=torch.float),
        edge_index=edge_index,
        y=torch.tensor(y, dtype=torch.float)
    )
    valid_idx = np.where(~np.isnan(y.flatten()))[0]
    perm = torch.randperm(len(valid_idx))
    n_train = int(0.8 * len(valid_idx))
    train_idx = valid_idx[perm[:n_train].numpy()]
    test_idx  = valid_idx[perm[n_train:].numpy()]
    train_mask = torch.zeros(data.num_nodes, dtype=torch.bool)
    test_mask  = torch.zeros(data.num_nodes, dtype=torch.bool)
    train_mask[train_idx] = True; test_mask[test_idx] = True
    if outliers is not None:
        train_mask[outliers] = False
    data.train_mask = train_mask
    data.test_mask = test_mask
    return data

# Groups used for augmentation
groups = {
    'industrial':        ['INDUS_300','INDUS_1000'],
    'residential':       ['RES_300','RES_1000'],
    'agriculture':       ['AGRI_300','AGRI_1000'],
    'natural':           ['NATUR_300','NATUR_1000'],
    'port':              ['PORT_300','PORT_1000'],
    'urb_built':         ['URBG_300','URBG_1000'],
    'water':             ['WATER_300','WATER_1000'],
    'traffic':           ['TRAFNEAR','TRAFMAJOR'],
    'pop':               ['POP_300','POP_1000'],
    'population_density':['EEA_300','EEA_1000'],
}

# Dynamic model tuning setup
def tune_model(model_cls, graph_params, n_trials=200):
    def objective(trial):
        num_layers = trial.suggest_int('num_layers', 2, 4)
        hidden_units = trial.suggest_categorical('hidden_units', [16, 32, 64])
        dropout = trial.suggest_float('dropout', 0.0, 0.5)
        lr = trial.suggest_float('learning_rate', 1e-4, 1e-2, log=True)
        weight_decay = trial.suggest_float('weight_decay', 1e-5, 1e-3, log=True)
        activation = trial.suggest_categorical('activation', ['relu', 'elu'])

        G_aug, _ = augment_grouped_far_knn(G, gdf, groups, coords, feature_matrix, feature_cols, **graph_params)
        data = build_data_mask_missing(G_aug, gdf, feature_cols, target_col, outliers=error_segs)
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        data = data.to(device)

        model = model_cls(len(feature_cols), hidden_units, 1,
                          num_layers=num_layers, dropout=dropout, activation=activation).to(device)
        opt = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
        for _ in range(100):
            model.train(); opt.zero_grad()
            out = model(data.x, data.edge_index)
            loss = F.mse_loss(out[data.train_mask], data.y[data.train_mask])
            loss.backward(); opt.step()
        model.eval()
        with torch.no_grad():
            pred = model(data.x, data.edge_index)[data.test_mask]
            true = data.y[data.test_mask]
            return torch.sqrt(F.mse_loss(pred, true)).item()

    study = optuna.create_study(direction='minimize')
    study.optimize(objective, n_trials=n_trials)
    return study

# Define final model classes
class GCN(torch.nn.Module):
    def __init__(self, in_c, h_c, out_c, num_layers=2, dropout=0.0, activation='relu'):
        super().__init__()
        acts = {'relu': F.relu, 'elu': F.elu}; self.activation = acts[activation]
        self.dropout = dropout
        self.layers = torch.nn.ModuleList(
            [GCNConv(in_c, h_c)] + [GCNConv(h_c, h_c) for _ in range(num_layers-2)] + [GCNConv(h_c, out_c)])
    def forward(self, x, edge_index):
        for layer in self.layers[:-1]:
            x = F.dropout(self.activation(layer(x, edge_index)), p=self.dropout, training=self.training)
        return self.layers[-1](x, edge_index)

class GAT(torch.nn.Module):
    def __init__(self, in_c, h_c, out_c, num_layers=2, dropout=0.0, activation='elu', heads=2):
        super().__init__()
        acts = {'relu': F.relu, 'elu': F.elu}; self.activation = acts[activation]
        self.dropout = dropout; self.heads = heads
        self.layers = torch.nn.ModuleList()
        self.layers.append(GATConv(in_c, h_c, heads=heads))
        for _ in range(num_layers - 2):
            self.layers.append(GATConv(h_c*heads, h_c, heads=heads))
        self.out = GATConv(h_c*heads, out_c, heads=1, concat=False)
    def forward(self, x, edge_index):
        for layer in self.layers:
            x = F.dropout(self.activation(layer(x, edge_index)), p=self.dropout, training=self.training)
        return self.out(x, edge_index)

# Run all tuning
print("\nTuning GCN on best graph params set 1")
study_gcn1 = tune_model(GCN, best_graph_params_1)
print("Best GCN1 params:", study_gcn1.best_params)

print("\nTuning GAT on best graph params set 2")
study_gat2 = tune_model(lambda *args, **kwargs: GAT(*args, **kwargs, heads=2), best_graph_params_2)
print("Best GAT2 params:", study_gat2.best_params)

[I 2025-05-28 12:01:39,739] A new study created in memory with name: no-name-9c353112-d614-4ef8-a8ea-7676b6165b81


Classified 125 errors, 2047 real extremes.

Tuning GCN on best graph params set 1


[I 2025-05-28 12:02:50,800] Trial 0 finished with value: 10.014951705932617 and parameters: {'num_layers': 4, 'hidden_units': 64, 'dropout': 0.06794594729446835, 'learning_rate': 0.0011098520917628392, 'weight_decay': 0.00010131845793380289, 'activation': 'elu'}. Best is trial 0 with value: 10.014951705932617.
[I 2025-05-28 12:03:34,021] Trial 1 finished with value: 15.754504203796387 and parameters: {'num_layers': 3, 'hidden_units': 64, 'dropout': 0.4605081754398706, 'learning_rate': 0.0004370832660900373, 'weight_decay': 9.552961597986633e-05, 'activation': 'elu'}. Best is trial 0 with value: 10.014951705932617.
[I 2025-05-28 12:04:12,512] Trial 2 finished with value: 27.092082977294922 and parameters: {'num_layers': 4, 'hidden_units': 32, 'dropout': 0.25776143866288703, 'learning_rate': 0.00013549072762894274, 'weight_decay': 2.3338429124208e-05, 'activation': 'relu'}. Best is trial 0 with value: 10.014951705932617.
[I 2025-05-28 12:04:41,604] Trial 3 finished with value: 28.4972686

Best GCN1 params: {'num_layers': 4, 'hidden_units': 64, 'dropout': 0.1507780398347278, 'learning_rate': 0.009136733981799275, 'weight_decay': 4.291139762395118e-05, 'activation': 'relu'}

Tuning GAT on best graph params set 2


[I 2025-05-28 15:38:05,226] Trial 0 finished with value: 29.332439422607422 and parameters: {'num_layers': 2, 'hidden_units': 16, 'dropout': 0.004784694951301849, 'learning_rate': 0.00010931456009924177, 'weight_decay': 0.00013185192665880906, 'activation': 'relu'}. Best is trial 0 with value: 29.332439422607422.
[I 2025-05-28 15:38:49,871] Trial 1 finished with value: 11.73042106628418 and parameters: {'num_layers': 3, 'hidden_units': 16, 'dropout': 0.04667060136148127, 'learning_rate': 0.0014109688804894748, 'weight_decay': 0.000554035744047727, 'activation': 'elu'}. Best is trial 1 with value: 11.73042106628418.
[I 2025-05-28 15:40:06,084] Trial 2 finished with value: 12.8055419921875 and parameters: {'num_layers': 3, 'hidden_units': 32, 'dropout': 0.10937970045735468, 'learning_rate': 0.0005867465409834917, 'weight_decay': 4.560642279760382e-05, 'activation': 'relu'}. Best is trial 1 with value: 11.73042106628418.
[I 2025-05-28 15:41:59,425] Trial 3 finished with value: 8.317344665

Best GAT2 params: {'num_layers': 4, 'hidden_units': 64, 'dropout': 0.024560349448039174, 'learning_rate': 0.009996786387592884, 'weight_decay': 2.345218108111075e-05, 'activation': 'relu'}
