## Environment Setup & Dependencies

Before running any graph-building code, we need to install PyTorch Geometric (PyG) and Optuna.  
This cell:

1. Installs **Optuna** for hyperparameter tuning.  
2. Removes any pre-installed PyG packages to avoid version conflicts.  
3. Installs `torch-scatter`, `torch-sparse`, `torch-cluster` for the current PyTorch version.  
4. Installs the full `torch-geometric` package from the official PyG GitHub.

> **Note:** Make sure your Colab or local environment’s PyTorch version matches the wheels pulled by these commands.

In [None]:
import torch

!pip install optuna -q
!pip uninstall torch-scatter torch-sparse torch-geometric torch-cluster  --y -q
!pip install torch-scatter -f https://data.pyg.org/whl/torch-{torch.__version__}.html -q
!pip install torch-sparse -f https://data.pyg.org/whl/torch-{torch.__version__}.html -q
!pip install torch-cluster -f https://data.pyg.org/whl/torch-{torch.__version__}.html -q
!pip install git+https://github.com/pyg-team/pytorch_geometric.git -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/395.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m389.1/395.9 kB[0m [31m12.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m395.9/395.9 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/242.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.7/242.7 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m105.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies 

## 1) Core Imports & Drive Mount

This cell pulls in all the Python libraries used in this notebook:

- **Basic tooling**: `os`, `random`, `pickle`, `numpy`, `pandas`  
- **Spatial & graph**: `geopandas`, `networkx`  
- **Machine learning**: `sklearn` utilities, `torch` & `torch_geometric` for GNNs  
- **Hyperparameter tuning**: `optuna` and its `TPESampler`  
- **Metrics**: `mean_squared_error`  

It also mounts your Google Drive at `/content/drive`, so that raw and processed data can be loaded from your Drive folder.

In [None]:
# ────────────────────────────────────────────────────────────
# 1) Core Imports & Drive Mount
# ────────────────────────────────────────────────────────────
import os
import random
import pickle
import numpy as np
import pandas as pd
import geopandas as gpd
import networkx as nx
import torch
import torch.nn.functional as F
import optuna
from optuna.samplers import TPESampler
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import KFold
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv, GATConv
from sklearn.metrics import mean_squared_error
from collections import Counter

# mount drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Code Block Overview

This code block implements the full SCFL graph‐building and tuning pipeline in five parts:

1. **Load & Prepare Data**  
   - Read 50 m road‐segment GeoJSON from Google Drive  
   - Reproject to EPSG:28992  
   - Compute segment centroids and stack X/Y into `coords`  

2. **Extract Target & Features**  
   - Set `target_col = 'NO2d'` (annual‐mean NO₂)  
   - Build `feature_cols` by filtering GIS buffer columns (traffic, land‐use, population, etc.)  
   - Assemble `feature_matrix` (NumPy) for similarity calculations  

3. **Build Base Graph**  
   - Initialize `G = nx.Graph()`  
   - Add one node per segment (with its metadata)  
   - Add an edge between segments whose geometries touch  

4. **Detect Outliers**  
   - Label highways vs. local roads by `TRAFMAJOR > 20000`  
   - Define `detect_outliers(G, gdf, ...)` to compute each node’s NO₂ residual vs. 1-hop neighbor mean, then flag those > (MAD × threshold)  
   - Run it to produce `error_segs` and print the count  

5. **Define SCFL Augmentation Function**  
   - `augment_grouped_far_knn(...)` groups features into semantic domains (traffic, residential, etc.)  
   - Selects top-n anchors per domain, normalizes features, finds nearest neighbors by cosine similarity  
   - Filters candidate pairs by `sim_thresh`, `min_dist`, `max_dist`, and `hop_thresh`  
   - Samples up to `per_node_cap` edges per node, caps at `max_edges`  
   - Returns augmented graph `G2` and new edge list  

6. **Define Data-Builder Helper**  
   - `build_data_mask_missing(...)` relabels nodes, standardizes features, and builds a `torch_geometric.data.Data` object  
   - Creates an 80/20 train/test mask over non-NaN NO₂ nodes  
   - Excludes flagged `error_segs` from the training mask  

7. **Specify Feature Groups**  
   - Defines the `groups` dictionary mapping domain names (e.g. “traffic”, “residential”) to their feature column lists  

8. **Define GNN Models & CV Helper**  
   - `GCN` and `GAT` classes: three‐layer graph convolution/attention networks with ReLU/ELU  
   - `cross_val_rmse(G_aug, model_cls, fixed_init, fixed_opt)`: builds data, runs 3-fold CV (50 epochs per fold), returns mean RMSE  

9. **Optuna Tuning for GCN**  
   - Fixed init: `in_c=len(feature_cols)`, `h_c=64`, `out_c=1`; optimizer: `lr=0.00914`, `weight_decay=4.29e-05`  
   - `objective_gcn(trial)`: samples SCFL params (`top_n`, `neighbors`, `sim_thresh`, `min_dist`, `max_dist`, `hop_thresh`, `max_edges`, `per_node_cap`), builds `G_aug`, computes CV RMSE, tracks best RMSE & graph  
   - Run 140 trials (80 startup) via `TPESampler`, minimize RMSE  
   - Print best params/RMSE and pickle `best_graph_gcn` to `outputs/G_gcn_augmented.gpickle`  

10. **Optuna Tuning for GAT**  
    - Fixed init: `in_c=len(feature_cols)`, `h_c=16`, `out_c=1`, `heads=2`; optimizer same as GCN  
    - `objective_gat(trial)`: inherits `top_n` from GCN study, samples remaining SCFL params, builds `G_aug`, computes CV RMSE, tracks best RMSE & graph  
    - Run 140 trials (80 startup) via `TPESampler`, minimize RMSE  
    - Print best params/RMSE and pickle `best_graph_gat` to `outputs/G_gat_augmented.gpickle`  

Finally, both best graphs are saved under `outputs/` for downstream use.

In [None]:
# ────────────────────────────────────────────────────────────
# 1) Load & Prepare Data
# ────────────────────────────────────────────────────────────
fp = '/content/drive/MyDrive/Universiteit Utrecht/Thesis/data/road_network_lufeature.geojson'
gdf = gpd.read_file(fp).to_crs(epsg=28992).reset_index(drop=True)
gdf['centroid'] = gdf.geometry.centroid
coords = np.column_stack([gdf.centroid.x, gdf.centroid.y])

target_col = 'NO2d'
feature_cols = [
    c for c in gdf.columns
    if any(s in c for s in ['AGRI','INDUS','NATUR','PORT','RES','TRANS','URBG','WATER',
                             'POP','EEA','HHOLD','RDL','TLOA','HLOA','MRDL','TMLOA','HMLOA',
                             'TRAF','DINV'])
]
feature_matrix = gdf[feature_cols].to_numpy()

# Build base graph
G = nx.Graph()
sidx = gdf.sindex
for idx, row in gdf.iterrows():
    G.add_node(idx, **row.drop('geometry').to_dict())
for idx, geom in enumerate(gdf.geometry):
    for j in sidx.intersection(geom.bounds):
        if idx != j and geom.touches(gdf.geometry[j]):
            G.add_edge(idx, j)

# Detect outliers
gdf['is_highway'] = gdf['TRAFMAJOR'] > 20000
group_thresh = {True: 9.0, False: 5.0}

def detect_outliers(G, gdf, target_col='NO2d', hop=1):
    nodes = list(G.nodes())
    vals, neigh_means = [], []
    for n in nodes:
        v = gdf.at[n, target_col]
        if pd.isna(v):
            vals.append(np.nan); neigh_means.append(np.nan)
            continue
        sp = nx.single_source_shortest_path_length(G, n, cutoff=hop)
        neigh = [m for m in sp if m != n]
        nbr_vals = gdf.loc[neigh, target_col].dropna().values
        vals.append(v)
        neigh_means.append(np.nan if nbr_vals.size == 0 else nbr_vals.mean())
    vals, neigh_means = np.array(vals), np.array(neigh_means)
    valid = ~np.isnan(vals) & ~np.isnan(neigh_means)
    residuals = vals[valid] - neigh_means[valid]
    valid_nodes = np.array(nodes)[valid]
    med = np.median(residuals)
    mad = np.median(np.abs(residuals - med))
    cutoffs = np.array([group_thresh[gdf.at[n,'is_highway']]*mad for n in valid_nodes])
    return valid_nodes[np.abs(residuals - med) > cutoffs].tolist()

error_segs = detect_outliers(G, gdf)
print(f"⚠️ Detected {len(error_segs)} outlier segments.")

# ────────────────────────────────────────────────────────────
# 2) Augmentation & Data‐builder Helpers
# ────────────────────────────────────────────────────────────
def augment_grouped_far_knn(
    G, gdf, groups, coords, feature_matrix, feature_cols,
    top_n, neighbors, sim_thresh, min_dist, max_dist,
    hop_thresh, max_edges, per_node_cap,
    road_id_col="ROAD_FID", suffix='grp_far_knn'
):
    road_ids = gdf[road_id_col].to_numpy()
    col_to_idx = {c:i for i,c in enumerate(feature_cols)}
    candidates = set()
    for cols in groups.values():
        if any(c not in col_to_idx for c in cols):
            continue
        intensity = gdf[cols].sum(axis=1)
        top_idx = intensity.nlargest(top_n).index.to_numpy()
        if top_idx.size < 2:
            continue
        idxs = [col_to_idx[c] for c in cols]
        subF = feature_matrix[top_idx][:, idxs]
        subF /= np.linalg.norm(subF, axis=1, keepdims=True).clip(1e-6)
        nbr = NearestNeighbors(
            n_neighbors=min(neighbors+1, len(top_idx)),
            metric='cosine', n_jobs=-1
        ).fit(subF)
        dists, nn_idxs = nbr.kneighbors(subF)
        sims = 1 - dists
        for ii, src in enumerate(top_idx):
            close = set(nx.single_source_shortest_path_length(G, int(src), cutoff=hop_thresh))
            for rank, dst_j in enumerate(nn_idxs[ii,1:], start=1):
                if sims[ii, rank] < sim_thresh:
                    break
                dst = top_idx[dst_j]
                u,v = sorted((int(src), int(dst)))
                if road_ids[src] == road_ids[dst]:
                    continue
                dxy = np.hypot(*(coords[src] - coords[dst]))
                if dxy < min_dist or dxy > max_dist or dst in close:
                    continue
                candidates.add((u,v))
    final, counts = [], Counter()
    for u,v in random.sample(list(candidates), len(candidates)):
        if counts[u] < per_node_cap and counts[v] < per_node_cap:
            final.append((u,v))
            counts[u] += 1; counts[v] += 1
        if len(final) >= max_edges:
            break
    G2 = G.copy()
    G2.add_edges_from(final, feature_sim=suffix)
    return G2, final

def build_data_mask_missing(G, gdf, feature_cols, target_col, outliers=None):
    gdf2 = gdf.reset_index(drop=True)
    G2 = nx.relabel_nodes(G, {old:new for new,old in enumerate(gdf.index)})
    X = StandardScaler().fit_transform(gdf2[feature_cols])
    y = gdf2[target_col].values.reshape(-1,1)
    edges = np.array(list(G2.edges())).T
    edge_index = torch.tensor(
        np.concatenate([edges, edges[::-1]], axis=1),
        dtype=torch.long
    )
    data = Data(
        x=torch.tensor(X, dtype=torch.float),
        edge_index=edge_index,
        y=torch.tensor(y, dtype=torch.float)
    )
    valid_idx = np.where(~np.isnan(y.flatten()))[0]
    perm = torch.randperm(len(valid_idx))
    n_train = int(0.8 * len(valid_idx))
    train_idx = valid_idx[perm[:n_train].numpy()]
    test_idx  = valid_idx[perm[n_train:].numpy()]
    train_mask = torch.zeros(data.num_nodes, dtype=torch.bool)
    test_mask  = torch.zeros(data.num_nodes, dtype=torch.bool)
    train_mask[train_idx] = True
    test_mask[test_idx] = True
    if outliers is not None:
        train_mask[outliers] = False
    data.train_mask, data.test_mask = train_mask, test_mask
    return data

groups = {
  'industrial':['INDUS_300','INDUS_1000'],
  'residential':['RES_300','RES_1000'],
  'agriculture':['AGRI_300','AGRI_1000'],
  'natural':['NATUR_300','NATUR_1000'],
  'port':['PORT_300','PORT_1000'],
  'urb_built':['URBG_300','URBG_1000'],
  'water':['WATER_300','WATER_1000'],
  'traffic':['TRAFNEAR','TRAFMAJOR'],
  'pop':['POP_300','POP_1000'],
  'population_density':['EEA_300','EEA_1000']
}

# ────────────────────────────────────────────────────────────
# 3) GNN Definitions & 3-Fold CV helper
# ────────────────────────────────────────────────────────────
class GCN(torch.nn.Module):
    def __init__(self, in_c, h_c, out_c):
        super().__init__()
        self.conv1 = GCNConv(in_c,   h_c)
        self.conv2 = GCNConv(h_c,   h_c)
        self.conv3 = GCNConv(h_c, out_c)
    def forward(self, x, e):
        x = F.relu(self.conv1(x,e))
        x = F.relu(self.conv2(x,e))
        return F.relu(self.conv3(x,e))

class GAT(torch.nn.Module):
    def __init__(self, in_c, h_c, out_c, heads=2):
        super().__init__()
        self.g1 = GATConv(in_c,      h_c,   heads=heads)
        self.g2 = GATConv(h_c*heads, h_c,   heads=heads)
        self.g3 = GATConv(h_c*heads, out_c, heads=1, concat=False)
    def forward(self, x, e):
        x = F.elu(self.g1(x,e))
        x = F.elu(self.g2(x,e))
        return self.g3(x,e)

def cross_val_rmse(G_aug, model_cls, fixed_init, fixed_opt):
    data = build_data_mask_missing(G_aug, gdf, feature_cols, target_col, outliers=error_segs)
    y = data.y.numpy().flatten()
    valid = np.where(~np.isnan(y))[0]
    kf = KFold(n_splits=3, shuffle=True, random_state=0)
    rmses = []
    for train_idx, test_idx in kf.split(valid):
        tm = torch.zeros(data.num_nodes, dtype=torch.bool)
        te = torch.zeros(data.num_nodes, dtype=torch.bool)
        tm[valid[train_idx]], te[valid[test_idx]] = True, True
        data.train_mask, data.test_mask = tm, te
        dev = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        model = model_cls(**fixed_init).to(dev)
        opt   = torch.optim.Adam(model.parameters(), **fixed_opt)
        for _ in range(50):
            model.train(); opt.zero_grad()
            out = model(data.x.to(dev), data.edge_index.to(dev))
            loss = F.mse_loss(out[data.train_mask], data.y[data.train_mask].to(dev))
            loss.backward(); opt.step()
        model.eval()
        with torch.no_grad():
            p = model(data.x.to(dev), data.edge_index.to(dev))[data.test_mask].cpu().numpy()
            t = data.y[data.test_mask].cpu().numpy()
        rmses.append(np.sqrt(mean_squared_error(t, p)))
    return float(np.mean(rmses))

# ────────────────────────────────────────────────────────────
# 4) Optuna for GCN (80 startup, 140 trials)
# ────────────────────────────────────────────────────────────
best_rmse_gcn = float('inf')
best_graph_gcn = None

fixed_gcn_init  = {'in_c': len(feature_cols), 'h_c': 64, 'out_c': 1}
fixed_gcn_optim = {'lr': 0.009136733981799275, 'weight_decay': 4.291139762395118e-05}

def objective_gcn(trial):
    global best_rmse_gcn, best_graph_gcn
    random.seed(trial.number); np.random.seed(trial.number)
    aug = {
        'top_n':        trial.suggest_int('top_n',        500,1500,step=500),
        'neighbors':    trial.suggest_int('neighbors',    10,200,  step=10),
        'sim_thresh':   trial.suggest_float('sim_thresh', 0.95,0.9999),
        'min_dist':     trial.suggest_int('min_dist',     50,500),
        'max_dist':     trial.suggest_int('max_dist',     500,5000,log=True),
        'hop_thresh':   trial.suggest_int('hop_thresh',   1,5),
        'max_edges':    trial.suggest_int('max_edges',    500,5000,step=500),
        'per_node_cap': trial.suggest_int('per_node_cap', 1,10)
    }
    G_aug,_ = augment_grouped_far_knn(
        G, gdf, groups, coords,
        feature_matrix, feature_cols,
        **aug, suffix='gcn_aug'
    )
    rmse = cross_val_rmse(G_aug, GCN, fixed_gcn_init, fixed_gcn_optim)
    if rmse < best_rmse_gcn:
        best_rmse_gcn, best_graph_gcn = rmse, G_aug.copy()
    return rmse

study_gcn = optuna.create_study(
    direction='minimize',
    sampler=TPESampler(n_startup_trials=80)
)
study_gcn.optimize(objective_gcn, n_trials=140)

print("🔍 GCN best params:", study_gcn.best_params)
print("🔍 GCN best RMSE :", best_rmse_gcn)

os.makedirs('outputs', exist_ok=True)
with open('outputs/G_gcn_augmented.gpickle','wb') as f:
    pickle.dump(best_graph_gcn, f)

# ────────────────────────────────────────────────────────────
# 5) Optuna for GAT (reuse GCN's top_n)
# ────────────────────────────────────────────────────────────
best_rmse_gat = float('inf')
best_graph_gat = None

fixed_gat_init  = {'in_c': len(feature_cols), 'h_c': 16, 'out_c': 1, 'heads': 2}
fixed_gat_optim = fixed_gcn_optim

def objective_gat(trial):
    global best_rmse_gat, best_graph_gat
    random.seed(trial.number); np.random.seed(trial.number)
    aug = dict(study_gcn.best_params)
    aug.update({
        'neighbors':    trial.suggest_int('neighbors',    10,200,step=10),
        'sim_thresh':   trial.suggest_float('sim_thresh', 0.95,0.9999),
        'min_dist':     trial.suggest_int('min_dist',     50,500),
        'max_dist':     trial.suggest_int('max_dist',     500,5000,log=True),
        'hop_thresh':   trial.suggest_int('hop_thresh',   1,5),
        'max_edges':    trial.suggest_int('max_edges',    500,5000,step=500),
        'per_node_cap': trial.suggest_int('per_node_cap', 1,10)
    })
    G_aug,_ = augment_grouped_far_knn(
        G, gdf, groups, coords,
        feature_matrix, feature_cols,
        **aug, suffix='gat_aug'
    )
    rmse = cross_val_rmse(G_aug, GAT, fixed_gat_init, fixed_gat_optim)
    if rmse < best_rmse_gat:
        best_rmse_gat, best_graph_gat = rmse, G_aug.copy()
    return rmse

study_gat = optuna.create_study(
    direction='minimize',
    sampler=TPESampler(n_startup_trials=80)
)
study_gat.optimize(objective_gat, n_trials=140)

print("🔍 GAT best params:", study_gat.best_params)
print("🔍 GAT best RMSE :", best_rmse_gat)

with open('outputs/G_gat_augmented.gpickle','wb') as f:
    pickle.dump(best_graph_gat, f)

print("✅ All done – best graphs saved under outputs/")

[I 2025-06-25 09:27:16,826] A new study created in memory with name: no-name-3b100393-51d7-469f-a550-5a7109c65f56


⚠️ Detected 2172 outlier segments.


[I 2025-06-25 09:27:26,117] Trial 0 finished with value: 9.352686432673037 and parameters: {'top_n': 1000, 'neighbors': 40, 'sim_thresh': 0.9650662615265146, 'min_dist': 280, 'max_dist': 597, 'hop_thresh': 2, 'max_edges': 3000, 'per_node_cap': 6}. Best is trial 0 with value: 9.352686432673037.
[I 2025-06-25 09:27:35,939] Trial 1 finished with value: 9.378615261819787 and parameters: {'top_n': 1500, 'neighbors': 90, 'sim_thresh': 0.9715342842733867, 'min_dist': 54, 'max_dist': 590, 'hop_thresh': 4, 'max_edges': 5000, 'per_node_cap': 5}. Best is trial 0 with value: 9.352686432673037.
[I 2025-06-25 09:27:43,228] Trial 2 finished with value: 9.414966863575286 and parameters: {'top_n': 1500, 'neighbors': 30, 'sim_thresh': 0.963277739615809, 'min_dist': 64, 'max_dist': 2587, 'hop_thresh': 1, 'max_edges': 3500, 'per_node_cap': 4}. Best is trial 0 with value: 9.352686432673037.
[I 2025-06-25 09:27:59,180] Trial 3 finished with value: 9.349824060259325 and parameters: {'top_n': 1500, 'neighbors

🔍 GCN best params: {'top_n': 1500, 'neighbors': 90, 'sim_thresh': 0.9994085326942802, 'min_dist': 113, 'max_dist': 3430, 'hop_thresh': 3, 'max_edges': 500, 'per_node_cap': 8}
🔍 GCN best RMSE : 9.220642745613558


[I 2025-06-25 09:52:04,083] A new study created in memory with name: no-name-142de6ae-58a8-4ead-9254-8be218836b90
[I 2025-06-25 09:52:12,947] Trial 0 finished with value: 9.63631166702234 and parameters: {'neighbors': 20, 'sim_thresh': 0.9547961436429523, 'min_dist': 342, 'max_dist': 3263, 'hop_thresh': 4, 'max_edges': 2000, 'per_node_cap': 4}. Best is trial 0 with value: 9.63631166702234.
[I 2025-06-25 09:52:30,693] Trial 1 finished with value: 9.64065273154508 and parameters: {'neighbors': 200, 'sim_thresh': 0.9973481986980512, 'min_dist': 96, 'max_dist': 1253, 'hop_thresh': 2, 'max_edges': 1500, 'per_node_cap': 7}. Best is trial 0 with value: 9.63631166702234.
[I 2025-06-25 09:52:38,321] Trial 2 finished with value: 9.645546111073735 and parameters: {'neighbors': 30, 'sim_thresh': 0.9550878698143159, 'min_dist': 176, 'max_dist': 534, 'hop_thresh': 3, 'max_edges': 2500, 'per_node_cap': 2}. Best is trial 0 with value: 9.63631166702234.
[I 2025-06-25 09:52:55,735] Trial 3 finished with

🔍 GAT best params: {'neighbors': 180, 'sim_thresh': 0.9929976273907586, 'min_dist': 263, 'max_dist': 726, 'hop_thresh': 2, 'max_edges': 500, 'per_node_cap': 5}
🔍 GAT best RMSE : 9.39702667166012
✅ All done – best graphs saved under outputs/
