# Exploring the Impacts of Architecture and Scale on GNN Performance on Relational Data
By: Joseph Guman, Atindra Jha, and Christopher Pondoc

## Introduction
Welcome back to Relbench! In this tutorial, we'll dive a bit deeper into the benchmark + Relational Deep Learning and explore several choices around architecture, scale, and generalizability. In particular, we'll look to answer the following questions:

1. Can we train our Relational Deep Learning on one entity classification task and expect strong zero-shot performance on another entity classification task? What happens if we finetune the model?
2. How does our choice of using embedding models to generate expressive node features impact our performance on node classification tasks?
3. How can we alter and/or extend the architecture of our existing Relational Deep Learning model to improve performance on different tasks?

This notebook already assumes you've looked through the tutorials on [loading in data](https://github.com/snap-stanford/relbench/blob/main/tutorials/load_data.ipynb) and [training a model](https://github.com/snap-stanford/relbench/blob/main/tutorials/train_model.ipynb), as our walkthrough uses those guides as a launchpad to explore deeper questions. If you haven't had a chance to look through those notebooks, we suggest starting there first.

With all that being said, let's get started!

## Question 1: Can we generalize?
Let's take a look at our first question, which involves looking at whether our Relational Deep Learning model can generalize to other tasks with/without finetuning.

Let's first start by looking setting up Relbench. As with the other tutorials, we're taking a look at the `rel-f1` dataset and focusing on node classification tasks. We'll begin by training a model on the `driver-dnf` task, which predicts whether a driver will not finish a race in the next month.

In [1]:
from src.tasks.tasks import initialize_task, db_to_graph
import torch
from torch.nn import BCEWithLogitsLoss
from torch_geometric.seed import seed_everything

# Set up dataset and task, define metrics and loss
dataset, task, train_table, val_table, test_table = initialize_task(
    "rel-f1", "driver-dnf"
)
loss_fn = BCEWithLogitsLoss()

# Set up device
seed_everything(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

  from .autonotebook import tqdm as notebook_tqdm


We can then preprocess all of our Relbench data.

In [2]:
import os
from relbench.modeling.graph import make_pkey_fkey_graph
from torch_frame.config.text_embedder import TextEmbedderConfig
from src.embeddings.glove import GloveTextEmbedding

# Preprocess the database data and set up our text embedder
db, col_to_stype_dict = db_to_graph(dataset)
text_embedder_cfg = TextEmbedderConfig(
    text_embedder=GloveTextEmbedding(device=device), batch_size=128
)

# Load in data used to train model
root_dir = "./data"
data, col_stats_dict = make_pkey_fkey_graph(
    db,
    col_to_stype_dict=col_to_stype_dict,
    text_embedder_cfg=text_embedder_cfg,
    cache_dir=os.path.join(root_dir, f"rel-f1_materialized_cache"),
)

Loading Database object from /home/cpondoc/.cache/relbench/rel-f1/db...
Done in 0.02 seconds.


  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)


Next, let's load in the data and have our model set up.

In [3]:
from src.models.loader import get_loader
from src.models.rdl import RDLModel

# Set up data loader and model
loader_dict, entity_table = get_loader(train_table, val_table, test_table, task, data)
model = RDLModel(
    data=data,
    col_stats_dict=col_stats_dict,
    num_layers=2,
    channels=128,
    out_channels=1,
    aggr="sum",
    norm="batch_norm",
).to(device)

# if you try out different RelBench tasks you will need to change these
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)
epochs = 10

Finalize, let's initialize our training run, and evaluate our model!

In [4]:
from src.models.training import eval_model, training_run

# Get model after a training run
state_dict = training_run(
    model, device, optimizer, task, loader_dict, val_table, loss_fn, entity_table
)
model.load_state_dict(state_dict)

# Evaluate on val and test set
eval_model(model, loader_dict, "val", task, device, val_table)
eval_model(model, loader_dict, "test", task, device, None)

100%|██████████| 23/23 [00:02<00:00,  8.16it/s]


Epoch: 01, Train loss: 0.37055541320443, Val metrics: {'average_precision': np.float64(0.8444772957169963), 'accuracy': 0.7791519434628975, 'f1': np.float64(0.8758689175769613), 'roc_auc': np.float64(0.6132607709750567)}


100%|██████████| 23/23 [00:02<00:00,  9.09it/s]


Epoch: 02, Train loss: 0.34091723174308564, Val metrics: {'average_precision': np.float64(0.8770578216363556), 'accuracy': 0.6802120141342756, 'f1': np.float64(0.7968574635241302), 'roc_auc': np.float64(0.6600272108843537)}


100%|██████████| 23/23 [00:02<00:00,  9.13it/s]


Epoch: 03, Train loss: 0.31163817321572, Val metrics: {'average_precision': np.float64(0.8869699922190497), 'accuracy': 0.7102473498233216, 'f1': np.float64(0.8244111349036403), 'roc_auc': np.float64(0.6658684807256235)}


100%|██████████| 23/23 [00:02<00:00,  9.13it/s]


Epoch: 04, Train loss: 0.30985885829284515, Val metrics: {'average_precision': np.float64(0.8864504621890734), 'accuracy': 0.7332155477031802, 'f1': np.float64(0.8405491024287223), 'roc_auc': np.float64(0.668172335600907)}


100%|██████████| 23/23 [00:02<00:00,  9.14it/s]


Epoch: 05, Train loss: 0.3080302636223894, Val metrics: {'average_precision': np.float64(0.8866194986803377), 'accuracy': 0.6501766784452296, 'f1': np.float64(0.7525), 'roc_auc': np.float64(0.6612426303854875)}


100%|██████████| 23/23 [00:02<00:00,  9.12it/s]


Epoch: 06, Train loss: 0.3044538261825432, Val metrics: {'average_precision': np.float64(0.8908993442679494), 'accuracy': 0.6236749116607774, 'f1': np.float64(0.7178807947019867), 'roc_auc': np.float64(0.6761541950113379)}


100%|██████████| 23/23 [00:02<00:00,  9.19it/s]


Epoch: 07, Train loss: 0.2984335782332358, Val metrics: {'average_precision': np.float64(0.8952525511929295), 'accuracy': 0.7084805653710248, 'f1': np.float64(0.8127128263337117), 'roc_auc': np.float64(0.6868027210884354)}


100%|██████████| 23/23 [00:02<00:00,  9.11it/s]


Epoch: 08, Train loss: 0.29515692535384636, Val metrics: {'average_precision': np.float64(0.8951346352196462), 'accuracy': 0.7579505300353356, 'f1': np.float64(0.849615806805708), 'roc_auc': np.float64(0.6960544217687075)}


100%|██████████| 23/23 [00:02<00:00,  9.07it/s]


Epoch: 09, Train loss: 0.28932201302905053, Val metrics: {'average_precision': np.float64(0.8969770863752542), 'accuracy': 0.7367491166077739, 'f1': np.float64(0.8289322617680827), 'roc_auc': np.float64(0.7071020408163264)}


100%|██████████| 23/23 [00:02<00:00,  9.01it/s]


Epoch: 10, Train loss: 0.2824940806036503, Val metrics: {'average_precision': np.float64(0.9012322189618995), 'accuracy': 0.7685512367491166, 'f1': np.float64(0.8542825361512792), 'roc_auc': np.float64(0.7172063492063492)}
Best val metrics: {'average_precision': np.float64(0.9010571403282265), 'accuracy': 0.7685512367491166, 'f1': np.float64(0.8542825361512792), 'roc_auc': np.float64(0.716734693877551)}
Best test metrics: {'average_precision': np.float64(0.832616785573719), 'accuracy': 0.7165242165242165, 'f1': np.float64(0.8145386766076421), 'roc_auc': np.float64(0.7156492460840287)}


As we can see, we are able to roughly replicate the results from the [core Relbench paper](https://huggingface.co/spaces/relbench/leaderboard). However, do the results generalize? To do so, let's load in the data for the other entity classification task within `rel-f1` -- `driver-top3` -- and see how we do. 

In [5]:
# Reuse functions to set up `driver-top3 task`
dataset, task, train_table, val_table, test_table = initialize_task(
    "rel-f1", "driver-top3"
)
db, col_to_stype_dict = db_to_graph(dataset)
data, col_stats_dict = make_pkey_fkey_graph(
    db,
    col_to_stype_dict=col_to_stype_dict,
    text_embedder_cfg=text_embedder_cfg,
    cache_dir=os.path.join(root_dir, f"rel-f1_materialized_cache"),
)

loader_dict, entity_table = get_loader(train_table, val_table, test_table, task, data)
model = RDLModel(
    data=data,
    col_stats_dict=col_stats_dict,
    num_layers=2,
    channels=128,
    out_channels=1,
    aggr="sum",
    norm="batch_norm",
).to(device)
model.load_state_dict(state_dict)
eval_model(model, loader_dict, "test", task, device, None)

  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)


Best test metrics: {'average_precision': np.float64(0.11076771178438702), 'accuracy': 0.22176308539944903, 'f1': np.float64(0.2206896551724138), 'roc_auc': np.float64(0.21797920150501673)}


Unfortunately, trying out our model zero-shot does not yield amazing results. However, what happens if we use this model as a starting point for finetuning on the task? Let's experiment on fine-tuning this model with fewer epochs on the `driver-top3` task and checking its performance.

In [6]:
# Get model after a training run
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)
state_dict = training_run(
    model,
    device,
    optimizer,
    task,
    loader_dict,
    val_table,
    loss_fn,
    entity_table,
    epochs=5,
    state_dict=state_dict,
)
model.load_state_dict(state_dict)

# Evaluate on val and test set
eval_model(model, loader_dict, "val", task, device, val_table)
eval_model(model, loader_dict, "test", task, device, None)

100%|██████████| 3/3 [00:00<00:00,  7.35it/s]


Epoch: 01, Train loss: 1.6957652979397544, Val metrics: {'average_precision': np.float64(0.15162000257737335), 'accuracy': 0.7857142857142857, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.37188009532171074)}


100%|██████████| 3/3 [00:00<00:00,  7.64it/s]


Epoch: 02, Train loss: 0.5086759428403507, Val metrics: {'average_precision': np.float64(0.16236274838638717), 'accuracy': 0.7976190476190477, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.41851964666463604)}


100%|██████████| 3/3 [00:00<00:00,  7.62it/s]


Epoch: 03, Train loss: 0.47670420909579914, Val metrics: {'average_precision': np.float64(0.19884117072961685), 'accuracy': 0.7976190476190477, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.5449821719732669)}


100%|██████████| 3/3 [00:00<00:00,  7.46it/s]


Epoch: 04, Train loss: 0.47285433696185053, Val metrics: {'average_precision': np.float64(0.2379809358928051), 'accuracy': 0.7976190476190477, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.6302521008403362)}


100%|██████████| 3/3 [00:00<00:00,  7.58it/s]


Epoch: 05, Train loss: 0.45695781685736825, Val metrics: {'average_precision': np.float64(0.23609459043443015), 'accuracy': 0.7976190476190477, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.6117432047445844)}
Best val metrics: {'average_precision': np.float64(0.23799640103799408), 'accuracy': 0.7976190476190477, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.6303058536847574)}
Best test metrics: {'average_precision': np.float64(0.14348596749935752), 'accuracy': 0.8236914600550964, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.42791074414715724)}


Nice! It looks like after we finetune even after just one epoch. we're able to practically replicate the Relbench results. Finally, let's compare this approach to simply training on the task from scratch.

In [7]:
# Define a new model, don't load in old weights.
base_model = RDLModel(
    data=data,
    col_stats_dict=col_stats_dict,
    num_layers=2,
    channels=128,
    out_channels=1,
    aggr="sum",
    norm="batch_norm",
).to(device)
base_optimizer = torch.optim.Adam(base_model.parameters(), lr=0.005)
base_state_dict = training_run(
    base_model,
    device,
    base_optimizer,
    task,
    loader_dict,
    val_table,
    loss_fn,
    entity_table,
    epochs=10,
    state_dict=state_dict,
)
base_model.load_state_dict(base_state_dict)

# Evaluate on val and test set
eval_model(base_model, loader_dict, "val", task, device, val_table)
eval_model(base_model, loader_dict, "test", task, device, None)

100%|██████████| 3/3 [00:00<00:00,  7.54it/s]


Epoch: 01, Train loss: 0.5566302544085607, Val metrics: {'average_precision': np.float64(0.24943764144239966), 'accuracy': 0.7976190476190477, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.6100499901453118)}


100%|██████████| 3/3 [00:00<00:00,  7.54it/s]


Epoch: 02, Train loss: 0.4479153604086293, Val metrics: {'average_precision': np.float64(0.27089084918611356), 'accuracy': 0.7976190476190477, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.5955815161885649)}


100%|██████████| 3/3 [00:00<00:00,  7.54it/s]


Epoch: 03, Train loss: 0.4455351232601287, Val metrics: {'average_precision': np.float64(0.27087269697561744), 'accuracy': 0.7976190476190477, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.6119402985074627)}


100%|██████████| 3/3 [00:00<00:00,  7.56it/s]


Epoch: 04, Train loss: 0.4430711813662022, Val metrics: {'average_precision': np.float64(0.30132359680667475), 'accuracy': 0.7976190476190477, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.6051316048807581)}


100%|██████████| 3/3 [00:00<00:00,  7.50it/s]


Epoch: 05, Train loss: 0.4411608953653929, Val metrics: {'average_precision': np.float64(0.30049981481987753), 'accuracy': 0.7976190476190477, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.6059737327766929)}


100%|██████████| 3/3 [00:00<00:00,  7.60it/s]


Epoch: 06, Train loss: 0.4364322723720308, Val metrics: {'average_precision': np.float64(0.27369696964151424), 'accuracy': 0.7976190476190477, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.6059020623174644)}


100%|██████████| 3/3 [00:00<00:00,  7.52it/s]


Epoch: 07, Train loss: 0.4313462432983269, Val metrics: {'average_precision': np.float64(0.2761524694266338), 'accuracy': 0.7976190476190477, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.6089301392198672)}


100%|██████████| 3/3 [00:00<00:00,  6.88it/s]


Epoch: 08, Train loss: 0.4214343483018064, Val metrics: {'average_precision': np.float64(0.24169877695992698), 'accuracy': 0.7976190476190477, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.6213470462811991)}


100%|██████████| 3/3 [00:00<00:00,  7.58it/s]


Epoch: 09, Train loss: 0.4054367639845773, Val metrics: {'average_precision': np.float64(0.31272663451023), 'accuracy': 0.7976190476190477, 'f1': np.float64(0.0), 'roc_auc': np.float64(0.6301266775366863)}


100%|██████████| 3/3 [00:00<00:00,  7.55it/s]


Epoch: 10, Train loss: 0.37522076117666403, Val metrics: {'average_precision': np.float64(0.3136183362728637), 'accuracy': 0.8061224489795918, 'f1': np.float64(0.16176470588235295), 'roc_auc': np.float64(0.6341402232534805)}
Best val metrics: {'average_precision': np.float64(0.3136183362728637), 'accuracy': 0.8078231292517006, 'f1': np.float64(0.16296296296296298), 'roc_auc': np.float64(0.6341402232534805)}
Best test metrics: {'average_precision': np.float64(0.38240391056125356), 'accuracy': 0.7920110192837465, 'f1': np.float64(0.03821656050955414), 'roc_auc': np.float64(0.8257995401337793)}


Ultimately, we don't see much of a difference from starting from random weights to using a model pre-initialized from another entity classification task.

## Question 2: Different expressiveness of node features?
Next, let's take a look at using different embedding models for node features.

The embedding models are used to help turn the tabular data into usable node features. In the Relbench tutorial, the team uses GloVe embeddings, but the paper also mentions utilizing BERT-style embeddings. In traditional NLP, BERT embeddings are much more popular given that they are contextual -- the vector representation depends on the surrounding words, compared to static embeddings used by GloVe -- and can handle words outside of their vocabulary. In addition, their embedding size is $768$ compared to GloVe's $300$, which introduces an opportunity for more expressiveness.

As an investigation, let's switch out our GloVe embedding model with BERT and retrain a new model from scratch on the `driver-dnf` task.

In [8]:
from src.embeddings.bert import BertTextEmbedding

dataset, task, train_table, val_table, test_table = initialize_task(
    "rel-f1", "driver-dnf"
)

# Preprocess the database data and set up our text embedder
db, col_to_stype_dict = db_to_graph(dataset)
text_embedder_cfg = TextEmbedderConfig(
    text_embedder=BertTextEmbedding(device=device), batch_size=128
)

# Load in data used to train model
data, col_stats_dict = make_pkey_fkey_graph(
    db,
    col_to_stype_dict=col_to_stype_dict,
    text_embedder_cfg=text_embedder_cfg,
    cache_dir=os.path.join(root_dir, f"rel-f1_materialized_cache"),
)
loader_dict, entity_table = get_loader(train_table, val_table, test_table, task, data)

# Initialize new, untrained model using BERT embeddings
bert_model = RDLModel(
    data=data,
    col_stats_dict=col_stats_dict,
    num_layers=2,
    channels=128,
    out_channels=1,
    aggr="sum",
    norm="batch_norm",
).to(device)
bert_optimizer = torch.optim.Adam(bert_model.parameters(), lr=0.005)
bert_state_dict = training_run(
    bert_model,
    device,
    bert_optimizer,
    task,
    loader_dict,
    val_table,
    loss_fn,
    entity_table,
    epochs=10,
    state_dict=state_dict,
)
bert_model.load_state_dict(bert_state_dict)

# Evaluate on val and test set
eval_model(bert_model, loader_dict, "val", task, device, val_table)
eval_model(bert_model, loader_dict, "test", task, device, None)

  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
  tf_dict, col_stats = torch.load(path)
100%|██████████| 23/23 [00:02<00:00,  8.88it/s]


Epoch: 01, Train loss: 0.38314242033533874, Val metrics: {'average_precision': np.float64(0.827504479911991), 'accuracy': 0.7791519434628975, 'f1': np.float64(0.8758689175769613), 'roc_auc': np.float64(0.5801179138321995)}


100%|██████████| 23/23 [00:02<00:00,  8.92it/s]


Epoch: 02, Train loss: 0.35447051484947734, Val metrics: {'average_precision': np.float64(0.8537097626599583), 'accuracy': 0.7791519434628975, 'f1': np.float64(0.8758689175769613), 'roc_auc': np.float64(0.6382040816326531)}


100%|██████████| 23/23 [00:02<00:00,  8.94it/s]


Epoch: 03, Train loss: 0.3318094138068119, Val metrics: {'average_precision': np.float64(0.885627601042355), 'accuracy': 0.7703180212014135, 'f1': np.float64(0.868421052631579), 'roc_auc': np.float64(0.6843900226757369)}


100%|██████████| 23/23 [00:02<00:00,  8.99it/s]


Epoch: 04, Train loss: 0.3180197037527265, Val metrics: {'average_precision': np.float64(0.8909164634008148), 'accuracy': 0.773851590106007, 'f1': np.float64(0.8717434869739479), 'roc_auc': np.float64(0.6725260770975057)}


100%|██████████| 23/23 [00:02<00:00,  8.93it/s]


Epoch: 05, Train loss: 0.31070933652983446, Val metrics: {'average_precision': np.float64(0.8958781058503079), 'accuracy': 0.6943462897526502, 'f1': np.float64(0.7962308598351001), 'roc_auc': np.float64(0.6887619047619048)}


100%|██████████| 23/23 [00:02<00:00,  8.96it/s]


Epoch: 06, Train loss: 0.30829004632837465, Val metrics: {'average_precision': np.float64(0.8978716984479074), 'accuracy': 0.7579505300353356, 'f1': np.float64(0.8586171310629515), 'roc_auc': np.float64(0.6953832199546485)}


100%|██████████| 23/23 [00:02<00:00,  8.98it/s]


Epoch: 07, Train loss: 0.3102427912723714, Val metrics: {'average_precision': np.float64(0.8904171744850806), 'accuracy': 0.6925795053003534, 'f1': np.float64(0.7967289719626168), 'roc_auc': np.float64(0.6832834467120181)}


100%|██████████| 23/23 [00:02<00:00,  8.81it/s]


Epoch: 08, Train loss: 0.30448004862729594, Val metrics: {'average_precision': np.float64(0.8935382263041846), 'accuracy': 0.7455830388692579, 'f1': np.float64(0.8441558441558441), 'roc_auc': np.float64(0.6918367346938776)}


100%|██████████| 23/23 [00:02<00:00,  8.94it/s]


Epoch: 09, Train loss: 0.30562818383901, Val metrics: {'average_precision': np.float64(0.8961956614695582), 'accuracy': 0.7473498233215548, 'f1': np.float64(0.8486772486772487), 'roc_auc': np.float64(0.6951473922902494)}


100%|██████████| 23/23 [00:02<00:00,  8.88it/s]


Epoch: 10, Train loss: 0.2950989901235985, Val metrics: {'average_precision': np.float64(0.8984410320980137), 'accuracy': 0.6872791519434629, 'f1': np.float64(0.7920094007050529), 'roc_auc': np.float64(0.6895056689342404)}
Best val metrics: {'average_precision': np.float64(0.8978319810638167), 'accuracy': 0.7579505300353356, 'f1': np.float64(0.8586171310629515), 'roc_auc': np.float64(0.6953106575963719)}
Best test metrics: {'average_precision': np.float64(0.8548379578973325), 'accuracy': 0.7264957264957265, 'f1': np.float64(0.8254545454545454), 'roc_auc': np.float64(0.7167862196847705)}


We ultimately don't see that drastic of a difference between using BERT embeddings and GloVe embeddings. Despite being trained differently, the fact that the models are close in size and perform similarly on [general embedding benchmarks](https://huggingface.co/spaces/mteb/leaderboard) may suggest that the results will not be that drastic. 

### Challenge
We encourage you to try larger models with even larger embedding dimensions -- to do so, use our `CustomTextEmbedding` class! To use this class, import it as below, and then specify the name of a model as used on HuggingFace:

```python
from src.embeddings.custom import CustomTextEmbedding
text_embedder_cfg = TextEmbedderConfig(
    text_embedder=CustomTextEmbedding(model_name=<INSERT_HUGGINGFACE_MODEL_HERE>, device=device), batch_size=128
)
```

## Question 3: Different RDL model architectures?
Finally, we experiment with different RDL model architectures.