# OGB Benchmarks

This notebook shows you how to re-create the benchmarks from the paper [Open Graph Benchmark: Datasets for Machine Learning on Graphs](https://arxiv.org/pdf/2005.00687.pdf).

We'll train on a small dataset, the ESOL solubility dataset. You can see the reported results on Table 28 near the end of the paper in the Appendix.

In [None]:
from hbond_benchmark.train import parse_args, train

In [None]:
name = 'esol'  # we'll use the esol dataset

args = [
    f'--dataset_name={name}',  # name of one of the builtin datasets, e.g. esol, pcba, hiv
    '--dataset_root=data/',  # store the dataset here
    '--hbonds',
    '--n_runs=10',  # the paper uses the mean and std over ten runs
    '--max_epochs=100',  # the paper uses 100 epochs per run
    '--residual',  # the reference implementation uses residual connections
    '--num_sanity_val_steps=0',
]

In [None]:
args.append('--gpus=1')

In [None]:
gnn_types = ['gcn', 'gin']
virtual_node = [True, False]

## Run the experiment

We're comparing four different architectures with the full featurization: GCN vs GIN with or without a virtual node.

In [None]:
for gnn in gnn_types:
    for virt in virtual_node:
        _args = args.copy()
        _args.append(f'--default_root_dir=models/{name}/{gnn}/{virt}')  # where the models will be stored
        _args.append(f'--gnn_type={gnn}')
        if virt:
            _args.append('--virtual_node')
        train(parse_args(_args))
        

## Compiling the results

We can extract the performance on the test set and the best validation epoch from the Tensorboard logs

In [None]:
from pathlib import Path
from tensorboard.backend.event_processing import event_accumulator
import pandas as pd

In [None]:
test_runs = []
valid_runs = []

for gnn in gnn_types:
    for virt in virtual_node:
        p = Path(f'models/{name}/{gnn}/{virt}')
        event_files = list(p.glob('*/*/*tfevents*'))
        for ef in event_files:
            ea = event_accumulator.EventAccumulator(str(ef))
            ea.Reload()
            tags = ea.Tags()['scalars']
            if any(['test' in x for x in tags]):
                tag = [x for x in tags if 'test' in x][0]
                row = {'gnn_type': gnn, 'virtual': virt, tag: ea.Scalars(tag)[0].value}
                test_runs.append(row)
            else:
                tag = [x for x in tags if x not in ['hp_metric', 'epoch'] and 'train' not in x][0]
                valid = [x.value for x in ea.Scalars(tag)]
                best = max(valid) if tag != 'rmse' else min(valid)
                row = {'gnn_type': gnn, 'virtual': virt, tag: best}
                valid_runs.append(row)

In [None]:
pd.DataFrame(valid_runs).groupby(['gnn_type', 'virtual']).mean().round(3)

In [None]:
pd.DataFrame(valid_runs).groupby(['gnn_type', 'virtual']).std().round(3)

In [None]:
pd.DataFrame(test_runs).groupby(['gnn_type', 'virtual']).mean().round(3)

In [None]:
pd.DataFrame(test_runs).groupby(['gnn_type', 'virtual']).std().round(3)