# InfoGain experiments

This notebook is designed to assist in reproducing the experiments in the paper "InfoGain: Furthering the Design of Diffusion Wavelets for Graph-Structured Data" (Johnson et al.)

In [3]:
import sys
sys.path.insert(0, '../')
import pandas as pd
import numpy as np
import pickle
import torch
import torch_geometric
from torch_geometric.data import Data
from torch_geometric.datasets import TUDataset
from torch_geometric.datasets import LRGBDataset

import infogain
import results_utilities as ru

# these objects control the rounding and other settings of
# the final results dataframes produced
col_rounds = {
    ('acc', 'mean'): 2,
    ('acc', 'std'): 2,
    ('bal_acc', 'mean'): 2,
    ('bal_acc', 'std'): 2,
    ('specificity', 'mean'): 2,
    ('specificity', 'std'): 2,
    ('f1', 'mean'): 2,
    ('f1', 'std'): 2,
    ('f1_neg', 'mean'): 2,
    ('f1_neg', 'std'): 2,
    ('sec_per_epoch', 'mean'): 2,
    ('sec_per_epoch', 'std'): 2,
    ('n_epochs', 'mean'): 0,
    ('n_epochs', 'std'): 0,
    ('train_min_per_fold', 'mean'): 2,
    ('train_min_per_fold', 'std'): 2
}
mean_std_colnames = ('acc', 'sec_per_epoch', 'n_epochs', 'train_min_per_fold')
final_colnames = {
    'acc': 'Accuracy',
    'bal_acc': 'Balanced Accuracy',
    'f1': 'F1 score',
    'sec_per_epoch': 'Sec. per epoch',
    'n_epochs': 'Num. epochs',
    'train_min_per_fold': 'Min. per fold'
}

# Run experiments {-}

First, make sure all necessary dependencies are installed (see the repo's README for the list). Second, set directories in the `__post_init__` method of `args_template.py` specific to the machine you are using. Then, change any desired arguments specific to each dataset in its args file (these are found in the 'infogain_testing' folder of this codebase). Next, set final experiment settings are command-line arguments in the cell below before running. These are:

- Options for script called: for 10-fold CV experiments, call the `experiments_cv.py` script; for train-valid-test experiments (e.g. for the 'peptides-func' dataset, which is a benchmark pre-split into splits), call the `experiments_tvt.py` script. You may need to provide the absolute path to the script.
- Options for `--machine`: the string you use here should match that in the `__post_init__` method of `args_template.py` (which helps set directories 'under the hood').
- Options for `--model_key`: 'mfcn_p' or 'legs'
- Options for `--dataset`: 'dd', 'ptc_mr', 'nci1', 'mutag', 'proteins', 'peptides-func'
- Options for `--p_wavelet_scales_type` [mfcn_p models only]: 'custom' [for LS-InfoGain models] or dyadic [for 'LS-dyadic' models]
- Use the boolean flag `--use_all_orig_feats` for 'LEGS-full' and 'LS-dyadic-full' models; otherwise, remove this flag, and the features in the `EXCLUDED_FEAT_IDX` attribute in dataset's arg file will be excluded from the dataset.
- Options for `--J`: 4 [for $t_J = 16$] or 5 [for $t_J = 32$]
- Set `--verbosity` to 0 for minimal print output, or 1 for epoch summary output as the experiment runs.
- Other settings: set according to the hyperparameters tables in the manuscript.

In [None]:
! python3 "experiments_tvt.py" \
--machine "desktop" \
--model_key "mfcn_p" \
--p_wavelet_scales_type "custom" \
--J 4 \
--dataset "peptides-func" \
--use_all_orig_feats \
--n_epochs 1000 \
--burn_in 100 \
--validate_every_n_epochs 5 \
--patience 50 \
--learn_rate 0.005 \
--batch_size 512 \
--verbosity 1

# Summarize results {-}

The methods in the cell below process experiment results into a summary results table. Note that they can process experiments for multiple models on the same dataset. 
1. Make sure the cell is importing the correct args file as `a` (at top).
2. Add each model results directory name (auto-created when experiment is run) to the `model_dirs` tuple. These directory names contain the model key plus a UTC timestamp, for example, "mfcn_p_2025-03-03-041407". (Remove this placeholder.)
3. If desired, add the directories as keys to the `model_suffix_dict` and any desired suffixes to display with the model name in the results table (this can be helpful keeping track of ablation models).

In [4]:
# import the dataset's args
import infogain_testing.peptides_func_args as a
# import infogain_testing.dd_args as a
# import infogain_testing.ptc_args as a
# import infogain_testing.nci1s_args as a
# import infogain_testing.proteins_args as a
# import infogain_testing.mutag_args as a
args = a.Args()

# use this tuple to collect model results to tabularize
model_dirs = (
    "mfcn_p_2025-03-03-041407",
)
# use this dict to add suffixes to the model names in the results table
model_suffix_dict = {
    "mfcn_p_2025-03-03-041407": "InfoGain-32-drop",
}

results_df = ru.get_cv_results_df(
    args, 
    model_dirs, 
    include_times=True,
    validate_every=args.VALIDATE_EVERY_N_EPOCHS,
    model_suffix_dict=model_suffix_dict,
    decimal_round=4
).sort_values('model')

# for k-fold CV experiments, this method summarizes results
# into metric means and standard deviations (for TVT experiments,
# these will not be shown) in a dataframe; it also produces a 
# LaTeX table string
df, df_latex = ru.get_mean_pm_std_df(
    df=results_df,
    mean_std_colnames=mean_std_colnames,
    col_rounds=col_rounds,
    final_colnames=final_colnames,
    mean_subcol_tuple=('mean', ),
    std_subcol_tuple=('std', )
)

# inspect results
# print(df_latex)
df

Unnamed: 0_level_0,acc,sec_per_epoch,n_epochs,train_min_per_fold
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mfcn_p-InfoGain-32-drop,$\mathbf{81.17}$,$5.60 \pm 0.06$,$350$,$32.66$
