# TCGA-GBMLGG Analysis Demo

- **Cohort**: Focuses on the TCGA-GBMLGG dataset, a vital resource merging Glioblastoma Multiforme (GBM) and Lower-Grade Glioma (LGG).
- **Goal**: Perform histological subtype classification.
- **Prediction Target**: Predict whether a tumor is an `astrocytoma`, `oligodendroglioma`, or `oligoastrocytoma` based on its multi-omics profile.

**Data Source:** Broad Institute FireHose (http://firebrowse.org/?cohort=GBMLGG)

In [1]:
import pandas as pd
from pathlib import Path
root = Path("/home/vicente/Github/BioNeuralNet/GBMLGG")

mirna_raw = pd.read_csv(root/"GBMLGG.miRseq_RPKM_log2.txt", sep="\t",index_col=0,low_memory=False)                            
rna_raw = pd.read_csv(root / "GBMLGG.uncv2.mRNAseq_RSEM_normalized_log2.txt", sep="\t",index_col=0,low_memory=False)
meth_raw = pd.read_csv(root/"GBMLGG.meth.by_mean.data.txt", sep='\t',index_col=0,low_memory=False)
clinical_raw = pd.read_csv(root / "GBMLGG.clin.merged.picked.txt",sep="\t", index_col=0, low_memory=False)

# display shapes and first few rows-columns of each file
display(mirna_raw.iloc[:3,:5])
display(mirna_raw.shape)

display(rna_raw.iloc[:3,:5])
display(meth_raw.shape)

display(meth_raw.iloc[:3,:5])
display(meth_raw.shape)

display(clinical_raw.iloc[:3,:5])
display(clinical_raw.shape)

Unnamed: 0_level_0,TCGA-06-0675-11,TCGA-06-0678-11,TCGA-06-0680-11,TCGA-06-0681-11,TCGA-06-AABW-11
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
hsa-let-7a-1,12.847399,13.789578,13.603454,13.346797,13.545128
hsa-let-7a-2,13.850719,14.79297,14.597877,14.34426,14.554888
hsa-let-7a-3,12.873946,13.810832,13.611074,13.364372,13.583039


(548, 531)

Unnamed: 0_level_0,TCGA-02-0047-01,TCGA-02-0055-01,TCGA-02-2483-01,TCGA-02-2485-01,TCGA-02-2486-01
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
?|100133144,1.619742,,1.5591,3.999567,2.475344
?|100134869,2.757258,3.972445,3.801138,3.902759,2.264506
?|10357,5.773564,4.97244,5.915141,6.520796,5.966629


(20115, 685)

Unnamed: 0_level_0,TCGA-06-0125-01,TCGA-06-0125-02,TCGA-06-0152-01,TCGA-06-0152-02,TCGA-06-0171-01
Hybridization REF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Composite Element REF,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value
A1BG,0.438986043005,0.565094788162,0.461699906718,0.534127262606,0.455267108058
A1CF,0.681141812896,0.724487443757,0.601439733092,0.632221318323,0.691054589549


(20115, 685)

Unnamed: 0_level_0,tcga-06-6391,tcga-19-a6j4,tcga-cs-6665,tcga-cs-6670,tcga-db-a4xc
Hybridization REF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Composite Element REF,value,value,value,value,value
years_to_birth,44,68,51,43,26
vital_status,1,1,0,0,0


(14, 1110)

## Data Processing Summary

1. **Transpose Data:** All raw data (miRNA, RNA, etc.) is flipped so rows represent patients and columns represent features.
2. **Standardize Patient IDs:** Patient IDs in all tables are cleaned to the 12-character TCGA format (e.g., `TCGA-AB-1234`) for matching.
3. **Handle Duplicates:** Duplicate patient rows are averaged in the omics data. The first entry is kept for duplicate patients in the clinical data.
4. **Find Common Patients:** The script identifies the list of patients that exist in *all* datasets.
5. **Subset Data:** All data tables are filtered down to *only* this common list of patients, ensuring alignment.
6. **Extract Target:** The `histological_type` column is pulled from the processed clinical data to be used as the prediction target (y-variable).

In [None]:
mirna = mirna_raw.T
rna = rna_raw.T
meth = meth_raw.T
clinical = clinical_raw.T

print(f"miRNA (samples, features): {mirna.shape}")
print(f"RNA (samples, features): {rna.shape}")
print(f"Methylation (samples, features): {meth.shape}")
print(f"Clinical (samples, features): {clinical.shape}")

def trim_barcode(idx):
    return idx.to_series().str.slice(0, 12)

# standarized patient IDs across all files
meth.index = trim_barcode(meth.index)
rna.index = trim_barcode(rna.index)
mirna.index = trim_barcode(mirna.index)
clinical.index = clinical.index.str.upper()
clinical.index.name = "Patient_ID"

# convert all data to numeric, coercing errors to NaN
meth = meth.apply(pd.to_numeric, errors='coerce')
rna = rna.apply(pd.to_numeric, errors='coerce')
mirna = mirna.apply(pd.to_numeric, errors='coerce')

# for any duplicate columns in the omics data, we average their values
meth = meth.groupby(meth.index).mean()
rna = rna.groupby(rna.index).mean()
mirna = mirna.groupby(mirna.index).mean()

# for any duplicate rows in the clinical data, we keep the first occurrence
clinical = clinical[~clinical.index.duplicated(keep='first')]

print(f"\nMethylation shape: {meth.shape}")
print(f"RNA shape: {rna.shape}")
print(f"miRNA shape: {mirna.shape}")
print(f"Clinical shape: {clinical.shape}")

for df in [meth, rna, mirna]:
    df.columns = df.columns.str.replace(r"\?", "unknown_", regex=True)
    df.columns = df.columns.str.replace(r"\|", "_", regex=True)
    df.columns = df.columns.str.replace("-", "_", regex=False)
    df.columns = df.columns.str.replace(r"_+", "_", regex=True)
    df.columns = df.columns.str.strip("_")
    
    df.fillna(df.mean(), inplace=True)

# to see which pateints are common across all data files
common_patients = sorted(list(set(meth.index)&set(rna.index)&set(mirna.index)&set(clinical.index)))

print(f"\nFound: {len(common_patients)} patients across all data types.")

# subset to only common patients
meth_processed = meth.loc[common_patients]
rna_processed= rna.loc[common_patients]
mirna_processed = mirna.loc[common_patients]
clinical_processed = clinical.loc[common_patients]

# extract target labels from clinical data
targets = clinical_processed['histological_type']

miRNA (samples, features): (531, 548)
RNA (samples, features): (701, 18328)
Methylation (samples, features): (685, 20115)
Clinical (samples, features): (1110, 14)

Methylation shape: (658, 20115)
RNA shape: (681, 18328)
miRNA shape: (517, 548)
Clinical shape: (1110, 14)

Found: 511 patients across all data types.


In [4]:
display(mirna_processed.iloc[:3,:5])
display(mirna_processed.shape)

display(rna_processed.iloc[:3,:5])
display(rna_processed.shape)

display(meth_processed.iloc[:3,:5])
display(meth_processed.shape)

display(clinical_processed.iloc[:3,:5])
display(clinical_processed.shape)

display(targets.value_counts())

gene,hsa_let_7a_1,hsa_let_7a_2,hsa_let_7a_3,hsa_let_7b,hsa_let_7c
TCGA-CS-4938,12.622353,13.632728,12.651613,14.20893,14.376942
TCGA-CS-4941,11.809808,12.815815,11.820061,13.047853,11.955006
TCGA-CS-4942,11.113995,12.128618,11.165523,12.48179,11.858545


(511, 548)

gene,unknown_100133144,unknown_100134869,unknown_10357,unknown_10431,unknown_155060
TCGA-CS-4938,3.123352,4.50794,8.069184,9.724198,7.51179
TCGA-CS-4941,5.187819,4.404406,7.291745,8.608326,8.344526
TCGA-CS-4942,3.562316,3.462602,7.53246,9.279502,7.034985


(511, 18328)

Hybridization REF,Composite Element REF,A1BG,A1CF,A2BP1,A2LD1
TCGA-CS-4938,,0.683179,0.776869,0.652055,0.919739
TCGA-CS-4941,,0.521934,0.784401,0.563447,0.865717
TCGA-CS-4942,,0.610067,0.828194,0.607771,0.875369


(511, 20115)

Hybridization REF,Composite Element REF,years_to_birth,vital_status,days_to_death,days_to_last_followup
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-CS-4938,value,31,0,,3574.0
TCGA-CS-4941,value,67,1,234.0,
TCGA-CS-4942,value,44,1,1335.0,


(511, 14)

histological_type
astrocytoma          193
oligodendroglioma    191
oligoastrocytoma     127
Name: count, dtype: int64

In [None]:
import bioneuralnet as bnn

# drop unwanted columns from clinical data
clinical_processed.drop(columns=["Composite Element REF"], errors="ignore", inplace=True)

# we transform the methylation beta values to M-values and drop unwanted columns
meth_m = meth_processed.drop(columns=["Composite Element REF"], errors="ignore")

# convert beta values to M-values using bioneuralnet utility with small epsilon to avoid log(0)
meth_m = bnn.utils.beta_to_m(meth_m, eps=1e-6) 

# lastly we turn the target labels into numerical classes
mapping = {"astrocytoma": 0, "oligodendroglioma": 1, "oligoastrocytoma": 2}
target_labels = targets.map(mapping).to_frame(name="target")

# as a safety check we align the indices once more
X_meth = meth_m.loc[common_patients]
X_rna = rna_processed.loc[common_patients]
X_mirna = mirna_processed.loc[common_patients]
Y_labels = target_labels.loc[common_patients]
clinical_final = clinical_processed.loc[common_patients]

print(f"\nDNA_Methylation shape: {X_meth.shape}")
print(f"RNA shape: {X_rna.shape}")
print(f"miRNA shape: {X_mirna.shape}")
print(f"Clinical shape: {clinical_final.shape}")
print(Y_labels.value_counts())

2025-11-08 13:06:12,202 - bioneuralnet.utils.data - INFO - Starting Beta-to-M value conversion (shape: (511, 20114)). Epsilon: 1e-06
2025-11-08 13:06:13,301 - bioneuralnet.utils.data - INFO - Beta-to-M conversion complete.



Methylation shape: (511, 20114)
RNA shape: (511, 18328)
miRNA shape: (511, 548)
Clinical shape: (511, 13)
target
0         193
1         191
2         127
Name: count, dtype: int64


## Feature Selection Methodology

### Supported Methods and Interpretation

**BioNeuralNet** provides three techniques for feature selection, allowing for different views of the data's statistical profile:

- **Variance Thresholding:** Identifies features with the **highest overall variance** across all samples.

- **ANOVA F-test:** Pinpoints features that best **distinguish between the target classes** (KIRC, KIRP, and KICH).

- **Random Forest Importance:** Assesses **feature utility** based on its contribution to a predictive non-linear model.

### GBMLGG Cohort Selection Strategy

A dimensionality reduction step was essential for managing the high-feature-count omics data:

- **High-Feature Datasets:** Both DNA Methylation (20,114) and RNA (18,328) required significant feature reduction.

- **Filtering Process:** The **top 6,000 features** were initially extracted from the Methylation and RNA datasets using all three methods.

- **Final Set:** A consensus set was built by finding the intersection of features selected by the ANOVA F-test and Random Forest Importance, ensuring both statistical relevance and model-based utility.

- **Low-Feature Datasets:** The miRNA data (548 features) was passed through **without selection**, as its feature count was already manageable.

In [9]:
import bioneuralnet as bnn

# feature selection
meth_highvar = bnn.utils.select_top_k_variance(X_meth, k=6000)
rna_highvar = bnn.utils.select_top_k_variance(X_rna, k=6000)

meth_af = bnn.utils.top_anova_f_features(X_meth, Y_labels, max_features=6000)
rna_af = bnn.utils.top_anova_f_features(X_rna, Y_labels, max_features=6000)

meth_rf = bnn.utils.select_top_randomforest(X_meth, Y_labels, top_k=6000)
rna_rf = bnn.utils.select_top_randomforest(X_rna, Y_labels, top_k=6000)

meth_var_set = set(meth_highvar.columns)
meth_anova_set = set(meth_af.columns)
meth_rf_set = set(meth_rf.columns)

rna_var_set = set(rna_highvar.columns)
rna_anova_set = set(rna_af.columns)
rna_rf_set = set(rna_rf.columns)

meth_inter1 = list(meth_anova_set & meth_var_set)
meth_inter2 = list(meth_rf_set & meth_var_set)
meth_inter3 = list(meth_anova_set & meth_rf_set)
meth_all_three = list(meth_anova_set & meth_var_set & meth_rf_set)

rna_inter4 = list(rna_anova_set & rna_var_set)
rna_inter5 = list(rna_rf_set & rna_var_set)
rna_inter6 = list(rna_anova_set & rna_rf_set)
rna_all_three = list(rna_anova_set & rna_var_set & rna_rf_set)

2025-11-08 13:15:33,012 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-08 13:15:33,012 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-11-08 13:15:33,012 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-11-08 13:15:33,085 - bioneuralnet.utils.preprocess - INFO - Selected top 6000 features by variance
2025-11-08 13:15:35,778 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-08 13:15:35,779 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-11-08 13:15:35,779 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-11-08 13:15:35,843 - bioneuralnet.utils.preprocess - INFO - Selected top 6000 features by variance
2025-11-08 13:15:38,824 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-08 13:15:38,824 - bioneuralnet.

In [11]:
print("FROM THE 6000 Methylation feature selection:\n")
print(f"Anova-F & variance selection share: {len(meth_inter1)} features")
print(f"Random Forest & variance selection share: {len(meth_inter2)} features")
print(f"Anova-F & Random Forest share: {len(meth_inter3)} features")
print(f"All three methods agree on: {len(meth_all_three)} features")

FROM THE 6000 Methylation feature selection:

Anova-F & variance selection share: 2704 features
Random Forest & variance selection share: 1768 features
Anova-F & Random Forest share: 1823 features
All three methods agree on: 809 features


In [12]:
print("\nFROM THE 6000 RNA feature selection:\n")
print(f"Anova-F & variance selection share: {len(rna_inter4)} features")
print(f"Random Forest & variance selection share: {len(rna_inter5)} features")
print(f"Anova-F & Random Forest share: {len(rna_inter6)} features")
print(f"All three methods agree on: {len(rna_all_three)} features")


FROM THE 6000 RNA feature selection:

Anova-F & variance selection share: 2183 features
Random Forest & variance selection share: 1977 features
Anova-F & Random Forest share: 2127 features
All three methods agree on: 763 features


## Feature Selection Summary: ANOVA-RF Intersection

The chosen strategy for feature selection is based on the **overlap** between features identified by the **ANOVA F-test** and **Random Forest Importance**. This approach offers comprehensive filtering by balancing class-based relevance (ANOVA) with non-linear model importance (Random Forest). The resulting feature sets are considered the most robust for downstream analysis.

### Feature Overlap Results

The following table details the number of features resulting from the intersection of different selection methods for each omics data type.

| Omics Data Type | ANOVA-F & Variance | RF & Variance | ANOVA-F & Random Forest (Selected) | All Three Agree |
| :--- | :--- | :--- | :--- | :--- |
| **Methylation** | 2,704 features | 1,768 features | **1,823 features** | 809 features |
| **RNA** | 2,183 features | 1,977 features | **2,127 features** | 763 features |

In [13]:
X_meth_selected = X_meth[meth_inter3]
X_rna_selected = X_rna[rna_inter6]

print("\nFinal Shapes for Modeling")
print(f"Methylation (X1): {X_meth_selected.shape}")
print(f"RNA-Seq (X2): {X_rna_selected.shape}")
print(f"miRNA-Seq (X3): {X_mirna.shape}")
print(f"Labels (Y): {Y_labels.shape}")


Final Shapes for Modeling
Methylation (X1): (511, 1823)
RNA-Seq (X2): (511, 2127)
miRNA-Seq (X3): (511, 548)
Labels (Y): (511, 1)


## Data Availability

To facilitate rapid experimentation and reproduction of our results, the fully processed and feature-selected dataset used in this analysis has been made available directly within the package.

Users can load this dataset, bypassing all preceding data acquisition, preprocessing, and feature selection steps. This allows users to proceed immediately from this step.

In [None]:
import bioneuralnet as bnn

tgca_gbmlgg = bnn.datasets.DatasetLoader("gbmlgg")
display(tgca_gbmlgg.shape)

# The dataset is returned as a dictionary. We extract each file independetly based on the name( Key).
dna_meth = tgca_gbmlgg.data["meth"]
rna = tgca_gbmlgg.data["rna"]
mirna = tgca_gbmlgg.data["mirna"]
clinical = tgca_gbmlgg.data["clinical"]
target = tgca_gbmlgg.data["target"]

{'mirna': (511, 548),
 'target': (511, 1),
 'clinical': (511, 13),
 'rna': (511, 2127),
 'meth': (511, 1823)}

In [None]:
# BioNeuralNet provides a preprocessing function to handle clinical data
clinical = tgca_gbmlgg.data["clinical"]

# For more details on the preprocessing functions, see `bioneuralnet.utils.preprocess``
clinical_preprocessed = bnn.utils.preprocess_clinical(
    clinical, 
    target, 
    top_k=7, 
    scale=False, 
    ignore_columns=[ "days_to_last_followup",  "years_to_birth", "days_to_death", "date_of_initial_pathologic_diagnosis"])

display(clinical_preprocessed.iloc[:3,:5])

2025-11-08 13:33:26,861 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-08 13:33:26,862 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 717 NaNs after median imputation
2025-11-08 13:33:26,862 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 1 columns dropped due to zero variance
2025-11-08 13:33:26,944 - bioneuralnet.utils.preprocess - INFO - Selected top 7 features by RandomForest importance


Unnamed: 0_level_0,histological_type_oligodendroglioma,histological_type_oligoastrocytoma,karnofsky_performance_score,radiation_therapy_no,radiation_therapy_yes
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-CS-4938,False,False,90.0,True,False
TCGA-CS-4941,False,False,90.0,False,True
TCGA-CS-4942,False,False,70.0,False,True


In [18]:
import pandas as pd

X_train_full = pd.concat([dna_meth, rna, mirna], axis=1)

print(f"Nan values in X_train_full: {X_train_full.isna().sum().sum()}")
X_train_full = X_train_full.dropna()
print(f"Nan value in X_train_full after dropping: {X_train_full.isna().sum().sum()}")

print(f"X_train_full shape: {X_train_full.shape}")
# building the graph using the similarity graph function with k=15
A_train = bnn.utils.gen_similarity_graph(X_train_full, k=15)

print(f"\nNetwork shape: {A_train.shape}")

Nan values in X_train_full: 0
Nan value in X_train_full after dropping: 0
X_train_full shape: (511, 4498)

Network shape: (4498, 4498)


## Reproducibility and Seeding

To ensure our experimental results are fully reproducible, a single global seed is set at the beginning of the analysis.

This utility function propagates the seed to all sources of randomness, including `random`, `numpy`, and `torch` (for both CPU and GPU).

Critically, it also configures the PyTorch cuDNN backend to use deterministic algorithms.

In [1]:
import bioneuralnet as bnn

SEED = 118
bnn.utils.set_seed(SEED)

2025-11-08 16:33:32,623 - bioneuralnet.utils.data - INFO - Setting global seed for reproducibility to: 118
2025-11-08 16:33:32,633 - bioneuralnet.utils.data - INFO - CUDA available. Applying seed to all GPU operations
2025-11-08 16:33:32,634 - bioneuralnet.utils.data - INFO - Seed setting complete


## Hyperparameter Tuning

- First step on our benchmark analysis is dedicated to **hyperparameter tuning** for:
    - **SAGE:** GraphSAGE (Graph Sample and Aggregate)
    - **GCN:** Graph Convolutional Network
    - **GAT:** Graph Attention Network
    
- We use `tune=True` and `repeat_num=5` within **DPMON** (our classification framework) to let the framework automatically search for the best architectural and training parameters.

- After the tuning phase, we will manually set the best-performing parameters for each model.
- We then use these optimal configurations in the final cross-validation to rigorously test each model performance and stability.

In [None]:
from pathlib import Path

# We rename the target to phenotype so we keep it consistent with DPMON expectations.
target = target.rename(columns={"target": "phenotype"})

dpmon_sage_tune = bnn.downstream_task.DPMON(
    adjacency_matrix=A_train,
    omics_list=[dna_meth, rna, mirna],
    phenotype_data=target,
    clinical_data=clinical_preprocessed,
    model="SAGE",
    tune=True,
    repeat_num=5,
    gpu=True, cuda=0,
    output_dir=Path("/home/vicente/Github/BioNeuralNet/dpmon_tuning/GBM_SAGE"),
)

dpmon_sage_tune.run()

dpmon_gcn_tune = bnn.downstream_task.DPMON(
    adjacency_matrix=A_train,
    omics_list=[dna_meth, rna, mirna],
    phenotype_data=target,
    clinical_data=clinical_preprocessed,
    model="GCN",
    tune=True,
    repeat_num=5,
    gpu=True, cuda=0,
    output_dir=Path("/home/vicente/Github/BioNeuralNet/dpmon_tuning/GBM_GCN"),
)

dpmon_gcn_tune.run()


dpmon_gat_tune = bnn.downstream_task.DPMON(
    adjacency_matrix=A_train,
    omics_list=[dna_meth, rna, mirna],
    phenotype_data=target,
    clinical_data=clinical_preprocessed,
    model="GAT",
    tune=True,
    repeat_num=5,
    gpu=True, cuda=0,
    output_dir=Path("/home/vicente/Github/BioNeuralNet/dpmon_tuning/GBM_GAT"),
)

dpmon_gcn_tune.run()

2025-11-08 14:31:25,659 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_tuning/GBM_SAGE
INFO:bioneuralnet.downstream_task.dpmon:Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_tuning/GBM_SAGE
2025-11-08 14:31:25,660 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
INFO:bioneuralnet.downstream_task.dpmon:Initialized DPMON with the provided parameters.
2025-11-08 14:31:25,661 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
INFO:bioneuralnet.downstream_task.dpmon:Starting DPMON run.
2025-11-08 14:31:25,674 - bioneuralnet.downstream_task.dpmon - INFO - Running hyperparameter tuning for DPMON.
INFO:bioneuralnet.downstream_task.dpmon:Running hyperparameter tuning for DPMON.
2025-11-08 14:31:25,675 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
INFO:bioneuralnet.downstream_task.dpmon:Using GPU 0
DEBUG:bioneuralnet.downstream_task.dpmon:Slici

(     Actual  Predicted
 0         0          0
 1         0          0
 2         0          0
 3         0          0
 4         0          0
 ..      ...        ...
 506       0          0
 507       0          0
 508       0          0
 509       2          2
 510       2          2
 
 [511 rows x 2 columns],
 0.5659491193737769)

## Analysis of Hyperparameter Optimization

**Hyperparameter tuning using `SEED = 118`**

### Tuning Search Space

**DPMON** parameter search space is extensive, offering over 12,000 discrete combinations plus a continuous space for learning rates. By sampling **50 trials** from this space, we are effectively probing it to find the distinct, optimal architecture best suited for each GNN (SAGE, GCN, and GAT).


| Parameter | SAGE (GraphSAGE) | GCN (Graph Convolutional) | GAT (Graph Attention) |
| :--- | :--- | :--- | :--- |
| **gnn\_layer\_num** | 2 | 64 | 16 |
| **gnn\_hidden\_dim** | 128 | 16 | 4 |
| **lr (Learning Rate)** | 0.005197 | 0.004397 | 0.003226 |
| **weight\_decay** | 0.046079 | 0.000405 | 0.046863 |
| **nn\_hidden\_dim1** | 1 | 64 | 16 |
| **nn\_hidden\_dim2** | 32 | 8 | 128 |
| **num\_epochs** | 4096 | 1024 | 512 |

### SAGE Model Configuration
The optimal SAGE configuration was **shallow but wide**, requiring an unusually long training time. This completely reverses the previous "deep and narrow" result, indicating that a high-dimensional feature representation was prioritized over recursive aggregation in this specific run.

### GCN Model Configuration
The GCN model favored a highly atypical architecture: **deep yet narrow**. The success of this extremely deep GCN suggests the model learned highly specialized, low-dimensional features over many layers, despite the typical risk of over-smoothing.

### GAT Model Configuration
The GAT model found a balanced, efficient architecture: **moderate depth and narrow**. It was the fastest to converge, suggesting GAT's attention mechanism effectively selected features, allowing it to perform well with a minimal, low-dimensional representation.

---

## Cross-Validation Results with 5 folds

| Model | Avg. Accuracy | Avg. F1 Weighted | Avg. F1 Macro |
| :--- | :--- | :--- | :--- |
| **GAT** | **0.9213** +/- 0.1321 | **0.9045** +/- 0.1656 | **0.8906** +/- 0.1910 |
| **GCN** | 0.8821 +/- 0.1353 | 0.8639 +/- 0.1666 | 0.8644 +/- 0.1620 |
| **SAGE** | 0.7567 +/- 0.1846 | 0.7311 +/- 0.2160 | 0.7135 +/- 0.2361 |

The cross-validation results demonstrate that the optimal architectures found with `SEED = 118` performed best with GAT, followed by GCN, and then SAGE.

### 1. GAT Performance

The GAT model achieved the highest average accuracy and F1 scores, suggesting its attention mechanism effectively utilized the narrow-but-moderately-deep architecture (16 layers, 4 hidden dim) for classification.

- Avg. Accuracy: 0.9213 (+/- 0.1321)
- Avg. F1 Weighted: 0.9045 (+/- 0.1656)

### 2. GCN Performance

GCN performed very strongly, showing that its highly unusual deep-and-narrow architecture (64 layers, 16 hidden dim) was successful. However, like the other models, it exhibited high standard deviations, indicating its performance was also unstable across the folds.

- Avg. Accuracy: 0.8821 (+/- 0.1353)
- Avg. F1 Weighted: 0.8639 (+/- 0.1666)

### 3. SAGE Performance

SAGE yielded the lowest average performance and the highest variability (largest +/- values). This suggests that the shallow-but-wide architecture (2 layers, 128 hidden dim) was less stable and less effective for this specific classification task compared to the other two models.

- Avg. Accuracy: 0.7567 (+/- 0.1846)
- Avg. F1 Weighted: 0.7311 (+/- 0.2160)


### Summary

While the GAT and GCN models achieved high average performance, all three models suffered from high standard deviations. This indicates that while the architectures are capable of high accuracy, their performance is highly sensitive.

**These results are shown below**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score, accuracy_score
from bioneuralnet.downstream_task import DPMON

X_meth_full = dna_meth
X_rna_full = rna
X_mirna_full = mirna
C_full = clinical_preprocessed
Y_full = target
A_full = A_train

patient_indices = np.arange(len(Y_full))
y_target_classes = Y_full.squeeze()

sage_params = {
    'layer_num': 2,
    'gnn_hidden_dim': 128,
    'lr': 0.005197038211089667,
    'weight_decay': 0.04607928341516355,
    'nn_hidden_dim1': 16,
    'nn_hidden_dim2': 32,
    'num_epochs': 4096
}

N_FOLDS = 5
skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED) 
output_dir_base_sage = Path("/home/vicente/Github/BioNeuralNet/dpmon_cv_results_SAGE_FINAL")

accuracy_scores_sage = []
f1_macro_scores_sage = []
f1_weighted_scores_sage = []

for fold_num, (train_index, test_index) in enumerate(skf.split(patient_indices, y_target_classes)):
    print(f"\nRunning DPMON SAGE Fold {fold_num + 1}/{N_FOLDS}")

    X_train_omics = [X_meth_full.iloc[train_index], X_rna_full.iloc[train_index], X_mirna_full.iloc[train_index]]
    Y_train = Y_full.iloc[train_index]
    C_train = C_full.iloc[train_index]
    A_train_fold = A_full.iloc[train_index, train_index] 
    
    dpmon_fold = DPMON(
        adjacency_matrix=A_train_fold, omics_list=X_train_omics, phenotype_data=Y_train,
        clinical_data=C_train, repeat_num=1, tune=False, gpu=True, cuda=0,
        output_dir=output_dir_base_sage / f"fold_{fold_num + 1}",
        **sage_params
    )

    predictions_df, _ = dpmon_fold.run() 
    actual = predictions_df["Actual"]
    pred = predictions_df["Predicted"]

    test_f1_macro = f1_score(actual, pred, average='macro')
    test_f1_weighted = f1_score(actual, pred, average='weighted')
    test_acc = accuracy_score(actual, pred)

    accuracy_scores_sage.append(test_acc)
    f1_macro_scores_sage.append(test_f1_macro)
    f1_weighted_scores_sage.append(test_f1_weighted)
    
    print(f"SAGE Fold {fold_num + 1} | Accuracy: {test_acc:.4f} | F1-Macro: {test_f1_macro:.4f}")

mean_acc_sage = np.mean(accuracy_scores_sage)
std_acc_sage = np.std(accuracy_scores_sage)

mean_f1_macro_sage = np.mean(f1_macro_scores_sage)
std_f1_macro_sage = np.std(f1_macro_scores_sage)

mean_f1_weighted_sage = np.mean(f1_weighted_scores_sage)
std_f1_weighted_sage = np.std(f1_weighted_scores_sage)

print("\nClassification with SAGE and 5-FOLD Cross-Validation")
print(f"Avg. Accuracy: {mean_acc_sage:.4f} +/- {std_acc_sage:.4f}")
print(f"Avg. F1 Weighted: {mean_f1_weighted_sage:.4f} +/- {std_f1_weighted_sage:.4f}")
print(f"Avg. F1 Macro: {mean_f1_macro_sage:.4f} +/- {std_f1_macro_sage:.4f}")


2025-11-08 14:48:37,559 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_cv_results_SAGE_FINAL/fold_1
INFO:bioneuralnet.downstream_task.dpmon:Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_cv_results_SAGE_FINAL/fold_1
2025-11-08 14:48:37,559 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
INFO:bioneuralnet.downstream_task.dpmon:Initialized DPMON with the provided parameters.
2025-11-08 14:48:37,560 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
INFO:bioneuralnet.downstream_task.dpmon:Starting DPMON run.


2025-11-08 14:48:37,563 - bioneuralnet.downstream_task.dpmon - INFO - Running standard training for DPMON.
INFO:bioneuralnet.downstream_task.dpmon:Running standard training for DPMON.
2025-11-08 14:48:37,564 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
INFO:bioneuralnet.downstream_task.dpmon:Using GPU 0
DEBUG:bioneuralnet.downstream_task.dpmon:Slicing omics dataset based on network nodes.
DEBUG:bioneuralnet.downstream_task.dpmon:Building PyTorch Geometric Data object from adjacency matrix.
2025-11-08 14:48:37,568 - bioneuralnet.downstream_task.dpmon - INFO - Number of nodes in network: 408
INFO:bioneuralnet.downstream_task.dpmon:Number of nodes in network: 408
DEBUG:bioneuralnet.downstream_task.dpmon:Using clinical vars for node features: ['histological_type_oligodendroglioma', 'histological_type_oligoastrocytoma', 'karnofsky_performance_score', 'radiation_therapy_no', 'radiation_therapy_yes', 'gender_male', 'ethnicity_hispanic or latino']



Running DPMON SAGE Fold 1/5


2025-11-08 14:48:37,844 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/4096], Loss: 1.1088
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/4096], Loss: 0.9256
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/4096], Loss: 0.8021
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/4096], Loss: 0.7217
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/4096], Loss: 0.6897
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/4096], Loss: 0.6741
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/4096], Loss: 0.6397
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/4096], Loss: 0.6261
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/4096], Loss: 0.7526
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/4096], Loss: 0.6725
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/4096], Loss: 0.6405
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/4096], Loss

SAGE Fold 1 | Accuracy: 0.4877 | F1-Macro: 0.3619

Running DPMON SAGE Fold 2/5


2025-11-08 14:48:47,181 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/4096], Loss: 1.0817
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/4096], Loss: 0.9337
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/4096], Loss: 0.8185
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/4096], Loss: 0.7151
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/4096], Loss: 0.6747
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/4096], Loss: 0.6485
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/4096], Loss: 0.6163
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/4096], Loss: 0.5984
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/4096], Loss: 0.5942
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/4096], Loss: 0.7387
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/4096], Loss: 0.6746
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/4096], Loss

SAGE Fold 2 | Accuracy: 0.9927 | F1-Macro: 0.9929

Running DPMON SAGE Fold 3/5


2025-11-08 14:48:56,591 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/4096], Loss: 1.1013
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/4096], Loss: 0.9286
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/4096], Loss: 0.8025
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/4096], Loss: 0.7211
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/4096], Loss: 0.6890
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/4096], Loss: 0.6270
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/4096], Loss: 0.6042
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/4096], Loss: 0.6004
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/4096], Loss: 0.6075
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/4096], Loss: 0.7369
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/4096], Loss: 0.6518
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/4096], Loss

SAGE Fold 3 | Accuracy: 0.8924 | F1-Macro: 0.8957

Running DPMON SAGE Fold 4/5


2025-11-08 14:49:06,007 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/4096], Loss: 1.1064
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/4096], Loss: 0.9469
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/4096], Loss: 0.8426
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/4096], Loss: 0.7706
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/4096], Loss: 0.7505
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/4096], Loss: 0.7652
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/4096], Loss: 0.6936
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/4096], Loss: 0.6603
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/4096], Loss: 0.6351
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/4096], Loss: 0.6497
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/4096], Loss: 0.6740
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/4096], Loss

SAGE Fold 4 | Accuracy: 0.6088 | F1-Macro: 0.5212

Running DPMON SAGE Fold 5/5


2025-11-08 14:49:15,290 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/4096], Loss: 1.0972
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/4096], Loss: 0.9515
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/4096], Loss: 0.8594
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/4096], Loss: 0.7877
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/4096], Loss: 0.7337
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/4096], Loss: 0.6766
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/4096], Loss: 0.6315
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/4096], Loss: 0.5964
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/4096], Loss: 0.5930
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/4096], Loss: 0.7565
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/4096], Loss: 0.6802
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/4096], Loss

SAGE Fold 5 | Accuracy: 0.8020 | F1-Macro: 0.7958

Classification with SAGE and 5-FOLD Cross-Validation
Avg. Accuracy: 0.7567 +/- 0.1846
Avg. F1 Weighted: 0.7311 +/- 0.2160
Avg. F1 Macro: 0.7135 +/- 0.2361


In [None]:
output_dir_base_gcn = Path("/home/vicente/Github/BioNeuralNet/dpmon_cv_results_GCN_FINAL")

gcn_params = {
    'layer_num': 64,
    'gnn_hidden_dim': 16,
    'lr': 0.004397636847528492,
    'weight_decay': 0.00040569459559797747,
    'nn_hidden_dim1': 4,
    'nn_hidden_dim2': 8,
    'num_epochs': 1024
}

skf_gcn = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED) 

accuracy_scores_gcn = []
f1_macro_scores_gcn = []
f1_weighted_scores_gcn = []

for fold_num, (train_index, test_index) in enumerate(skf_gcn.split(patient_indices, y_target_classes)):
    print(f"\nDPMON GCN Fold {fold_num + 1}/{N_FOLDS}")

    X_train_omics = [X_meth_full.iloc[train_index], X_rna_full.iloc[train_index], X_mirna_full.iloc[train_index]]
    Y_train = Y_full.iloc[train_index]
    C_train = C_full.iloc[train_index]
    A_train_fold = A_full.iloc[train_index, train_index] 
    
    dpmon_fold = DPMON(
        adjacency_matrix=A_train_fold, omics_list=X_train_omics, phenotype_data=Y_train,
        clinical_data=C_train, repeat_num=1, tune=False, gpu=True, cuda=0,
        output_dir=output_dir_base_gcn / f"fold_{fold_num + 1}",
        **gcn_params
    )

    predictions_df, _ = dpmon_fold.run() 
    actual = predictions_df["Actual"]
    pred = predictions_df["Predicted"]

    test_acc = accuracy_score(actual, pred)
    test_f1 = f1_score(actual, pred, average='macro')
    test_f1w = f1_score(actual, pred, average='weighted')

    accuracy_scores_gcn.append(test_acc)
    f1_macro_scores_gcn.append(test_f1)
    f1_weighted_scores_gcn.append(test_f1w)
    
    print(f"GCN Fold {fold_num + 1} | Accuracy: {test_acc:.4f} | F1-Macro: {test_f1:.4f}")

mean_acc_gcn = np.mean(accuracy_scores_gcn)
std_acc_gcn = np.std(accuracy_scores_gcn)

mean_f1_macro_gcn = np.mean(f1_macro_scores_gcn)
std_f1_macro_gcn = np.std(f1_macro_scores_gcn)

mean_f1_weighted_gcn = np.mean(f1_weighted_scores_gcn)
std_f1_weighted_gcn = np.std(f1_weighted_scores_gcn)

print("Classification with GCN and 5-FOLD Cross-Validation")
print(f"Avg. Accuracy: {mean_acc_gcn:.4f} +/- {std_acc_gcn:.4f}")
print(f"Avg. F1 Weighted: {mean_f1_weighted_gcn:.4f} +/- {std_f1_weighted_gcn:.4f}")
print(f"Avg. F1 Macro: {mean_f1_macro_gcn:.4f} +/- {std_f1_macro_gcn:.4f}")

2025-11-08 14:54:20,426 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_cv_results_GCN_FINAL/fold_1
INFO:bioneuralnet.downstream_task.dpmon:Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_cv_results_GCN_FINAL/fold_1
2025-11-08 14:54:20,427 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
INFO:bioneuralnet.downstream_task.dpmon:Initialized DPMON with the provided parameters.
2025-11-08 14:54:20,428 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
INFO:bioneuralnet.downstream_task.dpmon:Starting DPMON run.
2025-11-08 14:54:20,432 - bioneuralnet.downstream_task.dpmon - INFO - Running standard training for DPMON.
INFO:bioneuralnet.downstream_task.dpmon:Running standard training for DPMON.
2025-11-08 14:54:20,432 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
INFO:bioneuralnet.downstream_task.dpmon:Using GPU 0
DEBUG:bioneuralnet.downstream_


DPMON GCN Fold 1/5


2025-11-08 14:54:20,706 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/1024], Loss: 1.0924
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/1024], Loss: 1.0665
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/1024], Loss: 1.0243
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/1024], Loss: 0.9860
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/1024], Loss: 0.9358
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/1024], Loss: 0.8739
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/1024], Loss: 0.8153
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/1024], Loss: 0.7720
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/1024], Loss: 0.7396
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/1024], Loss: 0.7083
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/1024], Loss: 0.6841
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/1024], Loss

GCN Fold 1 | Accuracy: 0.8260 | F1-Macro: 0.8104

DPMON GCN Fold 2/5


2025-11-08 14:54:53,425 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/1024], Loss: 1.1103
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/1024], Loss: 1.0513
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/1024], Loss: 0.9953
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/1024], Loss: 0.9295
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/1024], Loss: 0.8594
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/1024], Loss: 0.8122
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/1024], Loss: 0.7760
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/1024], Loss: 0.7423
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/1024], Loss: 0.7196
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/1024], Loss: 0.6947
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/1024], Loss: 0.6594
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/1024], Loss

GCN Fold 2 | Accuracy: 0.9633 | F1-Macro: 0.9606

DPMON GCN Fold 3/5


2025-11-08 14:55:25,969 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/1024], Loss: 1.1165
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/1024], Loss: 1.0359
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/1024], Loss: 0.9772
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/1024], Loss: 0.9249
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/1024], Loss: 0.8825
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/1024], Loss: 0.8390
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/1024], Loss: 0.8053
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/1024], Loss: 0.7805
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/1024], Loss: 0.7575
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/1024], Loss: 0.7333
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/1024], Loss: 0.6981
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/1024], Loss

GCN Fold 3 | Accuracy: 0.9927 | F1-Macro: 0.9929

DPMON GCN Fold 4/5


2025-11-08 14:55:58,723 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/1024], Loss: 1.1040
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/1024], Loss: 1.0583
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/1024], Loss: 1.0211
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/1024], Loss: 0.9599
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/1024], Loss: 0.9015
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/1024], Loss: 0.8252
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/1024], Loss: 0.7679
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/1024], Loss: 0.7182
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/1024], Loss: 0.6866
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/1024], Loss: 0.6494
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/1024], Loss: 0.6294
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/1024], Loss

GCN Fold 4 | Accuracy: 0.9878 | F1-Macro: 0.9886

DPMON GCN Fold 5/5


2025-11-08 14:56:31,765 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/1024], Loss: 1.1184
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/1024], Loss: 1.0624
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/1024], Loss: 1.0131
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/1024], Loss: 0.9584
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/1024], Loss: 0.8998
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/1024], Loss: 0.8351
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/1024], Loss: 0.7644
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/1024], Loss: 0.6976
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/1024], Loss: 0.6529
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/1024], Loss: 0.6262
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/1024], Loss: 0.6071
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/1024], Loss

GCN Fold 5 | Accuracy: 0.6406 | F1-Macro: 0.5693
Classification with GCN and 5-FOLD Cross-Validation
Avg. Accuracy: 0.8821 +/- 0.1353
Avg. F1 Weighted: 0.8639 +/- 0.1666
Avg. F1 Macro: 0.8644 +/- 0.1620


In [None]:
output_dir_base_gat = Path("/home/vicente/Github/BioNeuralNet/dpmon_cv_results_GAT_FINAL")

gat_params = {
    'layer_num': 16,
    'gnn_hidden_dim': 4,
    'lr': 0.003226646486974539,
    'weight_decay': 0.04686394640851084,
    'nn_hidden_dim1': 16,
    'nn_hidden_dim2': 128,
    'num_epochs': 512
}

skf_gat = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED) 

accuracy_scores_gat = []
f1_macro_scores_gat = []
f1_weighted_scores_gat = []

for fold_num, (train_index, test_index) in enumerate(skf_gat.split(patient_indices, y_target_classes)):
    print(f"\nDPMON GAT Fold {fold_num + 1}/{N_FOLDS}")

    X_train_omics = [X_meth_full.iloc[train_index], X_rna_full.iloc[train_index], X_mirna_full.iloc[train_index]]
    Y_train = Y_full.iloc[train_index]
    C_train = C_full.iloc[train_index]
    A_train_fold = A_full.iloc[train_index, train_index] 
    
    dpmon_fold = DPMON(
        adjacency_matrix=A_train_fold, omics_list=X_train_omics, phenotype_data=Y_train,
        clinical_data=C_train, repeat_num=1, tune=False, gpu=True, cuda=0,
        output_dir=output_dir_base_gat / f"fold_{fold_num + 1}",
        **gat_params
    )

    predictions_df, _ = dpmon_fold.run() 
    actual = predictions_df["Actual"]
    pred = predictions_df["Predicted"]

    test_acc = accuracy_score(actual, pred)
    test_f1 = f1_score(actual, pred, average='macro')
    test_f1w = f1_score(actual, pred, average='weighted')

    accuracy_scores_gat.append(test_acc)
    f1_macro_scores_gat.append(test_f1)
    f1_weighted_scores_gat.append(test_f1w)
    
    print(f"GAT Fold {fold_num + 1} | Accuracy: {test_acc:.4f} | F1-Macro: {test_f1:.4f}")

mean_acc_gat = np.mean(accuracy_scores_gat)
std_acc_gat = np.std(accuracy_scores_gat)

mean_f1_macro_gat = np.mean(f1_macro_scores_gat)
std_f1_macro_gat = np.std(f1_macro_scores_gat)

mean_f1_weighted_gat = np.mean(f1_weighted_scores_gat)
std_f1_weighted_gat = np.std(f1_weighted_scores_gat)

print("Classification with GAT and 5-FOLD Cross-Validation")
print(f"Avg. Accuracy: {mean_acc_gat:.4f} +/- {std_acc_gat:.4f}")
print(f"Avg. F1 Weighted: {mean_f1_weighted_gat:.4f} +/- {std_f1_weighted_gat:.4f}")
print(f"Avg. F1 Macro: {mean_f1_macro_gat:.4f} +/- {std_f1_macro_gat:.4f}")

2025-11-08 14:59:36,165 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_cv_results_GAT_FINAL/fold_1
INFO:bioneuralnet.downstream_task.dpmon:Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_cv_results_GAT_FINAL/fold_1
2025-11-08 14:59:36,166 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
INFO:bioneuralnet.downstream_task.dpmon:Initialized DPMON with the provided parameters.
2025-11-08 14:59:36,167 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
INFO:bioneuralnet.downstream_task.dpmon:Starting DPMON run.
2025-11-08 14:59:36,172 - bioneuralnet.downstream_task.dpmon - INFO - Running standard training for DPMON.
INFO:bioneuralnet.downstream_task.dpmon:Running standard training for DPMON.
2025-11-08 14:59:36,172 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
INFO:bioneuralnet.downstream_task.dpmon:Using GPU 0
DEBUG:bioneuralnet.downstream_


DPMON GAT Fold 1/5


2025-11-08 14:59:36,451 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.0867
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.8907
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.7761
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.6571
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.6022
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.5800
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5704
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5694
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.6826
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.6143
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.5748
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.5676
DEB

GAT Fold 1 | Accuracy: 0.9951 | F1-Macro: 0.9951

DPMON GAT Fold 2/5


2025-11-08 14:59:41,556 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.1090
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.9313
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.8095
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.6731
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.6409
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.6002
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5774
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5719
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.5712
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.7861
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.6458
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.5931
DEB

GAT Fold 2 | Accuracy: 0.9804 | F1-Macro: 0.9786

DPMON GAT Fold 3/5


2025-11-08 14:59:46,639 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.0975
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.8928
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.7488
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.6503
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.6088
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.5910
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5808
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5784
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.7435
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.6711
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.6085
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.5877
DEB

GAT Fold 3 | Accuracy: 1.0000 | F1-Macro: 1.0000

DPMON GAT Fold 4/5


2025-11-08 14:59:51,714 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.1198
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.9304
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.8175
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.6965
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.6062
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.5719
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5650
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5648
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.7880
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.7030
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.6407
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.5923
DEB

GAT Fold 4 | Accuracy: 0.6577 | F1-Macro: 0.5092

DPMON GAT Fold 5/5


2025-11-08 14:59:56,870 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.1063
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.9135
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.8270
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.7274
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.6529
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.5984
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5754
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5671
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.6975
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.6613
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.6108
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.5794
DEB

GAT Fold 5 | Accuracy: 0.9731 | F1-Macro: 0.9704
Classification with GAT and 5-FOLD Cross-Validation
Avg. Accuracy: 0.9213 +/- 0.1321
Avg. F1 Weighted: 0.9045 +/- 0.1656
Avg. F1 Macro: 0.8906 +/- 0.1910
