# TCGA-KIPAN Analysis Demo

- **Cohort**: Focuses on the TCGA-KIPAN dataset, a vital resource merging three major kidney cancer subtypes:
    - `Kidney Renal Clear Cell Carcinoma` (KIRC)
    - `Kidney Renal Papillary Cell Carcinoma` (KIRP)
    - `Kidney Chromophobe `(KICH)

- **Goal**: Perform histological subtype classification.
- **Prediction Target**: Predict the specific kidney cancer subtype (`KIRC`, `KIRP`, or `KICH`) from its multi-omics profile.

**Data Source:** Broad Institute FireHose (`http://firebrowse.org/?cohort=KIPAN`)

In [None]:
import pandas as pd
from pathlib import Path
root = Path("/home/vicente/Github/BioNeuralNet/KIPAN")

mirna_raw = pd.read_csv(root/"KIPAN.miRseq_RPKM_log2.txt", sep="\t",index_col=0,low_memory=False)                            
rna_raw = pd.read_csv(root / "KIPAN.uncv2.mRNAseq_RSEM_normalized_log2.txt", sep="\t",index_col=0,low_memory=False)
meth_raw = pd.read_csv(root/"KIPAN.meth.by_mean.data.txt", sep='\t',index_col=0,low_memory=False)
clinical_raw = pd.read_csv(root / "KIPAN.clin.merged.picked.txt",sep="\t", index_col=0, low_memory=False)

# display all shapes and first few rows of each dataset
display(mirna_raw.iloc[:3,:5])
display(mirna_raw.shape)

display(rna_raw.iloc[:3,:5])
display(meth_raw.shape)

display(meth_raw.iloc[:3,:5])
display(meth_raw.shape)

display(clinical_raw.iloc[:3,:5])
display(clinical_raw.shape)

Unnamed: 0_level_0,TCGA-KL-8323-01,TCGA-KL-8324-11,TCGA-KL-8324-01,TCGA-KL-8325-01,TCGA-KL-8326-11
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
hsa-let-7a-1,13.350564,12.66929,12.138842,12.186424,12.251624
hsa-let-7a-2,14.345422,13.671356,13.139199,13.182016,13.244735
hsa-let-7a-3,13.354717,12.696054,12.157156,12.17879,12.260827


(472, 1005)

Unnamed: 0_level_0,TCGA-KL-8323-01,TCGA-KL-8324-11,TCGA-KL-8324-01,TCGA-KL-8325-01,TCGA-KL-8326-11
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
?|100133144,3.145221,2.847776,3.356102,2.591009,2.097307
?|100134869,2.462733,2.555203,4.644127,3.272561,3.044499
?|10357,6.624215,6.822777,5.91635,6.813293,6.846008


(20117, 867)

Unnamed: 0_level_0,TCGA-KL-8323-01,TCGA-KL-8324-01,TCGA-KL-8325-01,TCGA-KL-8326-01,TCGA-KL-8327-01
Hybridization REF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Composite Element REF,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value
A1BG,0.586272204702,0.258615822462,0.346998160003,0.303477029822,0.31424190458
A1CF,0.612681857814,0.585391864334,0.469213665938,0.572474360793,0.595785276504


(20117, 867)

Unnamed: 0_level_0,tcga-kl-8328,tcga-kl-8339,tcga-km-8439,tcga-km-8441,tcga-km-8442
Hybridization REF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Composite Element REF,value,value,value,value,value
years_to_birth,60,67,30,61,38
vital_status,0,1,0,0,0


(20, 941)

## Data Processing Summary

1. **Transpose Data:** All raw data (miRNA, RNA, etc.) is flipped so rows represent patients and columns represent features.
2. **Standardize Patient IDs:** Patient IDs in all tables are cleaned to the 12-character TCGA format (e.g., `TCGA-AB-1234`) for matching.
3. **Handle Duplicates:** Duplicate patient rows are averaged in the omics data. The first entry is kept for duplicate patients in the clinical data.
4. **Find Common Patients:** The script identifies the list of patients that exist in *all* datasets.
5. **Subset Data:** All data tables are filtered down to *only* this common list of patients, ensuring alignment.
6. **Extract Target:** The `histological_type` column is pulled from the processed clinical data to be used as the prediction target (y-variable).

In [None]:
mirna = mirna_raw.T
rna = rna_raw.T
meth = meth_raw.T
clinical = clinical_raw.T

print(f"miRNA (samples, features): {mirna.shape}")
print(f"RNA (samples, features): {rna.shape}")
print(f"Methylation (samples, features): {meth.shape}")
print(f"Clinical (samples, features): {clinical.shape}")

def trim_barcode(idx):
    return idx.to_series().str.slice(0, 12)

# standarized patient IDs across all files
meth.index = trim_barcode(meth.index)
rna.index = trim_barcode(rna.index)
mirna.index = trim_barcode(mirna.index)
clinical.index = clinical.index.str.upper()
clinical.index.name = "Patient_ID"

# convert all data to numeric, coercing errors to NaN
meth = meth.apply(pd.to_numeric, errors='coerce')
rna = rna.apply(pd.to_numeric, errors='coerce')
mirna = mirna.apply(pd.to_numeric, errors='coerce')

# for any duplicate columns in the omics data, we average their values
meth = meth.groupby(meth.index).mean()
rna = rna.groupby(rna.index).mean()
mirna = mirna.groupby(mirna.index).mean()

# for any duplicate rows in the clinical data, we keep the first occurrence
clinical = clinical[~clinical.index.duplicated(keep='first')]

print(f"\nMethylation shape: {meth.shape}")
print(f"RNA shape: {rna.shape}")
print(f"miRNA shape: {mirna.shape}")
print(f"Clinical shape: {clinical.shape}")

for df in [meth, rna, mirna]:
    df.columns = df.columns.str.replace(r"\?", "unknown_", regex=True)
    df.columns = df.columns.str.replace(r"\|", "_", regex=True)
    df.columns = df.columns.str.replace("-", "_", regex=False)
    df.columns = df.columns.str.replace(r"_+", "_", regex=True)
    df.columns = df.columns.str.strip("_")
    
    df.fillna(df.mean(), inplace=True)

# to see which pateints are common across all data files
common_patients = sorted(list(set(meth.index)&set(rna.index)&set(mirna.index)&set(clinical.index)))

print(f"\nFound: {len(common_patients)} patients across all data types.")

# subset to only common patients
meth_processed = meth.loc[common_patients]
rna_processed= rna.loc[common_patients]
mirna_processed = mirna.loc[common_patients]
clinical_processed = clinical.loc[common_patients]

# extract target labels from clinical data
targets = clinical_processed['histological_type']

miRNA (samples, features): (1005, 472)
RNA (samples, features): (1020, 18272)
Methylation (samples, features): (867, 20117)
Clinical (samples, features): (941, 20)

Methylation shape: (660, 20117)
RNA shape: (889, 18272)
miRNA shape: (873, 472)
Clinical shape: (941, 20)

Found: 658 patients across all data types.


In [3]:
display(mirna_processed.iloc[:3,:5])
display(mirna_processed.shape)

display(rna_processed.iloc[:3,:5])
display(rna_processed.shape)

display(meth_processed.iloc[:3,:5])
display(meth_processed.shape)

display(clinical_processed.iloc[:3,:5])
display(clinical_processed.shape)

display(targets.value_counts())

gene,hsa_let_7a_1,hsa_let_7a_2,hsa_let_7a_3,hsa_let_7b,hsa_let_7c
TCGA-2K-A9WE,12.933499,13.933025,12.938528,12.861969,11.474055
TCGA-2Z-A9J1,12.535658,13.536437,12.531655,12.710724,10.355773
TCGA-2Z-A9J2,11.832278,12.838388,11.840725,11.038718,8.36021


(658, 472)

gene,unknown_100133144,unknown_100134869,unknown_10357,unknown_10431,unknown_155060
TCGA-2K-A9WE,2.336112,2.520498,5.772965,9.610685,8.198804
TCGA-2Z-A9J1,3.006962,3.558929,6.177374,10.177077,7.656137
TCGA-2Z-A9J2,1.516973,1.736691,4.853207,10.345265,6.263288


(658, 18272)

Hybridization REF,Composite Element REF,A1BG,A1CF,A2BP1,A2LD1
TCGA-2K-A9WE,,0.498835,0.814418,0.536619,0.77175
TCGA-2Z-A9J1,,0.400956,0.554575,0.51705,0.50513
TCGA-2Z-A9J2,,0.438116,0.656936,0.535795,0.678655


(658, 20117)

Hybridization REF,Composite Element REF,years_to_birth,vital_status,days_to_death,days_to_last_followup
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-2K-A9WE,value,53,0,,214
TCGA-2Z-A9J1,value,71,0,,2298
TCGA-2Z-A9J2,value,71,0,,1795


(658, 20)

histological_type
kidney clear cell renal carcinoma        318
kidney papillary renal cell carcinoma    274
kidney chromophobe                        66
Name: count, dtype: int64

In [4]:
import bioneuralnet as bnn

# drop unwanted columns from clinical data
clinical_processed.drop(columns=["Composite Element REF"], errors="ignore", inplace=True)

# we transform the methylation beta values to M-values and drop unwanted columns
meth_m = meth_processed.drop(columns=["Composite Element REF"], errors="ignore")

# convert beta values to M-values using bioneuralnet utility with small epsilon to avoid log(0)
meth_m = bnn.utils.beta_to_m(meth_m, eps=1e-6) 

# lastly we turn the target labels into numerical classes
mapping = {"kidney clear cell renal carcinoma": 0, "kidney papillary renal cell carcinoma": 1, "kidney chromophobe": 2}
target_labels = targets.map(mapping).to_frame(name="target")

# as a safety check we align the indices once more
X_meth = meth_m.loc[common_patients]
X_rna = rna_processed.loc[common_patients]
X_mirna = mirna_processed.loc[common_patients]
Y_labels = target_labels.loc[common_patients]
clinical_final = clinical_processed.loc[common_patients]

print(f"\nDNA_Methylation shape: {X_meth.shape}")
print(f"RNA shape: {X_rna.shape}")
print(f"miRNA shape: {X_mirna.shape}")
print(f"Clinical shape: {clinical_final.shape}")
print(Y_labels.value_counts())

2025-11-08 15:12:24,843 - bioneuralnet.utils.data - INFO - Starting Beta-to-M value conversion (shape: (658, 20116)). Epsilon: 1e-06
2025-11-08 15:12:25,944 - bioneuralnet.utils.data - INFO - Beta-to-M conversion complete.



DNA_Methylation shape: (658, 20116)
RNA shape: (658, 18272)
miRNA shape: (658, 472)
Clinical shape: (658, 19)
target
0         318
1         274
2          66
Name: count, dtype: int64


## Feature Selection Methodology

### Supported Methods and Interpretation

**BioNeuralNet** provides three techniques for feature selection, allowing for different views of the data's statistical profile:

- **Variance Thresholding:** Identifies features with the **highest overall variance** across all samples.

- **ANOVA F-test:** Pinpoints features that best **distinguish between the target classes** (KIRC, KIRP, and KICH).

- **Random Forest Importance:** Assesses **feature utility** based on its contribution to a predictive non-linear model.

### KIPAN Cohort Selection Strategy

A dimensionality reduction step was essential for managing the high-feature-count omics data:

- **High-Feature Datasets:** Both DNA Methylation (20,116) and RNA (18,272) required significant feature reduction.

- **Filtering Process:** The **top 6,000 features** were initially extracted from the Methylation and RNA datasets using all three methods.

- **Final Set:** A consensus set was built by finding the intersection of features selected by the ANOVA F-test and Random Forest Importance, ensuring both statistical relevance and model-based utility.

- **Low-Feature Datasets:** The miRNA data (472 features) was passed through **without selection**, as its feature count was already manageable.

In [5]:
import bioneuralnet as bnn

# feature selection
meth_highvar = bnn.utils.select_top_k_variance(X_meth, k=6000)
rna_highvar = bnn.utils.select_top_k_variance(X_rna, k=6000)

meth_af = bnn.utils.top_anova_f_features(X_meth, Y_labels, max_features=6000)
rna_af = bnn.utils.top_anova_f_features(X_rna, Y_labels, max_features=6000)

meth_rf = bnn.utils.select_top_randomforest(X_meth, Y_labels, top_k=6000)
rna_rf = bnn.utils.select_top_randomforest(X_rna, Y_labels, top_k=6000)

meth_var_set = set(meth_highvar.columns)
meth_anova_set = set(meth_af.columns)
meth_rf_set = set(meth_rf.columns)

rna_var_set = set(rna_highvar.columns)
rna_anova_set = set(rna_af.columns)
rna_rf_set = set(rna_rf.columns)

meth_inter1 = list(meth_anova_set & meth_var_set)
meth_inter2 = list(meth_rf_set & meth_var_set)
meth_inter3 = list(meth_anova_set & meth_rf_set)
meth_all_three = list(meth_anova_set & meth_var_set & meth_rf_set)

rna_inter4 = list(rna_anova_set & rna_var_set)
rna_inter5 = list(rna_rf_set & rna_var_set)
rna_inter6 = list(rna_anova_set & rna_rf_set)
rna_all_three = list(rna_anova_set & rna_var_set & rna_rf_set)

2025-11-08 15:12:53,133 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-08 15:12:53,133 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-11-08 15:12:53,134 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-11-08 15:12:53,224 - bioneuralnet.utils.preprocess - INFO - Selected top 6000 features by variance
2025-11-08 15:12:55,954 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-08 15:12:55,955 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-11-08 15:12:55,955 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-11-08 15:12:56,036 - bioneuralnet.utils.preprocess - INFO - Selected top 6000 features by variance
2025-11-08 15:12:58,995 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-08 15:12:58,995 - bioneuralnet.

In [6]:
print("FROM THE 6000 Methylation feature selection:\n")
print(f"Anova-F & variance selection share: {len(meth_inter1)} features")
print(f"Random Forest & variance selection share: {len(meth_inter2)} features")
print(f"Anova-F & Random Forest share: {len(meth_inter3)} features")
print(f"All three methods agree on: {len(meth_all_three)} features")

FROM THE 6000 Methylation feature selection:

Anova-F & variance selection share: 1875 features
Random Forest & variance selection share: 1656 features
Anova-F & Random Forest share: 2102 features
All three methods agree on: 666 features


In [7]:
print("\nFROM THE 6000 RNA feature selection:\n")
print(f"Anova-F & variance selection share: {len(rna_inter4)} features")
print(f"Random Forest & variance selection share: {len(rna_inter5)} features")
print(f"Anova-F & Random Forest share: {len(rna_inter6)} features")
print(f"All three methods agree on: {len(rna_all_three)} features")


FROM THE 6000 RNA feature selection:

Anova-F & variance selection share: 2271 features
Random Forest & variance selection share: 2141 features
Anova-F & Random Forest share: 2284 features
All three methods agree on: 943 features


## Feature Selection Summary: ANOVA-RF Intersection

The final set of features was determined by the **intersection** of those highlighted by the **ANOVA F-test** and **Random Forest Importance**. This methodology provides a balanced filter, capturing features with both high class-separability (ANOVA) and significant predictive value in a non-linear model (Random Forest). The resulting feature pool is considered highly relevant for the subsequent modeling tasks.

### Feature Overlap Results

The table below quantifies the shared features identified by the different selection techniques for each omics type.

| Omics Data Type | ANOVA-F & Variance | RF & Variance | ANOVA-F & Random Forest (Selected) | All Three Agree |
| :--- | :--- | :--- | :--- | :--- |
| **Methylation** | 1,875 features | 1,656 features | **2,102 features** | 666 features |
| **RNA** | 2,271 features | 2,141 features | **2,284 features** | 943 features |

In [8]:
X_meth_selected = X_meth[meth_inter3]
X_rna_selected = X_rna[rna_inter6]

print("\nFinal Shapes for Modeling")
print(f"Methylation (X1): {X_meth_selected.shape}")
print(f"RNA-Seq (X2): {X_rna_selected.shape}")
print(f"miRNA-Seq (X3): {X_mirna.shape}")
print(f"Labels (Y): {Y_labels.shape}")


Final Shapes for Modeling
Methylation (X1): (658, 2102)
RNA-Seq (X2): (658, 2284)
miRNA-Seq (X3): (658, 472)
Labels (Y): (658, 1)


## Data Availability

To facilitate rapid experimentation and reproduction of our results, the fully processed and feature-selected dataset used in this analysis has been made available directly within the package.

Users can load this dataset, bypassing all preceding data acquisition, preprocessing, and feature selection steps. This allows users to proceed immediately from this step.

In [None]:
import bioneuralnet as bnn

tgca_kipan = bnn.datasets.DatasetLoader("kipan")
display(tgca_kipan.shape)

# The dataset is returned as a dictionary. We extract each file independetly based on the name (Key).
dna_meth = tgca_kipan.data["meth"]
rna = tgca_kipan.data["rna"]
mirna = tgca_kipan.data["mirna"]
clinical = tgca_kipan.data["clinical"]
target = tgca_kipan.data["target"]

{'mirna': (658, 472),
 'target': (658, 1),
 'clinical': (658, 19),
 'rna': (658, 2284),
 'meth': (658, 2102)}

In [None]:
# BioNeuralNet provides a preprocessing function to handle clinical data
clinical = tgca_kipan.data["clinical"]

# for more details on the preprocessing functions, see `bioneuralnet.utils.preprocess``
clinical_preprocessed = bnn.utils.preprocess_clinical(
    clinical, 
    target, 
    top_k=7, 
    scale=False, 
    ignore_columns=[ "days_to_last_followup",  "years_to_birth", "days_to_death", "date_of_initial_pathologic_diagnosis"])

display(clinical_preprocessed.iloc[:3,:5])

2025-11-08 15:17:58,736 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-08 15:17:58,737 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 2306 NaNs after median imputation
2025-11-08 15:17:58,737 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-11-08 15:17:58,826 - bioneuralnet.utils.preprocess - INFO - Selected top 7 features by RandomForest importance


Unnamed: 0_level_0,histological_type_kidney clear cell renal carcinoma,histological_type_kidney papillary renal cell carcinoma,radiation_therapy_no,pathology_M_stage_mx,ethnicity_not hispanic or latino
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-2K-A9WE,False,True,True,False,True
TCGA-2Z-A9J1,False,True,True,True,True
TCGA-2Z-A9J2,False,True,True,True,True


In [10]:
import pandas as pd

X_train_full = pd.concat([dna_meth, rna, mirna], axis=1)

print(f"Nan values in X_train_full: {X_train_full.isna().sum().sum()}")
X_train_full = X_train_full.dropna()
print(f"Nan value in X_train_full after dropping: {X_train_full.isna().sum().sum()}")

print(f"X_train_full shape: {X_train_full.shape}")
# building the graph using the similarity graph function with k=15
A_train = bnn.utils.gen_similarity_graph(X_train_full, k=15)

print(f"\nNetwork shape: {A_train.shape}")

Nan values in X_train_full: 0
Nan value in X_train_full after dropping: 0
X_train_full shape: (658, 4858)

Network shape: (4858, 4858)


## Reproducibility and Seeding

To ensure our experimental results are fully reproducible, a single global seed is set at the beginning of the analysis.

This utility function propagates the seed to all sources of randomness, including `random`, `numpy`, and `torch` (for both CPU and GPU).

Critically, it also configures the PyTorch cuDNN backend to use deterministic algorithms.

In [1]:
import bioneuralnet as bnn

SEED = 118
bnn.utils.set_seed(SEED)

2025-11-08 16:34:44,524 - bioneuralnet.utils.data - INFO - Setting global seed for reproducibility to: 118
2025-11-08 16:34:44,525 - bioneuralnet.utils.data - INFO - CUDA available. Applying seed to all GPU operations
2025-11-08 16:34:44,525 - bioneuralnet.utils.data - INFO - Seed setting complete


## Hyperparameter Tuning

- First step on our benchmark analysis is dedicated to **hyperparameter tuning** for:
    - **SAGE:** GraphSAGE (Graph Sample and Aggregate)
    - **GCN:** Graph Convolutional Network
    - **GAT:** Graph Attention Network
    
- We use `tune=True` and `repeat_num=5` within **DPMON** (our classification framework) to let the framework automatically search for the best architectural and training parameters.

- After the tuning phase, we will manually set the best-performing parameters for each model.
- We then use these optimal configurations in the final cross-validation to rigorously test each model performance and stability.

In [None]:
from pathlib import Path

# We rename the target to phenotype so we keep it consistent with DPMON expectations.
target = target.rename(columns={"target": "phenotype"})

dpmon_sage_tune = bnn.downstream_task.DPMON(
    adjacency_matrix=A_train,
    omics_list=[dna_meth, rna, mirna],
    phenotype_data=target,
    clinical_data=clinical_preprocessed,
    model="SAGE",
    tune=True,
    repeat_num=5,
    gpu=True, cuda=0,
    output_dir=Path("/home/vicente/Github/BioNeuralNet/dpmon_tuning/KIPAN_SAGE"),
)

dpmon_sage_tune.run()

dpmon_gcn_tune = bnn.downstream_task.DPMON(
    adjacency_matrix=A_train,
    omics_list=[dna_meth, rna, mirna],
    phenotype_data=target,
    clinical_data=clinical_preprocessed,
    model="GCN",
    tune=True,
    repeat_num=5,
    gpu=True, cuda=0,
    output_dir=Path("/home/vicente/Github/BioNeuralNet/dpmon_tuning/KIPAN_GCN"),
)

dpmon_gcn_tune.run()


dpmon_gat_tune = bnn.downstream_task.DPMON(
    adjacency_matrix=A_train,
    omics_list=[dna_meth, rna, mirna],
    phenotype_data=target,
    clinical_data=clinical_preprocessed,
    model="GAT",
    tune=True,
    repeat_num=5,
    gpu=True, cuda=0,
    output_dir=Path("/home/vicente/Github/BioNeuralNet/dpmon_tuning/KIPAN_GAT"),
)

dpmon_gcn_tune.run()

2025-11-08 15:19:55,640 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_tuning/KIPAN_SAGE
INFO:bioneuralnet.downstream_task.dpmon:Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_tuning/KIPAN_SAGE
2025-11-08 15:19:55,642 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
INFO:bioneuralnet.downstream_task.dpmon:Initialized DPMON with the provided parameters.
2025-11-08 15:19:55,644 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
INFO:bioneuralnet.downstream_task.dpmon:Starting DPMON run.
2025-11-08 15:19:55,684 - bioneuralnet.downstream_task.dpmon - INFO - Running hyperparameter tuning for DPMON.
INFO:bioneuralnet.downstream_task.dpmon:Running hyperparameter tuning for DPMON.
2025-11-08 15:19:55,684 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
INFO:bioneuralnet.downstream_task.dpmon:Using GPU 0
DEBUG:bioneuralnet.downstream_task.dpmon:S

(     Actual  Predicted
 0         1          1
 1         1          1
 2         1          1
 3         1          1
 4         1          1
 ..      ...        ...
 653       1          1
 654       1          1
 655       1          1
 656       1          1
 657       1          1
 
 [658 rows x 2 columns],
 0.8936170212765958)

## Analysis of Hyperparameter Optimization

**Hyperparameter tuning using `SEED = 118`**

### Tuning Search Space

**DPMON** parameter search space is extensive, offering over 12,000 discrete combinations plus a continuous space for learning rates. By sampling **50 trials** from this space, we are effectively probing it to find the distinct, optimal architecture best suited for each GNN (SAGE, GCN, and GAT).


| Parameter | SAGE (GraphSAGE) | GCN (Graph Convolutional) | GAT (Graph Attention) |
| :--- | :--- | :--- | :--- |
| **gnn\_layer\_num** | 2 | 2 | 16 |
| **gnn\_hidden\_dim** | 128 | 16 | 4 |
| **lr (Learning Rate)** | 0.005197 | 0.005345 | 0.003226 |
| **weight\_decay** | 0.046079 | 0.003940 | 0.046863 |
| **nn\_hidden\_dim1** | 16 | 64 | 16 |
| **nn\_hidden\_dim2** | 32 | 16 | 128 |
| **num\_epochs** | 4096 | 512 | 512 |


### SAGE Model Configuration
The optimal SAGE configuration was **shallow (2 layers) and wide (128 hidden dim)**. This confirms a preference for a high-dimensional feature representation over recursive depth. It also required the longest training time at 4,096 epochs.

### GCN Model Configuration
In a major shift, the GCN model favored a classic, shallow architecture: **2 layers and 16 hidden dim**. This highly efficient configuration converged in just 512 epochs and achieved a perfect (1.0) tuning accuracy, indicating it found a strong, generalizable pattern quickly.

### GAT Model Configuration
The GAT configuration favored a **deep (16 layers) and narrow (4 hidden dim)** structure. It was also highly efficient, converging in 512 epoch*, suggesting GATs attention mechanism works well even when stacked deeply with low-dimensional features.

---

## Cross-Validation Results with 5 folds

| Model | Avg. Accuracy | Avg. F1 Weighted | Avg. F1 Macro |
| :--- | :--- | :--- | :--- |
| **GCN** | **0.9525** +/- 0.0267 | **0.9563** +/- 0.0241 | 0.9232 +/- 0.0410 |
| **GAT** | 0.9506 +/- 0.0759 | 0.9528 +/- 0.0718 | **0.9349** +/- 0.0954 |
| **SAGE** | 0.9206 +/- 0.0581 | 0.9119 +/- 0.0695 | 0.8695 +/- 0.1418 |

The cross-validation results for the KIPAN dataset show a new performance ranking. GCN emerged as the top model, followed closely by GAT, with SAGE also performing strongly.

### 1. GAT Performance
The GAT model was a top contender, achieving an average accuracy of 0.9506. Its deep-and-narrow architecture (16 layers, 4 hidden dim) proved highly effective on this dataset. Its stability was significantly improved, though its variance was the highest of the three.

- Avg. Accuracy: 0.9506 (+/- 0.0759)
- Avg. F1 Weighted: 0.9528 (+/- 0.0718)

### 2. GCN Performance
GCN was the **top-performing and most stable model**. Its shallow-and-narrow architecture (2 layers, 16 hidden dim) translated perfectly from its 1.0 tuning accuracy. Most importantly, it had an **extremely low standard deviation** (+/- 0.0267), indicating its performance was highly reliable across all folds.

- Avg. Accuracy: 0.9525 (+/- 0.0267)
- Avg. F1 Weighted: 0.9563 (+/- 0.0241)

### 3. SAGE Performance
SAGE delivered a strong and reliable performance, with its average accuracy exceeding 0.92. Its shallow-and-wide architecture (2 layers, 128 hidden dim) was effective and showed good stability with a low standard deviation (+/- 0.0581), second only to GCN.

- Avg. Accuracy: 0.9206 (+/- 0.0581)
- Avg. F1 Weighted: 0.9119 (+/- 0.0695)

### Summary
In summary, all three models performed exceptionally well on the KIPAN dataset, with GCN taking the top spot. The most significant finding was the **dramatic increase in stability** across all models in comparison to the **GBMLGG** dataset.

**These results are shown below**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score, accuracy_score
from bioneuralnet.downstream_task import DPMON

X_meth_full = dna_meth
X_rna_full = rna
X_mirna_full = mirna
C_full = clinical_preprocessed
Y_full = target
A_full = A_train

patient_indices = np.arange(len(Y_full))
y_target_classes = Y_full.squeeze()

sage_params = {
    'layer_num': 2,
    'gnn_hidden_dim': 128,
    'lr': 0.005197038211089667,
    'weight_decay': 0.04607928341516355,
    'nn_hidden_dim1': 16,
    'nn_hidden_dim2': 32,
    'num_epochs': 4096
}

N_FOLDS = 5
skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED) 
output_dir_base_sage = Path("/home/vicente/Github/BioNeuralNet/dpmon_cv_results_SAGE_FINAL/kipan")

accuracy_scores_sage = []
f1_macro_scores_sage = []
f1_weighted_scores_sage = []

for fold_num, (train_index, test_index) in enumerate(skf.split(patient_indices, y_target_classes)):
    print(f"\nRunning DPMON SAGE Fold {fold_num + 1}/{N_FOLDS}")

    X_train_omics = [X_meth_full.iloc[train_index], X_rna_full.iloc[train_index], X_mirna_full.iloc[train_index]]
    Y_train = Y_full.iloc[train_index]
    C_train = C_full.iloc[train_index]
    A_train_fold = A_full.iloc[train_index, train_index] 
    
    dpmon_fold = DPMON(
        adjacency_matrix=A_train_fold, omics_list=X_train_omics, phenotype_data=Y_train,
        clinical_data=C_train, repeat_num=1, tune=False, gpu=True, cuda=0,
        output_dir=output_dir_base_sage / f"fold_{fold_num + 1}",
        **sage_params
    )

    predictions_df, _ = dpmon_fold.run() 
    actual = predictions_df["Actual"]
    pred = predictions_df["Predicted"]

    test_f1_macro = f1_score(actual, pred, average='macro')
    test_f1_weighted = f1_score(actual, pred, average='weighted')
    test_acc = accuracy_score(actual, pred)

    accuracy_scores_sage.append(test_acc)
    f1_macro_scores_sage.append(test_f1_macro)
    f1_weighted_scores_sage.append(test_f1_weighted)
    
    print(f"SAGE Fold {fold_num + 1} | Accuracy: {test_acc:.4f} | F1-Macro: {test_f1_macro:.4f}")

mean_acc_sage = np.mean(accuracy_scores_sage)
std_acc_sage = np.std(accuracy_scores_sage)

mean_f1_macro_sage = np.mean(f1_macro_scores_sage)
std_f1_macro_sage = np.std(f1_macro_scores_sage)

mean_f1_weighted_sage = np.mean(f1_weighted_scores_sage)
std_f1_weighted_sage = np.std(f1_weighted_scores_sage)

print("\nClassification with SAGE and 5-FOLD Cross-Validation")
print(f"Avg. Accuracy: {mean_acc_sage:.4f} +/- {std_acc_sage:.4f}")
print(f"Avg. F1 Weighted: {mean_f1_weighted_sage:.4f} +/- {std_f1_weighted_sage:.4f}")
print(f"Avg. F1 Macro: {mean_f1_macro_sage:.4f} +/- {std_f1_macro_sage:.4f}")


2025-11-08 15:26:59,129 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_cv_results_SAGE_FINAL/kipan/fold_1
INFO:bioneuralnet.downstream_task.dpmon:Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_cv_results_SAGE_FINAL/kipan/fold_1
2025-11-08 15:26:59,130 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
INFO:bioneuralnet.downstream_task.dpmon:Initialized DPMON with the provided parameters.
2025-11-08 15:26:59,130 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
INFO:bioneuralnet.downstream_task.dpmon:Starting DPMON run.
2025-11-08 15:26:59,134 - bioneuralnet.downstream_task.dpmon - INFO - Running standard training for DPMON.
INFO:bioneuralnet.downstream_task.dpmon:Running standard training for DPMON.
2025-11-08 15:26:59,135 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
INFO:bioneuralnet.downstream_task.dpmon:Using GPU 0
DEBUG:bioneuraln


Running DPMON SAGE Fold 1/5


2025-11-08 15:26:59,508 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/4096], Loss: 1.1157
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/4096], Loss: 0.7539
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/4096], Loss: 0.6388
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/4096], Loss: 0.5929
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/4096], Loss: 0.5838
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/4096], Loss: 0.6241
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/4096], Loss: 0.5939
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/4096], Loss: 0.5926
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/4096], Loss: 0.5936
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/4096], Loss: 0.6172
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/4096], Loss: 0.6017
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/4096], Loss

SAGE Fold 1 | Accuracy: 0.8536 | F1-Macro: 0.8497

Running DPMON SAGE Fold 2/5


2025-11-08 15:27:09,019 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/4096], Loss: 1.0483
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/4096], Loss: 0.7539
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/4096], Loss: 0.6408
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/4096], Loss: 0.5960
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/4096], Loss: 0.5945
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/4096], Loss: 0.5898
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/4096], Loss: 0.5890
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/4096], Loss: 0.5922
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/4096], Loss: 0.6043
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/4096], Loss: 0.5967
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/4096], Loss: 0.6115
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/4096], Loss

SAGE Fold 2 | Accuracy: 0.9658 | F1-Macro: 0.9481

Running DPMON SAGE Fold 3/5


2025-11-08 15:27:18,465 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/4096], Loss: 1.0858
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/4096], Loss: 0.7628
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/4096], Loss: 0.6428
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/4096], Loss: 0.5980
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/4096], Loss: 0.5876
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/4096], Loss: 0.5833
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/4096], Loss: 0.6239
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/4096], Loss: 0.6056
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/4096], Loss: 0.5980
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/4096], Loss: 0.6230
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/4096], Loss: 0.6086
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/4096], Loss

SAGE Fold 3 | Accuracy: 0.8536 | F1-Macro: 0.6027

Running DPMON SAGE Fold 4/5


2025-11-08 15:27:27,686 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/4096], Loss: 1.1399
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/4096], Loss: 0.8178
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/4096], Loss: 0.6703
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/4096], Loss: 0.6048
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/4096], Loss: 0.5849
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/4096], Loss: 0.5843
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/4096], Loss: 0.6277
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/4096], Loss: 0.6065
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/4096], Loss: 0.5975
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/4096], Loss: 0.6135
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/4096], Loss: 0.6136
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/4096], Loss

SAGE Fold 4 | Accuracy: 0.9962 | F1-Macro: 0.9972

Running DPMON SAGE Fold 5/5


2025-11-08 15:27:36,838 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/4096], Loss: 1.0923
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/4096], Loss: 0.7478
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/4096], Loss: 0.6389
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/4096], Loss: 0.5939
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/4096], Loss: 0.5952
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/4096], Loss: 0.5867
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/4096], Loss: 0.5878
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/4096], Loss: 0.6292
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/4096], Loss: 0.6002
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/4096], Loss: 0.5945
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/4096], Loss: 0.6100
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/4096], Loss

SAGE Fold 5 | Accuracy: 0.9336 | F1-Macro: 0.9498

Classification with SAGE and 5-FOLD Cross-Validation
Avg. Accuracy: 0.9206 +/- 0.0581
Avg. F1 Weighted: 0.9119 +/- 0.0695
Avg. F1 Macro: 0.8695 +/- 0.1418


In [None]:
output_dir_base_gcn = Path("/home/vicente/Github/BioNeuralNet/dpmon_cv_results_GCN_FINAL/kipan")

gcn_params = {
    'layer_num': 2,
    'gnn_hidden_dim': 16,
    'lr': 0.005345517248048608,
    'weight_decay': 0.003940908881257014,
    'nn_hidden_dim1': 64,
    'nn_hidden_dim2': 16,
    'num_epochs': 512
}

skf_gcn = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED) 

accuracy_scores_gcn = []
f1_macro_scores_gcn = []
f1_weighted_scores_gcn = []

for fold_num, (train_index, test_index) in enumerate(skf_gcn.split(patient_indices, y_target_classes)):
    print(f"\nDPMON GCN Fold {fold_num + 1}/{N_FOLDS}")

    X_train_omics = [X_meth_full.iloc[train_index], X_rna_full.iloc[train_index], X_mirna_full.iloc[train_index]]
    Y_train = Y_full.iloc[train_index]
    C_train = C_full.iloc[train_index]
    A_train_fold = A_full.iloc[train_index, train_index] 
    
    dpmon_fold = DPMON(
        adjacency_matrix=A_train_fold, omics_list=X_train_omics, phenotype_data=Y_train,
        clinical_data=C_train, repeat_num=1, tune=False, gpu=True, cuda=0,
        output_dir=output_dir_base_gcn / f"fold_{fold_num + 1}",
        **gcn_params
    )

    predictions_df, _ = dpmon_fold.run() 
    actual = predictions_df["Actual"]
    pred = predictions_df["Predicted"]

    test_acc = accuracy_score(actual, pred)
    test_f1 = f1_score(actual, pred, average='macro')
    test_f1w = f1_score(actual, pred, average='weighted')

    accuracy_scores_gcn.append(test_acc)
    f1_macro_scores_gcn.append(test_f1)
    f1_weighted_scores_gcn.append(test_f1w)
    
    print(f"GCN Fold {fold_num + 1} | Accuracy: {test_acc:.4f} | F1-Macro: {test_f1:.4f}")

mean_acc_gcn = np.mean(accuracy_scores_gcn)
std_acc_gcn = np.std(accuracy_scores_gcn)

mean_f1_macro_gcn = np.mean(f1_macro_scores_gcn)
std_f1_macro_gcn = np.std(f1_macro_scores_gcn)

mean_f1_weighted_gcn = np.mean(f1_weighted_scores_gcn)
std_f1_weighted_gcn = np.std(f1_weighted_scores_gcn)

print("Classification with GCN and 5-FOLD Cross-Validation")
print(f"Avg. Accuracy: {mean_acc_gcn:.4f} +/- {std_acc_gcn:.4f}")
print(f"Avg. F1 Weighted: {mean_f1_weighted_gcn:.4f} +/- {std_f1_weighted_gcn:.4f}")
print(f"Avg. F1 Macro: {mean_f1_macro_gcn:.4f} +/- {std_f1_macro_gcn:.4f}")

2025-11-08 15:33:44,850 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_cv_results_GCN_FINAL/kipan/fold_1
INFO:bioneuralnet.downstream_task.dpmon:Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_cv_results_GCN_FINAL/kipan/fold_1
2025-11-08 15:33:44,850 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
INFO:bioneuralnet.downstream_task.dpmon:Initialized DPMON with the provided parameters.
2025-11-08 15:33:44,851 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
INFO:bioneuralnet.downstream_task.dpmon:Starting DPMON run.
2025-11-08 15:33:44,854 - bioneuralnet.downstream_task.dpmon - INFO - Running standard training for DPMON.
INFO:bioneuralnet.downstream_task.dpmon:Running standard training for DPMON.
2025-11-08 15:33:44,855 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
INFO:bioneuralnet.downstream_task.dpmon:Using GPU 0
DEBUG:bioneuralnet


DPMON GCN Fold 1/5


2025-11-08 15:33:45,203 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.1004
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.7773
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.6632
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.6013
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.5762
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.5659
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5610
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5592
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.5891
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.5699
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.5648
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.5622
DEB

GCN Fold 1 | Accuracy: 0.9411 | F1-Macro: 0.9050

DPMON GCN Fold 2/5


2025-11-08 15:33:46,728 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.1012
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.7873
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.6815
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.6175
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.5828
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.5693
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5780
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5666
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.5619
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.5594
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.5587
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.5628
DEB

GCN Fold 2 | Accuracy: 0.9962 | F1-Macro: 0.9922

DPMON GCN Fold 3/5


2025-11-08 15:33:48,278 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.0940
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.8781
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.7433
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.6428
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.5911
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.5722
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5701
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5625
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.5602
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.5592
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.6029
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.5689
DEB

GCN Fold 3 | Accuracy: 0.9297 | F1-Macro: 0.8885

DPMON GCN Fold 4/5


2025-11-08 15:33:49,781 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.1172
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.7594
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.6614
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.6023
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.5756
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.5665
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5649
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5695
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.5629
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.5607
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.5593
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.5623
DEB

GCN Fold 4 | Accuracy: 0.9260 | F1-Macro: 0.8836

DPMON GCN Fold 5/5


2025-11-08 15:33:51,307 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.0933
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.7802
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.6822
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.6217
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.5861
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.5681
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5625
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5617
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.5903
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.5703
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.5622
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.5602
DEB

GCN Fold 5 | Accuracy: 0.9696 | F1-Macro: 0.9465
Classification with GCN and 5-FOLD Cross-Validation
Avg. Accuracy: 0.9525 +/- 0.0267
Avg. F1 Weighted: 0.9563 +/- 0.0241
Avg. F1 Macro: 0.9232 +/- 0.0410


In [None]:
output_dir_base_gat = Path("/home/vicente/Github/BioNeuralNet/dpmon_cv_results_GAT_FINAL/kipan")

gat_params = {
    'layer_num': 16,
    'gnn_hidden_dim': 4,
    'lr': 0.003226646486974539,
    'weight_decay': 0.04686394640851084,
    'nn_hidden_dim1': 16,
    'nn_hidden_dim2': 128,
    'num_epochs': 512
}

skf_gat = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED) 

accuracy_scores_gat = []
f1_macro_scores_gat = []
f1_weighted_scores_gat = []

for fold_num, (train_index, test_index) in enumerate(skf_gat.split(patient_indices, y_target_classes)):
    print(f"\nDPMON GAT Fold {fold_num + 1}/{N_FOLDS}")

    X_train_omics = [X_meth_full.iloc[train_index], X_rna_full.iloc[train_index], X_mirna_full.iloc[train_index]]
    Y_train = Y_full.iloc[train_index]
    C_train = C_full.iloc[train_index]
    A_train_fold = A_full.iloc[train_index, train_index] 
    
    dpmon_fold = DPMON(
        adjacency_matrix=A_train_fold, omics_list=X_train_omics, phenotype_data=Y_train,
        clinical_data=C_train, repeat_num=1, tune=False, gpu=True, cuda=0,
        output_dir=output_dir_base_gat / f"fold_{fold_num + 1}",
        **gat_params
    )

    predictions_df, _ = dpmon_fold.run() 
    actual = predictions_df["Actual"]
    pred = predictions_df["Predicted"]

    test_acc = accuracy_score(actual, pred)
    test_f1 = f1_score(actual, pred, average='macro')
    test_f1w = f1_score(actual, pred, average='weighted')

    accuracy_scores_gat.append(test_acc)
    f1_macro_scores_gat.append(test_f1)
    f1_weighted_scores_gat.append(test_f1w)
    
    print(f"GAT Fold {fold_num + 1} | Accuracy: {test_acc:.4f} | F1-Macro: {test_f1:.4f}")

mean_acc_gat = np.mean(accuracy_scores_gat)
std_acc_gat = np.std(accuracy_scores_gat)

mean_f1_macro_gat = np.mean(f1_macro_scores_gat)
std_f1_macro_gat = np.std(f1_macro_scores_gat)

mean_f1_weighted_gat = np.mean(f1_weighted_scores_gat)
std_f1_weighted_gat = np.std(f1_weighted_scores_gat)

print("Classification with GAT and 5-FOLD Cross-Validation")
print(f"Avg. Accuracy: {mean_acc_gat:.4f} +/- {std_acc_gat:.4f}")
print(f"Avg. F1 Weighted: {mean_f1_weighted_gat:.4f} +/- {std_f1_weighted_gat:.4f}")
print(f"Avg. F1 Macro: {mean_f1_macro_gat:.4f} +/- {std_f1_macro_gat:.4f}")

2025-11-08 15:34:22,035 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_cv_results_GAT_FINAL/kipan/fold_1
INFO:bioneuralnet.downstream_task.dpmon:Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_cv_results_GAT_FINAL/kipan/fold_1
2025-11-08 15:34:22,036 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
INFO:bioneuralnet.downstream_task.dpmon:Initialized DPMON with the provided parameters.
2025-11-08 15:34:22,036 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
INFO:bioneuralnet.downstream_task.dpmon:Starting DPMON run.
2025-11-08 15:34:22,043 - bioneuralnet.downstream_task.dpmon - INFO - Running standard training for DPMON.
INFO:bioneuralnet.downstream_task.dpmon:Running standard training for DPMON.
2025-11-08 15:34:22,043 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
INFO:bioneuralnet.downstream_task.dpmon:Using GPU 0
DEBUG:bioneuralnet


DPMON GAT Fold 1/5


2025-11-08 15:34:22,399 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.0508
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.6822
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.5946
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.5773
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.5742
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.5686
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5653
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5773
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.5670
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.5647
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.5648
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.6165
DEB

GAT Fold 1 | Accuracy: 0.9981 | F1-Macro: 0.9986

DPMON GAT Fold 2/5


2025-11-08 15:34:27,589 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.1539
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.7013
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.6103
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.5820
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.5710
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.5670
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5686
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5684
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.5686
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.5922
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.5725
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.5685
DEB

GAT Fold 2 | Accuracy: 0.9981 | F1-Macro: 0.9961

DPMON GAT Fold 3/5


2025-11-08 15:34:32,671 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.1048
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.6906
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.6025
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.5764
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.5668
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.5636
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5957
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5715
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.5649
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.5652
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.6052
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.5787
DEB

GAT Fold 3 | Accuracy: 0.8023 | F1-Macro: 0.7514

DPMON GAT Fold 4/5


2025-11-08 15:34:37,616 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.0746
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.6985
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.6015
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.5743
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.5665
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.5683
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5749
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5671
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.5666
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.5673
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.5970
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.5726
DEB

GAT Fold 4 | Accuracy: 0.9564 | F1-Macro: 0.9300

DPMON GAT Fold 5/5


2025-11-08 15:34:42,765 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/1
INFO:bioneuralnet.downstream_task.dpmon:Training iteration 1/1
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [1/512], Loss: 1.1169
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [10/512], Loss: 0.6862
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [20/512], Loss: 0.5982
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [30/512], Loss: 0.5749
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [40/512], Loss: 0.5687
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [50/512], Loss: 0.5799
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [60/512], Loss: 0.5677
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [70/512], Loss: 0.5653
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [80/512], Loss: 0.5977
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [90/512], Loss: 0.5740
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [100/512], Loss: 0.5664
DEBUG:bioneuralnet.downstream_task.dpmon:Epoch [110/512], Loss: 0.5663
DEB

GAT Fold 5 | Accuracy: 0.9981 | F1-Macro: 0.9986
Classification with GAT and 5-FOLD Cross-Validation
Avg. Accuracy: 0.9506 +/- 0.0759
Avg. F1 Weighted: 0.9528 +/- 0.0718
Avg. F1 Macro: 0.9349 +/- 0.0954
