# TCGA-PAAD Analysis Demo

- **Cohort**: Focuses on the **TCGA-PAAD** (Pancreatic Adenocarcinoma) dataset, a vital cohort for studying pancreatic cancer.

- **Goal**: Perform survival prediction using a multi-omics profile.
- **Prediction Target**: Predict **Overall Survival (OS)** based on the patient's integrated molecular data (RNA, Methylation, CNV, and clinical features).

**Data Sources:**
Omics Data: [https://xenabrowser.net/datapages/](https://xenabrowser.net/datapages/)
Clinical Data: Broad Institute FireHose (`http://firebrowse.org/?cohort=PAAD`)

In [1]:
import pandas as pd
from pathlib import Path
root = Path("/home/vicente/Github/BioNeuralNet/PAAN")

cnv_gistic_raw = pd.read_csv(root/"Gistic2_CopyNumber_Gistic2_all_thresholded_by_genes.txt", sep="\t",index_col=0,low_memory=False)                            
rna_raw = pd.read_csv(root / "HiSeqV2.txt", sep="\t",index_col=0,low_memory=False)
meth_raw = pd.read_csv(root/"HumanMethylation450.txt", sep='\t',index_col=0,low_memory=False)
clinical_raw = pd.read_csv(root / "PAAD.clin.merged.picked.txt",sep="\t", index_col=0, low_memory=False)
target = pd.read_csv(root / "survival_PAAD_survival.txt", sep="\t", index_col=0, low_memory=False)

probe_map_meth = pd.read_csv("probeMap_illuminaMethyl450_hg19_GPL16304_TCGAlegacy.txt", sep="\t", index_col=0, low_memory=False)

# display all shapes and first few rows of each dataset
display(cnv_gistic_raw.iloc[:3,:5])
display(cnv_gistic_raw.shape)

display(rna_raw.iloc[:3,:5])
display(rna_raw.shape)

display(meth_raw.iloc[:3,:5])
display(meth_raw.shape)

display(clinical_raw.iloc[:3,:5])
display(clinical_raw.shape)

display(target.iloc[:3,:5])
display(target.shape)

print(f"\nprobe map methylation:")
display(probe_map_meth.iloc[:3,:5])
display(probe_map_meth.shape)



Unnamed: 0_level_0,TCGA-2J-AAB1-01,TCGA-2J-AAB4-01,TCGA-2J-AAB6-01,TCGA-2J-AAB8-01,TCGA-2J-AAB9-01
Gene Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ACAP3,0,0,1,0,0
ACTRT2,0,0,1,0,0
AGRN,0,0,1,0,0


(24776, 184)

Unnamed: 0_level_0,TCGA-2L-AAQL-01,TCGA-2J-AABI-01,TCGA-3A-A9J0-01,TCGA-3A-A9I7-01,TCGA-2J-AABO-01
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ARHGEF10L,10.243,9.5734,10.3872,10.1431,9.8022
HIF3A,6.6983,6.4428,5.5034,4.6491,7.2137
RNF17,0.0,0.0,0.0,1.1814,0.0


(20530, 183)

Unnamed: 0_level_0,TCGA-S4-A8RP-01,TCGA-IB-A6UG-01,TCGA-US-A776-01,TCGA-FZ-5926-01,TCGA-HZ-A8P1-01
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
cg13332474,0.032,0.3927,0.0351,0.0442,0.0398
cg00651829,0.3976,0.1451,0.0209,0.0153,0.5791
cg17027195,0.0464,0.5564,0.0417,0.0342,0.6533


(485577, 195)

Unnamed: 0_level_0,tcga-2j-aabr,tcga-2j-aabt,tcga-3a-a9i5,tcga-3a-a9ij,tcga-3a-a9il
Hybridization REF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Composite Element REF,value,value,value,value,value
years_to_birth,60,72,57,65,39
vital_status,0,0,0,0,0


(20, 185)

Unnamed: 0_level_0,_PATIENT,OS,OS.time,DSS,DSS.time
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-2J-AAB1-01,TCGA-2J-AAB1,1,66,1.0,66
TCGA-2J-AAB4-01,TCGA-2J-AAB4,0,729,0.0,729
TCGA-2J-AAB6-01,TCGA-2J-AAB6,1,293,1.0,293


(196, 10)


probe map methylation:


Unnamed: 0_level_0,gene,chrom,chromStart,chromEnd,strand
#id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
cg13332474,.,chr7,25935146,25935148,.
cg00651829,"RSPH14,GNAZ",chr22,23413065,23413067,.
cg17027195,AUTS2,chr7,69064092,69064094,.


(395985, 5)

## Data Processing Summary

1.  **Transpose Data:** All omics data (`cnv_gistic`, `rna`, `meth`) was flipped so rows represent patients and columns represent features.
2.  **Process Methylation:** Raw methylation probes were mapped to gene names. Probes without genes were dropped, and probes mapping to multiple genes were split. The data was then averaged by gene to create a single feature per gene.
3.  **Standardize Patient IDs:** Patient IDs in all tables (omics, clinical features, and outcomes) were cleaned and standardized to the 12-character TCGA format (e.g., `TCGA-2J-AAB1`) for matching.
4.  **Handle Duplicates:** Duplicate patient rows were averaged in the omics data. The first entry was kept for duplicate patients in the clinical and outcomes data.
5.  **Find Common Patients:** The script identified the list of patients that exist in *all five* datasets (`meth`, `rna`, `cnv_gistic`, `clinical`, and `outcomes`).
6.  **Subset Data:** All data tables were filtered down to *only* this common list of patients, ensuring perfect alignment.
7.  **Process CNV Data:** The GISTIC-formatted CNV data was vectorized to create two separate binary matrices: `cnv_amp_processed` (for gains/amplifications) and `cnv_del_processed` (for losses/deletions).
8.  **Extract Target:** The `OS` (Overall Survival) column was pulled from the processed `outcomes` data to be used as the prediction target (y-variable).

In [2]:
# --- 1. Load, Transpose, and Standardize IDs ---
cnv_gistic = cnv_gistic_raw.T
rna = rna_raw.T
meth_transposed = meth_raw.T
clinical = clinical_raw.T

def trim_barcode(idx):
    return idx.to_series().str.slice(0, 12)

# Standardize all patient IDs FIRST
meth_transposed.index = trim_barcode(meth_transposed.index)
rna.index = trim_barcode(rna.index)
cnv_gistic.index = trim_barcode(cnv_gistic.index)
clinical.index = clinical.index.str.upper()
clinical.index.name = "Patient_ID"

outcomes = target.copy() 
outcomes = outcomes.set_index('_PATIENT')
outcomes.index = outcomes.index.str.upper()
outcomes.index.name = "Patient_ID"

# --- 2. Handle Duplicate Patients ---
# Average any patients that appear twice (e.g., from different vials)
meth_transposed = meth_transposed.groupby(meth_transposed.index).mean()
rna = rna.groupby(rna.index).mean()
cnv_gistic = cnv_gistic.groupby(cnv_gistic.index).mean()
clinical = clinical[~clinical.index.duplicated(keep='first')]
outcomes = outcomes[~outcomes.index.duplicated(keep='first')]



In [3]:
# --- 3. Find Common Patients & Subset ---
common_patients = sorted(list(
    set(meth_transposed.index) &
    set(rna.index) &
    set(cnv_gistic.index) &
    set(clinical.index) &
    set(outcomes.index)
))

print(f"\nFound: {len(common_patients)} patients across all data types.")

# SUBSET THE DATA *BEFORE* HEAVY PROCESSING
meth_processed = meth_transposed.loc[common_patients]
rna_processed = rna.loc[common_patients]
cnv_gistic_processed = cnv_gistic.loc[common_patients]
clinical_processed = clinical.loc[common_patients]
outcomes_processed = outcomes.loc[common_patients]


Found: 177 patients across all data types.


In [4]:
meth_processed_T = meth_processed.T

probe_cleaned_gene = probe_map_meth.dropna(subset=['gene'])
probe_cleaned_gene = probe_cleaned_gene[probe_cleaned_gene['gene'] != '.']
probe_cleaned_gene['gene'] = probe_cleaned_gene['gene'].str.split(',')
probe_map_meth_exploded = probe_cleaned_gene.explode('gene')

meth_with_genes = meth_processed_T.join(probe_map_meth_exploded['gene'])
meth_with_genes_filtered = meth_with_genes.dropna(subset=['gene'])


data_columns = meth_with_genes_filtered.columns.drop('gene')
meth_with_genes_filtered.loc[:, data_columns] = meth_with_genes_filtered[data_columns].apply(pd.to_numeric)
meth_processed = meth_with_genes_filtered.groupby('gene').mean().T

rna_processed = rna_processed.apply(pd.to_numeric, errors='coerce')
cnv_gistic_processed = cnv_gistic_processed.apply(pd.to_numeric, errors='coerce')

dfs_to_process = {
    "meth_": meth_processed,
    "rna_": rna_processed
}


for prefix, df in dfs_to_process.items():
    df.columns = df.columns.str.replace(r"\?", "unknown_", regex=True)
    df.columns = df.columns.str.replace(r"\|", "_", regex=True)
    df.columns = df.columns.str.replace("-", "_", regex=False)
    df.columns = df.columns.str.replace(r"_+", "_", regex=True)
    df.columns = df.columns.str.strip("_")

    # b. Prefixing (Ensures unique names)
    # The .add_prefix() method is used here.
    dfs_to_process[prefix] = df.add_prefix(prefix)

# --- 3. Reassigning Results (DataFrames are now clean and prefixed) ---
meth_processed = dfs_to_process["meth_"]
rna_processed = dfs_to_process["rna_"]


targets = outcomes_processed['OS']

In [5]:
# cnv_gistic = cnv_gistic_raw.T
# rna = rna_raw.T
# meth_transposed = meth_raw.T
# clinical = clinical_raw.T

# probe_cleaned_gene = probe_map_meth.dropna(subset=['gene'])
# probe_cleaned_gene = probe_cleaned_gene[probe_cleaned_gene['gene'] != '.']
# probe_cleaned_gene['gene'] = probe_cleaned_gene['gene'].str.split(',')
# probe_map_meth_exploded = probe_cleaned_gene.explode('gene')

# meth_with_genes = meth_transposed.join(probe_map_meth_exploded['gene'])
# meth_with_genes_filtered = meth_with_genes.dropna(subset=['gene'])
# data_columns = meth_with_genes_filtered.columns.drop('gene')
# meth_with_genes_filtered[data_columns] = meth_with_genes_filtered[data_columns].apply(pd.to_numeric)

# meth = meth_with_genes_filtered.groupby('gene').mean()
# print(f"DNA Meth shape (all genes): {meth.shape}")


# def trim_barcode(idx):
#     return idx.to_series().str.slice(0, 12)

# meth.index = trim_barcode(meth.index)
# rna.index = trim_barcode(rna.index)
# cnv_gistic.index = trim_barcode(cnv_gistic.index)

# clinical.index = clinical.index.str.upper()
# clinical.index.name = "Patient_ID"
# clinical = clinical[~clinical.index.duplicated(keep='first')]

# outcomes = target.copy() 
# outcomes = outcomes.set_index('_PATIENT')
# outcomes.index = outcomes.index.str.upper()
# outcomes.index.name = "Patient_ID"
# outcomes = outcomes[~outcomes.index.duplicated(keep='first')]

# rna = rna.apply(pd.to_numeric, errors='coerce')
# cnv_gistic = cnv_gistic.apply(pd.to_numeric, errors='coerce')


# meth = meth.groupby(meth.index).mean()
# rna = rna.groupby(rna.index).mean()
# cnv_gistic = cnv_gistic.groupby(cnv_gistic.index).mean()

# print(f"\nMethylation shape: {meth.shape}")
# print(f"RNA shape: {rna.shape}")
# print(f"cnv_gistic shape: {cnv_gistic.shape}")
# print(f"Clinical shape: {clinical.shape}")
# print(f"Outcomes shape: {outcomes.shape}")

# for df in [meth, rna, cnv_gistic]:
#     df.columns = df.columns.str.replace(r"\?", "unknown_", regex=True)
#     df.columns = df.columns.str.replace(r"\|", "_", regex=True)
#     df.columns = df.columns.str.replace("-", "_", regex=False)
#     df.columns = df.columns.str.replace(r"_+", "_", regex=True)
#     df.columns = df.columns.str.strip("_")
    
#     df.fillna(df.mean(), inplace=True)

# common_patients = sorted(list(
#     set(meth.index) &
#     set(rna.index) &
#     set(cnv_gistic.index) &
#     set(clinical.index) &
#     set(outcomes.index)
# ))

# print(f"\nFound: {len(common_patients)} patients across all data types.")

# # subset to only common patients
# meth_processed = meth.loc[common_patients]
# rna_processed= rna.loc[common_patients]
# cnv_gistic_processed = cnv_gistic.loc[common_patients]
# clinical_processed = clinical.loc[common_patients]
# outcomes_processed = outcomes.loc[common_patients]
# outcomes = outcomes_processed['OS']


In [6]:
display(cnv_gistic_processed.iloc[:3,:5])
display(cnv_gistic_processed.shape)

display(rna_processed.iloc[:3,:5])
display(rna_processed.shape)

display(meth_processed.iloc[:3,:5])
display(meth_processed.shape)

display(clinical_processed.iloc[:3,:5])
display(clinical_processed.shape)

display(targets.value_counts())

Gene Symbol,ACAP3,ACTRT2,AGRN,ANKRD65,ATAD3A
TCGA-2J-AAB1,0.0,0.0,0.0,0.0,0.0
TCGA-2J-AAB4,0.0,0.0,0.0,0.0,0.0
TCGA-2J-AAB6,1.0,1.0,1.0,1.0,1.0


(177, 24776)

sample,rna_ARHGEF10L,rna_HIF3A,rna_RNF17,rna_RNF10,rna_RNF11
TCGA-2J-AAB1,9.791,8.731,0.5732,11.9111,10.2312
TCGA-2J-AAB4,10.5186,5.9351,0.0,11.7311,10.7823
TCGA-2J-AAB6,10.2412,6.6414,0.0,12.2415,11.4475


(177, 20530)

gene,meth_5S_rRNA,meth_7SK,meth_A1BG,meth_A1BG_AS1,meth_A1CF
TCGA-2J-AAB1,0.5889,0.025075,0.616812,0.615167,0.595
TCGA-2J-AAB4,0.60546,0.026175,0.379675,0.356613,0.516583
TCGA-2J-AAB6,0.64782,0.023825,0.422,0.392127,0.475


(177, 34013)

Hybridization REF,Composite Element REF,years_to_birth,vital_status,days_to_death,days_to_last_followup
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-2J-AAB1,value,65,1,66.0,
TCGA-2J-AAB4,value,48,0,,729.0
TCGA-2J-AAB6,value,75,1,293.0,


(177, 20)

OS
1    93
0    84
Name: count, dtype: int64

In [7]:
import bioneuralnet as bnn

cnv_amp_processed = cnv_gistic_processed.isin([1, 2]).astype(int)
cnv_del_processed = cnv_gistic_processed.isin([-1, -2]).astype(int)

processed_dfs = {
    "amp_": cnv_amp_processed,
    "del_": cnv_del_processed
}

for prefix, df in processed_dfs.items():
    df.columns = df.columns.str.replace(r"\?", "unknown_", regex=True)
    df.columns = df.columns.str.replace(r"\|", "_", regex=True)
    df.columns = df.columns.str.replace("-", "_", regex=False)
    df.columns = df.columns.str.replace(r"_+", "_", regex=True)
    df.columns = df.columns.str.strip("_")
    processed_dfs[prefix] = df.add_prefix(prefix)

cnv_amp_processed = processed_dfs["amp_"]
cnv_del_processed = processed_dfs["del_"]

clinical_processed.drop(columns=["Composite Element REF"], errors="ignore", inplace=True)

meth_m_values = bnn.utils.beta_to_m(meth_processed, eps=1e-6) 
Y_labels = outcomes_processed['OS'].to_frame(name="target")

X_meth = meth_m_values.loc[common_patients]
X_rna = rna_processed.loc[common_patients]
clinical = clinical_processed.loc[common_patients]
Y_labels = Y_labels.loc[common_patients]

X_cnv_amp = cnv_amp_processed.loc[common_patients]
X_cnv_del = cnv_del_processed.loc[common_patients]


2025-11-09 15:29:28,190 - bioneuralnet.utils.data - INFO - Starting Beta-to-M value conversion (shape: (177, 34013)). Epsilon: 1e-06
2025-11-09 15:29:29,690 - bioneuralnet.utils.data - INFO - Beta-to-M conversion complete.


In [8]:
display(X_cnv_amp.iloc[:3,:5])
display(X_cnv_amp.shape)

display(X_cnv_del.iloc[:3,:5])
display(X_cnv_del.shape)

display(X_rna.iloc[:3,:5])
display(X_rna.shape)

display(X_meth.iloc[:3,:5])
display(X_meth.shape)

display(clinical.iloc[:3,:5])
display(clinical.shape)

display(Y_labels.value_counts())

Gene Symbol,amp_ACAP3,amp_ACTRT2,amp_AGRN,amp_ANKRD65,amp_ATAD3A
TCGA-2J-AAB1,0,0,0,0,0
TCGA-2J-AAB4,0,0,0,0,0
TCGA-2J-AAB6,1,1,1,1,1


(177, 24776)

Gene Symbol,del_ACAP3,del_ACTRT2,del_AGRN,del_ANKRD65,del_ATAD3A
TCGA-2J-AAB1,0,0,0,0,0
TCGA-2J-AAB4,0,0,0,0,0
TCGA-2J-AAB6,0,0,0,0,0


(177, 24776)

sample,rna_ARHGEF10L,rna_HIF3A,rna_RNF17,rna_RNF10,rna_RNF11
TCGA-2J-AAB1,9.791,8.731,0.5732,11.9111,10.2312
TCGA-2J-AAB4,10.5186,5.9351,0.0,11.7311,10.7823
TCGA-2J-AAB6,10.2412,6.6414,0.0,12.2415,11.4475


(177, 20530)

gene,meth_5S_rRNA,meth_7SK,meth_A1BG,meth_A1BG_AS1,meth_A1CF
TCGA-2J-AAB1,0.518533,-5.28097,0.686782,0.676744,0.554968
TCGA-2J-AAB4,0.61786,-5.217401,-0.708259,-0.851325,0.095734
TCGA-2J-AAB6,0.87928,-5.356592,-0.453826,-0.632451,-0.14439


(177, 34013)

Hybridization REF,years_to_birth,vital_status,days_to_death,days_to_last_followup,tumor_tissue_site
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-2J-AAB1,65,1,66.0,,pancreas
TCGA-2J-AAB4,48,0,,729.0,pancreas
TCGA-2J-AAB6,75,1,293.0,,pancreas


(177, 19)

target
1         93
0         84
Name: count, dtype: int64

## Feature Selection Methodology

### Supported Methods and Interpretation

**BioNeuralNet** provides three techniques for feature selection, allowing for different views of the data's statistical profile:

- **Variance Thresholding:** Identifies features with the **highest overall variance** across all samples.

- **ANOVA F-test:** Pinpoints features that best **distinguish between the target classes** (e.g., Alive vs. Deceased).

- **Random Forest Importance:** Assesses **feature utility** based on its contribution to a predictive non-linear model.

### PAAD Cohort Selection Strategy

A dimensionality reduction step was essential for managing the high-feature-count omics data:

- **High-Feature Datasets:** DNA Methylation (34,013), RNA (20,530), CNV Amplification (24,776), and CNV Deletion (24,776) all required significant feature reduction.

- **Filtering Process:** As an example strategy, the **top 6,000 features** could be extracted from each high-feature omics dataset using all three methods.

- **Final Set:** A consensus set could be built for each omics type by finding the intersection of features selected by the ANOVA F-test and Random Forest Importance, ensuring both statistical relevance and model-based utility.

- **Low-Feature Datasets:** The **Clinical** data (19 features) was passed through **without selection**, as its feature count was already manageable.

In [None]:
import bioneuralnet as bnn

meth_highvar = bnn.utils.select_top_k_variance(X_meth, k=6000)
meth_af = bnn.utils.top_anova_f_features(X_meth, Y_labels, max_features=6000)
meth_rf = bnn.utils.select_top_randomforest(X_meth, Y_labels, top_k=6000)

rna_highvar = bnn.utils.select_top_k_variance(X_rna, k=6000)
rna_af = bnn.utils.top_anova_f_features(X_rna, Y_labels, max_features=6000)
rna_rf = bnn.utils.select_top_randomforest(X_rna, Y_labels, top_k=6000)

cnv_amp_highvar = bnn.utils.select_top_k_variance(X_cnv_amp, k=6000)
cnv_amp_af = bnn.utils.top_anova_f_features(X_cnv_amp, Y_labels, max_features=6000)
cnv_amp_rf = bnn.utils.select_top_randomforest(X_cnv_amp, Y_labels, top_k=6000)

cnv_del_highvar = bnn.utils.select_top_k_variance(X_cnv_del, k=6000)
cnv_del_af = bnn.utils.top_anova_f_features(X_cnv_del, Y_labels, max_features=6000)
cnv_del_rf = bnn.utils.select_top_randomforest(X_cnv_del, Y_labels, top_k=6000)

meth_var_set = set(meth_highvar.columns)
meth_anova_set = set(meth_af.columns)
meth_rf_set = set(meth_rf.columns)

rna_var_set = set(rna_highvar.columns)
rna_anova_set = set(rna_af.columns)
rna_rf_set = set(rna_rf.columns)

cnv_amp_var_set = set(cnv_amp_highvar.columns)
cnv_amp_anova_set = set(cnv_amp_af.columns)
cnv_amp_rf_set = set(cnv_amp_rf.columns)

cnv_del_var_set = set(cnv_del_highvar.columns)
cnv_del_anova_set = set(cnv_del_af.columns)
cnv_del_rf_set = set(cnv_del_rf.columns)

meth_inter1 = list(meth_anova_set & meth_var_set)
meth_inter2 = list(meth_rf_set & meth_var_set)
meth_inter3 = list(meth_anova_set & meth_rf_set)
meth_all_three = list(meth_anova_set & meth_var_set & meth_rf_set)

rna_inter1 = list(rna_anova_set & rna_var_set)
rna_inter2 = list(rna_rf_set & rna_var_set)
rna_inter3 = list(rna_anova_set & rna_rf_set)
rna_all_three = list(rna_anova_set & rna_var_set & rna_rf_set)

cnv_amp_inter1 = list(cnv_amp_anova_set & cnv_amp_var_set)
cnv_amp_inter2 = list(cnv_amp_rf_set & cnv_amp_var_set)
cnv_amp_inter3 = list(cnv_amp_anova_set & cnv_amp_rf_set)
cnv_amp_all_three = list(cnv_amp_anova_set & cnv_amp_var_set & cnv_amp_rf_set)

cnv_del_inter1 = list(cnv_del_anova_set & cnv_del_var_set)
cnv_del_inter2 = list(cnv_del_rf_set & cnv_del_var_set)
cnv_del_inter3 = list(cnv_del_anova_set & cnv_del_rf_set)
cnv_del_all_three = list(cnv_del_anova_set & cnv_del_var_set & cnv_del_rf_set)

2025-11-09 15:29:33,506 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-09 15:29:33,507 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 6049 NaNs after median imputation
2025-11-09 15:29:33,507 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-11-09 15:29:33,552 - bioneuralnet.utils.preprocess - INFO - Selected top 6000 features by variance
2025-11-09 15:29:37,474 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-09 15:29:37,475 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 6049 NaNs after median imputation
2025-11-09 15:29:37,475 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-11-09 15:29:37,580 - bioneuralnet.utils.preprocess - INFO - Selected 6000 features by ANOVA (task=classification), 0 significant, 6000 padded
2025-11-09 15:29:41,338 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infini

In [10]:
print("FROM THE 6000 Methylation feature selection:\n")
print(f"Anova-F & variance selection share: {len(meth_inter1)} features")
print(f"Random Forest & variance selection share: {len(meth_inter2)} features")
print(f"Anova-F & Random Forest share: {len(meth_inter3)} features")
print(f"All three methods agree on: {len(meth_all_three)} features")

print("\nFROM THE 6000 RNA feature selection:\n")
print(f"Anova-F & variance selection share: {len(rna_inter1)} features")
print(f"Random Forest & variance selection share: {len(rna_inter2)} features")
print(f"Anova-F & Random Forest share: {len(rna_inter3)} features")
print(f"All three methods agree on: {len(rna_all_three)} features")

print("\nFROM THE 6000 CNV Amplification feature selection:\n")
print(f"Anova-F & variance selection share: {len(cnv_amp_inter1)} features")
print(f"Random Forest & variance selection share: {len(cnv_amp_inter2)} features")
print(f"Anova-F & Random Forest share: {len(cnv_amp_inter3)} features")
print(f"All three methods agree on: {len(cnv_amp_all_three)} features")

print("\nFROM THE 6000 CNV Deletion feature selection:\n")
print(f"Anova-F & variance selection share: {len(cnv_del_inter1)} features")
print(f"Random Forest & variance selection share: {len(cnv_del_inter2)} features")
print(f"Anova-F & Random Forest share: {len(cnv_del_inter3)} features")
print(f"All three methods agree on: {len(cnv_del_all_three)} features")

FROM THE 6000 Methylation feature selection:

Anova-F & variance selection share: 1416 features
Random Forest & variance selection share: 1069 features
Anova-F & Random Forest share: 1150 features
All three methods agree on: 279 features

FROM THE 6000 RNA feature selection:

Anova-F & variance selection share: 1910 features
Random Forest & variance selection share: 1815 features
Anova-F & Random Forest share: 1910 features
All three methods agree on: 589 features

FROM THE 6000 CNV Amplification feature selection:

Anova-F & variance selection share: 741 features
Random Forest & variance selection share: 2310 features
Anova-F & Random Forest share: 967 features
All three methods agree on: 216 features

FROM THE 6000 CNV Deletion feature selection:

Anova-F & variance selection share: 1361 features
Random Forest & variance selection share: 2470 features
Anova-F & Random Forest share: 1600 features
All three methods agree on: 518 features


## Feature Selection Summary: ANOVA-RF Intersection

The final set of features was determined by the **intersection** of those highlighted by the **ANOVA F-test** and **Random Forest Importance**. This methodology provides a balanced filter, capturing features with both high class-separability (ANOVA) and significant predictive value in a non-linear model (Random Forest). The resulting feature pool is considered highly relevant for the subsequent modeling tasks.

### Feature Overlap Results

The table below quantifies the shared features identified by the different selection techniques for each omics type.

| Omics Data Type | ANOVA-F & Variance | RF & Variance | ANOVA-F & Random Forest (Selected) | All Three Agree |
| :--- | :--- | :--- | :--- | :--- |
| **Methylation** | 1,416 features | 1,069 features | **1,150 features** | 279 features |
| **RNA** | 1,910 features | 1,815 features | **1,910 features** | 589 features |
| **CNV Amplification** | 741 features | 2,310 features | **967 features** | 216 features |
| **CNV Deletion** | 1,361 features | 2,470 features | **1,600 features** | 518 features |

In [11]:
# Subset each omics dataframe using the selected feature lists
X_meth_selected = X_meth[meth_inter3]
X_rna_selected = X_rna[rna_inter3]
X_cnv_amp_selected = X_cnv_amp[cnv_amp_inter3]
X_cnv_del_selected = X_cnv_del[cnv_del_inter3]

# Clinical data is low-feature and kept as is
clinical_selected = clinical

print("\nFinal Shapes for Modeling")
print(f"Methylation (X_meth_selected): {X_meth_selected.shape}")
print(f"RNA-Seq (X_rna_selected): {X_rna_selected.shape}")
print(f"CNV Amplification (X_cnv_amp_selected): {X_cnv_amp_selected.shape}")
print(f"CNV Deletion (X_cnv_del_selected): {X_cnv_del_selected.shape}")
print(f"Clinical (clinical_selected): {clinical_selected.shape}")
print(f"Labels (Y_labels): {Y_labels.shape}")


Final Shapes for Modeling
Methylation (X_meth_selected): (177, 1150)
RNA-Seq (X_rna_selected): (177, 1910)
CNV Amplification (X_cnv_amp_selected): (177, 967)
CNV Deletion (X_cnv_del_selected): (177, 1600)
Clinical (clinical_selected): (177, 19)
Labels (Y_labels): (177, 1)


In [12]:
# before we can save this data lets make sure we have no missing values

# check NaNs in each omics dataset
print(f"NaNs in dna_meth: {X_meth_selected.isna().sum().sum()}")
print(f"NaNs in rna: {X_rna_selected.isna().sum().sum()}")
print(f"NaNs in cnv_amp: {X_cnv_amp_selected.isna().sum().sum()}")
print(f"NaNs in cnv_del: {X_cnv_del_selected.isna().sum().sum()}")

# Impute missing values using BioNeuralNet KNN imputation
X_meth_selected = bnn.utils.impute_omics_knn(X_meth_selected, n_neighbors=5)
print(f"\nNaNs in dna_meth after: {X_meth_selected.isna().sum().sum()}")

2025-11-09 15:30:04,993 - bioneuralnet.utils.data - INFO - Starting KNN imputation (k=5) on DataFrame (shape: (177, 1150)).
2025-11-09 15:30:05,004 - bioneuralnet.utils.data - INFO - KNN imputation complete


NaNs in dna_meth: 25
NaNs in rna: 0
NaNs in cnv_amp: 0
NaNs in cnv_del: 0

NaNs in dna_meth after: 0


## Data Availability

To facilitate rapid experimentation and reproduction of our results, the fully processed and feature-selected dataset used in this analysis has been made available directly within the package.

Users can load this dataset, bypassing all preceding data acquisition, preprocessing, and feature selection steps. This allows users to proceed immediately from this step.

In [1]:
import bioneuralnet as bnn

tgca_paad = bnn.datasets.DatasetLoader("paad")
display(tgca_paad.shape)

# The dataset is returned as a dictionary. We extract each file independetly based on the name (Key).
cnv_amp = tgca_paad.data["cnv_amp"]
cnv_del = tgca_paad.data["cnv_del"]
clinical = tgca_paad.data["clinical"]
target = tgca_paad.data["target"]
dna_meth = tgca_paad.data["meth"]
rna = tgca_paad.data["rna"]

{'cnv_amp': (177, 967),
 'cnv_del': (177, 1600),
 'target': (177, 1),
 'clinical': (177, 19),
 'rna': (177, 1910),
 'meth': (177, 1150)}

In [2]:
print(clinical.columns)

Index(['years_to_birth', 'vital_status', 'days_to_death',
       'days_to_last_followup', 'tumor_tissue_site', 'pathologic_stage',
       'pathology_T_stage', 'pathology_N_stage', 'pathology_M_stage', 'gender',
       'date_of_initial_pathologic_diagnosis', 'radiation_therapy',
       'histological_type', 'number_pack_years_smoked',
       'year_of_tobacco_smoking_onset', 'residual_tumor',
       'number_of_lymph_nodes', 'race', 'ethnicity'],
      dtype='object')


In [3]:
# BioNeuralNet provides a preprocessing function to handle clinical data
clinical = tgca_paad.data["clinical"]

# for more details on the preprocessing functions, see `bioneuralnet.utils.preprocess`
clinical_preprocessed = bnn.utils.preprocess_clinical(
    clinical, 
    target, 
    top_k=7, 
    scale=False, 
    ignore_columns=[
        # we ignore the following to avoid data leakage
        "vital_status",
        "days_to_death",
        "days_to_last_followup",
        "tumor_tissue_site",
        "histological_type",
        "date_of_initial_pathologic_diagnosis",
        "year_of_tobacco_smoking_onset"
    ])

display(clinical_preprocessed.iloc[:3,:5])

2025-11-09 19:19:59,488 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-09 19:19:59,488 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 125 NaNs after median imputation
2025-11-09 19:19:59,488 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-11-09 19:19:59,566 - bioneuralnet.utils.preprocess - INFO - Selected top 7 features by RandomForest importance


Unnamed: 0_level_0,years_to_birth,number_of_lymph_nodes,number_pack_years_smoked,gender_male,pathology_M_stage_mx
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-2J-AAB1,65,7.0,25.0,True,False
TCGA-2J-AAB4,48,0.0,25.0,True,False
TCGA-2J-AAB6,75,0.0,25.0,True,False


In [4]:
import pandas as pd
import bioneuralnet as bnn

X_train_full = pd.concat([dna_meth, rna, cnv_amp, cnv_del], axis=1)

print(f"Nan values in X_train_full: {X_train_full.isna().sum().sum()}")
print(f"X_train_full shape: {X_train_full.shape}")

A_train = bnn.utils.gen_similarity_graph(X_train_full, k=15)
print(f"\nNetwork shape: {A_train.shape}")

Nan values in X_train_full: 0
X_train_full shape: (177, 5627)

Network shape: (5627, 5627)


In [None]:
# import os
# import random
# import logging 
# import warnings 
# import numpy as np
# import torch
# import ray
# import bioneuralnet as bnn
# from bioneuralnet.utils import logger 

# os.environ["CUDA_VISIBLE_DEVICES"] = "0"


# if not ray.is_initialized():

#     # 3. Set the ray init logging level to INFO
#     ray.init(logging_level=logging.INFO) 
    
#     # Ignore common warnings
#     warnings.filterwarnings("ignore", category=UserWarning)
#     warnings.filterwarnings("ignore", category=DeprecationWarning)


## Reproducibility and Seeding

To ensure our experimental results are fully reproducible, a single global seed is set at the beginning of the analysis.

This utility function propagates the seed to all sources of randomness, including `random`, `numpy`, and `torch` (for both CPU and GPU). Critically, it also configures the PyTorch cuDNN backend to use deterministic algorithms.

**for each DPMON outer iteration, the seed is incremented to generate a differnt internal test/train split.**

In [None]:
import bioneuralnet as bnn

SEED = 118
bnn.utils.set_seed(SEED)

2025-11-09 19:20:00,656 - bioneuralnet.utils.data - INFO - Setting global seed for reproducibility to: 118
2025-11-09 19:20:00,657 - bioneuralnet.utils.data - INFO - CUDA available. Applying seed to all GPU operations
2025-11-09 19:20:00,657 - bioneuralnet.utils.data - INFO - Seed setting complete


## Classification using DPMON: Training and Evaluation

* Run 3 outer iterations, each with a different seed.
* Each iteration performs hyperparameter tuning.
* After tuning, train `repeat_num = 3` models with the best hyperparameters.
* Collect predictions from the best model of each iteration.
* Compute **Accuracy**, **F1 Weighted**, and **F1 Macro +/- standard deviation** across iterations.

This demonstrates the **end-to-end BioNeuralNet pipeline** in action.

### Analysis of Hyperparameter Optimization

The hyperparameter tuning results below showcase the best configuration found across three distinct GNN model runs.

| Parameter | SAGE (GraphSAGE) | GCN (Graph Convolutional) | GAT (Graph Attention) |
| :--- | :--- | :--- | :--- |
| **gnn_layer_num** | 2 | 8 | 2 |
| **gnn_hidden_dim** | 128 | 32 | 128 |
| **lr (Learning Rate)** | 0.005197 | 0.000401 | 0.005197 |
| **weight_decay** | 0.046079 | 0.007823 | 0.046079 |
| **nn_hidden_dim1** | 16 | 32 | 16 |
| **nn_hidden_dim2** | 32 | 32 | 32 |
| **num_epochs** | 4096 | 2048 | 4096 |

### Results

| Model | Accuracy | F1 Weighted | F1 Macro |
| :--- | :--- | :--- | :--- |
| SAGE | 0.8512 ± 0.1985 | 0.8129 ± 0.2528 | 0.8083 ± 0.2592 |
| GCN  | 0.9962 ± 0.0053 | 0.9962 ± 0.0053 | 0.9962 ± 0.0053 |
| GAT  | 0.9171 ± 0.1172 | 0.9129 ± 0.1232 | 0.9138 ± 0.1220 |


In [None]:
from sklearn.metrics import f1_score, accuracy_score
from bioneuralnet.downstream_task import DPMON
from pathlib import Path
import numpy as np

output_dir_base_sage = Path("/home/vicente/Github/BioNeuralNet/dpmon_results_SAGE_FINAL/paad")
target = target.rename(columns={"target": "phenotype"})

n_repeats = 5
all_preds = []

for r in range(n_repeats):
    bnn.utils.set_seed(SEED+r)
    dpmon_repeat = DPMON(
        adjacency_matrix=A_train,
        omics_list=[dna_meth, rna, cnv_amp, cnv_del],
        phenotype_data=target,
        clinical_data=clinical_preprocessed,
        repeat_num=3,
        model='SAGE',
        tune=True,
        gpu=True,
        cuda=0,
        output_dir=output_dir_base_sage,
    )
    
    predictions_df, _ = dpmon_repeat.run()
    all_preds.append(predictions_df["Predicted"].values)

all_preds = np.array(all_preds)


f1_macro_list = [f1_score(target, pred, average='macro') for pred in all_preds]
f1_weighted_list = [f1_score(target, pred, average='weighted') for pred in all_preds]
accuracy_list = [accuracy_score(target, pred) for pred in all_preds]

avg_f1_macro = np.mean(f1_macro_list)
std_f1_macro = np.std(f1_macro_list)

avg_f1_weighted = np.mean(f1_weighted_list)
std_f1_weighted = np.std(f1_weighted_list)

avg_acc = np.mean(accuracy_list)
std_acc = np.std(accuracy_list)

print(f"Accuracy: {avg_acc:.4f} +/- {std_acc:.4f}")
print(f"F1 Weighted: {avg_f1_weighted:.4f} +/- {std_f1_weighted:.4f}")
print(f"F1 Macro: {avg_f1_macro:.4f} +/- {std_f1_macro:.4f}")

2025-11-09 19:20:01,858 - bioneuralnet.utils.data - INFO - Setting global seed for reproducibility to: 118
2025-11-09 19:20:01,859 - bioneuralnet.utils.data - INFO - CUDA available. Applying seed to all GPU operations
2025-11-09 19:20:01,859 - bioneuralnet.utils.data - INFO - Seed setting complete
2025-11-09 19:20:01,860 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_results_SAGE_FINAL/paad
2025-11-09 19:20:01,860 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
2025-11-09 19:20:01,860 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
2025-11-09 19:20:01,872 - bioneuralnet.downstream_task.dpmon - INFO - Running hyperparameter tuning for DPMON.
2025-11-09 19:20:01,872 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
2025-11-09 19:20:02,095 - bioneuralnet.downstream_task.dpmon - INFO - Number of nodes in network: 5627
2025-11-09 19:20:05,835 - bioneuralnet

Accuracy: 0.8512 +/- 0.1985
F1 Weighted: 0.8129 +/- 0.2528
F1 Macro: 0.8083 +/- 0.2592


In [None]:
output_dir_base_gcn = Path("/home/vicente/Github/BioNeuralNet/dpmon_results_GCN_FINAL/paad")
n_repeats = 5
all_preds = []

for r in range(n_repeats):
    bnn.utils.set_seed(SEED+r)
    dpmon_repeat = DPMON(
        adjacency_matrix=A_train,
        omics_list=[dna_meth, rna, cnv_amp, cnv_del],
        phenotype_data=target,
        clinical_data=clinical_preprocessed,
        repeat_num=3,
        model='GCN',
        tune=True,
        gpu=True,
        cuda=0,
        output_dir=output_dir_base_gcn,
    )
    
    predictions_df, _ = dpmon_repeat.run()
    all_preds.append(predictions_df["Predicted"].values)

all_preds = np.array(all_preds)


f1_macro_list = [f1_score(target, pred, average='macro') for pred in all_preds]
f1_weighted_list = [f1_score(target, pred, average='weighted') for pred in all_preds]
accuracy_list = [accuracy_score(target, pred) for pred in all_preds]

avg_f1_macro = np.mean(f1_macro_list)
std_f1_macro = np.std(f1_macro_list)

avg_f1_weighted = np.mean(f1_weighted_list)
std_f1_weighted = np.std(f1_weighted_list)

avg_acc = np.mean(accuracy_list)
std_acc = np.std(accuracy_list)

print(f"Accuracy: {avg_acc:.4f} +/- {std_acc:.4f}")
print(f"F1 Weighted: {avg_f1_weighted:.4f} +/- {std_f1_weighted:.4f}")
print(f"F1 Macro: {avg_f1_macro:.4f} +/- {std_f1_macro:.4f}")

2025-11-09 19:23:39,936 - bioneuralnet.utils.data - INFO - Setting global seed for reproducibility to: 118
2025-11-09 19:23:39,937 - bioneuralnet.utils.data - INFO - CUDA available. Applying seed to all GPU operations
2025-11-09 19:23:39,937 - bioneuralnet.utils.data - INFO - Seed setting complete
2025-11-09 19:23:39,937 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_cv_results_GCN_FINAL/paad
2025-11-09 19:23:39,937 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
2025-11-09 19:23:39,938 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
2025-11-09 19:23:39,945 - bioneuralnet.downstream_task.dpmon - INFO - Running hyperparameter tuning for DPMON.
2025-11-09 19:23:39,945 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
2025-11-09 19:23:40,164 - bioneuralnet.downstream_task.dpmon - INFO - Number of nodes in network: 5627
2025-11-09 19:23:43,898 - bioneuraln

Accuracy: 0.9962 +/- 0.0053
F1 Weighted: 0.9962 +/- 0.0053
F1 Macro: 0.9962 +/- 0.0053


In [None]:
output_dir_base_gat = Path("/home/vicente/Github/BioNeuralNet/dpmon_results_GAT_FINAL/paad")

n_repeats = 5
all_preds = []

for r in range(n_repeats):
    bnn.utils.set_seed(SEED+r)
    dpmon_repeat = DPMON(
        adjacency_matrix=A_train,
        omics_list=[dna_meth, rna, cnv_amp, cnv_del],
        phenotype_data=target,
        clinical_data=clinical_preprocessed,
        repeat_num=3,
        model='GAT',
        tune=True,
        gpu=True,
        cuda=0,
        output_dir=output_dir_base_gat,
    )
    
    predictions_df, _ = dpmon_repeat.run()
    all_preds.append(predictions_df["Predicted"].values)

all_preds = np.array(all_preds)

f1_macro_list = [f1_score(target, pred, average='macro') for pred in all_preds]
f1_weighted_list = [f1_score(target, pred, average='weighted') for pred in all_preds]
accuracy_list = [accuracy_score(target, pred) for pred in all_preds]

avg_f1_macro = np.mean(f1_macro_list)
std_f1_macro = np.std(f1_macro_list)

avg_f1_weighted = np.mean(f1_weighted_list)
std_f1_weighted = np.std(f1_weighted_list)

avg_acc = np.mean(accuracy_list)
std_acc = np.std(accuracy_list)

print(f"Accuracy: {avg_acc:.4f} +/- {std_acc:.4f}")
print(f"F1 Weighted: {avg_f1_weighted:.4f} +/- {std_f1_weighted:.4f}")
print(f"F1 Macro: {avg_f1_macro:.4f} +/- {std_f1_macro:.4f}")


2025-11-09 19:26:32,516 - bioneuralnet.utils.data - INFO - Setting global seed for reproducibility to: 118
2025-11-09 19:26:32,517 - bioneuralnet.utils.data - INFO - CUDA available. Applying seed to all GPU operations
2025-11-09 19:26:32,517 - bioneuralnet.utils.data - INFO - Seed setting complete
2025-11-09 19:26:32,518 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_cv_results_GAT_FINAL/paad
2025-11-09 19:26:32,518 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
2025-11-09 19:26:32,518 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
2025-11-09 19:26:32,524 - bioneuralnet.downstream_task.dpmon - INFO - Running hyperparameter tuning for DPMON.
2025-11-09 19:26:32,524 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
2025-11-09 19:26:32,748 - bioneuralnet.downstream_task.dpmon - INFO - Number of nodes in network: 5627
2025-11-09 19:26:36,495 - bioneuraln

Accuracy: 0.9171 +/- 0.1172
F1 Weighted: 0.9129 +/- 0.1232
F1 Macro: 0.9138 +/- 0.1220


## values below are just placeholder for nuw need to run.

In [None]:
import warnings 
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from sklearn.exceptions import ConvergenceWarning
from scipy.stats import loguniform, randint

X = pd.concat([dna_meth, rna, cnv_amp, cnv_del,clinical_preprocessed], axis=1)
y = target['phenotype']
print(f"Successfully created X matrix with shape: {X.shape}")
print(f"Successfully created y vector with shape: {y.shape}")

all_results = {
    "LogisticRegression": {"acc": [], "f1_w": [], "f1_m": []},
    "MLP": {"acc": [], "f1_w": [], "f1_m": []},
    "XGBoost": {"acc": [], "f1_w": [], "f1_m": []},
}

all_results = {
    "LogisticRegression": {"acc": [], "f1_w": [], "f1_m": []},
    "MLP": {"acc": [], "f1_w": [], "f1_m": []},
    "XGBoost": {"acc": [], "f1_w": [], "f1_m": []},
}

N_REPEATS = 5
TEST_SPLIT_SIZE = .7
CV_FOLDS = 3
N_ITER_SEARCH = 10

pipe_lr = Pipeline([('scaler', StandardScaler()),
    ('model', LogisticRegression(
        solver='lbfgs',
        max_iter=1000,
        penalty=None 
    ))
])

pipe_mlp = Pipeline([('scaler', StandardScaler()),
    ('model', MLPClassifier(
        max_iter=500,
        early_stopping=True,
        n_iter_no_change=10
    ))
])

pipe_xgb = Pipeline([('scaler', StandardScaler()),
    ('model', XGBClassifier(
        eval_metric='logloss'
    ))
])

params_lr = {
    'model__penalty': ['l2'], 
    'model__C': loguniform(1e-4, 1e2)
}

params_mlp = {
    'model__hidden_layer_sizes': [(100,), (100, 50), (50, 50)],
    'model__activation': ['relu', 'tanh'],
    'model__alpha': loguniform(1e-5, 1e-1),
    'model__learning_rate_init': loguniform(1e-4, 1e-2)
}

params_xgb = {
    'model__n_estimators': randint(100, 500),
    'model__learning_rate': loguniform(0.01, 0.3),
    'model__max_depth': randint(3, 10),
    'model__subsample': [0.7, 0.8, 0.9, 1.0],
    'model__colsample_bytree': [0.7, 0.8, 0.9, 1.0]
}

models_to_tune = {
    "LogisticRegression": (pipe_lr, params_lr),
    "MLP": (pipe_mlp, params_mlp),
    "XGBoost": (pipe_xgb, params_xgb)
}

for r in range(N_REPEATS):
    seed = SEED + r
    print(f"\nRunning Repeat {r+1}/{N_REPEATS} (Seed: {seed})")

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
        test_size=TEST_SPLIT_SIZE, 
        random_state=seed, 
        stratify=y)
    
    cv_splitter = StratifiedKFold(n_splits=CV_FOLDS, shuffle=True, random_state=seed)
    for name, (pipeline, params) in models_to_tune.items():
        print(f"Tuning {name}")
        search = RandomizedSearchCV(
            estimator=pipeline,
            param_distributions=params,
            n_iter=N_ITER_SEARCH,
            cv=cv_splitter,
            scoring='f1_weighted',
            n_jobs=-1,
            random_state=seed
        )
        
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=ConvergenceWarning)
            warnings.filterwarnings("ignore", category=UserWarning)
            search.fit(X_train, y_train)

        print(f"Best params for {name}: {search.best_params_}")
        
        best_model = search.best_estimator_
        preds = best_model.predict(X_test)

        acc = accuracy_score(y_test, preds)
        f1_w = f1_score(y_test, preds, average='weighted', zero_division=0)
        f1_m = f1_score(y_test, preds, average='macro', zero_division=0)

        all_results[name]["acc"].append(acc)
        all_results[name]["f1_w"].append(f1_w)
        all_results[name]["f1_m"].append(f1_m)

print(f"Tuned Model Results (Averaged over {N_REPEATS} runs)")
print(f"(Tuning was {N_ITER_SEARCH} iterations with {CV_FOLDS}fold CV)")

for model_name, metrics in all_results.items():
    avg_acc = np.mean(metrics["acc"])
    std_acc = np.std(metrics["acc"])
    
    avg_f1_w = np.mean(metrics["f1_w"])
    std_f1_w = np.std(metrics["f1_w"])
    
    avg_f1_m = np.mean(metrics["f1_m"])
    std_f1_m = np.std(metrics["f1_m"])

    print(f"Results for {model_name}:")
    print(f"Accuracy: {avg_acc:.4f} +/- {std_acc:.4f}")
    print(f"F1 Weighted: {avg_f1_w:.4f} +/- {std_f1_w:.4f}")
    print(f"F1 Macro: {avg_f1_m:.4f} +/- {std_f1_m:.4f}")


In [None]:
import bioneuralnet as bnn

gnn_plot_data = {
    "Accuracy": {
        "SAGE": (0.9903, 0.0173), 
        "GCN": (0.9720, 0.0379), 
        "GAT": (0.9626, 0.0575)
    },
    "F1 Weighted": {
        "SAGE": (0.9903, 0.0173), 
        "GCN": (0.9631, 0.0557), 
        "GAT": (0.9546, 0.0732)
    },
    "F1 Macro": {
        "SAGE": (0.9876, 0.0203), 
        "GCN": (0.9150, 0.1410), 
        "GAT": (0.9107, 0.1569)
    }
}

baseline_plot_data = {
    "Accuracy": {
        "SAGE": (0.9903, 0.0173), 
        "LogReg": (0.9553, 0.0090), 
        "XGBoost": (0.9527, 0.0059), 
        "MLP": (0.9362, 0.0133)
    },
    "F1 Weighted": {
        "SAGE": (0.9903, 0.0173), 
        "LogReg": (0.9557, 0.0088), 
        "XGBoost": (0.9529, 0.0058), 
        "MLP": (0.9379, 0.0125)
    },
    "F1 Macro": {
        "SAGE": (0.9876, 0.0203), 
        "LogReg": (0.9413, 0.0124), 
        "XGBoost": (0.9451, 0.0106), 
        "MLP": (0.9138, 0.0170)
    }
}


bnn.metrics.plot_multiple_metrics(
    gnn_plot_data,
    title_map={
        "Accuracy": "GNNs Comparison: Accuracy",
        "F1 Weighted": "GNNs Comparison: F1 Weighted",
        "F1 Macro": "GNNs Comparison: F1 Macro"
    }
)

bnn.metrics.plot_multiple_metrics(
    baseline_plot_data,
    title_map={
        "Accuracy": "SAGE vs. Baselines: Accuracy",
        "F1 Weighted": "SAGE vs. Baselines: F1 Weighted",
        "F1 Macro": "SAGE vs. Baselines: F1 Macro"
    }
)

