# TCGA-PAAD Analysis: *Overall Survival* classification

- **Cohort**: Focuses on the **TCGA-PAAD** (Pancreatic Adenocarcinoma) dataset, a vital cohort for studying pancreatic cancer.

- **Goal**: Perform survival prediction using a multi-omics profile.
- **Prediction Target**: Predict **Overall Survival (OS)** based on the patient's omics data (RNA, Methylation, CNV, and clinical features).

**Data Sources:**
- Omics Data and Target: [https://xenabrowser.net/datapages/](https://xenabrowser.net/datapages/)
- Clinical Data: [Broad Institute FireHose](http://firebrowse.org/?cohort=PAAD)

In [None]:
import pandas as pd
from pathlib import Path
root = Path("/home/vicente/Github/BioNeuralNet/PAAN")

cnv_gistic_raw = pd.read_csv(root/"Gistic2_CopyNumber_Gistic2_all_thresholded_by_genes.txt", sep="\t",index_col=0,low_memory=False)                            
rna_raw = pd.read_csv(root / "HiSeqV2.txt", sep="\t",index_col=0,low_memory=False)
meth_raw = pd.read_csv(root/"HumanMethylation450.txt", sep='\t',index_col=0,low_memory=False)
clinical_raw = pd.read_csv(root / "PAAD.clin.merged.picked.txt",sep="\t", index_col=0, low_memory=False)
target = pd.read_csv(root / "survival_PAAD_survival.txt", sep="\t", index_col=0, low_memory=False)
probe_map_meth = pd.read_csv( root / "probeMap_illuminaMethyl450_hg19_GPL16304_TCGAlegacy.txt", sep="\t", index_col=0, low_memory=False)

# display all shapes and first few rows of each dataset
display(cnv_gistic_raw.iloc[:5,:5])
display(cnv_gistic_raw.shape)

display(rna_raw.iloc[:5,:5])
display(rna_raw.shape)

display(meth_raw.iloc[:5,:5])
display(meth_raw.shape)
display(probe_map_meth.iloc[:5,:5])
display(probe_map_meth.shape)

display(clinical_raw.iloc[:5,:5])
display(clinical_raw.shape)

display(target.iloc[:5,:5])
display(target.shape)

In [None]:
import bioneuralnet as bnn
cnv_gistic_processed = cnv_gistic_raw.apply(pd.to_numeric, errors="coerce")

# collapse GISTIC to -1 / 0 / +1 per gene
cnv_signed = cnv_gistic_processed.copy()
cnv_signed[cnv_signed > 0] = 1
cnv_signed[cnv_signed < 0] = -1

cnv_signed_T = cnv_signed.T 
rna_numeric = rna_raw.apply(pd.to_numeric, errors="coerce")
rna_T = rna_numeric.T

# make numeric (beta values)
meth_numeric = meth_raw.apply(pd.to_numeric, errors="coerce")

# build gene map from probe_map_meth
gene_map = probe_map_meth[['gene']].copy()
gene_map = gene_map.dropna(subset=['gene'])
gene_map = gene_map[gene_map['gene'] != '.']

gene_map['gene'] = gene_map['gene'].str.split(',')
gene_map_exploded = gene_map.explode('gene')

meth_with_gene = meth_numeric.join(gene_map_exploded['gene'], how='inner')

data_cols = meth_with_gene.columns.drop('gene')
meth_gene_beta = (
    meth_with_gene
    .groupby('gene')[data_cols]
    .mean()
    .T
)

meth_gene_m = bnn.utils.beta_to_m(meth_gene_beta, eps=1e-8)

clinical = clinical_raw.copy()
clinical = clinical.T
clinical.index = clinical.index.str.upper().str.slice(0, 12)
clinical.index.name = "Patient_ID"
clinical = clinical[~clinical.index.duplicated(keep='first')]
clinical = clinical.drop(columns=["Hybridization REF", "Composite Element REF"], errors="ignore")

outcomes = target.copy()
outcomes = outcomes.set_index('_PATIENT')
outcomes.index = outcomes.index.str.upper().str.slice(0, 12)
outcomes.index.name = "Patient_ID"
outcomes = outcomes[~outcomes.index.duplicated(keep='first')]

cnv_signed_T.index = cnv_signed_T.index.str.upper().str.slice(0, 12)
rna_T.index = rna_T.index.str.upper().str.slice(0, 12)
meth_gene_m.index  = meth_gene_m.index.str.upper().str.slice(0, 12)

cnv_signed_T = cnv_signed_T.groupby(cnv_signed_T.index).mean()
rna_T = rna_T.groupby(rna_T.index).mean()
meth_gene_m  = meth_gene_m.groupby(meth_gene_m.index).mean()

common_patients = sorted(
    set(cnv_signed_T.index)
    & set(rna_T.index)
    & set(meth_gene_m.index)
    & set(outcomes.index)
)
X_cnv = cnv_signed_T.loc[common_patients]
X_rna = rna_T.loc[common_patients]
X_meth = meth_gene_m.loc[common_patients]
Y_labels = outcomes.loc[common_patients, "OS"]
clinical = clinical.loc[common_patients]

def clean_cols(df, prefix):
    cols = df.columns
    cols = cols.str.replace(r"\?", "unknown_", regex=True)
    cols = cols.str.replace(r"\|", "_", regex=True)
    cols = cols.str.replace("-", "_", regex=False)
    cols = cols.str.replace(r"_+", "_", regex=True)
    cols = cols.str.strip("_")
    df.columns = cols
    return df.add_prefix(prefix)

X_cnv = clean_cols(X_cnv, "cnv_")
X_rna = clean_cols(X_rna, "rna_")
X_meth = clean_cols(X_meth, "meth_")

In [None]:
display(X_cnv.iloc[:5,:5])
display(X_cnv.shape)

display(X_rna.iloc[:5,:5])
display(X_rna.shape)

display(X_meth.iloc[:5,:5])
display(X_meth.shape)

display(clinical.iloc[:5,:5])
display(clinical.shape)

## Feature Selection Methodology

### Supported Methods and Interpretation

**BioNeuralNet** provides three techniques for feature selection, allowing for different views of the data's statistical profile:

- **Variance Thresholding:** Identifies features with the **highest overall variance** across all samples.

- **ANOVA F-test:** Pinpoints features that best **distinguish between the target classes** (e.g., Alive vs. Deceased).

- **Random Forest Importance:** Assesses **feature utility** based on its contribution to a predictive non-linear model.

### PAAD Cohort Selection Strategy

A dimensionality reduction step was essential for managing the high-feature-count omics data:

- **High-Feature Datasets:** DNA Methylation (34,013), RNA (20,530), CNV Amplification (24,776), and CNV Deletion (24,776) all required significant feature reduction.

- **Filtering Process:** As an example strategy, the **top 6,000 features** could be extracted from each high-feature omics dataset using all three methods.

- **Final Set:** A consensus set could be built for each omics type by finding the intersection of features selected by the ANOVA F-test and Random Forest Importance, ensuring both statistical relevance and model-based utility.

- **Low-Feature Datasets:** The **Clinical** data (19 features) was passed through **without selection**, as its feature count was already manageable.

In [None]:
import bioneuralnet as bnn

print(f"METH before: {X_meth.shape}")
X_meth = bnn.impute_omics_knn(X_meth, n_neighbors=5)
print(f"METH after: {X_meth.shape}\n")

print(f"RNA before: {X_rna.shape}")
X_rna = bnn.impute_omics_knn(X_rna, n_neighbors=5)
print(f"RNA after: {X_rna.shape}\n")

print(f"CNV before: {X_cnv.shape}")
X_cnv = bnn.impute_omics_knn(X_cnv, n_neighbors=5)
print(f"CNV after: {X_cnv.shape}\n")


In [None]:
import bioneuralnet as bnn

meth_highvar = bnn.utils.select_top_k_variance(X_meth, k=6000)
meth_af = bnn.utils.top_anova_f_features(X_meth, Y_labels, max_features=6000)
meth_rf = bnn.utils.select_top_randomforest(X_meth, Y_labels, top_k=6000)

rna_highvar = bnn.utils.select_top_k_variance(X_rna, k=6000)
rna_af = bnn.utils.top_anova_f_features(X_rna, Y_labels, max_features=6000)
rna_rf = bnn.utils.select_top_randomforest(X_rna, Y_labels, top_k=6000)

cnv_highvar = bnn.utils.select_top_k_variance(X_cnv, k=6000)
cnv_af = bnn.utils.top_anova_f_features(X_cnv, Y_labels, max_features=6000)
cnv_rf = bnn.utils.select_top_randomforest(X_cnv, Y_labels, top_k=6000)

meth_var_set = set(meth_highvar.columns)
meth_anova_set = set(meth_af.columns)
meth_rf_set = set(meth_rf.columns)

rna_var_set = set(rna_highvar.columns)
rna_anova_set = set(rna_af.columns)
rna_rf_set = set(rna_rf.columns)

cnv_var_set = set(cnv_highvar.columns)
cnv_anova_set = set(cnv_af.columns)
cnv_rf_set = set(cnv_rf.columns)

meth_inter1 = list(meth_anova_set & meth_var_set)
meth_inter2 = list(meth_rf_set & meth_var_set)
meth_inter3 = list(meth_anova_set & meth_rf_set)
meth_all_three = list(meth_anova_set & meth_var_set & meth_rf_set)

rna_inter1 = list(rna_anova_set & rna_var_set)
rna_inter2 = list(rna_rf_set & rna_var_set)
rna_inter3 = list(rna_anova_set & rna_rf_set)
rna_all_three = list(rna_anova_set & rna_var_set & rna_rf_set)

cnv_inter1 = list(cnv_anova_set & cnv_var_set)
cnv_inter2 = list(cnv_rf_set & cnv_var_set)
cnv_inter3 = list(cnv_anova_set & cnv_rf_set)
cnv_all_three = list(cnv_anova_set & cnv_var_set & cnv_rf_set)

In [None]:
print("FROM THE 6000 Methylation feature selection:\n")
print(f"Anova-F & variance selection share: {len(meth_inter1)} features")
print(f"Random Forest & variance selection share: {len(meth_inter2)} features")
print(f"Anova-F & Random Forest share: {len(meth_inter3)} features")
print(f"All three methods agree on: {len(meth_all_three)} features")

print("\nFROM THE 6000 RNA feature selection:\n")
print(f"Anova-F & variance selection share: {len(rna_inter1)} features")
print(f"Random Forest & variance selection share: {len(rna_inter2)} features")
print(f"Anova-F & Random Forest share: {len(rna_inter3)} features")
print(f"All three methods agree on: {len(rna_all_three)} features")

print("\nFROM THE 6000 CNV feature selection:\n")
print(f"Anova-F & variance selection share: {len(cnv_inter1)} features")
print(f"Random Forest & variance selection share: {len(cnv_inter2)} features")
print(f"Anova-F & Random Forest share: {len(cnv_inter3)} features")
print(f"All three methods agree on: {len(cnv_all_three)} features")

## Feature Selection Summary: ANOVA-RF Intersection

The final set of features was determined by the **intersection** of those highlighted by the **ANOVA F-test** and **Random Forest Importance**. This methodology provides a balanced filter, capturing features with both high class-separability (ANOVA) and significant predictive value in a non-linear model (Random Forest). The resulting feature pool is considered highly relevant for the subsequent modeling tasks.

### Feature Overlap Results

The table below quantifies the shared features identified by the different selection techniques for each omics type.

| Omics Data Type | ANOVA-F & Variance | RF & Variance | ANOVA-F & Random Forest (Selected) | All Three Agree |
| :--- | :--- | :--- | :--- | :--- |
| **DNA Methylation** | 1,410 features | 1,076 features | **1,152 features** | 280 features |
| **RNA** | 1,910 features | 1,815 features | **1,910 features** | 589 features |
| **CNV** | 1,884 features | 1,516 features | **1,035 features** | 477 features |

In [None]:
# Subset each omics dataframe using the selected feature lists
X_meth_selected = X_meth[meth_inter3]
X_rna_selected = X_rna[rna_inter3]
X_cnv_selected = X_cnv[cnv_inter3]

print("\nFinal Shapes for Modeling")
print(f"Methylation (X_meth_selected): {X_meth_selected.shape}")
print(f"RNA-Seq (X_rna_selected): {X_rna_selected.shape}")
print(f"CNV (X_cnv_selected): {X_cnv_selected.shape}")
print(f"Clinical (clinical_selected): {clinical.shape}")
print(f"Labels (Y_labels): {Y_labels.shape}")

## Data Availability

To facilitate rapid experimentation and reproduction of our results, the fully processed and feature-selected dataset used in this analysis has been made available directly within the package.

Users can load this dataset, bypassing all preceding data acquisition, preprocessing, and feature selection steps. This allows users to proceed immediately from this step.

In [None]:
# for training puroses we will do variance selection to reduce the number of features
import bioneuralnet as bnn

tgca_paad = bnn.datasets.DatasetLoader("paad")
display(tgca_paad.shape)

# The dataset is returned as a dictionary. We extract each file independetly based on the name (Key).
cnv = tgca_paad["cnv"]
clinical = tgca_paad["clinical"]
target = tgca_paad["target"]
dna_meth = tgca_paad["meth"]
rna = tgca_paad["rna"]

display(cnv.iloc[:3,:5])
display(dna_meth.iloc[:3,:5])
display(rna.iloc[:3,:5])
display(clinical.iloc[:3,:5])
display(target.iloc[:2,:5])

In [None]:
# Variance selection
cnv = bnn.utils.select_top_k_variance(cnv, k=500)
dna_meth = bnn.utils.select_top_k_variance(dna_meth, k=500)
rna = bnn.utils.select_top_k_variance(rna, k=500)

In [None]:
from bioneuralnet.utils import preprocess_clinical

dataleak_columns = [
        "vital_status",
        "days_to_death",
        "days_to_last_followup",
        "date_of_initial_pathologic_diagnosis",
    ]

clinical_for_model = preprocess_clinical(
    clinical, 
    top_k=10, 
    scale=False, 
    ignore_columns=dataleak_columns,
    nan_threshold=0.6)

display(clinical_for_model.iloc[:5,:5])

## Reproducibility and Seeding

To ensure our experimental results are fully reproducible, a single global seed is set at the beginning of the analysis.

This utility function propagates the seed to all sources of randomness, including `random`, `numpy`, and `torch` (for both CPU and GPU). Critically, it also configures the PyTorch cuDNN backend to use deterministic algorithms.


In [None]:
import bioneuralnet as bnn
import pandas as pd

SEED = 1883
bnn.utils.set_seed(SEED)

In [None]:
from bioneuralnet.utils import find_optimal_graph

omics_paad = pd.concat([cnv, dna_meth, rna], axis=1)

optimal_graph, best_params, results_df = find_optimal_graph(
    omics_data=omics_paad,
    y_labels=target,
    methods=['threshold', 'correlation','similarity','gaussian'],
    seed=SEED,
    verbose=False,
    trials=30,
    omics_list=[cnv, dna_meth, rna],
    centrality_mode="eigenvector",
)
display(optimal_graph.iloc[:5,:5])
display(best_params)

results_df.sort_values("score", ascending=False, inplace=True)
display(results_df)
## {'k': 20,'metric': 'cosine','mutual': False, 'linkage_mode': 'eigen_corr', 'epsilon': 0.0001}

In [None]:
# Graph analysis: The optimal gaph uses a proxy to evaluate its properties, but for the final output
# the graph is built without the target variable. The search only helps to find the best graph parameters..
from bioneuralnet.utils import graph_analysis

graph_analysis(optimal_graph, graph_name="Optimal")

In [None]:
from pathlib import Path
from bioneuralnet.downstream_task import DPMON

output_dir_base = Path("/home/vicente/Github/BioNeuralNet/dpmon_results_GAT_FINAL/paad")

current_output_dir = output_dir_base / "Gat_best_graph"
current_output_dir.mkdir(parents=True, exist_ok=True)

dpmon_params_base = {
    "adjacency_matrix": optimal_graph,
    "omics_list": omics_paad,
    "phenotype_data": target,
    "phenotype_col": "target",
    "clinical_data": clinical_for_model,
    "model": 'GAT',
    "tune": True, 
    "cv": True,   
    "n_folds": 5,
    'repeat_num': 1,
    "gpu": True,
    "cuda": 0,
    "seed": SEED,
    "output_dir": current_output_dir
}

dpmon_tunned = DPMON(**dpmon_params_base)
predictions_df, metrics, embeddings = dpmon_tunned.run()

graph_metrics = metrics

acc_row = graph_metrics.loc[graph_metrics['Metric'] == 'Accuracy'].iloc[0]
f1_macro_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Macro'].iloc[0]
f1_weighted_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Weighted'].iloc[0]
recall_row = graph_metrics.loc[graph_metrics['Metric'] == 'Recall'].iloc[0]
auc_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUC'].iloc[0]
aupr_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUPR'].iloc[0]

acc_avg, acc_std = acc_row['Average'], acc_row['StdDev']
f1_macro_avg, f1_macro_std = f1_macro_row['Average'], f1_macro_row['StdDev']
f1_weighted_avg, f1_weighted_std = f1_weighted_row['Average'], f1_weighted_row['StdDev']
recall_avg, recall_std = recall_row['Average'], recall_row['StdDev']
auc_avg, auc_std = auc_row['Average'], auc_row['StdDev']
aupr_avg, aupr_std = aupr_row['Average'], aupr_row['StdDev']

print(f"Accuracy (Avg +/- Std): {acc_avg:.4f} +/- {acc_std:.4f}")
print(f"F1 Macro (Avg +/- Std): {f1_macro_avg:.4f}  +/- {f1_macro_std:.4f}")
print(f"F1 Weighted (Avg +/- Std): {f1_weighted_avg:.4f} +/- {f1_weighted_std:.4f}")
print(f"Recall: {recall_avg:.4f} +/- {recall_std:.4f}")
print(f"AUC: {auc_avg:.4f} +/- {auc_std:.4f}")
print(f"AUPR: {aupr_avg:.4f} +/- {aupr_std:.4f}")

In [None]:
from bioneuralnet.downstream_task import DPMON


output_dir_base = Path("/home/vicente/Github/BioNeuralNet/dpmon_results_GCN_FINAL/paad")

current_output_dir = output_dir_base / "Gcn_best_graph"
current_output_dir.mkdir(parents=True, exist_ok=True)

dpmon_params_base = {
    "adjacency_matrix": optimal_graph,
    "omics_list": omics_paad,
    "phenotype_data": target,
    "phenotype_col": "target",
    "clinical_data": clinical_for_model,
    "model": 'GCN',
    "tune": True, 
    "cv": True,   
    "n_folds": 5,
    'repeat_num': 1,
    "gpu": True,
    "cuda": 0,
    "seed": SEED,
    "output_dir": current_output_dir
}

dpmon_tunned = DPMON(**dpmon_params_base)
predictions_df, metrics, embeddings = dpmon_tunned.run()
graph_metrics = metrics

acc_row = graph_metrics.loc[graph_metrics['Metric'] == 'Accuracy'].iloc[0]
f1_macro_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Macro'].iloc[0]
f1_weighted_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Weighted'].iloc[0]
recall_row = graph_metrics.loc[graph_metrics['Metric'] == 'Recall'].iloc[0]
auc_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUC'].iloc[0]
aupr_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUPR'].iloc[0]

acc_avg, acc_std = acc_row['Average'], acc_row['StdDev']
f1_macro_avg, f1_macro_std = f1_macro_row['Average'], f1_macro_row['StdDev']
f1_weighted_avg, f1_weighted_std = f1_weighted_row['Average'], f1_weighted_row['StdDev']
recall_avg, recall_std = recall_row['Average'], recall_row['StdDev']
auc_avg, auc_std = auc_row['Average'], auc_row['StdDev']
aupr_avg, aupr_std = aupr_row['Average'], aupr_row['StdDev']

print(f"Accuracy (Avg +/- Std): {acc_avg:.4f} +/- {acc_std:.4f}")
print(f"F1 Macro (Avg +/- Std): {f1_macro_avg:.4f}  +/- {f1_macro_std:.4f}")
print(f"F1 Weighted (Avg +/- Std): {f1_weighted_avg:.4f} +/- {f1_weighted_std:.4f}")
print(f"Recall: {recall_avg:.4f} +/- {recall_std:.4f}")
print(f"AUC: {auc_avg:.4f} +/- {auc_std:.4f}")
print(f"AUPR: {aupr_avg:.4f} +/- {aupr_std:.4f}")

In [None]:
from bioneuralnet.downstream_task import DPMON
output_dir_base = Path("/home/vicente/Github/BioNeuralNet/dpmon_results_GIN_FINAL/paad")


current_output_dir = output_dir_base / "gin_best_graph"
current_output_dir.mkdir(parents=True, exist_ok=True)

dpmon_params_base = {
    "adjacency_matrix": optimal_graph,
    "omics_list": omics_paad,
    "phenotype_data": target,
    "phenotype_col": "target",
    "clinical_data": clinical_for_model,
    "model": 'GIN',
    "tune": True, 
    "cv": True,   
    "n_folds": 5,
    'repeat_num': 1,
    "gpu": True,
    "cuda": 0,
    "seed": SEED,
    "output_dir": current_output_dir
}

dpmon_tunned = DPMON(**dpmon_params_base)
predictions_df, metrics, embeddings = dpmon_tunned.run()
graph_metrics = metrics

acc_row = graph_metrics.loc[graph_metrics['Metric'] == 'Accuracy'].iloc[0]
f1_macro_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Macro'].iloc[0]
f1_weighted_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Weighted'].iloc[0]
recall_row = graph_metrics.loc[graph_metrics['Metric'] == 'Recall'].iloc[0]
auc_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUC'].iloc[0]
aupr_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUPR'].iloc[0]

acc_avg, acc_std = acc_row['Average'], acc_row['StdDev']
f1_macro_avg, f1_macro_std = f1_macro_row['Average'], f1_macro_row['StdDev']
f1_weighted_avg, f1_weighted_std = f1_weighted_row['Average'], f1_weighted_row['StdDev']
recall_avg, recall_std = recall_row['Average'], recall_row['StdDev']
auc_avg, auc_std = auc_row['Average'], auc_row['StdDev']
aupr_avg, aupr_std = aupr_row['Average'], aupr_row['StdDev']

print(f"Accuracy (Avg +/- Std): {acc_avg:.4f} +/- {acc_std:.4f}")
print(f"F1 Macro (Avg +/- Std): {f1_macro_avg:.4f}  +/- {f1_macro_std:.4f}")
print(f"F1 Weighted (Avg +/- Std): {f1_weighted_avg:.4f} +/- {f1_weighted_std:.4f}")
print(f"Recall: {recall_avg:.4f} +/- {recall_std:.4f}")
print(f"AUC: {auc_avg:.4f} +/- {auc_std:.4f}")
print(f"AUPR: {aupr_avg:.4f} +/- {aupr_std:.4f}")

In [None]:
from bioneuralnet.downstream_task import DPMON

output_dir_base = Path("/home/vicente/Github/BioNeuralNet/dpmon_results_SAGE_FINAL/paad")

current_output_dir = output_dir_base / "sage_best_graph"
current_output_dir.mkdir(parents=True, exist_ok=True)

dpmon_params_base = {
    "adjacency_matrix": optimal_graph,
    "omics_list": omics_paad,
    "phenotype_data": target,
    "phenotype_col": "target",
    "clinical_data": clinical_for_model,
    "model": 'SAGE',
    "tune": True, 
    "cv": True,   
    "n_folds": 5,
    'repeat_num': 1,
    "gpu": True,
    "cuda": 0,
    "seed": SEED,
    "output_dir": current_output_dir
}

dpmon_tunned = DPMON(**dpmon_params_base)
predictions_df, metrics, embeddings = dpmon_tunned.run()
graph_metrics = metrics

acc_row = graph_metrics.loc[graph_metrics['Metric'] == 'Accuracy'].iloc[0]
f1_macro_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Macro'].iloc[0]
f1_weighted_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Weighted'].iloc[0]
recall_row = graph_metrics.loc[graph_metrics['Metric'] == 'Recall'].iloc[0]
auc_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUC'].iloc[0]
aupr_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUPR'].iloc[0]

acc_avg, acc_std = acc_row['Average'], acc_row['StdDev']
f1_macro_avg, f1_macro_std = f1_macro_row['Average'], f1_macro_row['StdDev']
f1_weighted_avg, f1_weighted_std = f1_weighted_row['Average'], f1_weighted_row['StdDev']
recall_avg, recall_std = recall_row['Average'], recall_row['StdDev']
auc_avg, auc_std = auc_row['Average'], auc_row['StdDev']
aupr_avg, aupr_std = aupr_row['Average'], aupr_row['StdDev']

print(f"Accuracy (Avg +/- Std): {acc_avg:.4f} +/- {acc_std:.4f}")
print(f"F1 Macro (Avg +/- Std): {f1_macro_avg:.4f}  +/- {f1_macro_std:.4f}")
print(f"F1 Weighted (Avg +/- Std): {f1_weighted_avg:.4f} +/- {f1_weighted_std:.4f}")
print(f"Recall: {recall_avg:.4f} +/- {recall_std:.4f}")
print(f"AUC: {auc_avg:.4f} +/- {auc_std:.4f}")
print(f"AUPR: {aupr_avg:.4f} +/- {aupr_std:.4f}")

In [None]:
from sklearn.model_selection import StratifiedKFold, ParameterSampler, RepeatedStratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, recall_score, roc_auc_score, average_precision_score
from sklearn.base import clone
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from scipy.stats import loguniform, randint
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
import numpy as np
import pandas as pd

X = pd.concat([dna_meth, rna, cnv, clinical_for_model], axis=1)
y = target['target']
print(f"Successfully created X matrix with shape: {X.shape}")
print(f"Successfully created y vector with shape: {y.shape}")

pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(solver='lbfgs', max_iter=1000, penalty=None, random_state=SEED))
])

pipe_mlp = Pipeline([
    ('scaler', StandardScaler()),
    ('model', MLPClassifier(max_iter=500, early_stopping=True, n_iter_no_change=10, random_state=SEED))
])

pipe_xgb = Pipeline([
    ('scaler', StandardScaler()),
    ('model', XGBClassifier(eval_metric='logloss', tree_method='hist', max_bin=128, random_state=SEED))
])
pipe_rf = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(random_state=SEED))
])

pipe_svm = Pipeline([
    ('scaler', StandardScaler()),
    ('model', SVC(probability=True, random_state=SEED))
])

pipe_dt = Pipeline([
    ('scaler', StandardScaler()),
    ('model', DecisionTreeClassifier(random_state=SEED))
])

params_lr = {'model__penalty': ['l2'], 'model__C': loguniform(1e-4, 1e2)}

params_mlp = {
    'model__hidden_layer_sizes': [(100,), (100, 50), (50, 50)],
    'model__activation': ['relu', 'tanh'],
    'model__alpha': loguniform(1e-5, 1e-1),
    'model__learning_rate_init': loguniform(1e-4, 1e-2)
}
params_xgb = {
    'model__n_estimators': randint(50, 200),
    'model__learning_rate': loguniform(0.01, 0.3),
    'model__max_depth': randint(3, 7),
    'model__subsample': [0.8, 1.0], 
    'model__colsample_bytree': [0.8, 1.0]
}
params_rf = {
    'model__n_estimators': randint(100, 300),
    'model__max_depth': [10, 20, 30, None],
    'model__min_samples_split': randint(2, 10),
    'model__min_samples_leaf': randint(1, 5),
    'model__max_features': ['sqrt', 'log2']
}
params_svm = {
    'model__C': loguniform(1e-2, 1e3),
    'model__kernel': ['rbf', 'linear'],
    'model__gamma': ['scale', 'auto']
}

params_dt = {
    'model__max_depth': randint(3, 15),
    'model__min_samples_split': randint(2, 20),
    'model__criterion': ['gini', 'entropy']
}

models_to_tune = {
    "LogisticRegression": (pipe_lr, params_lr),
    "SVM": (pipe_svm, params_svm),
    "MLP": (pipe_mlp, params_mlp),
    "XGBoost": (pipe_xgb, params_xgb),
    "RandomForest": (pipe_rf, params_rf),
    "DecisionTree": (pipe_dt, params_dt),
}

all_results = {
    "LogisticRegression": {"acc": [], "f1_w": [], "f1_m": [], "recall": [], "auc": [], "aupr": []},
    "MLP": {"acc": [], "f1_w": [], "f1_m": [], "recall": [], "auc": [], "aupr": []},
    "XGBoost": {"acc": [], "f1_w": [], "f1_m": [], "recall": [], "auc": [], "aupr": []},
    "RandomForest": {"acc": [], "f1_w": [], "f1_m": [], "recall": [], "auc": [], "aupr": []},
    "SVM": {"acc": [], "f1_w": [], "f1_m": [], "recall": [], "auc": [], "aupr": []},
    "DecisionTree": {"acc": [], "f1_w": [], "f1_m": [], "recall": [], "auc": [], "aupr": []},
}


for model_name, (pipeline, param_dist) in models_to_tune.items():
    print(f"Evaluating model with nested CV: {model_name}")
    
    outer_cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=SEED)
    
    # inner folds are for finding the best hyperparameters
    for fold_idx, (train_idx, test_idx) in enumerate(outer_cv.split(X, y), start=1):
        X_train_outer, X_test_outer = X.iloc[train_idx], X.iloc[test_idx]
        y_train_outer, y_test_outer = y.iloc[train_idx], y.iloc[test_idx]
        inner_cv = StratifiedKFold(n_splits=5, shuffle=True)
        
        best_score_fold = -np.inf
        best_params_fold = None
        # May or may not want to set a seed here. A fix seed = same hyperparamters each fold.
        # No seed = different hyperparameters each fold. This adds more randomness, and may yield better generalization.
        param_sampler = list(ParameterSampler(param_dist, n_iter=20))
        
        for params in param_sampler:
            inner_scores = []
            
            for inner_train_idx, inner_val_idx in inner_cv.split(X_train_outer, y_train_outer):
                X_train_inner = X_train_outer.iloc[inner_train_idx]
                X_val_inner = X_train_outer.iloc[inner_val_idx]
                y_train_inner = y_train_outer.iloc[inner_train_idx]
                y_val_inner = y_train_outer.iloc[inner_val_idx]
                
                inner_pipeline = clone(pipeline)
                inner_pipeline.set_params(**params)
                inner_pipeline.fit(X_train_inner, y_train_inner)
                
                y_val_pred = inner_pipeline.predict(X_val_inner)
                score = f1_score(y_val_inner, y_val_pred, average='weighted', zero_division=0)
                inner_scores.append(score)
            
            mean_score = np.mean(inner_scores)
            if mean_score > best_score_fold:
                best_score_fold = mean_score
                best_params_fold = params
        
        print(f"Outer fold {fold_idx}: best params (inner CV F1-W={best_score_fold:.4f})")
        print(f"{best_params_fold}")
        
        final_pipeline = clone(pipeline)
        final_pipeline.set_params(**best_params_fold)
        final_pipeline.fit(X_train_outer, y_train_outer)
        
        preds = final_pipeline.predict(X_test_outer)
        
        if hasattr(final_pipeline, "predict_proba"):
            proba = final_pipeline.predict_proba(X_test_outer)
        else:
            proba = None
        
        acc = accuracy_score(y_test_outer, preds)
        f1_w = f1_score(y_test_outer, preds, average='weighted', zero_division=0)
        f1_m = f1_score(y_test_outer, preds, average='macro', zero_division=0)
        recall = recall_score(y_test_outer, preds, average='macro', zero_division=0)
        
        auc = np.nan
        aupr = np.nan
        
        if proba is not None:
            try:
                if len(np.unique(y)) == 2:
                    auc = roc_auc_score(y_test_outer, proba[:, 1])
                    aupr = average_precision_score(y_test_outer, proba[:, 1])
                else:
                    auc = roc_auc_score(y_test_outer, proba, multi_class='ovr', average='macro')
                    y_test_bin = label_binarize(y_test_outer, classes=np.unique(y))
                    aupr = average_precision_score(y_test_bin, proba, average='weighted')
            except Exception:
                auc = np.nan
                aupr = np.nan

        print(f"Fold {fold_idx} results: Acc={acc:.4f}, F1-W={f1_w:.4f}, "
              f"F1-M={f1_m:.4f}, Recall={recall:.4f}, AUC={auc:.4f}, AUPR={aupr:.4f}")
        
        all_results[model_name]["acc"].append(acc)
        all_results[model_name]["f1_w"].append(f1_w)
        all_results[model_name]["f1_m"].append(f1_m)
        all_results[model_name]["recall"].append(recall)
        all_results[model_name]["auc"].append(auc)
        all_results[model_name]["aupr"].append(aupr)

print("\nFINAL BASELINE RESULTS\n")
for model_name, metrics in all_results.items():
    avg_acc = np.mean(metrics["acc"])
    std_acc = np.std(metrics["acc"])
    avg_f1_w = np.mean(metrics["f1_w"])
    std_f1_w = np.std(metrics["f1_w"])
    avg_f1_m = np.mean(metrics["f1_m"])
    std_f1_m = np.std(metrics["f1_m"])
    avg_recall = np.mean(metrics["recall"])
    std_recall = np.std(metrics["recall"])
    avg_auc = np.nanmean(metrics["auc"])
    std_auc = np.nanstd(metrics["auc"])
    avg_aupr = np.nanmean(metrics["aupr"])
    std_aupr = np.nanstd(metrics["aupr"])
    
    print(f"\n{model_name}:")
    print(f"Accuracy: {avg_acc:.4f} +/- {std_acc:.4f}")
    print(f"F1 Weighted: {avg_f1_w:.4f} +/- {std_f1_w:.4f}")
    print(f"F1 Macro: {avg_f1_m:.4f} +/- {std_f1_m:.4f}")
    print(f"Recall: {avg_recall:.4f} +/- {std_recall:.4f}")
    print(f"AUC: {avg_auc:.4f} +/- {std_auc:.4f}")
    print(f"AUPR: {avg_aupr:.4f} +/- {std_aupr:.4f}")

In [None]:
import bioneuralnet as bnn
# plotting function, the values below were enter manually

baseline_data = {
    "Accuracy": {
        "GIN": (0.7005, 0.0583),
        "GraphSAGE": (0.7005, 0.0934),
        "GAT": (0.6894, 0.0646),
        "GCN": (0.6608, 0.0469),
        "LogReg": (0.7062, 0.0560),
        "SVM": (0.6778, 0.0620),
        "XGBoost": (0.6779, 0.0721),
        "MLP": (0.6710, 0.0667),
        "Random Forest": (0.6665, 0.0830),
        "Decision Tree": (0.5715, 0.0654),
    },
    "F1 Weighted": {
        "GIN": (0.6973, 0.0601),
        "GraphSAGE": (0.6925, 0.0984),
        "GAT": (0.6774, 0.0701),
        "GCN": (0.6515, 0.0487),
        "LogReg": (0.7025, 0.0586),
        "SVM": (0.6740, 0.0630),
        "XGBoost": (0.6715, 0.0740),
        "MLP": (0.6678, 0.0690),
        "Random Forest": (0.6592, 0.0870),
        "Decision Tree": (0.5663, 0.0659),
    },
    "F1 Macro": {
        "GIN": (0.6966, 0.0602),
        "GraphSAGE": (0.6924, 0.0911),
        "GAT": (0.6762, 0.0629),
        "GCN": (0.6505, 0.0435),
        "LogReg": (0.7008, 0.0597),
        "SVM": (0.6725, 0.0637),
        "XGBoost": (0.6693, 0.0751),
        "MLP": (0.6662, 0.0697),
        "Random Forest": (0.6568, 0.0883),
        "Decision Tree": (0.5648, 0.0668),
    },
}

baseline_data_cont = {
    "Recall": {
        "GIN": (0.7002, 0.0589),
        "GraphSAGE": (0.7032, 0.0734),
        "GAT": (0.6883, 0.0553),
        "GCN": (0.6607, 0.0427),
        "LogReg": (0.7032, 0.0585),
        "SVM": (0.6755, 0.0631),
        "XGBoost": (0.6737, 0.0736),
        "MLP": (0.6686, 0.0682),
        "Random Forest": (0.6620, 0.0846),
        "Decision Tree": (0.5700, 0.0671),
    },
    "AUC": {
        "GIN": (0.7471, 0.0573),
        "GraphSAGE": (0.7482, 0.0772),
        "GAT": (0.7418, 0.0784),
        "GCN": (0.7390, 0.0779),
        "LogReg": (0.7905, 0.0590),
        "SVM": (0.7602, 0.0604),
        "XGBoost": (0.7454, 0.0852),
        "MLP": (0.7363, 0.0720),
        "Random Forest": (0.7509, 0.0813),
        "Decision Tree": (0.5802, 0.0747),
    },
    "AUPR": {
        "GIN": (0.7634, 0.0743),
        "GraphSAGE": (0.7812, 0.0722),
        "GAT": (0.7570, 0.0872),
        "GCN": (0.7661, 0.0739),
        "LogReg": (0.8042, 0.0689),
        "SVM": (0.7824, 0.0574),
        "XGBoost": (0.7624, 0.0836),
        "MLP": (0.7533, 0.0824),
        "Random Forest": (0.7770, 0.0734),
        "Decision Tree": (0.5869, 0.0559),
    },
}

bnn.metrics.plot_multiple_metrics(
    baseline_data,
    title_map={
        "Accuracy": "GNNs vs Baselines: Accuracy",
        "F1 Weighted": "GNNs vs Baselines: F1 Weighted",
        "F1 Macro": "GNNs vs Baselines: F1 Macro",
    }
)

bnn.metrics.plot_multiple_metrics(
    baseline_data_cont,
    title_map={
        "Recall": "GNNs vs Baselines: Recall",
        "AUC": "GNNs vs Baselines: AUC",
        "AUPR": "GNNs vs Baselines: AUPR",
    }
)