# TCGA-BRCA Demo

### Dataset Source

- **Omics Data**: [FireHose BRCA](http://firebrowse.org/?cohort=BRCA)
- **Clinical and PAM50 Data**: [TCGAbiolinks](http://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html)

##### Dataset Overview

- **Original Data**:

    - **Methylation**: 20,107 × 885
    - **mRNA**: 18,321 × 1,212
    - **miRNA**: 503 × 1,189
    - **PAM50**: 1,087 × 1
    - **Clinical**: 1,098 × 101

- **PAM50 Subtype Counts**:

    - **LumA**: 419
    - **LumB**: 140
    - **Basal**: 130
    - **Her2**: 46
    - **Normal**: 34

### Patients in Every Dataset

- Total patients present in methylation, mRNA, miRNA, PAM50, and clinical: **769**

### Final Shapes (Per-Patient)

After aggregating multiple aliquots by mean, all modalities align on 769 patients:

- **Methylation**: 769 × 20,106
- **mRNA**: 769 × 20,531
- **miRNA**: 769 × 503
- **PAM50**: 769 × 1
- **Clinical**: 769 × 119

### Data Summary Table

| Stage                          | Clinical    | Methylation  | miRNA       | mRNA           | PAM50 (Subtype Counts)                                         | Notes                                   |
| ------------------------------ | ----------- | ------------ | ----------- | -------------- | -------------------------------------------------------------- | --------------------------------------- |
| **Original Raw Data**          | 1,098 × 101 | 20,107 × 885 | 503 × 1,189 | 18,321 × 1,212 | LumA: 509<br>LumB: 209<br>Basal: 192<br>Her2: 82<br>Normal: 40 | Raw FireHose & TCGAbiolinks files       |
| **Patient-Level Intersection** | 769 × 101   | 769 × 20,107 | 769 × 1,046 | 769 × 20,531   | LumA: 419<br>LumB: 140<br>Basal: 130<br>Her2: 46<br>Normal: 34 | Patients with complete data in all sets |

### Reference Links

- [FireHose BRCA](http://firebrowse.org/?cohort=BRCA)
- [TCGAbiolinks](http://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html)
- [Direct Download BRCA](http://firebrowse.org/?cohort=BRCA&download_dialog=true)


In [1]:
import pandas as pd
from pathlib import Path
root = Path("/home/vicente/Github/BioNeuralNet/TCGA_BRCA_DATA")

mirna_raw = pd.read_csv(root/"BRCA.miRseq_RPKM_log2.txt", sep="\t",index_col=0,low_memory=False)                            
rna_raw = pd.read_csv(root / "BRCA.uncv2.mRNAseq_RSEM_normalized_log2.txt", sep="\t",index_col=0,low_memory=False)
meth_raw = pd.read_csv(root/"BRCA.meth.by_mean.data.txt", sep='\t',index_col=0,low_memory=False)
clinical_raw = pd.read_csv(root / "BRCA.clin.merged.picked.txt",sep="\t", index_col=0, low_memory=False)

# display all shapes and first few rows of each dataset
display(mirna_raw.iloc[:3,:5])
display(mirna_raw.shape)

display(rna_raw.iloc[:3,:5])
display(rna_raw.shape)

display(meth_raw.iloc[:3,:5])
display(meth_raw.shape)

display(clinical_raw.iloc[:3,:5])
display(clinical_raw.shape)

Unnamed: 0_level_0,TCGA-3C-AAAU-01,TCGA-3C-AALI-01,TCGA-3C-AALJ-01,TCGA-3C-AALK-01,TCGA-4H-AAAK-01
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
hsa-let-7a-1,13.129765,12.918069,13.012033,13.144697,13.411684
hsa-let-7a-2,14.117933,13.9223,14.010002,14.141721,14.413518
hsa-let-7a-3,13.147714,12.913194,13.028483,13.151281,13.420481


(503, 1189)

Unnamed: 0_level_0,TCGA-3C-AAAU-01,TCGA-3C-AALI-01,TCGA-3C-AALJ-01,TCGA-3C-AALK-01,TCGA-4H-AAAK-01
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
?|100133144,4.032489,3.211931,3.538886,3.595671,2.77543
?|100134869,3.692829,4.119273,3.206237,3.469873,3.850979
?|10357,5.704604,6.124231,7.26957,7.168565,6.395968


(18321, 1212)

Unnamed: 0_level_0,TCGA-3C-AAAU-01,TCGA-3C-AALI-01,TCGA-3C-AALJ-01,TCGA-3C-AALK-01,TCGA-4H-AAAK-01
Hybridization REF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Composite Element REF,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value
A1BG,0.483716119676,0.637191226131,0.656092398242,0.615194471357,0.612080370511
A1CF,0.295827203492,0.458972998571,0.489725289638,0.625765223243,0.507736509665


(20107, 885)

Unnamed: 0_level_0,tcga-5l-aat0,tcga-5l-aat1,tcga-a1-a0sp,tcga-a2-a04v,tcga-a2-a04y
Hybridization REF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Composite Element REF,value,value,value,value,value
years_to_birth,42,63,40,39,53
vital_status,0,0,0,1,0


(18, 1097)

## TCGA-BioLink: Pam50

This section demonstrates how to use the `TCGAbiolinks` R package to access and download clinical and molecular subtype data. It begins by ensuring `TCGAbiolinks` is installed, then loads the package. It retrieves PAM50 molecular subtype labels using `TCGAquery_subtype()` and writes them to a CSV file. Additionally, it downloads clinical data using `GDCquery_clinic()` and formats it with `GDCprepare_clinic()`, saving the result as another CSV file.

```R
  # Install TCGAbiolinks
  if (!requireNamespace("TCGAbiolinks", quietly = TRUE)) {
    if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
    BiocManager::install("TCGAbiolinks")
  }

  # Load the library
  library(TCGAbiolinks)

  # Download PAM50 subtype labels
  pam50_df <- TCGAquery_subtype(tumor = "BRCA")[ , c("patient", "BRCA_Subtype_PAM50")]
  write.csv(pam50_df, file = "BRCA_PAM50_labels.csv", row.names = FALSE, quote = FALSE)

  # Download clinical data
  clin_raw <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
  clin_df <- GDCprepare_clinic(clin_raw, clinical.info = "patient")
  write.csv(clin_df, file = "BRCA_clinical_data.csv", row.names = FALSE, quote = FALSE)
```

In [2]:
pam50 = pd.read_csv(root /"BRCA_PAM50_labels.csv",index_col=0)
clinical_biolinks = pd.read_csv(root /"BRCA_clinical_data.csv",index_col=1)

display(pam50.iloc[:5,:5])
display(pam50.shape)
display(clinical_biolinks.iloc[:5,:5])
display(clinical_biolinks.shape)

Unnamed: 0_level_0,BRCA_Subtype_PAM50
patient,Unnamed: 1_level_1
TCGA-3C-AAAU,LumA
TCGA-3C-AALI,Her2
TCGA-3C-AALJ,LumB
TCGA-3C-AALK,LumA
TCGA-4H-AAAK,LumA


(1087, 1)

Unnamed: 0_level_0,project,synchronous_malignancy,ajcc_pathologic_stage,days_to_diagnosis,laterality
submitter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-A7-A0DC,TCGA-BRCA,No,Stage IA,0.0,
TCGA-D8-A1XG,TCGA-BRCA,No,Stage IIIB,0.0,Right
TCGA-AN-A0FY,TCGA-BRCA,No,Stage IA,0.0,Left
TCGA-B6-A0RN,TCGA-BRCA,No,Stage IA,0.0,Right
TCGA-AR-A1AJ,TCGA-BRCA,No,Stage I,0.0,Right


(1098, 101)

## Data Processing Summary

- **Transpose Data**: All raw data (miRNA, RNA, etc.) is flipped so rows represent patients and columns represent features.
- **Standardize Patient IDs**: Patient IDs in all tables are cleaned to the 12-character TCGA format (e.g., `TCGA-AB-1234`) for matching.
- **Handle Duplicates**: Duplicate patient rows are averaged in the omics data. The first entry is kept for duplicate patients in the clinical data.
- **Impute Missing Values (KNN)**: Missing data (NaNs) in the omics datasets are estimated and filled using **K-Nearest Neighbors (KNN)** imputation.
- **Find Common Patients**: The script identifies the largest common cohort of patients that exist in all datasets.
- **Subset Data**: All data tables are filtered down to only this common list of patients, ensuring perfect alignment.
- **Extract Target**: The **PAM50 subtype** column is pulled from the corresponding data table to be used as the final prediction target (y-variable).

In [3]:
import bioneuralnet as bnn

meth = meth_raw.T
rna = rna_raw.T
mirna = mirna_raw.T
clinical_firehose = clinical_raw.T


print(f"miRNA (samples, features): {mirna.shape}")
print(f"RNA (samples, features): {rna.shape}")
print(f"Methylation (samples, features): {meth.shape}")
print(f"Clinical (samples, features): {clinical_firehose.shape}")

def trim_barcode(idx):
    return idx.to_series().str.slice(0, 12)

# Standardize patient IDs across all files
meth.index = trim_barcode(meth.index)
rna.index = trim_barcode(rna.index)
mirna.index = trim_barcode(mirna.index)
clinical_firehose.index = clinical_firehose.index.str.upper()
clinical_firehose.index.name = "Patient_ID"

meth = meth.apply(pd.to_numeric, errors='coerce')
meth = meth.drop(columns=["Composite Element REF"], errors="ignore")

rna = rna.apply(pd.to_numeric, errors='coerce')
mirna = mirna.apply(pd.to_numeric, errors='coerce')

meth = meth.groupby(meth.index).mean()
rna = rna.groupby(rna.index).mean()
mirna = mirna.groupby(mirna.index).mean()

# For any duplicate rows in the clinical data, we keep the first occurrence
clinical = clinical_firehose[~clinical_firehose.index.duplicated(keep='first')]

for df in [meth, rna, mirna]:
    df.columns = df.columns.str.replace(r"\?", "unknown_", regex=True)
    df.columns = df.columns.str.replace(r"\|", "_", regex=True)
    df.columns = df.columns.str.replace("-", "_", regex=False)
    df.columns = df.columns.str.replace(r"_+", "_", regex=True)
    df.columns = df.columns.str.strip("_")

print(f"\nMethylation shape: {meth.shape}")
print(f"RNA shape: {rna.shape}")
print(f"miRNA shape: {mirna.shape}")
print(f"Clinical shape: {clinical.shape}")

# --- 4. KNN IMPUTATION (Outside Loop, as instructed) ---

meth.columns = pd.Index(meth.columns.tolist())
rna.columns = pd.Index(rna.columns.tolist())
mirna.columns = pd.Index(mirna.columns.tolist())

# --- 3. KNN IMPUTATION ---

# Impute missing values using KNN on the finalized omics data
# This step should now succeed as the column labels are strictly aligned with the underlying data array.
meth = bnn.utils.data.impute_omics_knn(meth, n_neighbors=5) # This should now work
rna = bnn.utils.data.impute_omics_knn(rna, n_neighbors=5)
mirna = bnn.utils.data.impute_omics_knn(mirna, n_neighbors=5)

# --- 4. Clinical Merge and Final Alignment ---

# Handle duplicate patient IDs in clinical data (Keep first occurrence)
clinical_biolinks = clinical_biolinks[~clinical_biolinks.index.duplicated(keep='first')]
clinical_firehose = clinical_firehose[~clinical_firehose.index.duplicated(keep='first')]

# Intersect patients common to both clinical sources and merge
common_clinical_patients = clinical_biolinks.index.intersection(clinical_firehose.index)
clinical_biolinks = clinical_biolinks.loc[common_clinical_patients]
clinical_firehose = clinical_firehose.loc[common_clinical_patients]
clinical = pd.concat([clinical_biolinks, clinical_firehose], axis=1)
clinical.index.name = "Patient_ID"


# Determine the final list of patients present in ALL datasets
common_patients = sorted(
    set(meth.index) & 
    set(rna.index) & 
    set(mirna.index) & 
    set(pam50.index) & 
    set(clinical.index)
)

print(f"\nFound: {len(common_patients)} patients across all data types.")

meth = meth.loc[common_patients]
rna = rna.loc[common_patients]
mirna = mirna.loc[common_patients]
pam50 = pam50.loc[common_patients]
clinical = clinical.loc[common_patients]


targets = pam50['BRCA_Subtype_PAM50'] 

print("\nFinal shapes:")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
print(f"pam50: {pam50.shape}")
print(f"clinical: {clinical.shape}")
print(f"targets: {targets.shape}")


miRNA (samples, features): (1189, 503)
RNA (samples, features): (1212, 18321)
Methylation (samples, features): (885, 20107)
Clinical (samples, features): (1097, 18)

Methylation shape: (791, 20106)
RNA shape: (1093, 18321)
miRNA shape: (1079, 503)
Clinical shape: (1097, 18)


2025-11-12 23:39:27,784 - bioneuralnet.utils.data - INFO - Starting KNN imputation (k=5) on DataFrame (shape: (791, 20106)).
2025-11-12 23:39:30,051 - bioneuralnet.utils.data - INFO - KNN imputation complete
2025-11-12 23:39:30,423 - bioneuralnet.utils.data - INFO - Starting KNN imputation (k=5) on DataFrame (shape: (1093, 18321)).
2025-11-12 23:39:51,503 - bioneuralnet.utils.data - INFO - KNN imputation complete
2025-11-12 23:39:51,514 - bioneuralnet.utils.data - INFO - Starting KNN imputation (k=5) on DataFrame (shape: (1079, 503)).
2025-11-12 23:39:52,287 - bioneuralnet.utils.data - INFO - KNN imputation complete



Found: 769 patients across all data types.

Final shapes:
meth: (769, 20106)
rna: (769, 18321)
mirna: (769, 503)
pam50: (769, 1)
clinical: (769, 119)
targets: (769,)


In [4]:
# drop unwanted columns from clinical data
clinical.drop(columns=["Composite Element REF"], errors="ignore", inplace=True)

# we transform the methylation beta values to M-values and drop unwanted columns
meth_m = meth.drop(columns=["Composite Element REF"], errors="ignore")

# convert beta values to M-values using bioneuralnet utility with small epsilon to avoid log(0)
meth_m = bnn.utils.beta_to_m(meth_m, eps=1e-6) 

# lastly we turn the target labels into numerical classes
mapping_brca = {
    'LumA': 0, 
    'Her2': 1, 
    'LumB': 2, 
    'Basal': 3, 
    'Normal': 4
}
target_labels = targets.map(mapping_brca).to_frame(name="target")

X_meth = meth_m.loc[targets.index]
X_rna = rna.loc[targets.index]
X_mirna = mirna.loc[targets.index]
Y_labels = target_labels.loc[targets.index]
clinical_processed = clinical.loc[targets.index]


2025-11-12 23:39:52,362 - bioneuralnet.utils.data - INFO - Starting Beta-to-M value conversion (shape: (769, 20106)). Epsilon: 1e-06
2025-11-12 23:39:53,549 - bioneuralnet.utils.data - INFO - Beta-to-M conversion complete.


In [5]:
display(X_meth.iloc[:3,:5])
display(X_meth.shape)

display(X_rna.iloc[:3,:5])
display(X_rna.shape)

display(X_mirna.iloc[:3,:5])
display(X_mirna.shape)

display(clinical_processed.iloc[:3,:5])
display(clinical_processed.shape)

display(Y_labels.value_counts())


Unnamed: 0_level_0,A1BG,A1CF,A2BP1,A2LD1,A2M
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-3C-AAAU,-0.094004,-1.251175,-2.113585,0.765262,0.345896
TCGA-3C-AALI,0.812517,-0.237291,-1.658888,0.99744,0.630221
TCGA-3C-AALJ,0.931878,-0.059301,-1.369104,1.628617,0.97213


(769, 20106)

Unnamed: 0_level_0,unknown_100133144,unknown_100134869,unknown_10357,unknown_10431,unknown_155060
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-3C-AAAU,4.032489,3.692829,5.704604,8.672694,10.21311
TCGA-3C-AALI,3.211931,4.119273,6.124231,9.139279,9.011343
TCGA-3C-AALJ,3.538886,3.206237,7.26957,10.410275,9.209506


(769, 18321)

Unnamed: 0_level_0,hsa_let_7a_1,hsa_let_7a_2,hsa_let_7a_3,hsa_let_7b,hsa_let_7c
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-3C-AAAU,13.129765,14.117933,13.147714,14.595135,8.41489
TCGA-3C-AALI,12.918069,13.9223,12.913194,14.512657,9.646536
TCGA-3C-AALJ,13.012033,14.010002,13.028483,13.419612,9.312455


(769, 503)

Unnamed: 0_level_0,project,synchronous_malignancy,ajcc_pathologic_stage,days_to_diagnosis,laterality
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-3C-AAAU,TCGA-BRCA,No,Stage X,0.0,Left
TCGA-3C-AALI,TCGA-BRCA,No,Stage IIB,0.0,Right
TCGA-3C-AALJ,TCGA-BRCA,No,Stage IIB,0.0,Right


(769, 118)

target
0         419
2         140
3         130
1          46
4          34
Name: count, dtype: int64

Feature Selection Methodology for BRCA

Supported Methods and Interpretation

BioNeuralNet provides three techniques for feature selection, allowing for different views of the data's statistical profile:

    Variance Thresholding: Identifies features with the highest overall variance across all samples.

    ANOVA F-test: Pinpoints features that best distinguish between the target classes (LumA, LumB, Her2, Basal, and Normal).

    Random Forest Importance: Assesses feature utility based on its contribution to a predictive non-linear model.

BRCA Cohort Selection Strategy

A dimensionality reduction step was essential for managing the high-feature-count omics data, given the complexity of the BRCA network:

    High-Feature Datasets: Both DNA Methylation (20,106 features) and RNA (18,321 features) required significant feature reduction.

    Filtering Process: The top 6,000 features were initially extracted from the Methylation and RNA datasets using all three methods.

    Final Set: A consensus set was built by finding the intersection of features selected by the ANOVA F-test and Random Forest Importance, ensuring both statistical relevance and model-based utility.

    Low-Feature Datasets: The miRNA data (503 features) was passed through without selection, as its feature count was already manageable.

In [6]:
import bioneuralnet as bnn

# feature selection
meth_highvar = bnn.utils.select_top_k_variance(X_meth, k=6000)
rna_highvar = bnn.utils.select_top_k_variance(X_rna, k=6000)

meth_af = bnn.utils.top_anova_f_features(X_meth, Y_labels, max_features=6000)
rna_af = bnn.utils.top_anova_f_features(X_rna, Y_labels, max_features=6000)

meth_rf = bnn.utils.select_top_randomforest(X_meth, Y_labels, top_k=6000)
rna_rf = bnn.utils.select_top_randomforest(X_rna, Y_labels, top_k=6000)

meth_var_set = set(meth_highvar.columns)
meth_anova_set = set(meth_af.columns)
meth_rf_set = set(meth_rf.columns)

rna_var_set = set(rna_highvar.columns)
rna_anova_set = set(rna_af.columns)
rna_rf_set = set(rna_rf.columns)

meth_inter1 = list(meth_anova_set & meth_var_set)
meth_inter2 = list(meth_rf_set & meth_var_set)
meth_inter3 = list(meth_anova_set & meth_rf_set)
meth_all_three = list(meth_anova_set & meth_var_set & meth_rf_set)

rna_inter4 = list(rna_anova_set & rna_var_set)
rna_inter5 = list(rna_rf_set & rna_var_set)
rna_inter6 = list(rna_anova_set & rna_rf_set)
rna_all_three = list(rna_anova_set & rna_var_set & rna_rf_set)

2025-11-12 23:39:56,760 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-12 23:39:56,761 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-11-12 23:39:56,761 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-11-12 23:39:56,869 - bioneuralnet.utils.preprocess - INFO - Selected top 6000 features by variance
2025-11-12 23:39:59,747 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-12 23:39:59,748 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-11-12 23:39:59,748 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-11-12 23:39:59,842 - bioneuralnet.utils.preprocess - INFO - Selected top 6000 features by variance
2025-11-12 23:40:03,008 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-12 23:40:03,009 - bioneuralnet.

In [7]:
print("FROM THE 6000 Methylation feature selection:\n")
print(f"Anova-F & variance selection share: {len(meth_inter1)} features")
print(f"Random Forest & variance selection share: {len(meth_inter2)} features")
print(f"Anova-F & Random Forest share: {len(meth_inter3)} features")
print(f"All three methods agree on: {len(meth_all_three)} features")

FROM THE 6000 Methylation feature selection:

Anova-F & variance selection share: 2092 features
Random Forest & variance selection share: 1870 features
Anova-F & Random Forest share: 2203 features
All three methods agree on: 814 features


In [8]:
print("\nFROM THE 6000 RNA feature selection:\n")
print(f"Anova-F & variance selection share: {len(rna_inter4)} features")
print(f"Random Forest & variance selection share: {len(rna_inter5)} features")
print(f"Anova-F & Random Forest share: {len(rna_inter6)} features")
print(f"All three methods agree on: {len(rna_all_three)} features")


FROM THE 6000 RNA feature selection:

Anova-F & variance selection share: 2359 features
Random Forest & variance selection share: 2191 features
Anova-F & Random Forest share: 2500 features
All three methods agree on: 1124 features


## Feature Selection Summary: ANOVA-RF Intersection

The final set of features was determined by the **intersection** of those highlighted by the **ANOVA F-test** and **Random Forest Importance**. This methodology provides a balanced filter, capturing features with both high class-separability (ANOVA) and significant predictive value in a non-linear model (Random Forest). The resulting feature pool is considered highly relevant for the subsequent modeling tasks.

### Feature Overlap Results

The table below quantifies the shared features identified by the different selection techniques for each omics type.

| Omics Data Type | ANOVA-F & Variance | RF & Variance | ANOVA-F & Random Forest (Selected) | All Three Agree |
| :--- | :--- | :--- | :--- | :--- |
| **Methylation** | 2,092 features | 1,870 features | **2,203 features** | 814 features |
| **RNA** | 2,359 features | 2,191 features | **2,500 features** | 1,124 features |

In [9]:
X_meth_selected = X_meth[meth_inter3]
X_rna_selected = X_rna[rna_inter6]

print("\nFinal Shapes for Modeling")
print(f"Methylation (X1): {X_meth_selected.shape}")
print(f"RNA-Seq (X2): {X_rna_selected.shape}")
print(f"miRNA-Seq (X3): {X_mirna.shape}")
print(f"Labels (Y): {Y_labels.shape}")



Final Shapes for Modeling
Methylation (X1): (769, 2203)
RNA-Seq (X2): (769, 2500)
miRNA-Seq (X3): (769, 503)
Labels (Y): (769, 1)


## Data Availability

To facilitate rapid experimentation and reproduction of our results, the fully processed and feature-selected dataset used in this analysis has been made available directly within the package.

Users can load this dataset, bypassing all preceding data acquisition, preprocessing, and feature selection steps. This allows users to proceed immediately from this step.

In [None]:
out_dir = Path("/home/vicente/Github/BioNeuralNet/bioneuralnet/datasets/brca")
X_meth_selected.to_csv(out_dir / "meth.csv", index=True)
X_rna_selected.to_csv(out_dir / "rna.csv", index=True)
X_mirna.to_csv(out_dir / "mirna.csv", index=True)

clinical_processed.to_csv(out_dir / "clinical.csv", index=True)
Y_labels.to_csv(out_dir / "target.csv", index=True)

In [39]:
import bioneuralnet as bnn

tgca_brca = bnn.datasets.DatasetLoader("brca")
display(tgca_brca.shape)

dna_meth = tgca_brca.data["meth"]
rna = tgca_brca.data["rna"]
mirna = tgca_brca.data["mirna"]
clinical = tgca_brca.data["clinical"]
target = tgca_brca.data["target"]


{'mirna': (769, 503),
 'target': (769, 1),
 'clinical': (769, 118),
 'rna': (769, 2500),
 'meth': (769, 2203)}

In [40]:
samples_before = clinical.shape[1]

clinical_half_len = clinical.shape[1] /2
clinical.dropna(inplace=True, axis=1, thresh=clinical_half_len)
samples_after = clinical.shape[1]
print(f"Samples dropped by dropna: {samples_before - samples_after}")
print(f"Final shape of clinical data: {clinical.shape}")


Samples dropped by dropna: 49
Final shape of clinical data: (769, 69)


In [41]:
import bioneuralnet as bnn

# for more details on the preprocessing functions, see `bioneuralnet.utils.preprocess`
clinical_preprocessed = bnn.utils.preprocess_clinical(
    clinical, 
    target, 
    top_k=7, 
    scale=False,
    ignore_columns = [
    'days_to_birth',
    'years_to_birth',
    'age_at_index',
    'updated_datetime',
    'bcr_patient_barcode',
    'diagnosis_id',
    'icd_10_code',
    'ajcc_staging_system_edition',
    'date_of_initial_pathologic_diagnosis',
    'gender.1', 
    'race.1', 
    'ethnicity.1',
    'number_of_lymph_nodes',
    'vital_status',
    'vital_status.1',
    'days_to_death',
    'days_to_death.1',
    'days_to_last_followup',
    'treatments_radiation_days_to_treatment_end',
    'treatments_radiation_days_to_treatment_start',
    'ajcc_pathologic_stage',
    'pathologic_stage',
    "ajcc_pathologic_t",
    'pathology_T_stage',
    'pathology_N_stage',
    'pathology_M_stage',]
)

#print(clinical_preprocessed.columns)
display(clinical_preprocessed.iloc[:3,:10])

2025-11-13 00:41:28,700 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-11-13 00:41:28,700 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 15 NaNs after median imputation
2025-11-13 00:41:28,701 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 1 columns dropped due to zero variance
2025-11-13 00:41:29,026 - bioneuralnet.utils.preprocess - INFO - Selected top 7 features by RandomForest importance


Unnamed: 0_level_0,age_at_diagnosis,year_of_diagnosis,laterality_Right,country_of_residence_at_enrollment_United States,race_black or african american,method_of_diagnosis_Core Biopsy,metastasis_at_diagnosis_Missing
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
TCGA-3C-AAAU,20211.0,2004.0,False,True,False,False,True
TCGA-3C-AALI,18538.0,2003.0,True,True,True,True,True
TCGA-3C-AALJ,22848.0,2011.0,True,True,True,True,True


## Building a Multi-Omics Network

We built a k-NN cosine similarity graph to capture relationships across omics


In [42]:
import pandas as pd

X_train_full = pd.concat([dna_meth, rna, mirna], axis=1)

print(f"Nan values in X_train_full: {X_train_full.isna().sum().sum()}")
X_train_full = X_train_full.dropna()
print(f"Nan value in X_train_full after dropping: {X_train_full.isna().sum().sum()}")

print(f"X_train_full shape: {X_train_full.shape}")
# building the graph using the similarity graph function with k=15
A_train = bnn.utils.gen_similarity_graph(X_train_full, k=15)

print(f"\nNetwork shape: {A_train.shape}")

Nan values in X_train_full: 0
Nan value in X_train_full after dropping: 0
X_train_full shape: (769, 5206)

Network shape: (5206, 5206)


## Reproducibility and Seeding

To ensure our experimental results are fully reproducible, a single global seed is set at the beginning of the analysis.

This utility function propagates the seed to all sources of randomness, including `random`, `numpy`, and `torch` (for both CPU and GPU). Critically, it also configures the PyTorch cuDNN backend to use deterministic algorithms.

**for each DPMON outer iteration, the seed is incremented to generate a different internal test/train split.**

In [44]:
import bioneuralnet as bnn

SEED = 1804
bnn.utils.set_seed(SEED)

2025-11-13 00:41:54,623 - bioneuralnet.utils.data - INFO - Setting global seed for reproducibility to: 1804
2025-11-13 00:41:54,625 - bioneuralnet.utils.data - INFO - CUDA available. Applying seed to all GPU operations
2025-11-13 00:41:54,625 - bioneuralnet.utils.data - INFO - Seed setting complete


## Classification using DPMON: Training and Evaluation


### SAGE Analysis of Hyperparameter Optimization

| Seed | Internal Avg Acc | Internal Std Dev | gnn_layer_num | gnn_hidden_dim | lr | weight_decay | nn_hidden_dim1 | nn_hidden_dim2 | num_epochs |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| **1804** | 0.8062 | 0.2817 | 64 | 8 | 0.004644 | 0.001269 | 64 | 128 | 512 |
| **1805** | 0.8266 | 0.3003 | 16 | 8 | 0.000322 | 0.018485 | 128 | 128 | 1024 |
| **1806** | 0.7629 | 0.3341 | 16 | 64 | 0.000316 | 0.065466 | 64 | 32 | 4096 |
| **1807** | 0.9046 | 0.0470 | 4 | 32 | 0.004406 | 0.017536 | 32 | 64 | 8192 |
| **1808** | 0.9905 | 0.0059 | 8 | 32 | 0.005560 | 0.000111 | 32 | 64 | 2048 |

### SAGE Final Results

| | Mean | Std Dev |
|:---|---:|---:|
| **Accuracy** | **0.9774** | **0.0190** |
| **F1 Weighted** | **0.9684** | **0.0286** |
| **F1 Macro** | **0.8957** | **0.1017** |


### GCN Analysis of Hyperparameter Optimization

| Seed | Internal Avg Acc | Internal Std Dev | gnn_layer_num | gnn_hidden_dim | lr | weight_decay | nn_hidden_dim1 | nn_hidden_dim2 | num_epochs |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| **1804** | 1.0000 | 0.0000 | 64 | 16 | 0.000302 | 0.000130 | 64 | 16 | 2048 |
| **1805** | 0.6680 | 0.0686 | 128 | 64 | 0.055114 | 0.002403 | 16 | 4 | 64 |
| **1806** | 1.0000 | 0.0000 | 4 | 128 | 0.000714 | 0.012667 | 64 | 128 | 8192 |
| **1807** | 0.6606 | 0.1201 | 8 | 64 | 0.020924 | 0.009291 | 16 | 128 | 256 |
| **1808** | 0.9991 | 0.0008 | 8 | 32 | 0.005560 | 0.000111 | 32 | 64 | 2048 |

### GCN Final Results

| | Mean | Std Dev |
|:---|---:|---:|
| **Accuracy** | **0.9009** | **0.1238** |
| **F1 Weighted** | **0.8697** | **0.1673** |
| **F1 Macro** | **0.7880** | **0.2738** |


### GAT Analysis of Hyperparameter Optimization

| Seed | Internal Avg Acc | Internal Std Dev | gnn_layer_num | gnn_hidden_dim | lr | weight_decay | nn_hidden_dim1 | nn_hidden_dim2 | num_epochs |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| **1804** | 0.7195 | 0.0251 | 8 | 64 | 0.042316 | 0.025413 | 64 | 16 | 64 |
| **1805** | 0.9991 | 0.0015 | 16 | 8 | 0.000322 | 0.018485 | 128 | 128 | 1024 |
| **1806** | 0.7009 | 0.3219 | 4 | 128 | 0.000714 | 0.012667 | 64 | 128 | 8192 |
| **1807** | 0.5388 | 0.1508 | 8 | 64 | 0.020924 | 0.009291 | 16 | 128 | 256 |
| **1808** | 0.7560 | 0.0529 | 128 | 4 | 0.003024 | 0.078813 | 8 | 128 | 8192 |

### GAT Final Results

| | Mean | Std Dev |
|:---|:---|:---|
| **Accuracy** | **0.8398** | **0.1419** |
| **F1 Weighted** | **0.8008** | **0.1781** |
| **F1 Macro** | **0.6495** | **0.2936** |

In [45]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.metrics import f1_score, accuracy_score
from bioneuralnet.downstream_task import DPMON

output_dir_base_sage =  Path("/home/vicente/Github/BioNeuralNet/dpmon_results_SAGE_FINAL/brca")
target = target.rename(columns={"target": "phenotype"})

n_repeats = 5
all_preds = []

for r in range(n_repeats):
    bnn.utils.set_seed(SEED+r)
    dpmon_repeat = DPMON(
        adjacency_matrix=A_train,
        omics_list=[dna_meth, rna, mirna],
        phenotype_data=target,
        clinical_data=clinical_preprocessed,
        repeat_num=3,
        model='SAGE',
        tune=True,
        gpu=True,
        cuda=0,
        output_dir=output_dir_base_sage,
    )
    
    predictions_df, _ = dpmon_repeat.run()
    all_preds.append(predictions_df["Predicted"].values)

all_preds = np.array(all_preds)

f1_macro_list = [f1_score(target, pred, average='macro') for pred in all_preds]
f1_weighted_list = [f1_score(target, pred, average='weighted') for pred in all_preds]
accuracy_list = [accuracy_score(target, pred) for pred in all_preds]

avg_f1_macro = np.mean(f1_macro_list)
std_f1_macro = np.std(f1_macro_list)

avg_f1_weighted = np.mean(f1_weighted_list)
std_f1_weighted = np.std(f1_weighted_list)

avg_acc = np.mean(accuracy_list)
std_acc = np.std(accuracy_list)

print(f"Accuracy: {avg_acc:.4f} +/- {std_acc:.4f}")
print(f"F1 Weighted: {avg_f1_weighted:.4f} +/- {std_f1_weighted:.4f}")
print(f"F1 Macro: {avg_f1_macro:.4f} +/- {std_f1_macro:.4f}")

2025-11-13 00:41:54,650 - bioneuralnet.utils.data - INFO - Setting global seed for reproducibility to: 1804
2025-11-13 00:41:54,651 - bioneuralnet.utils.data - INFO - CUDA available. Applying seed to all GPU operations
2025-11-13 00:41:54,652 - bioneuralnet.utils.data - INFO - Seed setting complete
2025-11-13 00:41:54,653 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_results_SAGE_FINAL/brca
2025-11-13 00:41:54,653 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
2025-11-13 00:41:54,653 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
2025-11-13 00:41:54,711 - bioneuralnet.downstream_task.dpmon - INFO - Running hyperparameter tuning for DPMON.
2025-11-13 00:41:54,711 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
2025-11-13 00:41:54,929 - bioneuralnet.downstream_task.dpmon - INFO - Number of nodes in network: 5206
2025-11-13 00:41:58,530 - bioneuralne

Accuracy: 0.9774 +/- 0.0190
F1 Weighted: 0.9684 +/- 0.0286
F1 Macro: 0.8957 +/- 0.1017


In [46]:
output_dir_base_gcn = Path("/home/vicente/Github/BioNeuralNet/dpmon_results_GCN_FINAL/brca")

n_repeats = 5
all_preds = []

for r in range(n_repeats):
    bnn.utils.set_seed(SEED+r)
    dpmon_repeat = DPMON(
        adjacency_matrix=A_train,
        omics_list=[dna_meth, rna, mirna],
        phenotype_data=target,
        clinical_data=clinical_preprocessed,
        repeat_num=3,
        model='GCN',
        tune=True,
        gpu=True,
        cuda=0,
        output_dir=output_dir_base_gcn,
    )
    
    predictions_df, _ = dpmon_repeat.run()
    all_preds.append(predictions_df["Predicted"].values)

all_preds = np.array(all_preds)

f1_macro_list = [f1_score(target, pred, average='macro') for pred in all_preds]
f1_weighted_list = [f1_score(target, pred, average='weighted') for pred in all_preds]
accuracy_list = [accuracy_score(target, pred) for pred in all_preds]

avg_f1_macro = np.mean(f1_macro_list)
std_f1_macro = np.std(f1_macro_list)

avg_f1_weighted = np.mean(f1_weighted_list)
std_f1_weighted = np.std(f1_weighted_list)

avg_acc = np.mean(accuracy_list)
std_acc = np.std(accuracy_list)

print(f"Accuracy: {avg_acc:.4f} +/- {std_acc:.4f}")
print(f"F1 Weighted: {avg_f1_weighted:.4f} +/- {std_f1_weighted:.4f}")
print(f"F1 Macro: {avg_f1_macro:.4f} +/- {std_f1_macro:.4f}")

2025-11-13 00:48:22,330 - bioneuralnet.utils.data - INFO - Setting global seed for reproducibility to: 1804
2025-11-13 00:48:22,331 - bioneuralnet.utils.data - INFO - CUDA available. Applying seed to all GPU operations
2025-11-13 00:48:22,331 - bioneuralnet.utils.data - INFO - Seed setting complete
2025-11-13 00:48:22,331 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_results_GCN_FINAL/brca
2025-11-13 00:48:22,332 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
2025-11-13 00:48:22,332 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
2025-11-13 00:48:22,350 - bioneuralnet.downstream_task.dpmon - INFO - Running hyperparameter tuning for DPMON.
2025-11-13 00:48:22,350 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
2025-11-13 00:48:22,561 - bioneuralnet.downstream_task.dpmon - INFO - Number of nodes in network: 5206
2025-11-13 00:48:26,095 - bioneuralnet

Accuracy: 0.9009 +/- 0.1238
F1 Weighted: 0.8697 +/- 0.1673
F1 Macro: 0.7880 +/- 0.2738


In [47]:
output_dir_base_gat = Path("/home/vicente/Github/BioNeuralNet/dpmon_results_GAT_FINAL/brca")

n_repeats = 5
all_preds = []

for r in range(n_repeats):
    bnn.utils.set_seed(SEED+r)
    dpmon_repeat = DPMON(
        adjacency_matrix=A_train,
        omics_list=[dna_meth, rna, mirna],
        phenotype_data=target,
        clinical_data=clinical_preprocessed,
        repeat_num=3,
        model='GAT',
        tune=True,
        gpu=True,
        cuda=0,
        output_dir=output_dir_base_gat,
    )
    
    predictions_df, _ = dpmon_repeat.run()
    all_preds.append(predictions_df["Predicted"].values)

all_preds = np.array(all_preds)

f1_macro_list = [f1_score(target, pred, average='macro') for pred in all_preds]
f1_weighted_list = [f1_score(target, pred, average='weighted') for pred in all_preds]
accuracy_list = [accuracy_score(target, pred) for pred in all_preds]

avg_f1_macro = np.mean(f1_macro_list)
std_f1_macro = np.std(f1_macro_list)

avg_f1_weighted = np.mean(f1_weighted_list)
std_f1_weighted = np.std(f1_weighted_list)

avg_acc = np.mean(accuracy_list)
std_acc = np.std(accuracy_list)

print(f"Accuracy: {avg_acc:.4f} +/- {std_acc:.4f}")
print(f"F1 Weighted: {avg_f1_weighted:.4f} +/- {std_f1_weighted:.4f}")
print(f"F1 Macro: {avg_f1_macro:.4f} +/- {std_f1_macro:.4f}")

2025-11-13 00:59:06,919 - bioneuralnet.utils.data - INFO - Setting global seed for reproducibility to: 1804
2025-11-13 00:59:06,920 - bioneuralnet.utils.data - INFO - CUDA available. Applying seed to all GPU operations
2025-11-13 00:59:06,920 - bioneuralnet.utils.data - INFO - Seed setting complete
2025-11-13 00:59:06,921 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_results_GAT_FINAL/brca
2025-11-13 00:59:06,921 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
2025-11-13 00:59:06,921 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
2025-11-13 00:59:06,944 - bioneuralnet.downstream_task.dpmon - INFO - Running hyperparameter tuning for DPMON.
2025-11-13 00:59:06,944 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
2025-11-13 00:59:07,192 - bioneuralnet.downstream_task.dpmon - INFO - Number of nodes in network: 5206
2025-11-13 00:59:11,207 - bioneuralnet

Accuracy: 0.8398 +/- 0.1419
F1 Weighted: 0.8008 +/- 0.1781
F1 Macro: 0.6495 +/- 0.2936


## Classification Model Training and Evaluation Summary

### LogisticRegression: Best Hyperparameters per Seed

| Seed | model__C | model__penalty |
| :---: | :---: | :---: |
| 1804 | 8.40099 | l2 |
| 1805 | 14.91826 | l2 |
| 1806 | 0.17344 | l2 |
| 1807 | 4.16578 | l2 |
| 1808 | 0.22553 | l2 |

### LogisticRegression: Final Results

| Metric | Score |
| :---: | :---: |
| Accuracy | 0.8486 +/- 0.0115 |
| F1 Weighted | 0.8401 +/- 0.0102 |
| F1 Macro | 0.7390 +/- 0.0142 |

### MLP: Best Hyperparameters per Seed

| Seed | model__activation | model__alpha | model__hidden_layer_sizes | model__learning_rate_init |
| :---: | :---: | :---: | :---: | :---: |
| 1804 | relu | 6.5803 \mathrm{e}-05 | (50, 50) | 0.000928 |
| 1805 | tanh | 0.000868 | (100, 50) | 0.000201 |
| 1806 | relu | 0.000209 | (100,) | 0.000121 |
| 1807 | tanh | 0.001400 | (100,) | 0.000763 |
| 1808 | relu | 0.075812 | (100, 50) | 0.000703 |

### MLP: Final Results

| Metric | Score |
| :---: | :---: |
| Accuracy | 0.8134 +/- 0.0178 |
| F1 Weighted | 0.8059 +/- 0.0167 |
| F1 Macro | 0.6874 +/- 0.0420 |

### XGBoost: Best Hyperparameters per Seed

| Seed | model__colsample_bytree | model__learning_rate | model__max_depth | model__n_estimators | model__subsample |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 1804 | 0.8 | 0.166 | 1 | 812 | 0.7 |
| 1805 | 0.8 | 0.290 | 7 | 933 | 0.7 |
| 1806 | 0.9 | 0.229 | 1 | 538 | 0.7 |
| 1807 | 0.8 | 0.060 | 1 | 735 | 0.9 |
| 1808 | 0.8 | 0.198 | 8 | 837 | 0.9 |

### XGBoost: Final Results

| Metric | Score |
| :---: | :---: |
| Accuracy | 0.8390 +/- 0.0091 |
| F1 Weighted | 0.8210 +/- 0.0100 |
| F1 Macro | 0.7024 +/- 0.0185 |

In [48]:
import warnings 
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from sklearn.exceptions import ConvergenceWarning
from scipy.stats import loguniform, randint

X = pd.concat([dna_meth, rna, mirna, clinical_preprocessed], axis=1)
y = target['phenotype']
print(f"Successfully created X matrix with shape: {X.shape}")
print(f"Successfully created y vector with shape: {y.shape}")

all_results = {
    "LogisticRegression": {"acc": [], "f1_w": [], "f1_m": []},
    "MLP": {"acc": [], "f1_w": [], "f1_m": []},
    "XGBoost": {"acc": [], "f1_w": [], "f1_m": []},
}

all_results = {
    "LogisticRegression": {"acc": [], "f1_w": [], "f1_m": []},
    "MLP": {"acc": [], "f1_w": [], "f1_m": []},
    "XGBoost": {"acc": [], "f1_w": [], "f1_m": []},
}

N_REPEATS = 5
TEST_SPLIT_SIZE = .7
CV_FOLDS = 3
N_ITER_SEARCH = 10

pipe_lr = Pipeline([('scaler', StandardScaler()),
    ('model', LogisticRegression(
        solver='lbfgs',
        max_iter=1000,
        penalty=None 
    ))
])

pipe_mlp = Pipeline([('scaler', StandardScaler()),
    ('model', MLPClassifier(
        max_iter=500,
        early_stopping=True,
        n_iter_no_change=10
    ))
])

pipe_xgb = Pipeline([('scaler', StandardScaler()),
    ('model', XGBClassifier(
        eval_metric='logloss'
    ))
])

params_lr = {
    'model__penalty': ['l2'], 
    'model__C': loguniform(1e-4, 1e2)
}

params_mlp = {
    'model__hidden_layer_sizes': [(100,), (100, 50), (50, 50)],
    'model__activation': ['relu', 'tanh'],
    'model__alpha': loguniform(1e-5, 1e-1),
    'model__learning_rate_init': loguniform(1e-4, 1e-2)
}

params_xgb = {
    'model__n_estimators': randint(100, 500),
    'model__learning_rate': loguniform(0.01, 0.3),
    'model__max_depth': randint(3, 10),
    'model__subsample': [0.7, 0.8, 0.9, 1.0],
    'model__colsample_bytree': [0.7, 0.8, 0.9, 1.0]
}

models_to_tune = {
    "LogisticRegression": (pipe_lr, params_lr),
    "MLP": (pipe_mlp, params_mlp),
    "XGBoost": (pipe_xgb, params_xgb)
}

for r in range(N_REPEATS):
    seed = SEED + r
    print(f"\nRunning Repeat {r+1}/{N_REPEATS} (Seed: {seed})")

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
        test_size=TEST_SPLIT_SIZE, 
        random_state=seed, 
        stratify=y)
    
    cv_splitter = StratifiedKFold(n_splits=CV_FOLDS, shuffle=True, random_state=seed)
    for name, (pipeline, params) in models_to_tune.items():
        print(f"Tuning {name}")
        search = RandomizedSearchCV(
            estimator=pipeline,
            param_distributions=params,
            n_iter=N_ITER_SEARCH,
            cv=cv_splitter,
            scoring='f1_weighted',
            n_jobs=-1,
            random_state=seed
        )
        
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=ConvergenceWarning)
            warnings.filterwarnings("ignore", category=UserWarning)
            search.fit(X_train, y_train)

        print(f"Best params for {name}: {search.best_params_}")
        
        best_model = search.best_estimator_
        preds = best_model.predict(X_test)

        acc = accuracy_score(y_test, preds)
        f1_w = f1_score(y_test, preds, average='weighted', zero_division=0)
        f1_m = f1_score(y_test, preds, average='macro', zero_division=0)

        all_results[name]["acc"].append(acc)
        all_results[name]["f1_w"].append(f1_w)
        all_results[name]["f1_m"].append(f1_m)

print(f"Tuned Model Results (Averaged over {N_REPEATS} runs)")
print(f"(Tuning was {N_ITER_SEARCH} iterations with {CV_FOLDS}fold CV)")

for model_name, metrics in all_results.items():
    avg_acc = np.mean(metrics["acc"])
    std_acc = np.std(metrics["acc"])
    
    avg_f1_w = np.mean(metrics["f1_w"])
    std_f1_w = np.std(metrics["f1_w"])
    
    avg_f1_m = np.mean(metrics["f1_m"])
    std_f1_m = np.std(metrics["f1_m"])

    print(f"Results for {model_name}:")
    print(f"Accuracy: {avg_acc:.4f} +/- {std_acc:.4f}")
    print(f"F1 Weighted: {avg_f1_w:.4f} +/- {std_f1_w:.4f}")
    print(f"F1 Macro: {avg_f1_m:.4f} +/- {std_f1_m:.4f}")


Successfully created X matrix with shape: (769, 5213)
Successfully created y vector with shape: (769,)

Running Repeat 1/5 (Seed: 1804)
Tuning LogisticRegression
Best params for LogisticRegression: {'model__C': 8.400991552100372, 'model__penalty': 'l2'}
Tuning MLP
Best params for MLP: {'model__activation': 'relu', 'model__alpha': 6.580281251856646e-05, 'model__hidden_layer_sizes': (50, 50), 'model__learning_rate_init': 0.0009282784959657257}
Tuning XGBoost
Best params for XGBoost: {'model__colsample_bytree': 0.8, 'model__learning_rate': 0.16614177474236685, 'model__max_depth': 8, 'model__n_estimators': 127, 'model__subsample': 0.7}

Running Repeat 2/5 (Seed: 1805)
Tuning LogisticRegression
Best params for LogisticRegression: {'model__C': 14.918264811461304, 'model__penalty': 'l2'}
Tuning MLP
Best params for MLP: {'model__activation': 'tanh', 'model__alpha': 0.0008678071941085742, 'model__hidden_layer_sizes': (100, 50), 'model__learning_rate_init': 0.00020060261445052188}
Tuning XGBoost

## run cell below number are updated

In [None]:
import bioneuralnet as bnn

gnn_plot_data = {
    "Accuracy": {
        "SAGE": (0.9774, 0.0190),
        "GCN": (0.9009, 0.1238),
        "GAT": (0.8398, 0.1419)
    },
    "F1 Weighted": {
        "SAGE": (0.9684, 0.0286),
        "GCN": (0.8697, 0.1673),
        "GAT": (0.8008, 0.1781)
    },
    "F1 Macro": {
        "SAGE": (0.8957, 0.1017),
        "GCN": (0.7880, 0.2738),
        "GAT": (0.6495, 0.2936)
    }
}

baseline_plot_data = {
    "Accuracy": {
        "SAGE": (0.9774, 0.0190),
        "LogReg": (0.8486, 0.0115),
        "XGBoost": (0.8390, 0.0091),
        "MLP": (0.8134, 0.0178)
    },
    "F1 Weighted": {
        "SAGE": (0.9684, 0.0286),
        "LogReg": (0.8401, 0.0102),
        "XGBoost": (0.8210, 0.0100),
        "MLP": (0.8059, 0.0167)
    },
    "F1 Macro": {
        "SAGE": (0.8957, 0.1017),
        "LogReg": (0.7390, 0.0142),
        "XGBoost": (0.7024, 0.0185),
        "MLP": (0.6874, 0.0420)
    }
}


bnn.metrics.plot_multiple_metrics(
    gnn_plot_data,
    title_map={
        "Accuracy": "GNNs Comparison: Accuracy",
        "F1 Weighted": "GNNs Comparison: F1 Weighted",
        "F1 Macro": "GNNs Comparison: F1 Macro"
    }
)

bnn.metrics.plot_multiple_metrics(
    baseline_plot_data,
    title_map={
        "Accuracy": "SAGE vs. Baselines: Accuracy",
        "F1 Weighted": "SAGE vs. Baselines: F1 Weighted",
        "F1 Macro": "SAGE vs. Baselines: F1 Macro"
    }
)

