# TCGA-LGG Analysis: *Vital Status* classification

- **Cohort**: Focuses on the TCGA-LGG dataset for Lower-Grade Glioma (LGG).
- **Goal**: Classification
- **Prediction Target**: Predict `Vital Status` based on their omics profiles.

**Data Source:** [Broad Institute FireHose](http://firebrowse.org/?cohort=LGG)

In [None]:
import pandas as pd
from pathlib import Path
root = Path("/home/vicente/Github/BioNeuralNet/LGG")

mirna_raw = pd.read_csv(root/"LGG.miRseq_RPKM_log2.txt", sep="\t",index_col=0,low_memory=False)                            
rna_raw = pd.read_csv(root / "LGG.uncv2.mRNAseq_RSEM_normalized_log2.txt", sep="\t",index_col=0,low_memory=False)
meth_raw = pd.read_csv(root/"LGG.meth.by_mean.data.txt", sep='\t',index_col=0,low_memory=False)
clinical_raw = pd.read_csv(root / "LGG.clin.merged.picked.txt",sep="\t", index_col=0, low_memory=False)

# display shapes and first few rows-columns of each file
display(mirna_raw.iloc[:3,:5])
display(mirna_raw.shape)

display(rna_raw.iloc[:3,:5])
display(meth_raw.shape)

display(meth_raw.iloc[:3,:5])
display(meth_raw.shape)

display(clinical_raw.iloc[:3,:5])
display(clinical_raw.shape)

## Data Processing Summary

1. **Transpose Data:** All raw data (miRNA, RNA, etc.) is flipped so rows represent patients and columns represent features.
2. **Standardize Patient IDs:** Patient IDs in all tables are cleaned to the 12-character TCGA format (e.g., `TCGA-AB-1234`) for matching.
3. **Handle Duplicates:** Duplicate patient rows are averaged in the omics data. The first entry is kept for duplicate patients in the clinical data.
4. **Find Common Patients:** The script identifies the list of patients that exist in *all* datasets.
5. **Subset Data:** All data tables are filtered down to *only* this common list of patients, ensuring alignment.
6. **Extract Target:** The `vital_status` column is pulled from the processed clinical data to be used as the prediction target (y-variable).

In [None]:
mirna = mirna_raw.T
rna = rna_raw.T
meth = meth_raw.T
clinical = clinical_raw.T

print(f"miRNA (samples, features): {mirna.shape}")
print(f"RNA (samples, features): {rna.shape}")
print(f"Methylation (samples, features): {meth.shape}")
print(f"Clinical (samples, features): {clinical.shape}")

def trim_barcode(idx):
    return idx.to_series().str.slice(0, 12)

# standarized patient IDs across all files
meth.index = trim_barcode(meth.index)
rna.index = trim_barcode(rna.index)
mirna.index = trim_barcode(mirna.index)
clinical.index = clinical.index.str.upper()
clinical.index.name = "Patient_ID"

# convert all data to numeric, coercing errors to NaN
meth = meth.apply(pd.to_numeric, errors='coerce')
rna = rna.apply(pd.to_numeric, errors='coerce')
mirna = mirna.apply(pd.to_numeric, errors='coerce')

# for any duplicate columns in the omics data, we average their values
meth = meth.groupby(meth.index).mean()
rna = rna.groupby(rna.index).mean()
mirna = mirna.groupby(mirna.index).mean()

# for any duplicate rows in the clinical data, we keep the first occurrence
clinical = clinical[~clinical.index.duplicated(keep='first')]

print(f"\nMethylation shape: {meth.shape}")
print(f"RNA shape: {rna.shape}")
print(f"miRNA shape: {mirna.shape}")
print(f"Clinical shape: {clinical.shape}")

for df in [meth, rna, mirna]:
    df.columns = df.columns.str.replace(r"\?", "unknown_", regex=True)
    df.columns = df.columns.str.replace(r"\|", "_", regex=True)
    df.columns = df.columns.str.replace("-", "_", regex=False)
    df.columns = df.columns.str.replace(r"_+", "_", regex=True)
    df.columns = df.columns.str.strip("_")
    
    df.fillna(df.mean(), inplace=True)

# to see which pateints are common across all data files
common_patients = sorted(list(set(meth.index)&set(rna.index)&set(mirna.index)&set(clinical.index)))

print(f"\nFound: {len(common_patients)} patients across all data types.")

# subset to only common patients
meth_processed = meth.loc[common_patients]
rna_processed= rna.loc[common_patients]
mirna_processed = mirna.loc[common_patients]
clinical_processed = clinical.loc[common_patients]

# extract target labels from clinical data
targets = clinical_processed['vital_status']

In [None]:
display(mirna_processed.iloc[:3,:5])
display(mirna_processed.shape)

display(rna_processed.iloc[:3,:5])
display(rna_processed.shape)

display(meth_processed.iloc[:3,:5])
display(meth_processed.shape)

display(clinical_processed.iloc[:3,:5])
display(clinical_processed.shape)

display(targets.value_counts())

In [None]:
import bioneuralnet as bnn

# drop unwanted columns from clinical data
clinical_processed.drop(columns=["Composite Element REF"], errors="ignore", inplace=True)

# we transform the methylation beta values to M-values and drop unwanted columns
meth_m = meth_processed.drop(columns=["Composite Element REF"], errors="ignore")

# convert beta values to M-values using bioneuralnet utility with small epsilon to avoid log(0)
meth_m = bnn.utils.beta_to_m(meth_m, eps=1e-6) 

# lastly we turn the target labels into numerical classes
mapping = {"astrocytoma": 0, "oligodendroglioma": 1, "oligoastrocytoma": 2}
target_labels = targets.map(mapping).to_frame(name="target")

# as a safety check we align the indices once more
X_meth = meth_m.loc[common_patients]
X_rna = rna_processed.loc[common_patients]
X_mirna = mirna_processed.loc[common_patients]
Y_labels = target_labels.loc[common_patients]
clinical_final = clinical_processed.loc[common_patients]

print(f"\nDNA_Methylation shape: {X_meth.shape}")
print(f"RNA shape: {X_rna.shape}")
print(f"miRNA shape: {X_mirna.shape}")
print(f"Clinical shape: {clinical_final.shape}")
print(Y_labels.value_counts())

## Feature Selection Methodology

### Supported Methods and Interpretation

**BioNeuralNet** provides three techniques for feature selection, allowing for different views of the data's statistical profile:

- **Variance Thresholding:** Identifies features with the **highest overall variance** across all samples.

- **ANOVA F-test:** Pinpoints features that best **distinguish between the target classes** (KIRC, KIRP, and KICH).

- **Random Forest Importance:** Assesses **feature utility** based on its contribution to a predictive non-linear model.

### LGG Cohort Selection Strategy

A dimensionality reduction step was essential for managing the high-feature-count omics data:

- **High-Feature Datasets:** Both DNA Methylation (20,114) and RNA (18,328) required significant feature reduction.

- **Filtering Process:** The **top 6,000 features** were initially extracted from the Methylation and RNA datasets using all three methods.

- **Final Set:** A consensus set was built by finding the intersection of features selected by the ANOVA F-test and Random Forest Importance, ensuring both statistical relevance and model-based utility.

- **Low-Feature Datasets:** The miRNA data (548 features) was passed through **without selection**, as its feature count was already manageable.

In [None]:
import bioneuralnet as bnn

# feature selection
meth_highvar = bnn.utils.select_top_k_variance(X_meth, k=6000)
rna_highvar = bnn.utils.select_top_k_variance(X_rna, k=6000)

meth_af = bnn.utils.top_anova_f_features(X_meth, Y_labels, max_features=6000)
rna_af = bnn.utils.top_anova_f_features(X_rna, Y_labels, max_features=6000)

meth_rf = bnn.utils.select_top_randomforest(X_meth, Y_labels, top_k=6000)
rna_rf = bnn.utils.select_top_randomforest(X_rna, Y_labels, top_k=6000)

meth_var_set = set(meth_highvar.columns)
meth_anova_set = set(meth_af.columns)
meth_rf_set = set(meth_rf.columns)

rna_var_set = set(rna_highvar.columns)
rna_anova_set = set(rna_af.columns)
rna_rf_set = set(rna_rf.columns)

meth_inter1 = list(meth_anova_set & meth_var_set)
meth_inter2 = list(meth_rf_set & meth_var_set)
meth_inter3 = list(meth_anova_set & meth_rf_set)
meth_all_three = list(meth_anova_set & meth_var_set & meth_rf_set)

rna_inter4 = list(rna_anova_set & rna_var_set)
rna_inter5 = list(rna_rf_set & rna_var_set)
rna_inter6 = list(rna_anova_set & rna_rf_set)
rna_all_three = list(rna_anova_set & rna_var_set & rna_rf_set)

In [None]:
print("FROM THE 6000 Methylation feature selection:\n")
print(f"Anova-F & variance selection share: {len(meth_inter1)} features")
print(f"Random Forest & variance selection share: {len(meth_inter2)} features")
print(f"Anova-F & Random Forest share: {len(meth_inter3)} features")
print(f"All three methods agree on: {len(meth_all_three)} features")

In [None]:
print("\nFROM THE 6000 RNA feature selection:\n")
print(f"Anova-F & variance selection share: {len(rna_inter4)} features")
print(f"Random Forest & variance selection share: {len(rna_inter5)} features")
print(f"Anova-F & Random Forest share: {len(rna_inter6)} features")
print(f"All three methods agree on: {len(rna_all_three)} features")

## Feature Selection Summary: ANOVA-RF Intersection

The chosen strategy for feature selection is based on the **overlap** between features identified by the **ANOVA F-test** and **Random Forest Importance**. This approach offers comprehensive filtering by balancing class-based relevance (ANOVA) with non-linear model importance (Random Forest). The resulting feature sets are considered the most robust for downstream analysis.

### Feature Overlap Results

The following table details the number of features resulting from the intersection of different selection methods for each omics data type.

| Omics Data Type | ANOVA-F & Variance | RF & Variance | ANOVA-F & Random Forest (Selected) | All Three Agree |
| :--- | :--- | :--- | :--- | :--- |
| **Methylation** | 2,704 features | 1,768 features | **1,823 features** | 809 features |
| **RNA** | 2,183 features | 1,977 features | **2,127 features** | 763 features |

In [None]:
X_meth_selected = X_meth[meth_inter3]
X_rna_selected = X_rna[rna_inter6]

print("\nFinal Shapes for Modeling")
print(f"Methylation (X1): {X_meth_selected.shape}")
print(f"RNA-Seq (X2): {X_rna_selected.shape}")
print(f"miRNA-Seq (X3): {X_mirna.shape}")
print(f"Labels (Y): {Y_labels.shape}")

## Data Availability

To facilitate rapid experimentation and reproduction of our results, the fully processed and feature-selected dataset used in this analysis has been made available directly within the package.

Users can load this dataset, bypassing all preceding data acquisition, preprocessing, and feature selection steps. This allows users to proceed immediately from this step.

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
import bioneuralnet as bnn

tgca_lgg = bnn.datasets.DatasetLoader("lgg")
display(tgca_lgg.shape)

# The dataset is returned as a dictionary. We extract each file independetly based on the name( Key).
dna_meth = tgca_lgg["meth"]
rna = tgca_lgg["rna"]
mirna = tgca_lgg["mirna"]
clinical = tgca_lgg["clinical"]
target = tgca_lgg["target"]


In [None]:
dna_meth = bnn.utils.select_top_k_variance(dna_meth, k=500)
rna = bnn.utils.select_top_k_variance(rna, k=500)
mirna = bnn.utils.select_top_k_variance(mirna, k=500)

print(dna_meth.shape)
print(rna.shape)

In [None]:
from bioneuralnet.utils import preprocess_clinical
clinical_for_model = preprocess_clinical(
    X=clinical, 
    top_k=10,
    scale=False,
    ignore_columns=[
        # target-related or outcome-related or irrelevant columns
        "days_to_last_followup", 
        "days_to_death",
        "date_of_initial_pathologic_diagnosis",
        "histological_type"
    ]
)
display(clinical_for_model.iloc[:5,:5])

## Reproducibility and Seeding

To ensure our experimental results are fully reproducible, a single global seed is set at the beginning of the analysis.

This utility function propagates the seed to all sources of randomness, including `random`, `numpy`, and `torch` (for both CPU and GPU). Critically, it also configures the PyTorch cuDNN backend to use deterministic algorithms.

In [None]:
import bioneuralnet as bnn

SEED = 1883
bnn.utils.set_seed(SEED)

## Classification using DPMON: Training and Evaluation

In [None]:
from bioneuralnet.utils import find_optimal_graph

omics_lgg = pd.concat([mirna, dna_meth, rna], axis=1)

optimal_graph, best_params, results_df = find_optimal_graph(
    omics_data=omics_lgg,
    y_labels=target,
    methods=['correlation','threshold'],
    seed=SEED,
    verbose=False,
    trials=10,
    omics_list=[mirna, dna_meth, rna],
    centrality_mode="eigenvector",
)
display(optimal_graph.iloc[:5,:5])
display(best_params)

results_df.sort_values("score", ascending=False, inplace=True)
display(results_df)

In [None]:
# Graph analysis: The optimal gaph uses a proxy to evaluate its properties, but for the final output
# the graph is built without the target variable. The search only helps to find the best graph parameters..
from bioneuralnet.utils import graph_analysis

graph_analysis(optimal_graph, graph_name="Optimal")

In [None]:
from pathlib import Path
from bioneuralnet.downstream_task import DPMON

output_dir_base = Path("/home/vicente/Github/BioNeuralNet/dpmon_results_GAT_FINAL/lgg")

current_output_dir = output_dir_base / "Gat_best_graph"
current_output_dir.mkdir(parents=True, exist_ok=True)

dpmon_params_base = {
    "adjacency_matrix": optimal_graph,
    "omics_list": omics_lgg,
    "phenotype_data": target,
    "phenotype_col": "target",
    "clinical_data": clinical_for_model,
    "model": 'GAT',
    "tune": True, 
    "cv": True,   
    "n_folds": 5,
    'repeat_num': 1,
    "gpu": True,
    "cuda": 0,
    "seed": SEED,
    "output_dir": current_output_dir
}

dpmon_tunned = DPMON(**dpmon_params_base)
predictions_df, metrics, embeddings = dpmon_tunned.run()

graph_metrics = metrics

acc_row = graph_metrics.loc[graph_metrics['Metric'] == 'Accuracy'].iloc[0]
f1_macro_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Macro'].iloc[0]
f1_weighted_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Weighted'].iloc[0]
recall_row = graph_metrics.loc[graph_metrics['Metric'] == 'Recall'].iloc[0]
auc_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUC'].iloc[0]
aupr_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUPR'].iloc[0]

acc_avg, acc_std = acc_row['Average'], acc_row['StdDev']
f1_macro_avg, f1_macro_std = f1_macro_row['Average'], f1_macro_row['StdDev']
f1_weighted_avg, f1_weighted_std = f1_weighted_row['Average'], f1_weighted_row['StdDev']
recall_avg, recall_std = recall_row['Average'], recall_row['StdDev']
auc_avg, auc_std = auc_row['Average'], auc_row['StdDev']
aupr_avg, aupr_std = aupr_row['Average'], aupr_row['StdDev']

print(f"Accuracy (Avg +/- Std): {acc_avg:.4f} +/- {acc_std:.4f}")
print(f"F1 Macro (Avg +/- Std): {f1_macro_avg:.4f}  +/- {f1_macro_std:.4f}")
print(f"F1 Weighted (Avg +/- Std): {f1_weighted_avg:.4f} +/- {f1_weighted_std:.4f}")
print(f"Recall: {recall_avg:.4f} +/- {recall_std:.4f}")
print(f"AUC: {auc_avg:.4f} +/- {auc_std:.4f}")
print(f"AUPR: {aupr_avg:.4f} +/- {aupr_std:.4f}")

In [None]:
from bioneuralnet.downstream_task import DPMON


output_dir_base = Path("/home/vicente/Github/BioNeuralNet/dpmon_results_GCN_FINAL/lgg")

current_output_dir = output_dir_base / "Gcn_best_graph"
current_output_dir.mkdir(parents=True, exist_ok=True)

dpmon_params_base = {
    "adjacency_matrix": optimal_graph,
    "omics_list": omics_lgg,
    "phenotype_data": target,
    "phenotype_col": "target",
    "clinical_data": clinical_for_model,
    "model": 'GCN',
    "tune": True, 
    "cv": True,   
    "n_folds": 5,
    'repeat_num': 1,
    "gpu": True,
    "cuda": 0,
    "seed": SEED,
    "output_dir": current_output_dir
}

dpmon_tunned = DPMON(**dpmon_params_base)
predictions_df, metrics, embeddings = dpmon_tunned.run()
graph_metrics = metrics

acc_row = graph_metrics.loc[graph_metrics['Metric'] == 'Accuracy'].iloc[0]
f1_macro_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Macro'].iloc[0]
f1_weighted_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Weighted'].iloc[0]
recall_row = graph_metrics.loc[graph_metrics['Metric'] == 'Recall'].iloc[0]
auc_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUC'].iloc[0]
aupr_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUPR'].iloc[0]

acc_avg, acc_std = acc_row['Average'], acc_row['StdDev']
f1_macro_avg, f1_macro_std = f1_macro_row['Average'], f1_macro_row['StdDev']
f1_weighted_avg, f1_weighted_std = f1_weighted_row['Average'], f1_weighted_row['StdDev']
recall_avg, recall_std = recall_row['Average'], recall_row['StdDev']
auc_avg, auc_std = auc_row['Average'], auc_row['StdDev']
aupr_avg, aupr_std = aupr_row['Average'], aupr_row['StdDev']

print(f"Accuracy (Avg +/- Std): {acc_avg:.4f} +/- {acc_std:.4f}")
print(f"F1 Macro (Avg +/- Std): {f1_macro_avg:.4f}  +/- {f1_macro_std:.4f}")
print(f"F1 Weighted (Avg +/- Std): {f1_weighted_avg:.4f} +/- {f1_weighted_std:.4f}")
print(f"Recall: {recall_avg:.4f} +/- {recall_std:.4f}")
print(f"AUC: {auc_avg:.4f} +/- {auc_std:.4f}")
print(f"AUPR: {aupr_avg:.4f} +/- {aupr_std:.4f}")

In [None]:
from bioneuralnet.downstream_task import DPMON
output_dir_base = Path("/home/vicente/Github/BioNeuralNet/dpmon_results_GIN_FINAL/lgg")


current_output_dir = output_dir_base / "gin_best_graph"
current_output_dir.mkdir(parents=True, exist_ok=True)

dpmon_params_base = {
    "adjacency_matrix": optimal_graph,
    "omics_list": omics_lgg,
    "phenotype_data": target,
    "phenotype_col": "target",
    "clinical_data": clinical_for_model,
    "model": 'GIN',
    "tune": True, 
    "cv": True,   
    "n_folds": 5,
    'repeat_num': 1,
    "gpu": True,
    "cuda": 0,
    "seed": SEED,
    "output_dir": current_output_dir
}

dpmon_tunned = DPMON(**dpmon_params_base)
predictions_df, metrics, embeddings = dpmon_tunned.run()
graph_metrics = metrics

acc_row = graph_metrics.loc[graph_metrics['Metric'] == 'Accuracy'].iloc[0]
f1_macro_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Macro'].iloc[0]
f1_weighted_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Weighted'].iloc[0]
recall_row = graph_metrics.loc[graph_metrics['Metric'] == 'Recall'].iloc[0]
auc_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUC'].iloc[0]
aupr_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUPR'].iloc[0]

acc_avg, acc_std = acc_row['Average'], acc_row['StdDev']
f1_macro_avg, f1_macro_std = f1_macro_row['Average'], f1_macro_row['StdDev']
f1_weighted_avg, f1_weighted_std = f1_weighted_row['Average'], f1_weighted_row['StdDev']
recall_avg, recall_std = recall_row['Average'], recall_row['StdDev']
auc_avg, auc_std = auc_row['Average'], auc_row['StdDev']
aupr_avg, aupr_std = aupr_row['Average'], aupr_row['StdDev']

print(f"Accuracy (Avg +/- Std): {acc_avg:.4f} +/- {acc_std:.4f}")
print(f"F1 Macro (Avg +/- Std): {f1_macro_avg:.4f}  +/- {f1_macro_std:.4f}")
print(f"F1 Weighted (Avg +/- Std): {f1_weighted_avg:.4f} +/- {f1_weighted_std:.4f}")
print(f"Recall: {recall_avg:.4f} +/- {recall_std:.4f}")
print(f"AUC: {auc_avg:.4f} +/- {auc_std:.4f}")
print(f"AUPR: {aupr_avg:.4f} +/- {aupr_std:.4f}")

In [None]:
from bioneuralnet.downstream_task import DPMON
output_dir_base = Path("/home/vicente/Github/BioNeuralNet/dpmon_results_SAGE_FINAL/lgg")

current_output_dir = output_dir_base / "sage_best_graph"
current_output_dir.mkdir(parents=True, exist_ok=True)

dpmon_params_base = {
    "adjacency_matrix": optimal_graph,
    "omics_list": omics_lgg,
    "phenotype_data": target,
    "phenotype_col": "target",
    "clinical_data": clinical_for_model,
    "model": 'SAGE',
    "tune": True, 
    "cv": True,   
    "n_folds": 5,
    'repeat_num': 1,
    "gpu": True,
    "cuda": 0,
    "seed": SEED,
    "output_dir": current_output_dir
}

dpmon_tunned = DPMON(**dpmon_params_base)
predictions_df, metrics, embeddings = dpmon_tunned.run()
graph_metrics = metrics

acc_row = graph_metrics.loc[graph_metrics['Metric'] == 'Accuracy'].iloc[0]
f1_macro_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Macro'].iloc[0]
f1_weighted_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Weighted'].iloc[0]
recall_row = graph_metrics.loc[graph_metrics['Metric'] == 'Recall'].iloc[0]
auc_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUC'].iloc[0]
aupr_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUPR'].iloc[0]

acc_avg, acc_std = acc_row['Average'], acc_row['StdDev']
f1_macro_avg, f1_macro_std = f1_macro_row['Average'], f1_macro_row['StdDev']
f1_weighted_avg, f1_weighted_std = f1_weighted_row['Average'], f1_weighted_row['StdDev']
recall_avg, recall_std = recall_row['Average'], recall_row['StdDev']
auc_avg, auc_std = auc_row['Average'], auc_row['StdDev']
aupr_avg, aupr_std = aupr_row['Average'], aupr_row['StdDev']

print(f"Accuracy (Avg +/- Std): {acc_avg:.4f} +/- {acc_std:.4f}")
print(f"F1 Macro (Avg +/- Std): {f1_macro_avg:.4f}  +/- {f1_macro_std:.4f}")
print(f"F1 Weighted (Avg +/- Std): {f1_weighted_avg:.4f} +/- {f1_weighted_std:.4f}")
print(f"Recall: {recall_avg:.4f} +/- {recall_std:.4f}")
print(f"AUC: {auc_avg:.4f} +/- {auc_std:.4f}")
print(f"AUPR: {aupr_avg:.4f} +/- {aupr_std:.4f}")

In [None]:
from sklearn.model_selection import StratifiedKFold, ParameterSampler, RepeatedStratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, recall_score, roc_auc_score, average_precision_score
from sklearn.base import clone
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from scipy.stats import loguniform, randint
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
import numpy as np
import pandas as pd

X = pd.concat([dna_meth, rna, mirna, clinical_for_model], axis=1)
y = target['target']
print(f"Successfully created X matrix with shape: {X.shape}")
print(f"Successfully created y vector with shape: {y.shape}")


pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(solver='lbfgs', max_iter=1000, penalty=None, random_state=SEED))
])

pipe_mlp = Pipeline([
    ('scaler', StandardScaler()),
    ('model', MLPClassifier(max_iter=500, early_stopping=True, n_iter_no_change=10, random_state=SEED))
])

pipe_xgb = Pipeline([
    ('scaler', StandardScaler()),
    ('model', XGBClassifier(eval_metric='logloss', tree_method='hist', max_bin=128, random_state=SEED))
])
pipe_rf = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(random_state=SEED))
])

pipe_svm = Pipeline([
    ('scaler', StandardScaler()),
    ('model', SVC(probability=True, random_state=SEED))
])

pipe_dt = Pipeline([
    ('scaler', StandardScaler()),
    ('model', DecisionTreeClassifier(random_state=SEED))
])

params_lr = {'model__penalty': ['l2'], 'model__C': loguniform(1e-4, 1e2)}

params_mlp = {
    'model__hidden_layer_sizes': [(100,), (100, 50), (50, 50)],
    'model__activation': ['relu', 'tanh'],
    'model__alpha': loguniform(1e-5, 1e-1),
    'model__learning_rate_init': loguniform(1e-4, 1e-2)
}
params_xgb = {
    'model__n_estimators': randint(50, 200),
    'model__learning_rate': loguniform(0.01, 0.3),
    'model__max_depth': randint(3, 7),
    'model__subsample': [0.8, 1.0], 
    'model__colsample_bytree': [0.8, 1.0]
}
params_rf = {
    'model__n_estimators': randint(100, 300),
    'model__max_depth': [10, 20, 30, None],
    'model__min_samples_split': randint(2, 10),
    'model__min_samples_leaf': randint(1, 5),
    'model__max_features': ['sqrt', 'log2']
}
params_svm = {
    'model__C': loguniform(1e-2, 1e3),
    'model__kernel': ['rbf', 'linear'],
    'model__gamma': ['scale', 'auto']
}

params_dt = {
    'model__max_depth': randint(3, 15),
    'model__min_samples_split': randint(2, 20),
    'model__criterion': ['gini', 'entropy']
}


models_to_tune = {
    "LogisticRegression": (pipe_lr, params_lr),
    "SVM": (pipe_svm, params_svm),
    "MLP": (pipe_mlp, params_mlp),
    "XGBoost": (pipe_xgb, params_xgb),
    "RandomForest": (pipe_rf, params_rf),
    "DecisionTree": (pipe_dt, params_dt),
}

all_results = {
    "LogisticRegression": {"acc": [], "f1_w": [], "f1_m": [], "recall": [], "auc": [], "aupr": []},
    "MLP": {"acc": [], "f1_w": [], "f1_m": [], "recall": [], "auc": [], "aupr": []},
    "XGBoost": {"acc": [], "f1_w": [], "f1_m": [], "recall": [], "auc": [], "aupr": []},
    "RandomForest": {"acc": [], "f1_w": [], "f1_m": [], "recall": [], "auc": [], "aupr": []},
    "SVM": {"acc": [], "f1_w": [], "f1_m": [], "recall": [], "auc": [], "aupr": []},
    "DecisionTree": {"acc": [], "f1_w": [], "f1_m": [], "recall": [], "auc": [], "aupr": []},
}


for model_name, (pipeline, param_dist) in models_to_tune.items():
    print(f"Evaluating model with nested CV: {model_name}")
    
    outer_cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=SEED)
    
    # inner folds are for finding the best hyperparameters
    for fold_idx, (train_idx, test_idx) in enumerate(outer_cv.split(X, y), start=1):
        X_train_outer, X_test_outer = X.iloc[train_idx], X.iloc[test_idx]
        y_train_outer, y_test_outer = y.iloc[train_idx], y.iloc[test_idx]
        inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
        
        best_score_fold = -np.inf
        best_params_fold = None
        # May or may not want to set a seed here. A fix seed = same hyperparamters each fold.
        # No seed = different hyperparameters each fold. This adds more randomness, and may yield better generalization.
        param_sampler = list(ParameterSampler(param_dist, n_iter=20))
        
        for params in param_sampler:
            inner_scores = []
            
            for inner_train_idx, inner_val_idx in inner_cv.split(X_train_outer, y_train_outer):
                X_train_inner = X_train_outer.iloc[inner_train_idx]
                X_val_inner = X_train_outer.iloc[inner_val_idx]
                y_train_inner = y_train_outer.iloc[inner_train_idx]
                y_val_inner = y_train_outer.iloc[inner_val_idx]
                
                inner_pipeline = clone(pipeline)
                inner_pipeline.set_params(**params)
                inner_pipeline.fit(X_train_inner, y_train_inner)
                
                y_val_pred = inner_pipeline.predict(X_val_inner)
                score = f1_score(y_val_inner, y_val_pred, average='weighted', zero_division=0)
                inner_scores.append(score)
            
            mean_score = np.mean(inner_scores)
            if mean_score > best_score_fold:
                best_score_fold = mean_score
                best_params_fold = params
        
        print(f"Outer fold {fold_idx}: best params (inner CV F1-W={best_score_fold:.4f})")
        print(f"{best_params_fold}")
        
        final_pipeline = clone(pipeline)
        final_pipeline.set_params(**best_params_fold)
        final_pipeline.fit(X_train_outer, y_train_outer)
        
        preds = final_pipeline.predict(X_test_outer)
        
        if hasattr(final_pipeline, "predict_proba"):
            proba = final_pipeline.predict_proba(X_test_outer)
        else:
            proba = None
        
        acc = accuracy_score(y_test_outer, preds)
        f1_w = f1_score(y_test_outer, preds, average='weighted', zero_division=0)
        f1_m = f1_score(y_test_outer, preds, average='macro', zero_division=0)
        recall = recall_score(y_test_outer, preds, average='macro', zero_division=0)
        
        auc = np.nan
        aupr = np.nan
        
        if proba is not None:
            try:
                if len(np.unique(y)) == 2:
                    auc = roc_auc_score(y_test_outer, proba[:, 1])
                    aupr = average_precision_score(y_test_outer, proba[:, 1])
                else:
                    auc = roc_auc_score(y_test_outer, proba, multi_class='ovr', average='macro')
                    y_test_bin = label_binarize(y_test_outer, classes=np.unique(y))
                    aupr = average_precision_score(y_test_bin, proba, average='weighted')
            except Exception:
                auc = np.nan
                aupr = np.nan

        print(f"Fold {fold_idx} results: Acc={acc:.4f}, F1-W={f1_w:.4f}, "
              f"F1-M={f1_m:.4f}, Recall={recall:.4f}, AUC={auc:.4f}, AUPR={aupr:.4f}")
        
        all_results[model_name]["acc"].append(acc)
        all_results[model_name]["f1_w"].append(f1_w)
        all_results[model_name]["f1_m"].append(f1_m)
        all_results[model_name]["recall"].append(recall)
        all_results[model_name]["auc"].append(auc)
        all_results[model_name]["aupr"].append(aupr)

print("\nFINAL BASELINE RESULTS\n")
for model_name, metrics in all_results.items():
    avg_acc = np.mean(metrics["acc"])
    std_acc = np.std(metrics["acc"])
    avg_f1_w = np.mean(metrics["f1_w"])
    std_f1_w = np.std(metrics["f1_w"])
    avg_f1_m = np.mean(metrics["f1_m"])
    std_f1_m = np.std(metrics["f1_m"])
    avg_recall = np.mean(metrics["recall"])
    std_recall = np.std(metrics["recall"])
    avg_auc = np.nanmean(metrics["auc"])
    std_auc = np.nanstd(metrics["auc"])
    avg_aupr = np.nanmean(metrics["aupr"])
    std_aupr = np.nanstd(metrics["aupr"])
    
    print(f"\n{model_name}:")
    print(f"Accuracy: {avg_acc:.4f} +/- {std_acc:.4f}")
    print(f"F1 Weighted: {avg_f1_w:.4f} +/- {std_f1_w:.4f}")
    print(f"F1 Macro: {avg_f1_m:.4f} +/- {std_f1_m:.4f}")
    print(f"Recall: {avg_recall:.4f} +/- {std_recall:.4f}")
    print(f"AUC: {avg_auc:.4f} +/- {std_auc:.4f}")
    print(f"AUPR: {avg_aupr:.4f} +/- {std_aupr:.4f}")