# TCGA-BRCA Demo

## Dataset Source

- **Omics Data**: [FireHose BRCA](http://firebrowse.org/?cohort=BRCA)
- **Clinical and PAM50 Data**: [TCGAbiolinks](http://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html)

## Dataset Overview

**Original Data**:

- **Methylation**: 20,107 × 885
- **mRNA**: 18,321 × 1,212
- **miRNA**: 503 × 1,189
- **PAM50**: 1,087 × 1
- **Clinical**: 1,098 × 101

- **Note: Omics matrices are features × samples; clinical matrices are samples × fields.**

**PAM50 Subtype Counts**:

- **LumA**: 419
- **LumB**: 140
- **Basal**: 130
- **Her2**: 46
- **Normal**: 34

## Patients in Every Dataset

- Total patients present in methylation, mRNA, miRNA, PAM50, and clinical: **769**

## Final Shapes (Per-Patient)

After aggregating multiple aliquots by mean, all modalities align on 769 patients:

- **Methylation**: 769 × 20,107
- **mRNA**: 769 × 20,531
- **miRNA**: 769 × 503
- **PAM50**: 769 × 1
- **Clinical**: 769 × 119

## Data Summary Table

| Stage                          | Clinical    | Methylation  | miRNA       | mRNA           | PAM50 (Subtype Counts)                                         | Notes                                   |
| ------------------------------ | ----------- | ------------ | ----------- | -------------- | -------------------------------------------------------------- | --------------------------------------- |
| **Original Raw Data**          | 1,098 × 101 | 20,107 × 885 | 503 × 1,189 | 18,321 × 1,212 | LumA: 509<br>LumB: 209<br>Basal: 192<br>Her2: 82<br>Normal: 40 | Raw FireHose & TCGAbiolinks files       |
| **Patient-Level Intersection** | 769 × 101   | 769 × 20,107 | 769 × 1,046 | 769 × 20,531   | LumA: 419<br>LumB: 140<br>Basal: 130<br>Her2: 46<br>Normal: 34 | Patients with complete data in all sets |

## Reference Links

- [FireHose BRCA](http://firebrowse.org/?cohort=BRCA)
- [TCGAbiolinks](http://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html)
- [Direct Download BRCA](http://firebrowse.org/?cohort=BRCA&download_dialog=true)


## Raw Data Overview

Let's take a look at the data from FireHose directly after download:

Some of the first things we noticed were:

- **Different sample sizes** for each data type
- **Multi-index structure** in some datasets
- **Presence of NaN values**, especially in clinical data

**Dataset Shapes:**
- `mirna` shape: **(503, 1189)**
- `rna` shape: **(18321, 1212)**
- `meth` shape: **(20107, 885)**
- `clinical` shape: **(18, 1097)**

**Additional Notes:**
- `mirna`, `rna`, and `meth` use gene names as index and patient/sample IDs as columns.
- `meth` and `clinical` datasets include metadata rows (e.g., "Beta_Value", "value") as part of a multi-index.
- `clinical` data contains missing values (e.g., in "days_to_death"), which will require preprocessing.

In [2]:
import pandas as pd
from pathlib import Path
root = Path("/home/vicente/Github/BioNeuralNet/TCGA_BRCA_DATA")

mirna_raw = pd.read_csv(root/"BRCA.miRseq_RPKM_log2.txt", sep="\t",index_col=0,low_memory=False)                            
rna_raw = pd.read_csv(root / "BRCA.uncv2.mRNAseq_RSEM_normalized_log2.txt", sep="\t",index_col=0,low_memory=False)
meth_raw = pd.read_csv(root/"BRCA.meth.by_mean.data.txt", sep='\t',index_col=0,low_memory=False)
clinical_raw = pd.read_csv(root / "BRCA.clin.merged.picked.txt",sep="\t", index_col=0, low_memory=False)

# display all shapes and first few rows of each dataset
print(f"mirna shape: {mirna_raw.shape}, rna shape: {rna_raw.shape}, meth shape: {meth_raw.shape}, clinical shape: {clinical_raw.shape}")
print(mirna_raw.head())
print(rna_raw.head())
print(meth_raw.head())
print(clinical_raw.head())

mirna shape: (503, 1189), rna shape: (18321, 1212), meth shape: (20107, 885), clinical shape: (18, 1097)
              TCGA-3C-AAAU-01  TCGA-3C-AALI-01  ...  TCGA-E2-A10E-01  TCGA-E2-A10F-01
gene                                            ...                                  
hsa-let-7a-1        13.129765        12.918069  ...        14.060268        12.990403
hsa-let-7a-2        14.117933        13.922300  ...        15.047592        14.006035
hsa-let-7a-3        13.147714        12.913194  ...        14.074978        13.018659
hsa-let-7b          14.595135        14.512657  ...        16.370741        15.439239
hsa-let-7c           8.414890         9.646536  ...        10.885520        11.385638

[5 rows x 1189 columns]
             TCGA-3C-AAAU-01  TCGA-3C-AALI-01  ...  TCGA-Z7-A8R5-01  TCGA-Z7-A8R6-01
gene                                           ...                                  
?|100133144         4.032489         3.211931  ...         1.178747         2.783771
?|100134869  

## TCGA-BioLink: Pam50

This section demonstrates how to use the `TCGAbiolinks` R package to access and download clinical and molecular subtype data. It begins by ensuring `TCGAbiolinks` is installed, then loads the package. It retrieves PAM50 molecular subtype labels using `TCGAquery_subtype()` and writes them to a CSV file. Additionally, it downloads clinical data using `GDCquery_clinic()` and formats it with `GDCprepare_clinic()`, saving the result as another CSV file.

```R
  # Install TCGAbiolinks
  if (!requireNamespace("TCGAbiolinks", quietly = TRUE)) {
    if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
    BiocManager::install("TCGAbiolinks")
  }

  # Load the library
  library(TCGAbiolinks)

  # Download PAM50 subtype labels
  pam50_df <- TCGAquery_subtype(tumor = "BRCA")[ , c("patient", "BRCA_Subtype_PAM50")]
  write.csv(pam50_df, file = "BRCA_PAM50_labels.csv", row.names = FALSE, quote = FALSE)

  # Download clinical data
  clin_raw <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
  clin_df <- GDCprepare_clinic(clin_raw, clinical.info = "patient")
  write.csv(clin_df, file = "BRCA_clinical_data.csv", row.names = FALSE, quote = FALSE)
```

## Preprocessing: Phase 1

- Loaded raw data from FireHose and TCGABiolinks
- Transposed `mirna`, `meth`, and `rna` to have samples as rows
- Standardized sample IDs (e.g., trimmed barcodes, uppercased indices)
- Aligned clinical data from both sources and merged them
- Filtered to patients present in all datasets

In [3]:
import pandas as pd

# from Firehose
mirna = pd.read_csv(root/"BRCA.miRseq_RPKM_log2.txt", sep="\t",index_col=0,low_memory=False)
meth = pd.read_csv(root/"BRCA.meth.by_mean.data.txt", sep='\t',index_col=0,low_memory=False)                             
rna = pd.read_csv(root / "BRCA.uncv2.mRNAseq_RSEM_normalized_log2.txt", sep="\t",index_col=0,low_memory=False)
clinical_firehose = pd.read_csv(root / "BRCA.clin.merged.picked.txt",sep="\t", index_col=0, low_memory=False).T

# from TCGABiolinks
pam50 = pd.read_csv(root /"BRCA_PAM50_labels.csv",index_col=0)
clinical_biolinks = pd.read_csv(root /"BRCA_clinical_data.csv",index_col=1)

print("Initial shapes")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
print(f"pam50: {pam50.shape}")
print(f"clinical TCGABioLinks: {clinical_biolinks.shape}")
print(f"clinical FireHose: {clinical_firehose.shape}")

meth = meth.T
rna = rna.T
mirna = mirna.T

print("\nAfter tranpose")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")

def trim(idx):
    return idx.to_series().str.extract(r'(^TCGA-\w\w-\w\w\w\w)')[0]

meth.index = trim(meth.index)
rna.index = trim(rna.index)
mirna.index = trim(mirna.index)
pam50.index = pam50.index.str.upper()
clinical_biolinks.index = clinical_biolinks.index.str.upper()
clinical_firehose.index = clinical_firehose.index.str.upper()

idx1 = clinical_biolinks.index
idx2 = clinical_firehose.index

# intersection and unique counts
common = idx1.intersection(idx2)
only_in_1 = idx1.difference(idx2)
only_in_2 = idx2.difference(idx1)

print(f"Patients in both clinical datasets: {len(common)}")
common = clinical_biolinks.index.intersection(clinical_firehose.index)
clinical_biolinks = clinical_biolinks.loc[common]
clinical_firehose = clinical_firehose.loc[common]

clinical = pd.concat([clinical_biolinks, clinical_firehose], axis=1)

print(f"Combined Clinical shape {clinical.shape}")

common = sorted(set(meth.index) & set(rna.index) & set(mirna.index) & set(pam50.index) & set(clinical.index))
print(f"Patients in every dataset: {len(common)}")

meth = meth.loc[common]
rna = rna.loc[common]
mirna = mirna.loc[common]
pam50 = pam50.loc[common]
clinical = clinical.loc[common]

print("\nFinal shapes:")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
print(f"pam50: {pam50.shape}")
print(f"clinical: {clinical.shape}\n")

Initial shapes
meth: (20107, 885)
rna: (18321, 1212)
mirna: (503, 1189)
pam50: (1087, 1)
clinical TCGABioLinks: (1098, 101)
clinical FireHose: (1097, 18)

After tranpose
meth: (885, 20107)
rna: (1212, 18321)
mirna: (1189, 503)
Patients in both clinical datasets: 1097
Combined Clinical shape (1097, 119)
Patients in every dataset: 769

Final shapes:
meth: (863, 20107)
rna: (865, 18321)
mirna: (855, 503)
pam50: (769, 1)
clinical: (769, 119)



## Handling Multiple Aliquots per Sample

To ensure each patient appears only once across datasets:

- Identified and counted patients with multiple aliquots in `meth`, `rna`, and `mirna`
- Converted all data to numeric (with coercion for errors)
- Aggregated duplicate rows by computing the mean per patient
- Aligned all datasets to retain only shared patients across `meth`, `rna`, `mirna`, `pam50`, and `clinical`

**Duplicate summary:**
- meth: 91 patients with multiple aliquots (94 extra rows)
- rna: 93 patients (96 extra rows)
- mirna: 84 patients (86 extra rows)

**Final shapes after aggregation and filtering:**
- meth: (769, 20107)
- rna: (769, 18321)
- mirna: (769, 503)
- pam50: (769, 1)
- clinical: (769, 119)

In [4]:
for name, df in [("meth", meth), ("rna", rna), ("mirna", mirna)]:
    counts = df.index.value_counts()
    n_multiple = (counts > 1).sum()
    total_duplicates = counts[counts > 1].sum() - n_multiple
    
    print(f"{name}:")
    print(f"patients with >1 aliquot: {n_multiple}")
    print(f"total duplicate rows: {total_duplicates}\n")

meth = meth.apply(pd.to_numeric, errors="coerce")
rna = rna .apply(pd.to_numeric, errors="coerce")
mirna = mirna.apply(pd.to_numeric, errors="coerce")

meth = meth.groupby(level=0).mean()
rna = rna.groupby(level=0).mean()
mirna = mirna.groupby(level=0).mean()

# Now each has one row per patient
print("Post-aggregation shapes:")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")

common = sorted( set(meth.index) & set(rna.index) & set(mirna.index)& set(pam50.index) & set(clinical.index) )
print(f"Patients in every dataset: {len(common)}")

meth = meth.loc[common]
rna = rna.loc[common]
mirna = mirna.loc[common]
pam50 = pam50.loc[common]
clinical = clinical.loc[common]

print("\nFinal shapes")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
print(f"pam50: {pam50.shape}")
print(f"clinical:{clinical.shape}")

meth:
patients with >1 aliquot: 91
total duplicate rows: 94

rna:
patients with >1 aliquot: 93
total duplicate rows: 96

mirna:
patients with >1 aliquot: 84
total duplicate rows: 86

Post-aggregation shapes:
meth: (769, 20107)
rna: (769, 18321)
mirna: (769, 503)
Patients in every dataset: 769

Final shapes
meth: (769, 20107)
rna: (769, 18321)
mirna: (769, 503)
pam50: (769, 1)
clinical:(769, 119)


## Review Data:

After preprocessing (phase 1), the datasets are aligned and filtered to include only patients present in all sources.

**Sample views:**
- `meth`: Methylation data with 20,107 features
- `rna`: Gene expression data with 18,321 features
- `mirna`: miRNA expression with 503 features
- `clinical`: Demographic and clinical information with 119 columns
- `pam50`: Subtype distribution  
    - LumA: 419  
    - LumB: 140  
    - Basal: 130  
    - Her2: 46  
    - Normal: 34

All datasets now share a consistent set of sample ids.

In [5]:
print(meth.head())
print(rna.head())
print(mirna.head())
print(clinical.head())
print(pam50.value_counts())

Hybridization REF  Composite Element REF      A1BG  ...  psiTPTE22      tAKR
0                                                   ...                     
TCGA-3C-AAAU                         NaN  0.483716  ...   0.247304  0.506404
TCGA-3C-AALI                         NaN  0.637191  ...   0.163022  0.623865
TCGA-3C-AALJ                         NaN  0.656092  ...   0.252328  0.504451
TCGA-3C-AALK                         NaN  0.615194  ...   0.471956  0.682468
TCGA-4H-AAAK                         NaN  0.612080  ...   0.314877  0.744877

[5 rows x 20107 columns]
gene          ?|100133144  ?|100134869  ...  ZZZ3|26009  psiTPTE22|387590
0                                       ...                              
TCGA-3C-AAAU     4.032489     3.692829  ...   10.205129          0.785174
TCGA-3C-AALI     3.211931     4.119273  ...    8.667973          9.855788
TCGA-3C-AALJ     3.538886     3.206237  ...    8.992994          5.143969
TCGA-3C-AALK     3.595671     3.469873  ...    9.453001          

## Preprocessing: Phase 2

After reviewing the data, we applied the following steps to prepare it for downstream analysis.

1. **Methylation (B -> M-value)**
   - Clip B-values to \[E, 1-E] and apply logit transform: M = log_2(B / (1-B)).
   - Drop the original `Composite Element REF` column.

2. **mRNA & miRNA:**
   - Already in log_2 scale (RSEM normalized and RPKM).

3. **Quality Control:**
   - Count samples with all-zero rows in each modality.
   - Compute NaN counts post-transformation, then replace all NaNs with 0.

4. **Column Name Cleaning:**
   - Replace all `-` and `|` characters with `_`.
   - Replace `?` with `unknown`.

5. **Label Encoding:**
   - Map `PAM50` subtypes to integers: 
      - Normal = 0
      - Basal = 1 
      - Her2 = 2
      - LumA = 3
      - LumB = 4

6. **Alignment & Aggregation:**
   - Trim barcodes to patient level.
   - Aggregate duplicate aliquots by mean per patient.
   - Drop the `project` column from clinical.
   - Subset all tables to the common patient set (no missing or all-zero samples).
   - Set up a commong index across all files.

7. **Final Output Shapes:**
   - Methylation M-value: 769 × 20,107
   - mRNA (log_2): 769 × 20,531
   - miRNA (log_2): 769 × 503
   - PAM50 labels: 769 × 1
   - Clinical covariates: 769 × 101

In [6]:
import numpy as np
import pandas as pd

def beta_to_m(df, eps=1e-6):
    B = np.clip(df.values, eps, 1.0 - eps)
    M = np.log2(B / (1 - B))
    return pd.DataFrame(M, index=df.index, columns=df.columns)

# find rows that are all 0s
zeros_meth = (meth  == 0).all(axis=1).sum()
zeros_rna = (rna   == 0).all(axis=1).sum()
zeros_mirna = (mirna == 0).all(axis=1).sum()
print(f"All zeros: meth: {zeros_meth}, rna: {zeros_rna}, mirna: {zeros_mirna}")

# find rows with all nans
nan_meth = meth.isna().all(axis=1).sum()
nan_rna = rna.isna().all(axis=1).sum()
nan_mirna = mirna.isna().all(axis=1).sum()
nan_clinical = clinical.isna().all(axis=1).sum()
nan_pam50 = pam50.isna().all(axis=1).sum()
print(f"nan_meth: {nan_meth}, nan_rna: {nan_rna}, nan_mirna: {nan_mirna}, nan_clinical: {nan_clinical}, nan_pam50: {nan_pam50}")

# map PAM50 subtypes to integers
mapping = {"Normal":0, "Basal":1, "Her2":2, "LumA":3, "LumB":4}
pam50 = pam50["BRCA_Subtype_PAM50"].map(mapping).to_frame(name="pam50")

# drop and transform methylation
meth_clean = meth.drop(columns=["Composite Element REF"], errors="ignore")
meth_m = beta_to_m(meth_clean)
clinical = clinical.drop(columns=["project"], errors="ignore")

# clean column names and fill nans
for df in [meth_m, rna, mirna]:
    df.columns = df.columns.str.replace(r"\?", "unknown_", regex=True)
    df.columns = df.columns.str.replace(r"\|", "_", regex=True)
    df.columns = df.columns.str.replace("-", "_", regex=False)
    df.columns = df.columns.str.replace(r"_+", "_", regex=True)
    df.columns = df.columns.str.strip("_")
    df.fillna(0, inplace=True)

# check for nans after filling
print("NaN counts after filling:")
print(meth_m.isna().sum().sum(),rna.isna().sum().sum(),mirna.isna().sum().sum(),clinical.isna().sum().sum(),pam50.isna().sum().sum())

# align index to PAM50
X_meth = meth_m.loc[pam50.index]
X_rna = rna.loc[pam50.index]
X_mirna = mirna.loc[pam50.index]
clinical= clinical.loc[pam50.index]

print(f"new shapes: meth: {X_meth.shape}, rna: {X_rna.shape}, mirna: {X_mirna.shape}, pam50: {pam50.shape}, clinical: {clinical.shape}")
print(X_meth.head())
print(X_rna.head())
print(X_mirna.head())
print(clinical.head())
print(pam50.value_counts())

All zeros: meth: 0, rna: 0, mirna: 0
nan_meth: 0, nan_rna: 0, nan_mirna: 0, nan_clinical: 0, nan_pam50: 0
NaN counts after filling:
0 0 0 46476 0
new shapes: meth: (769, 20106), rna: (769, 18321), mirna: (769, 503), pam50: (769, 1), clinical: (769, 118)
Hybridization REF      A1BG      A1CF  ...  psiTPTE22      tAKR
patient                                ...                     
TCGA-3C-AAAU      -0.094004 -1.251175  ...  -1.605783  0.036955
TCGA-3C-AALI       0.812517 -0.237291  ...  -2.360128  0.729981
TCGA-3C-AALJ       0.931878 -0.059301  ...  -1.567104  0.025686
TCGA-3C-AALK       0.676913  0.741678  ...  -0.162004  1.103860
TCGA-4H-AAAK       0.657963  0.044649  ...  -1.121575  1.545812

[5 rows x 20106 columns]
gene          unknown_100133144  unknown_100134869  ...  ZZZ3_26009  psiTPTE22_387590
patient                                             ...                              
TCGA-3C-AAAU           4.032489           3.692829  ...   10.205129          0.785174
TCGA-3C-AALI  

## Save & Load

Our data is clean and consistently structured across all modalities.

- All-zero rows: meth: 0, rna: 0, mirna: 0  
- All-NaN rows: meth: 0, rna: 0, mirna: 0, clinical: 0, pam50: 0  
- NaN values exist in clinical data:
    - A total of 46,476 NaN entries
    - Will be addressed in the next step

**Final dataset shapes:**
- meth: 769 × 20,106  
- rna: 769 × 18,321  
- mirna: 769 × 503  
- clinical: 769 × 118  
- pam50: 769 × 1

**Saving files:**  
Set a common patient index across all datasets and saved each one as a `.csv` file.

**Verifying saved files:**  
Loaded each `.csv` and printed the head to confirm successful read/write with preserved structure and content.


In [7]:
# Setting up a commong index and saving to csv
X_meth.index.name = "patient"
X_rna.index.name = "patient"
X_mirna.index.name = "patient"
pam50.index.name = "patient"
clinical.index.name = "patient"

X_meth.to_csv(root / "meth.csv", index=True)
X_rna.to_csv(root / "rna.csv", index=True)
X_mirna.to_csv(root / "mirna.csv", index=True)
pam50.to_csv(root / "pam50.csv", index=True)
clinical.to_csv(root / "clinical.csv", index=True)

In [8]:
# To confirm our data saved and loads properly:
meth = pd.read_csv(root / "meth.csv", index_col=0)
rna = pd.read_csv(root / "rna.csv", index_col=0)
mirna = pd.read_csv(root / "mirna.csv", index_col=0)
pam50 = pd.read_csv(root / "pam50.csv", index_col=0)
clinical = pd.read_csv(root / "clinical.csv", index_col=0)
    
print(meth.head())
print(rna.head())
print(mirna.head())
print(clinical.head())
print(pam50.head())

                  A1BG      A1CF  ...  psiTPTE22      tAKR
patient                           ...                     
TCGA-3C-AAAU -0.094004 -1.251175  ...  -1.605783  0.036955
TCGA-3C-AALI  0.812517 -0.237291  ...  -2.360128  0.729981
TCGA-3C-AALJ  0.931878 -0.059301  ...  -1.567104  0.025686
TCGA-3C-AALK  0.676913  0.741678  ...  -0.162004  1.103860
TCGA-4H-AAAK  0.657963  0.044649  ...  -1.121575  1.545812

[5 rows x 20106 columns]
              unknown_100133144  unknown_100134869  ...  ZZZ3_26009  psiTPTE22_387590
patient                                             ...                              
TCGA-3C-AAAU           4.032489           3.692829  ...   10.205129          0.785174
TCGA-3C-AALI           3.211931           4.119273  ...    8.667973          9.855788
TCGA-3C-AALJ           3.538886           3.206237  ...    8.992994          5.143969
TCGA-3C-AALK           3.595671           3.469873  ...    9.453001          6.057699
TCGA-4H-AAAK           2.775430           3.8

## Feature Selection: Phase 1

To explore different ways of selecting informative features, we evaluated three built-in methods:
- variance thresholding  
- ANOVA F-test  
- random forest importance  

Each method highlights different statistical properties: overall variability, class-based separability, and model-derived relevance. Here, we applied all three and compared the overlap between selected features to assess their agreement.

**Methods applied:**
- Selected the top 6000 features for both methylation and RNA datasets using each method  
- Compared feature overlap across methods  
- miRNA was excluded due to its limited feature count (503 total)  
- Selection was necessary for methylation and RNA, which originally had over 20,000 and 18,000 features, respectively

**Methylation feature selection:**
- ANOVA F-test & variance share: 2,091 features  
- Random forest & variance share: 1,871 features  
- ANOVA F-test & random forest share: 2,201 features  
- All three methods agree on: 815 features

**RNA feature selection:**
- ANOVA F-test & variance share: 2,152 features  
- Random forest & variance share: 1,829 features  
- ANOVA F-test & random forest share: 2,216 features  
- All three methods agree on: 805 features

These overlaps suggest that while each method captures unique aspects of the data, there is meaningful agreement, particularly between ANOVA and random forest.

In [9]:
from bioneuralnet.utils.preprocess import select_top_k_variance
from bioneuralnet.utils.preprocess import top_anova_f_features
from bioneuralnet.utils.preprocess import select_top_randomforest

# feature selection
meth_highvar = select_top_k_variance(meth, k=6000)
rna_highvar = select_top_k_variance(rna, k=6000)

meth_af = top_anova_f_features(meth, pam50, max_features=6000)
rna_af = top_anova_f_features(rna, pam50, max_features=6000)

meth_rf = select_top_randomforest(meth, pam50, top_k=6000)
rna_rf = select_top_randomforest(rna, pam50, top_k=6000)

meth_var = list(meth_highvar.columns)
meth_anova = list(meth_af.columns)
meth_rf = list(meth_rf.columns)

rna_var = list(rna_highvar.columns)
rna_anova = list(rna_af.columns)
rna_rf = list(rna_rf.columns)

inter1 = []
for x in meth_anova:
    if x in meth_var:
        inter1.append(x)

inter2 = []
for x in meth_rf:
    if x in meth_var:
        inter2.append(x)

inter3 = []
for x in meth_anova:
    if x in meth_rf:
        inter3.append(x)

meth_all_three = []
for x in meth_anova:
    if x in meth_rf and x in meth_var:
        meth_all_three.append(x)

inter4 = []
for x in rna_anova:
    if x in rna_var:
        inter4.append(x)

inter5 = []
for x in rna_rf:
    if x in rna_var:
        inter5.append(x)

inter6 = []
for x in rna_anova:
    if x in rna_rf:
        inter6.append(x)

rna_all_three = []
for x in rna_anova:
    if x in rna_rf and x in rna_var:
        rna_all_three.append(x)


2025-05-28 11:30:33,011 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-28 11:30:33,011 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-05-28 11:30:33,011 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-05-28 11:30:33,115 - bioneuralnet.utils.preprocess - INFO - Selected top 6000 features by variance
2025-05-28 11:30:35,816 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-28 11:30:35,817 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-05-28 11:30:35,817 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-05-28 11:30:35,911 - bioneuralnet.utils.preprocess - INFO - Selected top 6000 features by variance
2025-05-28 11:30:38,886 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-28 11:30:38,886 - bioneuralnet.

In [10]:
print("Methylation feature selection:\n")
print(f"Anova-F & variance selection share: {len(inter1)} features")
print(f"Random Forest & variance selection share: {len(inter2)} features")
print(f"Anova-F & Random Forest share: {len(inter3)} features")
print(f"All three methods agree on: {len(meth_all_three)} features")

Methylation feature selection:

Anova-F & variance selection share: 2091 features
Random Forest & variance selection share: 1871 features
Anova-F & Random Forest share: 2201 features
All three methods agree on: 815 features


In [11]:
print("\nRNA feature selection:\n")
print(f"Anova-F & variance selection share: {len(inter4)} features")
print(f"Random Forest & variance selection share: {len(inter5)} features")
print(f"Anova-F & Random Forest share: {len(inter6)} features")
print(f"All three methods agree on: {len(rna_all_three)} features")


RNA feature selection:

Anova-F & variance selection share: 2340 features
Random Forest & variance selection share: 2218 features
Anova-F & Random Forest share: 2546 features
All three methods agree on: 1134 features


In [12]:
out_dir = Path("/home/vicente/Github/BioNeuralNet/TCGA_BRCA_DATA/ANOVA")

rna_af.to_csv(out_dir / "rna_anova.csv")
meth_af.to_csv(out_dir / "meth_anova.csv")

## Data Accessibility: Using `DatasetLoader`

To make this dataset easy to use, we've packaged it into the `DatasetLoader` component. Due to GitHub and PyPI file size limits, we included only the top 6,000 features from Methylation, RNA. Selected using the ANOVA F-test from the previous step.

If you have additional preprocessed or raw datasets you would like to contribute, feel free to reach out and we are happy to help expand the platform.

In [13]:
from bioneuralnet.datasets import DatasetLoader

tgca_brca = DatasetLoader("brca")

print(f"TGCA BRCA dataset shape: {tgca_brca.shape}")
brca_meth = tgca_brca.data["meth"]
brca_rna = tgca_brca.data["rna"]
brca_mirna = tgca_brca.data["mirna"]
brca_clinical = tgca_brca.data["clinical"]
brca_pam50 = tgca_brca.data["pam50"]


TGCA BRCA dataset shape: {'mirna': (769, 503), 'pam50': (769, 1), 'clinical': (769, 118), 'rna': (769, 6000), 'meth': (769, 6000)}


## Feature Selection: Phase 2

We used `preprocess_clinical` to reduce the clinical dataset to the top 10 most informative features based on random forest importance.

- Dropped samples with missing PAM50 labels  
- Subset all datasets to matched patients  
- Ignored non-informative age-related columns  
- No scaling applied

**Result:**
- Clinical data reduced to 10 features across the 769 patients  

In [14]:
from bioneuralnet.utils.preprocess import preprocess_clinical

#shapes
print(f"RNA shape: {brca_rna.shape}")
print(f"METH shape: {brca_meth.shape}")
print(f"miRNA shape: {brca_mirna.shape}")
print(f"Clinical shape: {brca_clinical.shape}")
print(f"Phenotype shape: {brca_pam50.shape}")
print(f"Phenotype counts:\n{brca_pam50.value_counts()}")

#check nans in pam50
print(f"Nan values in pam50 {brca_pam50.isna().sum().sum()}")
brca_pam50 = brca_pam50.dropna()

X_rna = brca_rna.loc[brca_pam50.index]
X_meth = brca_meth.loc[brca_pam50.index]
X_mirna = brca_mirna.loc[brca_pam50.index]
clinical = brca_clinical.loc[brca_pam50.index]

# for more details on the preprocessing function, see bioneuralnet.utils.preprocess
clinical = preprocess_clinical(clinical, brca_pam50, top_k=10, scale=False, ignore_columns=["days_to_birth", "age_at_diagnosis", "days_to_last_followup", "age_at_index", "years_to_birth"])
print(clinical.head())

2025-05-28 11:31:03,616 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-28 11:31:03,616 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 31384 NaNs after median imputation
2025-05-28 11:31:03,617 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 39 columns dropped due to zero variance


RNA shape: (769, 6000)
METH shape: (769, 6000)
miRNA shape: (769, 503)
Clinical shape: (769, 118)
Phenotype shape: (769, 1)
Phenotype counts:
pam50
3        419
4        140
1        130
2         46
0         34
Name: count, dtype: int64
Nan values in pam50 0


2025-05-28 11:31:03,978 - bioneuralnet.utils.preprocess - INFO - Selected top 10 features by RandomForest importance


              year_of_diagnosis  number_of_lymph_nodes  ...  laterality_Right  primary_diagnosis_Infiltrating duct carcinoma, NOS
patient                                                 ...                                                                      
TCGA-3C-AAAU             2004.0                    4.0  ...             False                                              False 
TCGA-3C-AALI             2003.0                    1.0  ...              True                                               True 
TCGA-3C-AALJ             2011.0                    1.0  ...              True                                               True 
TCGA-3C-AALK             2011.0                    0.0  ...              True                                               True 
TCGA-4H-AAAK             2013.0                    4.0  ...             False                                              False 

[5 rows x 10 columns]


## Graph Construction

We built a k-NN cosine similarity graph to capture relationships across omics

- Selected 1,000 features each from methylation and RNA, and all 503 from miRNA  
- Combined into `X_train_full` (769 × 2,503), no NaNs found  
- Transposed the matrix to treat features as nodes  
- Constructed a cosine similarity graph with `k=15`
- Graph shape: 2,503 × 2,503 (features × features)

In [15]:
from bioneuralnet.utils.preprocess import top_anova_f_features
from bioneuralnet.utils.graph import gen_similarity_graph

meth_sel = top_anova_f_features(X_meth, brca_pam50, max_features=1000)
rna_sel = top_anova_f_features(X_rna, brca_pam50 ,max_features=1000)
mirna_sel = top_anova_f_features(X_mirna, brca_pam50,max_features=503)
X_train_full = pd.concat([meth_sel, rna_sel, mirna_sel], axis=1)

# we check again for nan values then drop if any
print(f"Nan values in X_train_full: {X_train_full.isna().sum().sum()}")
X_train_full = X_train_full.dropna()
print(f"Nan value in X_train_full after dropping: {X_train_full.isna().sum().sum()}")

print(f"X_train_full shape: {X_train_full.shape}")
# building the graph using the similarity graph function with k=15
A_train = gen_similarity_graph(X_train_full.T, k=15)

print(f"\nNetwork shape: {A_train.shape}")

2025-05-28 11:31:04,871 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-28 11:31:04,871 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-05-28 11:31:04,871 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-05-28 11:31:04,914 - bioneuralnet.utils.preprocess - INFO - Selected 1000 features by ANOVA (task=classification), 6000 significant, 0 padded
2025-05-28 11:31:05,822 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-28 11:31:05,822 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-05-28 11:31:05,823 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-05-28 11:31:05,866 - bioneuralnet.utils.preprocess - INFO - Selected 1000 features by ANOVA (task=classification), 6000 significant, 0 padded
2025-05-28 11:31:05,945 - bioneuralnet.utils.preproc

Nan values in X_train_full: 0
Nan value in X_train_full after dropping: 0
X_train_full shape: (769, 2503)

Network shape: (2503, 2503)


## DPMON Run Summary

We evaluated **DPMON** for PAM50 subtype classification using multi-omics data (RNA, methylation, miRNA), a feature graph, and clinical covariates.

**Performance Metrics:**
- Accuracy: 0.9870
- F1-Weighted: 0.9875 
- F1-Macro: 0.9651

DPMON is an end-to-end optimized pipeline that fuses multi-omics data and network structure for disease prediction using GNNs.  

For implementation details, see the [documentation](https://bioneuralnet.readthedocs.io/en/latest/gnns.html#how-dpmon-uses-gnns-differently).  

For the full paper, see:
[2] Hussein, S., Ramos, V., et al. *Learning from Multi-Omics Networks to Enhance Disease Prediction: An Optimized Network Embedding and Fusion Approach.*  
**IEEE BIBM 2024**, Lisbon, Portugal, pp. 4371–4378. DOI: [10.1109/BIBM62325.2024.10822233](https://doi.org/10.1109/BIBM62325.2024.10822233)

In [None]:
from bioneuralnet.downstream_task import DPMON
from sklearn.metrics import accuracy_score, f1_score

save = Path("/home/vicente/Github/BioNeuralNet/dpmon_output")
brca_pam50 = brca_pam50.rename(columns={"pam50": "phenotype"})

dpmon = DPMON(
    adjacency_matrix=A_train,
    omics_list=[meth_sel, rna_sel, mirna_sel],
    phenotype_data=brca_pam50,
    clinical_data=clinical,
    repeat_num=5,
    tune=True,
    gpu=True, 
    cuda=0,
    output_dir=Path(save),
)

predictions_df, avg_accuracy = dpmon.run()
actual = predictions_df["Actual"]
pred = predictions_df["Predicted"]

dpmon_acc = accuracy_score(actual, pred)
dpmon_f1w = f1_score(actual, pred, average='weighted')
dpmon_f1m = f1_score(actual, pred, average='macro')

print(f"\nDPMON results:")
print(f"Accuracy: {dpmon_acc}")
print(f"F1 weighted: {dpmon_f1w}")
print(f"F1 macro: {dpmon_f1m}")

2025-05-29 13:09:32,081 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/dpmon_output
2025-05-29 13:09:32,081 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
2025-05-29 13:09:32,082 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
2025-05-29 13:09:32,096 - bioneuralnet.downstream_task.dpmon - INFO - Running hyperparameter tuning for DPMON.
2025-05-29 13:09:32,096 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
2025-05-29 13:09:32,177 - bioneuralnet.downstream_task.dpmon - INFO - Number of nodes in network: 2503
2025-05-29 13:09:34,491 - bioneuralnet.downstream_task.dpmon - INFO - Starting hyperparameter tuning for dataset shape: (769, 2504)
2025-05-29 13:10:23,658 - bioneuralnet.downstream_task.dpmon - INFO - Best trial config: {'gnn_layer_num': 4, 'gnn_hidden_dim': 64, 'lr': 0.02435222881645533, 'weight_decay': 0.0005853927207500042, 'nn_hidden_dim1': 64, 'n


DPMON results:
Accuracy: 0.9869960988296489
F1 weighted: 0.9874857727588546
F1 macro: 0.9695114345114344
