# TCGA-BRCA Demo

## Dataset Source

- **Omics Data**: [FireHose BRCA](http://firebrowse.org/?cohort=BRCA)
- **Clinical and PAM50 Data**: [TCGAbiolinks](http://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html)

## Dataset Overview

**Original Data**:

- **Methylation**: 20,107 × 885
- **mRNA**: 18,321 × 1,212
- **miRNA**: 503 × 1,189
- **PAM50**: 1,087 × 1
- **Clinical**: 1,098 × 101

- **Note: Omics matrices are features × samples; clinical matrices are samples × fields.**

**PAM50 Subtype Counts**:

- **LumA**: 419
- **LumB**: 140
- **Basal**: 130
- **Her2**: 46
- **Normal**: 34

## Patients in Every Dataset

- Total patients present in methylation, mRNA, miRNA, PAM50, and clinical: **769**

## Final Shapes (Per-Patient)

After aggregating multiple aliquots by mean, all modalities align on 769 patients:

- **Methylation**: 769 × 20,107
- **mRNA**: 769 × 20,531
- **miRNA**: 769 × 503
- **PAM50**: 769 × 1
- **Clinical**: 769 × 119

## Data Summary Table

| Stage                          | Clinical    | Methylation  | miRNA       | mRNA           | PAM50 (Subtype Counts)                                         | Notes                                   |
| ------------------------------ | ----------- | ------------ | ----------- | -------------- | -------------------------------------------------------------- | --------------------------------------- |
| **Original Raw Data**          | 1,098 × 101 | 20,107 × 885 | 503 × 1,189 | 18,321 × 1,212 | LumA: 509<br>LumB: 209<br>Basal: 192<br>Her2: 82<br>Normal: 40 | Raw FireHose & TCGAbiolinks files       |
| **Patient-Level Intersection** | 769 × 101   | 769 × 20,107 | 769 × 1,046 | 769 × 20,531   | LumA: 419<br>LumB: 140<br>Basal: 130<br>Her2: 46<br>Normal: 34 | Patients with complete data in all sets |

## Reference Links

- [FireHose BRCA](http://firebrowse.org/?cohort=BRCA)
- [TCGAbiolinks](http://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html)
- [Direct Download BRCA](http://firebrowse.org/?cohort=BRCA&download_dialog=true)


## Lets take a look at the data from FireHose directly after download

In [1]:
import pandas as pd
from pathlib import Path
root = Path("/home/vicente/Github/BioNeuralNet/TCGA_BRCA_DATA")

mirna_raw = pd.read_csv(root/"BRCA.miRseq_RPKM_log2.txt", sep="\t",index_col=0,low_memory=False)                            
rna_raw = pd.read_csv(root / "BRCA.uncv2.mRNAseq_RSEM_normalized_log2.txt", sep="\t",index_col=0,low_memory=False)
meth_raw = pd.read_csv(root/"BRCA.meth.by_mean.data.txt", sep='\t',index_col=0,low_memory=False)
clinical_raw = pd.read_csv(root / "BRCA.clin.merged.picked.txt",sep="\t", index_col=0, low_memory=False)

print(f"mirna shape: {mirna_raw.shape}, rna shape: {rna_raw.shape}, meth shape: {meth_raw.shape}, clinical shape: {clinical_raw.shape}")
print(mirna_raw.head())
print(rna_raw.head())
print(meth_raw.head())
print(clinical_raw.head())

mirna shape: (503, 1189), rna shape: (18321, 1212), meth shape: (20107, 885), clinical shape: (18, 1097)
              TCGA-3C-AAAU-01  TCGA-3C-AALI-01  TCGA-3C-AALJ-01  \
gene                                                              
hsa-let-7a-1        13.129765        12.918069        13.012033   
hsa-let-7a-2        14.117933        13.922300        14.010002   
hsa-let-7a-3        13.147714        12.913194        13.028483   
hsa-let-7b          14.595135        14.512657        13.419612   
hsa-let-7c           8.414890         9.646536         9.312455   

              TCGA-3C-AALK-01  TCGA-4H-AAAK-01  TCGA-5L-AAT0-01  \
gene                                                              
hsa-let-7a-1        13.144697        13.411684        13.316301   
hsa-let-7a-2        14.141721        14.413518        14.310917   
hsa-let-7a-3        13.151281        13.420481        13.327144   
hsa-let-7b          14.667196        14.438548        14.576493   
hsa-let-7c          11.

## TCGA-BioLink

This section demonstrates how to use the `TCGAbiolinks` R package to access and download clinical and molecular subtype data. It begins by ensuring `TCGAbiolinks` is installed, then loads the package. It retrieves PAM50 molecular subtype labels using `TCGAquery_subtype()` and writes them to a CSV file. Additionally, it downloads clinical data using `GDCquery_clinic()` and formats it with `GDCprepare_clinic()`, saving the result as another CSV file.

```R
  # Install TCGAbiolinks
  if (!requireNamespace("TCGAbiolinks", quietly = TRUE)) {
    if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
    BiocManager::install("TCGAbiolinks")
  }

  # Load the library
  library(TCGAbiolinks)

  # Download PAM50 subtype labels
  pam50_df <- TCGAquery_subtype(tumor = "BRCA")[ , c("patient", "BRCA_Subtype_PAM50")]
  write.csv(pam50_df, file = "BRCA_PAM50_labels.csv", row.names = FALSE, quote = FALSE)

  # Download clinical data
  clin_raw <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
  clin_df <- GDCprepare_clinic(clin_raw, clinical.info = "patient")
  write.csv(clin_df, file = "BRCA_clinical_data.csv", row.names = FALSE, quote = FALSE)
```

In [2]:
import pandas as pd

# from Firehose
mirna = pd.read_csv(root/"BRCA.miRseq_RPKM_log2.txt", sep="\t",index_col=0,low_memory=False)
meth = pd.read_csv(root/"BRCA.meth.by_mean.data.txt", sep='\t',index_col=0,low_memory=False)                             
rna = pd.read_csv(root / "BRCA.uncv2.mRNAseq_RSEM_normalized_log2.txt", sep="\t",index_col=0,low_memory=False)
clinical_firehose = pd.read_csv(root / "BRCA.clin.merged.picked.txt",sep="\t", index_col=0, low_memory=False).T

# from TCGABiolinks
pam50 = pd.read_csv(root /"BRCA_PAM50_labels.csv",index_col=0)
clinical_biolinks = pd.read_csv(root /"BRCA_clinical_data.csv",index_col=1)

print("Initial shapes")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
print(f"pam50: {pam50.shape}")
print(f"clinical TCGABioLinks: {clinical_biolinks.shape}")
print(f"clinical FireHose: {clinical_firehose.shape}")

meth = meth.T
rna = rna.T
mirna = mirna.T

print("\nAfter tranpose")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")

def trim(idx):
    return idx.to_series().str.extract(r'(^TCGA-\w\w-\w\w\w\w)')[0]

meth.index = trim(meth.index)
rna.index = trim(rna.index)
mirna.index = trim(mirna.index)
pam50.index = pam50.index.str.upper()
clinical_biolinks.index = clinical_biolinks.index.str.upper()
clinical_firehose.index = clinical_firehose.index.str.upper()

idx1 = clinical_biolinks.index
idx2 = clinical_firehose.index

# intersection and unique counts
common = idx1.intersection(idx2)
only_in_1 = idx1.difference(idx2)
only_in_2 = idx2.difference(idx1)

print(f"Patients in both clinical datasets: {len(common)}")
common = clinical_biolinks.index.intersection(clinical_firehose.index)
clinical_biolinks = clinical_biolinks.loc[common]
clinical_firehose = clinical_firehose.loc[common]

clinical = pd.concat([clinical_biolinks, clinical_firehose], axis=1)

print(f"Combined Clinical shape {clinical.shape}")

common = sorted(set(meth.index) & set(rna.index) & set(mirna.index) & set(pam50.index) & set(clinical.index))
print(f"Patients in every dataset: {len(common)}")

meth = meth.loc[common]
rna = rna.loc[common]
mirna = mirna.loc[common]
pam50 = pam50.loc[common]
clinical = clinical.loc[common]

print("\nFinal shapes:")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
print(f"pam50: {pam50.shape}")
print(f"clinical: {clinical.shape}\n")

Initial shapes
meth: (20107, 885)
rna: (18321, 1212)
mirna: (503, 1189)
pam50: (1087, 1)
clinical TCGABioLinks: (1098, 101)
clinical FireHose: (1097, 18)

After tranpose
meth: (885, 20107)
rna: (1212, 18321)
mirna: (1189, 503)
Patients in both clinical datasets: 1097
Combined Clinical shape (1097, 119)
Patients in every dataset: 769

Final shapes:
meth: (863, 20107)
rna: (865, 18321)
mirna: (855, 503)
pam50: (769, 1)
clinical: (769, 119)



## Handling Multiple Aliquots per Sample

This section addresses cases where some patients have multiple aliquots per sample in the `meth`, `rna`, and `mirna` datasets. It first identifies and counts patients with duplicate entries. Then, it coerces all data to numeric types and aggregates the duplicates by computing the mean across aliquots for each patient, ensuring only one row per patient. After aggregation, the datasets are aligned by keeping only the patients that are common across all five datasets (`meth`, `rna`, `mirna`, `pam50`, and `clinical`). The result is s set of matched samples ready for integrated analysis.

In [3]:
for name, df in [("meth", meth), ("rna", rna), ("mirna", mirna)]:
    counts = df.index.value_counts()
    n_multiple = (counts > 1).sum()
    total_duplicates = counts[counts > 1].sum() - n_multiple
    
    print(f"{name}:")
    print(f"patients with >1 aliquot: {n_multiple}")
    print(f"total duplicate rows: {total_duplicates}\n")

meth = meth.apply(pd.to_numeric, errors="coerce")
rna = rna .apply(pd.to_numeric, errors="coerce")
mirna = mirna.apply(pd.to_numeric, errors="coerce")

meth = meth.groupby(level=0).mean()
rna = rna.groupby(level=0).mean()
mirna = mirna.groupby(level=0).mean()

# Now each has one row per patient
print("Post-aggregation shapes:")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")

common = sorted( set(meth.index) & set(rna.index) & set(mirna.index)& set(pam50.index) & set(clinical.index) )
print(f"Patients in every dataset: {len(common)}")

meth = meth.loc[common]
rna = rna.loc[common]
mirna = mirna.loc[common]
pam50 = pam50.loc[common]
clinical = clinical.loc[common]

print("\nFinal shapes")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
print(f"pam50: {pam50.shape}")
print(f"clinical:{clinical.shape}")

meth:
patients with >1 aliquot: 91
total duplicate rows: 94

rna:
patients with >1 aliquot: 93
total duplicate rows: 96

mirna:
patients with >1 aliquot: 84
total duplicate rows: 86

Post-aggregation shapes:
meth: (769, 20107)
rna: (769, 18321)
mirna: (769, 503)
Patients in every dataset: 769

Final shapes
meth: (769, 20107)
rna: (769, 18321)
mirna: (769, 503)
pam50: (769, 1)
clinical:(769, 119)


## Review the first few rows of each file

In [4]:
print(meth.head())
print(rna.head())
print(mirna.head())
print(clinical.head())
print(pam50.value_counts())

Hybridization REF  Composite Element REF      A1BG      A1CF     A2BP1  \
0                                                                        
TCGA-3C-AAAU                         NaN  0.483716  0.295827  0.187700   
TCGA-3C-AALI                         NaN  0.637191  0.458973  0.240516   
TCGA-3C-AALJ                         NaN  0.656092  0.489725  0.279088   
TCGA-3C-AALK                         NaN  0.615194  0.625765  0.488889   
TCGA-4H-AAAK                         NaN  0.612080  0.507737  0.463845   

Hybridization REF     A2LD1       A2M     A2ML1    A4GALT     A4GNT      AAA1  \
0                                                                               
TCGA-3C-AAAU       0.629586  0.559654  0.835412  0.484800  0.690217  0.807805   
TCGA-3C-AALI       0.666272  0.607505  0.842391  0.550047  0.749890  0.395290   
TCGA-3C-AALJ       0.755630  0.662360  0.829020  0.476107  0.653756  0.795102   
TCGA-3C-AALK       0.745751  0.727982  0.835365  0.556016  0.652005  0.81642

## Preprocessing

After reviewing the data above, we applied the following steps to the data before further analysis.

1. Methylation (B -> M-value)
   - Clip B-values to \[E, 1-E] and apply logit transform: M = log_2(B / (1-B)).
   - Drop the original `Composite Element REF` column.

2. mRNA & miRNA:
   - Already in log_2 scale (RSEM normalized and RPKM).

3. Quality Control:
   - Count samples with all-zero rows in each modality.
   - Compute NaN counts post-transformation, then replace all NaNs with 0.

4. Column Name Cleaning:
   - Replace all `-` and `|` characters with `_`.
   - Replace `?` with `unknown`.

5. Label Encoding:
   - Map `PAM50` subtypes to integers: 
      - Normal = 0
      - Basal = 1 
      - Her2 = 2
      - LumA = 3
      - LumB = 4

6. Alignment & Aggregation:
   - Trim barcodes to patient level.
   - Aggregate duplicate aliquots by mean per patient.
   - Drop the `project` column from clinical.
   - Subset all tables to the common patient set (no missing or all-zero samples).
   - Set up a commong index across all files.

7. Final Output Shapes:
   - Methylation M-value: 769 × 20,107
   - mRNA (log_2): 769 × 20,531
   - miRNA (log_2): 769 × 503
   - PAM50 labels: 769 × 1
   - Clinical covariates: 769 × 101

In [5]:
import numpy as np
import pandas as pd

def beta_to_m(df, eps=1e-6):
    B = np.clip(df.values, eps, 1.0 - eps)
    M = np.log2(B / (1 - B))
    return pd.DataFrame(M, index=df.index, columns=df.columns)

# find rows that are all 0s
zeros_meth = (meth  == 0).all(axis=1).sum()
zeros_rna = (rna   == 0).all(axis=1).sum()
zeros_mirna = (mirna == 0).all(axis=1).sum()
print(f"All zeros: meth: {zeros_meth}, rna: {zeros_rna}, mirna: {zeros_mirna}")

# find rows with all nans
nan_meth = meth.isna().all(axis=1).sum()
nan_rna = rna.isna().all(axis=1).sum()
nan_mirna = mirna.isna().all(axis=1).sum()
nan_clinical = clinical.isna().all(axis=1).sum()
nan_pam50 = pam50.isna().all(axis=1).sum()
print(f"nan_meth: {nan_meth}, nan_rna: {nan_rna}, nan_mirna: {nan_mirna}, nan_clinical: {nan_clinical}, nan_pam50: {nan_pam50}")

# map PAM50 subtypes to integers
mapping = {"Normal":0, "Basal":1, "Her2":2, "LumA":3, "LumB":4}
pam50 = pam50["BRCA_Subtype_PAM50"].map(mapping).to_frame(name="pam50")

# drop and transform methylation
meth_clean = meth.drop(columns=["Composite Element REF"], errors="ignore")
meth_m = beta_to_m(meth_clean)
clinical = clinical.drop(columns=["project"], errors="ignore")

# clean column names and fill nans
for df in [meth_m, rna, mirna]:
    df.columns = df.columns.str.replace(r"\?", "unknown_", regex=True)
    df.columns = df.columns.str.replace(r"\|", "_", regex=True)
    df.columns = df.columns.str.replace("-", "_", regex=False)
    df.columns = df.columns.str.replace(r"_+", "_", regex=True)
    df.columns = df.columns.str.strip("_")
    df.fillna(0, inplace=True)

# check for nans after filling
print("NaN counts after filling:")
print(meth_m.isna().sum().sum(),rna.isna().sum().sum(),mirna.isna().sum().sum(),clinical.isna().sum().sum(),pam50.isna().sum().sum())

# align index to PAM50
X_meth = meth_m.loc[pam50.index]
X_rna = rna.loc[pam50.index]
X_mirna = mirna.loc[pam50.index]
clinical= clinical.loc[pam50.index]

print(f"new shapes: meth: {X_meth.shape}, rna: {X_rna.shape}, mirna: {X_mirna.shape}, pam50: {pam50.shape}, clinical: {clinical.shape}")
print(X_meth.head())
print(X_rna.head())
print(X_mirna.head())
print(clinical.head())
print(pam50.value_counts())

All zeros: meth: 0, rna: 0, mirna: 0
nan_meth: 0, nan_rna: 0, nan_mirna: 0, nan_clinical: 0, nan_pam50: 0
NaN counts after filling:
0 0 0 46476 0
new shapes: meth: (769, 20106), rna: (769, 18321), mirna: (769, 503), pam50: (769, 1), clinical: (769, 118)
Hybridization REF      A1BG      A1CF     A2BP1     A2LD1       A2M     A2ML1  \
patient                                                                         
TCGA-3C-AAAU      -0.094004 -1.251175 -2.113585  0.765262  0.345896  2.343631   
TCGA-3C-AALI       0.812517 -0.237291 -1.658888  0.997440  0.630221  2.418135   
TCGA-3C-AALJ       0.931878 -0.059301 -1.369104  1.628617  0.972130  2.277584   
TCGA-3C-AALK       0.676913  0.741678 -0.064133  1.552454  1.420200  2.343133   
TCGA-4H-AAAK       0.657963  0.044649 -0.209004  1.212210  1.170304  2.021628   

Hybridization REF    A4GALT     A4GNT      AAA1      AAAS  ...    ZWILCH  \
patient                                                    ...             
TCGA-3C-AAAU      -0.08774

In [6]:
# Setting up a commong index and saving to csv
X_meth.index.name = "patient"
X_rna.index.name = "patient"
X_mirna.index.name = "patient"
pam50.index.name = "patient"
clinical.index.name = "patient"

X_meth.to_csv(root / "meth.csv", index=True)
X_rna.to_csv(root / "rna.csv", index=True)
X_mirna.to_csv(root / "mirna.csv", index=True)
pam50.to_csv(root / "pam50.csv", index=True)
clinical.to_csv(root / "clinical.csv", index=True)

In [7]:
# To confirm our data saved and loads properly:
meth = pd.read_csv(root / "meth.csv", index_col=0)
rna = pd.read_csv(root / "rna.csv", index_col=0)
mirna = pd.read_csv(root / "mirna.csv", index_col=0)
pam50 = pd.read_csv(root / "pam50.csv", index_col=0)
clinical = pd.read_csv(root / "clinical.csv", index_col=0)
    
print(meth.head())
print(rna.head())
print(mirna.head())
print(clinical.head())
print(pam50.head())

                  A1BG      A1CF     A2BP1     A2LD1       A2M     A2ML1  \
patient                                                                    
TCGA-3C-AAAU -0.094004 -1.251175 -2.113585  0.765262  0.345896  2.343631   
TCGA-3C-AALI  0.812517 -0.237291 -1.658888  0.997440  0.630221  2.418135   
TCGA-3C-AALJ  0.931878 -0.059301 -1.369104  1.628617  0.972130  2.277584   
TCGA-3C-AALK  0.676913  0.741678 -0.064133  1.552454  1.420200  2.343133   
TCGA-4H-AAAK  0.657963  0.044649 -0.209004  1.212210  1.170304  2.021628   

                A4GALT     A4GNT      AAA1      AAAS  ...    ZWILCH     ZWINT  \
patient                                               ...                       
TCGA-3C-AAAU -0.087741  1.155791  2.071436 -2.650851  ... -2.972923 -4.132523   
TCGA-3C-AALI  0.289780  1.584114 -0.613329 -4.072465  ... -2.989465 -4.369032   
TCGA-3C-AALJ -0.137988  0.916964  1.956230 -3.781647  ... -2.969472 -4.488190   
TCGA-3C-AALK  0.324621  0.905816  2.152928 -3.894574  ... -2.5

## Feature Selection

To support downstream analysis, we provide three built-in feature selection methods: +
- variance thresholding 
- ANOVA F-test
- Random Forest importance. 

These are designed to help users quickly identify the most informative features from high-dimensional omics datasets. Each method captures different statistical properties, ranging from general variability to class-based separability and model-derived relevance. In this section, we put all three to the test and examine how much they agree with each other.

In [None]:
from bioneuralnet.utils.preprocess import select_top_k_variance
from bioneuralnet.utils.preprocess import top_anova_f_features
from bioneuralnet.utils.preprocess import select_top_randomforest

# feature selection
meth_highvar = select_top_k_variance(meth, k=6000)
rna_highvar = select_top_k_variance(rna, k=6000)

meth_af = top_anova_f_features(meth, pam50, max_features=6000)
rna_af = top_anova_f_features(rna, pam50, max_features=6000)

meth_rf = select_top_randomforest(meth, pam50, top_k=6000)
rna_rf = select_top_randomforest(rna, pam50, top_k=6000)

meth_var = list(meth_highvar.columns)
meth_anova = list(meth_af.columns)
meth_rf = list(meth_rf.columns)

rna_var = list(rna_highvar.columns)
rna_anova = list(rna_af.columns)
rna_rf = list(rna_rf.columns)

inter1 = []
for x in meth_anova:
    if x in meth_var:
        inter1.append(x)

inter2 = []
for x in meth_rf:
    if x in meth_var:
        inter2.append(x)

inter3 = []
for x in meth_anova:
    if x in meth_rf:
        inter3.append(x)

meth_all_three = []
for x in meth_anova:
    if x in meth_rf and x in meth_var:
        meth_all_three.append(x)

inter4 = []
for x in rna_anova:
    if x in rna_var:
        inter4.append(x)

inter5 = []
for x in rna_rf:
    if x in rna_var:
        inter5.append(x)

inter6 = []
for x in rna_anova:
    if x in rna_rf:
        inter6.append(x)

rna_all_three = []
for x in rna_anova:
    if x in rna_rf and x in rna_var:
        rna_all_three.append(x)


In [14]:
print("Methylation feature selection:\n")
print(f"Anova-F & variance selection share: {len(inter1)} features")
print(f"Random Forest & variance selection share: {len(inter2)} features")
print(f"Anova-F & Random Forest share: {len(inter3)} features")
print(f"All three methods agree on: {len(meth_all_three)} features")

Methylation feature selection:

Anova-F & variance selection share: 2091 features
Random Forest & variance selection share: 1871 features
Anova-F & Random Forest share: 2201 features
All three methods agree on: 815 features


In [15]:
print("\nRNA feature selection:\n")
print(f"Anova-F & variance selection share: {len(inter4)} features")
print(f"Random Forest & variance selection share: {len(inter5)} features")
print(f"Anova-F & Random Forest share: {len(inter6)} features")
print(f"All three methods agree on: {len(rna_all_three)} features")


RNA feature selection:

Anova-F & variance selection share: 2340 features
Random Forest & variance selection share: 2218 features
Anova-F & Random Forest share: 2546 features
All three methods agree on: 1134 features


In [9]:
out_dir = Path("/home/vicente/Github/BioNeuralNet/TCGA_BRCA_DATA/ANOVA")

rna_af.to_csv(out_dir / "rna_anova.csv")
meth_af.to_csv(out_dir / "meth_anova.csv")

## Easy Access via DatasetLoader

To facilitate working with this data, we have made it available through our `DatasetLoader` component. Due to GitHub and PyPI file size limitations, we selected the top 6,000 features from each omics modality using ANOVA F-test. If you have additional pre-processed or raw datasets you would like to include, feel free to reach out, we are happy to support expanding the platform and adding new datasets.

In [10]:
from bioneuralnet.datasets import DatasetLoader

tgca_brca = DatasetLoader("brca")

print(f"TGCA BRCA dataset shape: {tgca_brca.shape}")
brca_meth = tgca_brca.data["meth"]
brca_rna = tgca_brca.data["rna"]
brca_mirna = tgca_brca.data["mirna"]
brca_clinical = tgca_brca.data["clinical"]
brca_pam50 = tgca_brca.data["pam50"]


TGCA BRCA dataset shape: {'mirna': (769, 503), 'pam50': (769, 1), 'clinical': (769, 118), 'rna': (769, 6000), 'meth': (769, 6000)}


In [11]:
from bioneuralnet.utils.preprocess import preprocess_clinical

#shapes
print(f"RNA shape: {brca_rna.shape}")
print(f"METH shape: {brca_meth.shape}")
print(f"miRNA shape: {brca_mirna.shape}")
print(f"Clinical shape: {brca_clinical.shape}")
print(f"Phenotype shape: {brca_pam50.shape}")
print(f"Phenotype counts:\n{brca_pam50.value_counts()}")

#check nans in pam50
print(f"Nan values in pam50 {brca_pam50.isna().sum().sum()}")
brca_pam50 = brca_pam50.dropna()

X_rna = brca_rna.loc[brca_pam50.index]
X_meth = brca_meth.loc[brca_pam50.index]
X_mirna = brca_mirna.loc[brca_pam50.index]
clinical = brca_clinical.loc[brca_pam50.index]

# for more details on the preprocessing function, see bioneuralnet.utils.preprocess
clinical = preprocess_clinical(clinical, brca_pam50, top_k=15, scale=True, ignore_columns=["days_to_birth", "age_at_diagnosis", "days_to_last_followup", "age_at_index", "years_to_birth"])
print(clinical.head())

2025-05-23 12:34:11,231 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-23 12:34:11,232 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 31384 NaNs after median imputation
2025-05-23 12:34:11,232 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 39 columns dropped due to zero variance


RNA shape: (769, 6000)
METH shape: (769, 6000)
miRNA shape: (769, 503)
Clinical shape: (769, 118)
Phenotype shape: (769, 1)
Phenotype counts:
pam50
3        419
4        140
1        130
2         46
0         34
Name: count, dtype: int64
Nan values in pam50 0


2025-05-23 12:34:11,612 - bioneuralnet.utils.preprocess - INFO - Selected top 15 features by RandomForest importance


              age_at_diagnosis  days_to_birth  years_to_birth  age_at_index  \
patient                                                                       
TCGA-3C-AAAU           20211.0       -20211.0            55.0          55.0   
TCGA-3C-AALI           18538.0       -18538.0            50.0          50.0   
TCGA-3C-AALJ           22848.0       -22848.0            62.0          62.0   
TCGA-3C-AALK           19074.0       -19074.0            52.0          52.0   
TCGA-4H-AAAK           18371.0       -18371.0            50.0          50.0   

              days_to_last_followup  year_of_diagnosis  number_of_lymph_nodes  \
patient                                                                         
TCGA-3C-AAAU                 4047.0              -1.50                    1.5   
TCGA-3C-AALI                 4005.0              -1.75                    0.0   
TCGA-3C-AALJ                 1474.0               0.25                    0.0   
TCGA-3C-AALK                 1448.0      

## Preparing Multi-Omics Data for downstream tasks

1. Check sample overlap.

2. Select top features.

    - Although each omics dataset has already been filtered down to the top 6,000 features using the ANOVA F-test, this is still considered high-dimensional for most modeling tasks.

    - To make the data more tractable and improve downstream performance, we apply ANOVA F-test again to select the top 1,000 most discriminative features from each dataset.

3. Combine datasets.

    - Selected features from RNA, methylation, and miRNA are combined into a single dataset.

4. Clean missing values.

    - Counts and removes any missing (nan) values from the combined dataset.

5. Build similarity graph.

    - Creates a k-nearest neighbors graph from the transposed feature matrix.

    - Other supported methods include correlation-based graphs, soft-thresholding (WGCNA-style), Gaussian kernels, and mutual information networks.

Note: For more details on preprocessing functions and graph generation algorithms, see the [Utils documentation](https://bioneuralnet.readthedocs.io/en/latest/utils.html)

In [12]:
from sklearn.metrics import accuracy_score, f1_score
from bioneuralnet.utils.preprocess import top_anova_f_features
from bioneuralnet.utils.graph import gen_similarity_graph

meth_sel = top_anova_f_features(X_meth, brca_pam50, max_features=1000)
rna_sel = top_anova_f_features(X_rna, brca_pam50 ,max_features=1000)
mirna_sel = top_anova_f_features(X_mirna, brca_pam50,max_features=503)
X_train_full = pd.concat([meth_sel, rna_sel, mirna_sel], axis=1)

# we check again for nan values then drop if any
print(f"Nan values in X_train_full: {X_train_full.isna().sum().sum()}")
X_train_full = X_train_full.dropna()
print(f"Nan value in X_train_full after dropping: {X_train_full.isna().sum().sum()}")

print(f"X_train_full shape: {X_train_full.shape}")
# building the graph using the similarity graph function with k=15
A_train = gen_similarity_graph(X_train_full.T, k=15)

print(f"\nNetwork shape: {A_train.shape}")

2025-05-23 12:34:12,482 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-23 12:34:12,482 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-05-23 12:34:12,482 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-05-23 12:34:12,527 - bioneuralnet.utils.preprocess - INFO - Selected 1000 features by ANOVA (task=classification), 6000 significant, 0 padded
2025-05-23 12:34:13,385 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-23 12:34:13,385 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-05-23 12:34:13,385 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-05-23 12:34:13,431 - bioneuralnet.utils.preprocess - INFO - Selected 1000 features by ANOVA (task=classification), 6000 significant, 0 padded
2025-05-23 12:34:13,506 - bioneuralnet.utils.preproc

Nan values in X_train_full: 0
Nan value in X_train_full after dropping: 0
X_train_full shape: (769, 2503)

Network shape: (2503, 2503)


In [13]:
from bioneuralnet.downstream_task import DPMON

save = Path("/home/vicente/Github/BioNeuralNet/TCGA_BRCA/results")
brca_pam50 = brca_pam50.rename(columns={"pam50": "phenotype"})

dpmon = DPMON(
    adjacency_matrix=A_train,
    omics_list=[meth_sel, rna_sel, mirna_sel],
    phenotype_data=brca_pam50,
    clinical_data=clinical,
    repeat_num=3,
    tune=True,
    gpu=True, 
    cuda=0,
    output_dir=Path(save/"run"),
)

predictions_df, avg_accuracy = dpmon.run()
actual = predictions_df["Actual"]
pred = predictions_df["Predicted"]
dp_acc = accuracy_score(actual, pred)
dp_f1w = f1_score(actual, pred, average='weighted')
dp_f1m = f1_score(actual, pred, average='macro')

print(f"\nDPMON results:")
print(f"Accuracy: {dp_acc}")
print(f"F1 weighted: {dp_f1w}")
print(f"F1 macro: {dp_f1m}")

2025-05-23 12:34:14,025 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/TCGA_BRCA/results/run
2025-05-23 12:34:14,025 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
2025-05-23 12:34:14,026 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
2025-05-23 12:34:14,038 - bioneuralnet.downstream_task.dpmon - INFO - Running hyperparameter tuning for DPMON.
2025-05-23 12:34:14,039 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
2025-05-23 12:34:14,039 - bioneuralnet.downstream_task.dpmon - INFO - Slicing omics dataset based on network nodes.
2025-05-23 12:34:14,042 - bioneuralnet.downstream_task.dpmon - INFO - Building PyTorch Geometric Data object from adjacency matrix.
2025-05-23 12:34:14,109 - bioneuralnet.downstream_task.dpmon - INFO - Number of nodes in network: 2503
2025-05-23 12:34:14,109 - bioneuralnet.downstream_task.dpmon - INFO - Using clinical vars for nod



[36m(tune_train_n pid=395677)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/vicente/ray_results/tune_dp/T89880_00000/checkpoint_000000)
[36m(tune_train_n pid=395677)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/vicente/ray_results/tune_dp/T89880_00000/checkpoint_000001)
[36m(tune_train_n pid=395677)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/vicente/ray_results/tune_dp/T89880_00000/checkpoint_000002)
[36m(tune_train_n pid=395900)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/vicente/ray_results/tune_dp/T89880_00002/checkpoint_000003)[32m [repeated 5x across cluster][0m
[36m(tune_train_n pid=395978)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/vicente/ray_results/tune_dp/T89880_00003/checkpoint_000000)[32m [repeated 97x across cluster][0m
[36m(tune_train_n pid=396057)[0m Checkpoint successfully created a


DPMON results:
Accuracy: 0.9570871261378413
F1 weighted: 0.9611444079317725
F1 macro: 0.9270923032997647
