# Datasets Guide


- The `DatasetLoader` class provides a simple interface to access example multi-omics datasets included in this package.  
- Each dataset is loaded as a collection of **pandas DataFrames**, with table names as keys and the corresponding data as values.  
- Users can explore the structure of any dataset via the `.shape` property, which returns a mapping from table name to `(rows, columns)`.  
- Three datasets are available out-of-the-box:

    1. **Example1**:
        - Synthetic dataset designed for testing and demonstration  
        - Contains small DataFrames: `X1`, `X2`, `Y`, `clinical_data`  
        - Useful for quick checks of package functionality

    2. **monet**:  
        - Multi-omics benchmark dataset from the **Multi-Omics NETwork Analysis Workshop (MONET)**. 
        - Includes multiple DataFrames: `gene_data`, `mirna_data`, `phenotype`, `rppa_data`, `clinical_data`  
        - Workshop details: <https://coloradosph.cuanschutz.edu/research-and-practice/centers-programs/cida/learning/multi-omics-network-analysis-workshop>

    3. **brca**  :
        - Breast cancer cohort from TCGA (BRCA project)  
        - Provides comprehensive omics DataFrames: `rna`, `mirna`, `meth`, `pam50`, `clinical`  
        - Full dataset description available at: <https://bioneuralnet.readthedocs.io/en/latest/TCGA-BRCA_Dataset.html>



In [4]:
from bioneuralnet.datasets import DatasetLoader
import pandas as pd

for name in ["example1", "monet", "brca"]:
    ds = DatasetLoader(name)
    print(f"{name} shapes:\n")
    for tbl, (rows, cols) in ds.shape.items():
        print(f"{tbl}: {rows} x {cols}")
    print("\n")

example1 shapes:

X1: 358 x 500
X2: 358 x 100
Y: 358 x 1
clinical_data: 358 x 6


monet shapes:

gene_data: 107 x 5039
mirna_data: 107 x 789
phenotype: 106 x 1
rppa_data: 107 x 175
clinical_data: 107 x 5


brca shapes:

mirna: 769 x 503
pam50: 769 x 1
clinical: 769 x 118
rna: 769 x 6000
meth: 769 x 6000




### Example 1: Synthetic dataset

In [None]:
from bioneuralnet.datasets import DatasetLoader

Example = DatasetLoader("example1")
omics1 = Example.data["X1"]
omics2= Example.data["X2"]
phenotype = Example.data["Y"]
clinical = Example.data["clinical_data"]

display(omics1.iloc[:, :5])
display(omics2.iloc[:, :5])
display(phenotype.iloc[:, :5])
display(clinical.iloc[:, :5])

Unnamed: 0,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5
Samp_1,22.485701,40.353720,31.025745,20.847206,26.697293
Samp_2,37.058850,34.052233,33.487020,23.531461,26.754628
Samp_3,20.530767,31.669623,35.189567,20.952544,25.018826
Samp_4,33.186888,38.480880,18.897097,31.823300,34.049383
Samp_5,28.961981,41.060494,28.494956,18.374495,30.815238
...,...,...,...,...,...
Samp_354,24.520652,28.595409,31.299666,32.095379,33.659730
Samp_355,31.252789,28.988087,29.574195,31.189288,32.098841
Samp_356,24.894826,25.944887,30.852641,26.705158,30.102546
Samp_357,17.034337,38.574705,25.095201,37.062442,35.417758


Unnamed: 0,Mir_1,Mir_2,Mir_3,Mir_4,Mir_5
Samp_1,15.223913,17.545826,15.784719,14.891983,10.348205
Samp_2,16.306965,16.672830,13.361529,14.488549,12.660905
Samp_3,16.545119,16.735005,14.617472,17.845267,13.822790
Samp_4,13.986899,16.207432,16.293078,17.725286,12.300565
Samp_5,16.338332,17.393869,16.397925,15.853725,13.387675
...,...,...,...,...,...
Samp_354,15.065065,16.079830,14.635616,17.013845,11.612843
Samp_355,15.997576,15.448951,15.355566,16.501752,11.701778
Samp_356,15.206862,14.395378,16.218001,16.044955,13.650741
Samp_357,14.474129,15.482863,15.512549,15.136613,14.531277


Unnamed: 0,phenotype
Samp_1,235.067423
Samp_2,253.544991
Samp_3,234.204994
Samp_4,281.035429
Samp_5,245.447781
...,...
Samp_354,236.120451
Samp_355,222.572359
Samp_356,268.472285
Samp_357,235.808167


Unnamed: 0_level_0,Age,Gender,BMI,Chronic_Bronchitis,Emphysema
PatientID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Samp_1,78,0,31.2,1,1
Samp_2,68,1,19.2,1,0
Samp_3,54,1,19.3,0,1
Samp_4,47,1,36.2,0,0
Samp_5,60,1,26.2,0,1
...,...,...,...,...,...
Samp_354,71,0,23.0,1,0
Samp_355,62,1,25.5,0,1
Samp_356,61,0,21.1,1,0
Samp_357,64,0,37.6,0,0


### Monet: Set from the **Multi-Omics NETwork Analysis Workshop (MONET)**, Univ. of Colorado Anschutz  

In [None]:
from bioneuralnet.datasets import DatasetLoader

monet = DatasetLoader("monet")
gene = monet.data["gene_data"]
mirna = monet.data["mirna_data"]
phenotype = monet.data["phenotype"]
rppa = monet.data["rppa_data"]
clinical = monet.data["clinical_data"]

display(gene.iloc[:, :5])
display(mirna.iloc[:, :5])
display(phenotype.iloc[:, :5])
display(rppa.iloc[:, :5])
display(clinical.iloc[:, :5])

Unnamed: 0,A2ML1,AACSL,AADAC,AADAT,AATK
0,0.466671,0.074845,0.990309,-0.410873,1.897562
1,-0.524465,-0.146727,-0.735206,-0.628456,0.170962
2,-0.029879,-0.626509,-0.735206,-0.677892,0.020060
3,0.674895,-0.626509,-0.330409,-0.662162,-0.911966
4,-0.110607,-0.626509,-0.735206,-0.848542,0.042645
...,...,...,...,...,...
102,0.999600,1.343979,-0.735206,1.742674,0.253103
103,-0.919337,-0.626509,-0.735206,-1.380461,-0.899019
104,-0.606702,-0.626509,0.497658,-0.717505,-0.625122
105,1.911346,0.021380,-0.166624,0.785542,0.344953


Unnamed: 0,hsa-let-7a-1,hsa-let-7a-2,hsa-let-7a-3,hsa-let-7b,hsa-let-7c
0,-0.832527,-0.851616,-0.837155,-1.079659,-0.181270
1,0.229155,0.249696,0.234295,0.859289,-0.057729
2,0.414268,0.417023,0.408913,0.635059,1.195203
3,-0.855214,-0.869152,-0.862713,-1.955447,-0.572552
4,1.365310,1.356252,1.351750,1.259095,-0.316760
...,...,...,...,...,...
102,-1.402001,-1.401125,-1.349961,-1.534386,1.231456
103,2.551277,2.547046,2.563191,1.054769,0.981436
104,0.182138,0.188730,0.191094,-1.060615,0.345907
105,0.289489,0.292470,0.297778,-0.197850,1.040270


Unnamed: 0,0
0,1
1,0
2,0
3,0
4,0
...,...
101,0
102,1
103,0
104,0


Unnamed: 0,YWHAE,YWHAZ,EIF4EBP1,TP53BP1,ARAF
0,-0.357998,0.099812,-1.067285,-0.412211,-0.357998
1,-0.055031,-0.517445,0.032633,-0.743096,-0.055031
2,-0.137863,-0.559690,0.302764,-0.968388,-0.137863
3,-0.170726,-0.028206,-0.341461,0.282581,-0.170726
4,-1.430765,-0.138087,-0.545894,-0.616864,-1.430765
...,...,...,...,...,...
102,-0.708685,-0.778813,1.623365,-0.090612,-0.708685
103,0.261442,-0.407563,-0.567735,-0.186919,0.261442
104,1.350866,1.461061,-1.159541,-1.674874,1.350866
105,0.179510,-0.300029,-1.048938,-0.621680,0.179510


Unnamed: 0,overall_survival,status,years_to_birth,race,radiation_therapy
0,3015,0,37,blackorafricanamerican,yes
1,2348,1,73,white,yes
2,3011,0,41,asian,yes
3,3283,0,67,white,no
4,1873,0,42,white,no
...,...,...,...,...,...
102,2329,0,63,white,yes
103,1004,1,74,white,yes
104,984,0,46,white,yes
105,867,0,44,white,no


### BRCA: Breast cancer cohort dataset.

In [None]:
from bioneuralnet.datasets import DatasetLoader

brca = DatasetLoader("brca")
rna = brca.data["rna"]
mirna = brca.data["mirna"]
meth = brca.data["meth"]
pam50 = brca.data["pam50"]
clinical = brca.data["clinical"]

display(rna.iloc[:, :5])
display(mirna.iloc[:, :5])
display(meth.iloc[:, :5])
display(pam50.iloc[:, :5])
display(clinical.iloc[:, :5])

Unnamed: 0_level_0,ESR1_2099,FOXA1_3169,MLPH_79083,AGR3_155465,TBC1D9_23158
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-3C-AAAU,11.755706,12.411608,12.566775,12.049729,14.173402
TCGA-3C-AALI,6.098358,12.562596,14.101263,12.431691,11.295692
TCGA-3C-AALJ,12.869270,12.173717,12.315435,11.496098,12.314665
TCGA-3C-AALK,11.279211,12.843939,13.379291,10.153571,12.610953
TCGA-4H-AAAK,12.430008,12.731229,12.580920,10.253672,12.353710
...,...,...,...,...,...
TCGA-WT-AB44,12.154421,12.583949,15.223312,11.015164,11.632502
TCGA-XX-A899,11.415476,12.001547,13.067212,9.704339,12.868580
TCGA-XX-A89A,11.287576,11.988771,12.769825,10.190025,11.919563
TCGA-Z7-A8R5,11.688852,11.544861,12.522186,10.556924,12.289650


Unnamed: 0_level_0,hsa_let_7a_1,hsa_let_7a_2,hsa_let_7a_3,hsa_let_7b,hsa_let_7c
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-3C-AAAU,13.129765,14.117933,13.147714,14.595135,8.414890
TCGA-3C-AALI,12.918069,13.922300,12.913194,14.512657,9.646536
TCGA-3C-AALJ,13.012033,14.010002,13.028483,13.419612,9.312455
TCGA-3C-AALK,13.144697,14.141721,13.151281,14.667196,11.511431
TCGA-4H-AAAK,13.411684,14.413518,13.420481,14.438548,11.693927
...,...,...,...,...,...
TCGA-WT-AB44,13.375715,14.366671,13.369827,14.514024,11.926315
TCGA-XX-A899,14.036155,15.036341,14.043313,14.339503,12.361761
TCGA-XX-A89A,13.679569,14.684855,13.691463,14.198207,12.684212
TCGA-Z7-A8R5,12.962088,13.966350,12.984897,14.320660,11.980246


Unnamed: 0_level_0,SFT2D2,IL17RA,MIR128_1,FOXA1,LOC145837
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-3C-AAAU,-2.196646,-0.049742,3.355022,-3.934344,-1.801595
TCGA-3C-AALI,-2.436039,-0.217006,2.652026,-3.995267,-1.512691
TCGA-3C-AALJ,-2.390041,-0.360180,2.564778,-3.917724,-0.701434
TCGA-3C-AALK,-2.469813,-0.107791,2.718057,-4.100320,-0.756467
TCGA-4H-AAAK,-2.501687,-0.091774,3.086157,-3.628072,-0.090305
...,...,...,...,...,...
TCGA-WT-AB44,-2.358699,-0.092863,3.138854,-3.864208,-0.446164
TCGA-XX-A899,-2.633115,-0.192698,3.330302,-3.498419,0.144114
TCGA-XX-A89A,-2.602103,-0.287718,2.287165,-3.720622,-0.236061
TCGA-Z7-A8R5,-2.572044,-0.146791,3.000648,-3.335691,0.693710


Unnamed: 0_level_0,pam50
patient,Unnamed: 1_level_1
TCGA-3C-AAAU,3
TCGA-3C-AALI,2
TCGA-3C-AALJ,4
TCGA-3C-AALK,3
TCGA-4H-AAAK,3
...,...
TCGA-WT-AB44,3
TCGA-XX-A899,3
TCGA-XX-A89A,3
TCGA-Z7-A8R5,3


Unnamed: 0_level_0,synchronous_malignancy,ajcc_pathologic_stage,days_to_diagnosis,laterality,created_datetime
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-3C-AAAU,No,Stage X,0.0,Left,
TCGA-3C-AALI,No,Stage IIB,0.0,Right,
TCGA-3C-AALJ,No,Stage IIB,0.0,Right,
TCGA-3C-AALK,No,Stage IA,0.0,Right,
TCGA-4H-AAAK,No,Stage IIIA,0.0,Left,
...,...,...,...,...,...
TCGA-WT-AB44,No,Stage IA,0.0,Left,
TCGA-XX-A899,No,Stage IIIA,0.0,Right,
TCGA-XX-A89A,No,Stage IIB,0.0,Left,
TCGA-Z7-A8R5,No,Stage IIIA,0.0,Left,
