# Data preparation

Getting the desired data from the GDC portal can be a complicated task. We need all TCGA datasets with gene expression data and corresponding clinical data for this project. One can do this manually, but we have created scripts to help with this task. Please go to this link for more information: https://github.com/jakakokosar/tcga-data.

To run this notebook, you need to prepare a folder with separate CSV files of all TCGA projects:

```plaintext
tcga_data/
|-- TCGA-ACC.csv
|-- TCGA-BLCA.csv
|-- TCGA-BRCA.csv
|-- ... (all other projects)
```

The CSV files should have the following columns:

```plaintext
samples,time,event,<genes>...
```

In [None]:
import glob
import pandas as pd

files = glob.glob('tcga_data/*.csv')
df_list = []
for f in files:
    df = pd.read_csv(f, index_col=0)
    df.insert(loc=0, column='tcga', value=[f.split('/')[-1].split('.')[0]]*len(df))
    df_list.append(df)
    print(f'completed {f}')

### Now combine all datasets into one CSV file and save to disk as 'TCGA-combined-temp.csv.'

In [5]:
combined_df = pd.concat(df_list, axis = 0)
combined_df.to_csv('TCGA-combined-temp.csv')

### Finally, open CSV as an Orange data table and edit the domain accordingly.

In [None]:
import Orange
from Orange.data import ContinuousVariable, Domain

table = Orange.data.Table('TCGA-combined-temp.csv')

# change all attributes to continuous
genes = [ContinuousVariable(attr.name) for attr in table.domain if attr.name.startswith('ENSG')]

# meta attributes
metas = [attr for attr in table.domain if attr.name in ['time', 'event', 'tcga', 'samples']]

# Save newly created data table as CSV and HDF5
table_transformed = table.transform(Domain(genes, metas=metas))
table_transformed.save('TCGA-combined.csv')

###  To work with large files in Orange, we need them to be saved in the correct format. A detailed explanation can be found here: https://orangedatamining.com/blog/2023/2023-10-24-dask-all-folks/

### You can already find Dask supported dataset here: https://file.biolab.si/dask/TCGA-combined.hdf5