# Handling TCGA data
### Overall goals
- Work with TCGA data (e.g. of a specific study or primary site)
- Downoad data (or further data after a while)
- Only download data matching specific cases

### Filtering and selecting TCGA data
- Filter and select TCGA data from the TCGA GDC data portal as explained in "TCGA_steps_explained.ipynb"
- Do these steps every time for your new analyses, also when you later on have new aspects or file types to consider

### Combine manifest with sample sheet, filter for relevant files
- Adapt the data/config.yaml
- manifest files

In [16]:
import pandas as pd
import yaml
import os

# load config file
with open('data/config.yaml', 'r') as streamfile:
    conf_yaml = yaml.load(streamfile, Loader=yaml.FullLoader)

# analysis path
analysis_path = conf_yaml['analysis_path']

# original (prior) manifest files, merge them
manifests = [pd.read_table('sample_sheets/manifests/'+i) for i in conf_yaml['manifests_prior']]
manifests_merge = pd.concat(manifests)

# original (prior) sample sheets, merge them
sample_sheets = [pd.read_table('sample_sheets/prior_sample_sheets/'+i) for i in conf_yaml['sample_sheets_prior']]
sample_sheets_merge = pd.concat(sample_sheets)

sample_sheets_merge['Case ID'] = sample_sheets_merge['Case ID'].str.split(', ', expand=True)[0]


- merge manifest and sample sheet
- if previous selection of case IDs -> filter for specific case IDs of previous analysis
- create adapted filtered manifest file for gdc-client download

### Download TCGA data via a manifest document and the GDC-client tool
- For restricted access files:
    - Login at NIH for restricted access files
    - Download access token, save as secured file
- Download gdc-client tool

```
gdc-client download -m manifest.txt -t user-token.txt
```


In [None]:
# download raw files into the following folder:

analysis_path + '00_raw_data'

### Prepare renaming files with sample sheet

In [7]:
# categorize samples for their analysis method
method_dict = {'BRASS':'BRASS', 'CaVEMan':'CaVEMan', 'ASCAT':'CNV_segment', 'pindel':'Pindel', 'star_splice':'Splicing', 
               'star_gene_counts':'STAR_counts'}

sample_sheets_merge['Folder'] = ''

for met in method_dict.keys():
    sample_sheets_merge.loc[sample_sheets_merge['File Name'].str.contains(met), 'Folder'] = method_dict[met]

sample_sheets_merge['File Suffix'] = sample_sheets_merge['File Name'].str.split('.', expand=True)[1]

sample_sheets_merge['Path_raw'] = path_raw_data+sample_sheets_merge['Folder']+'/'+sample_sheets_merge['File ID']+'/'+sample_sheets_merge['File Name']
sample_sheets_merge['Path_sample'] = path_sample_data+sample_sheets_merge['Folder']+'/'+sample_sheets_merge['Case ID']+'.'+sample_sheets_merge['File Suffix']



In [18]:
sample_sheets_merge.iloc[1000]['File Name']

'd9368b67-bd37-475a-810b-f0ec243c4e8f.rna_seq.augmented_star_gene_counts.tsv'

In [5]:
sample_sheets.keys()

dict_keys(['2024-04-08_gdc_sample_sheet_splice_lusc_prior.tsv', '2024-04-08_gdc_sample_sheet_star_counts_lusc_prior.tsv'])



### Download TCGA data via a manifest document and the GDC-client tool

### Rename the downloaded files as case_id.file_suffix
- in manifest only id, filename with 36 different characters
- take merged manifest and sample sheet
- rename downloaded files and put them in new folders for each analysis

### Analyze files
- with Snakemake pipeline


### Download TCGA data via GDC-client

In [None]:
import pandas as pd
import os
import yaml

with open('data/config.yaml', 'r') as streamfile:
    config_file = yaml.load(streamfile, Loader=yaml.FullLoader)

tcga_manifest = 'manifests/2024-04-08_gdc_manifest_star_counts_lusc_prior.txt'
tcga_user_token_file = config_file['tcga_user_token_file']

if tcga_user_token_file == False:
    print(f'Download TCGA data with TCGA manifest {tcga_manifest.split("/")[-1]} without TCGA user token')
    command_download_tcga_data = f'gdc-client download -m {tcga_manifest}'
else:
    print(f'Download TCGA data with TCGA manifest {tcga_manifest.split("/")[-1]} with TCGA user token file {tcga_user_token_file.split("/")[-1]}')
    command_download_tcga_data = f'gdc-client download -m {tcga_manifest} -t {tcga_user_token_file}'

# os.system(command_download_tcga_data)
