# Handling TCGA data
### Overall goals
- Work with TCGA data (e.g. of a specific study or primary site)
- Downoad data (or further data after a while)
- Only download data matching specific cases

### Filtering and selecting TCGA data
- Go to: https://portal.gdc.cancer.gov/analysis_page
- Click on "Repository"

<img src="figures/tcga_gdc_repository.png" style="width:1000px; position: relative; left: 40px">
<p></p>
<img src="figures/tcga_gdc_repository_files.png" style="width:1000px; position: relative; left: 40px">

- Filter by files:
    - Experimental strategy
    - Data type
    - Data format
    - Access
    - ...

<img src="figures/tcga_gdc_files_filter.png" style="width:300px; position: relative; left: 40px">

- Filter by cases:
    - Primary site
    - Disease Type
    - Project
    - Gender
    - ...

<img src="figures/tcga_gdc_cases_filter.png" style="width:1000px; position: relative; left: 40px">

- Select associated files or put all relevant data into the cart

<img src="figures/tcga_gdc_select_file_cart.png" style="width:800px; position: relative; left: 40px">

- Click on the Cart symbol

<img src="figures/tcga_gdc_cart_overview.png" style="width:1000px; position: relative; left: 40px">

- Download the Sample Sheet, Metadata, Clinical: TSV data for additional information

<img src="figures/tcga_gdc_cart_ass_data.png" style="width:180px; position: relative; left: 40px">

- Download the Manifest for a data download with the gdc-client tool

<img src="figures/tcga_gdc_cart_manifest.png" style="width:120px; position: relative; left: 40px">

- Do these steps every time for your new analyses, also when you later on have new aspects or file types to consider


### File locations for manifests and sample sheets
- Create an analysis folder and in this folder, a "sample_sheets" folder with sub-folders "clinical_data", "manifests", and "sample_sheets_prior"

<analysis_path>

└── sample_sheets
<br>
&emsp;&emsp;├── clinical_data
<br>
&emsp;&emsp;├── manifests
<br>
&emsp;&emsp;└── sample_sheets_prior    



### Combine manifest with sample sheet, filter for relevant files
- manifest file

<img src="figures/tcga_manifest_file_example.png" style="width:1000px; position: relative; left: 40px">

- sample sheet

<img src="figures/tcga_sample_sheet_example.png" style="width:1000px; position: relative; left: 40px">

- merge manifest and sample sheet
- if previous selection of case IDs -> filter for specific case IDs of previous analysis
- create adapted filtered manifest file for gdc-client download

### Download TCGA data via a manifest document and the GDC-client tool
- For restricted access files:
    - Login at NIH for restricted access files
    - Download access token, save as secured file
- Download gdc-client tool

```
gdc-client download -m manifest.txt -t user-token.txt
```

### Rename the downloaded files as case_id.file_suffix
- in manifest only id, filename with 36 different characters
- take merged manifest and sample sheet
- rename downloaded files and put them in new folders for each analysis

### Analyze files
- with Snakemake pipeline


In [3]:
print('└── sample_sheets')
print('    ├── clinical_data')
print('    ├── manifests')
print('    └── sample_sheets_prior')

└── sample_sheets
    ├── clinical_data
    ├── manifests
    └── sample_sheets_prior


### Download TCGA data via GDC-client

In [None]:
import pandas as pd
import os
import yaml

with open('data/config.yaml', 'r') as streamfile:
    config_file = yaml.load(streamfile, Loader=yaml.FullLoader)

tcga_manifest = 'manifests/2024-04-08_gdc_manifest_star_counts_lusc_prior.txt'
tcga_user_token_file = config_file['tcga_user_token_file']

if tcga_user_token_file == False:
    print(f'Download TCGA data with TCGA manifest {tcga_manifest.split("/")[-1]} without TCGA user token')
    command_download_tcga_data = f'gdc-client download -m {tcga_manifest}'
else:
    print(f'Download TCGA data with TCGA manifest {tcga_manifest.split("/")[-1]} with TCGA user token file {tcga_user_token_file.split("/")[-1]}')
    command_download_tcga_data = f'gdc-client download -m {tcga_manifest} -t {tcga_user_token_file}'

# os.system(command_download_tcga_data)
