# Handling TCGA data
### 0. Your input
- Choose the folder path your analysis results should be stored in

In [1]:
# Input your analysis folder (e.g. analysis_path = '/home/user/user_name1/TCGA_analysis')
analysis_path = '/home/alex/Dokumente/02_OLCIR/00_tcga_download_helper/07_improved_TCGA_download_helper_v1_data'

import os
if not os.path.exists(analysis_path):
    print('The analysis path is non-existent. Please check again and input your new analysis path.')

In [2]:
# Creates a folder called "sample_sheets" in your analysis path, creates the folders "manifests", "sample_sheets_prior", and "clinical_data" in there
if not analysis_path.endswith('/'):
    analysis_path += '/'

os.makedirs(analysis_path+'log_files', exist_ok=True)
os.makedirs(analysis_path+'log_files/correct_files', exist_ok=True)
os.makedirs(analysis_path+'sample_sheets', exist_ok=True)
for folder_name in ['manifests', 'sample_sheets_prior', 'clinical_data']:
    os.makedirs(analysis_path+f'sample_sheets/{folder_name}', exist_ok=True)

- Filter and select TCGA data from the TCGA GDC data portal as explained in the ["README.md"](README.md) file
- Follow these steps every time for your new analyses, also when you have new aspects or file types to consider later on
- Fill the folders "manifests" and "sample_sheets_prior" (optional: "clinical_data") in your "sample_sheets" folder in your analysis path with the according files
- Adapt the ["data/config.yaml"](data/config.yaml)

### 1. Check validity of configuration file entries
- Checks whether all files and file paths listed in the configuration file are existent.

In [None]:
import time
import os
start_time = time.strftime('%Y-%m-%d_%H:%M:%S', time.localtime())

input_config = 'data/config.yaml'
out_correct_file = analysis_path + f'log_files/correct_files/{start_time}_config_file_correct.txt'
log_file = analysis_path + f'log_files/{start_time}_check_config_file_log.txt'

%run -nt 'scripts_TCGA_pipeline/01_check_config_file.py' {input_config} {out_correct_file} {log_file}

### 2. Combine manifest file(s) with sample sheet, filter for relevant files to download
- This script merges the manifest file(s) and the sample sheet.
- If previous selection of case IDs is wanted, the script filters for specific case IDs of a previous analysis.
- Creates adapted filtered manifest file for gdc-client download.

In [None]:
input_config = 'data/config.yaml'
out_correct_file = analysis_path + f'log_files/correct_files/{start_time}_combine_manifest_sample_sheet_correct.txt'
log_file = analysis_path + f'log_files/{start_time}_combine_manifest_sample_sheet_log.txt'

%run -nt 'scripts_TCGA_pipeline/02_combine_manifest_sample_sheet.py' {input_config} {out_correct_file} {log_file}

### 3. Download TCGA data via a manifest document and the GDC-client tool
- The script creates a new conda environment called "gdc_client" and downloads the gdc-client tool. If you have already installed the gdc-client in a separete conda environment, you can specify that in the configuration file.
- The script downloads the TCGA data from manifest(s) specified in previous steps and/or the configuration file via the gdc-client.
- The files from the manifest are downloaded into the following folder: \<your analysis_path\> + '00_raw_data'
- Download TCGA data via the gdc-client tool.

In [None]:
input_config = 'data/config.yaml'
log_files_gdc_prefix = analysis_path + f'log_files/{start_time}_gdc_client_log'
out_correct_file = analysis_path + f'log_files/correct_files/{start_time}_combine_manifest_sample_sheet_correct.txt'
log_file = analysis_path + f'log_files/{start_time}_combine_manifest_sample_sheet_log.txt'

%run -nt 'scripts_TCGA_pipeline/03_download_TCGA_data.py' {input_config} {log_files_gdc_prefix} {out_correct_file} {log_file}

### 4. Rename file names
- This script changes the suffix to the case id.
- The downloaded files are renamed and sorted in new folders for each analysis.

In [None]:
input_config = 'data/config.yaml'
out_correct_file = analysis_path + f'log_files/correct_files/{start_time}_rename_files_correct.txt'
log_file = analysis_path + f'log_files/{start_time}_rename_files_log.txt'

%run -nt 'scripts_TCGA_pipeline/04_rename_files.py' {input_config} {out_correct_file} {log_file}

### 5. Analyze files with separate Snakemake pipeline (Snakefile_sample_analysis)
- A Snakemake pipeline can be used to analyze all downloaded data at once (if wanted).
- The Snakemake pipeline is a template and is not ready to use for your analysis.
- Please adapt the rules for all of your methods in the Snakefile_sample_analysis to use Snakemake.
- Each rule requires a Python script with the analysis methods or an adapted shell command in the rule.
- The Python scripts for the analysis are located in the folder "scripts_snakemake".

In [None]:
import subprocess
import yaml

input_config = 'data/config.yaml'

with open(input_config, 'r') as streamfile:
    config_file = yaml.load(streamfile, Loader=yaml.FullLoader)

conda_snakemake = config_file['conda_snakemake']
snakemake_threads = config_file['snakemake_threads']
snakemake_methods = config_file['snakemake_methods']

os.makedirs(analysis_path+'02_results_raw', exist_ok=True)
for m in snakemake_methods:
    os.makedirs(analysis_path+'02_results_raw/'+m, exist_ok=True)
os.makedirs(analysis_path+'03_results_combined', exist_ok=True)

# run Snakemake pipeline
command_snakemake = f'conda run -n {conda_snakemake} snakemake --cores {snakemake_threads} --use-conda -k'
process_snakemake = subprocess.Popen(command_snakemake, shell=True)
process_snakemake.wait()