# Common Somatic Tertiary Analysis (COSTA) Notebooks

This series of notebooks is created to common tertiary analysis of somatic genetic variants. The series consists of the following notebooks:

- Notebook 0: Somatic Variant Source Data (not in OpenBio)
- Notebook 1: Somatic VCF to annotated MAF
- Notebook 2: Kaplan-Meier Survival Curve: Phenotype Based Cohort
- Notebook 3: Population Level Somatic Mutation Analysis
- Notebook 4: Kaplan-Meier Survival Curve: Somatic Variant Based Cohort
- Notebook 5: Gene Level Somatic Mutation Analysis

# Notebook 1: Somatic VCF to annotated MAF
In this notebook, we download individual level VCF files saved in our project on the DNAnexus platform in the folder `source_vcf`, transform them into individual MAF files and save them in a folder on the platform. MAF files are annotated with transcript information from the VEP annotation database.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

## 1. Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML
* Kernel: Bash
* Instance type: mem1_ssd1_v2_x16
* Runtime: =~ 55 min
* Data description: File inputs for this notebook are:

    * Individual level VCF files.
    
### Package and tools dependency:

| Package | License | 
| --- | --- |
| <a href="http://www.htslib.org/">samtools</a> | <a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT/Expat License</a> |
| <a href="http://www.htslib.org/doc/tabix.html">tabix</a> | <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache License 2.0</a> |
| <a href="https://github.com/mskcc/vcf2maf">VCF2MAF</a> | <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache License 2.0</a> |
| <a href="https://www.ensembl.org/vep">Ensembl-VEP</a> | <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache License 2.0</a> |

#### Download Packages

Uncomment the install commands if you are comfortable with the library license and want to install and run the parts notebook that depend on the library.

_Note: Package installation takes ~35 minutes_

**Ensembl-vep**

In [None]:
# Install and update the dependencies of ensembl-vep. Uncomment to install
# apt-get update
# apt-get install cpanminus -y
# apt-get install libmysqlclient-dev -y
# cpanm DBI
# cpanm DBD::mysql

In [None]:
# Download and install ensembl-vep. Uncomment to install
# git clone https://github.com/Ensembl/ensembl-vep.git
# cd ensembl-vep
# perl INSTALL.pl --ASSEMBLY GRCh38 --AUTO aflc --SPECIES homo_sapiens -c /opt/notebooks/vep -n 
# cd ..

**Samtools**

In [None]:
# Download Samtools. Uncomment to download
# wget -q https://github.com/samtools/samtools/releases/download/1.15/samtools-1.15.tar.bz2

**Tabix**

In [None]:
# Download tabix. Uncomment to download
# wget -q https://github.com/samtools/htslib/releases/download/1.15/htslib-1.15.tar.bz2

**VCF2MAF**

In [None]:
# Download and vcf2maf. Uncomment to install
# export VCF2MAF_URL=`curl -sL https://api.github.com/repos/mskcc/vcf2maf/releases | grep -m1 tarball_url | cut -d\" -f4`
# curl -L -o mskcc-vcf2maf.tar.gz $VCF2MAF_URL; tar -zxf mskcc-vcf2maf.tar.gz; cd mskcc-vcf2maf-*
# perl maf2vcf.pl --man
# cd ..

_Note: At this point, we suggest creating a snapshot of the environment for resuse --> DNAnexus/Create SnapshotOnce a snapshot is created, the object may be used when launching a new JupyterLab instance and will contain all installed packages and any downloaded data._

## 2. Install downloaded packages

In [None]:
# Samtools. Uncomment to install
# tar -xf samtools-1.15.tar.bz2
# cd samtools-1.15
# ./configure --prefix=/where/to/install
# make
# make install
# export PATH=/where/to/install/bin:$PATH
# cd ..

In [None]:
# Tabix. Uncomment to install
# tar -xf htslib-1.15.tar.bz2
# cd htslib-1.15
# ./configure --prefix=/where/to/install
# make
# make install
# export PATH=/where/to/install/bin:$PATH
# cd ..

## 3. Pre-process source data

In this notebook, we will convert 50 VCF files to their corresponding MAF files.

In [None]:
dx ls source_vcf | head -50 > vcf_names_50.txt

**Change Chromosome Notation.**

When working with additional VCF files, make sure to follow "CHROM" naming guidelines specified by vcf2maf. For example, vcf2maf supports the chromosome naming, "1," instead of "chr1." Any unmapped contigs (e.g. 11_KI270927v1_alt) should be filtered out before running VCF2MAF. 

In [None]:
mkdir vcf_without_chr

while read p; do
    awk '{gsub(/^chr/,""); print}' "/mnt/project/source_vcf/""$p" > vcf_without_chr/"$p"
done < vcf_names_50.txt

## 4. Run VCF2MAF

We need to obtain the tumor sample ID and normal sample ID from the VCF for converting it to MAF file. In our VCF files, the normal sample ID appears in the last column of the header and the tumor sample ID appears in the last-but-one column of the header.

The parameter `--vep-forks` gives the number of forked processes to use when running VEP. The maximum number of forked processes that you can have depends on the instance size. For the default instance size (mem1_ssd1_v2_x16), we recommend having 16 forks, where the fork count is equal to, or less than, the core suffix count ("x16"). For instance type naming conventions, see https://documentation.dnanexus.com/developer/api/running-analyses/instance-types.

In [None]:
# Move vep cache to $HOME/.vep
mv /opt/notebooks/vep/ $HOME/.vep/

cd vcf_without_chr

# Make a directory for individual MAF files
mkdir ../individual_maf_files_50

# Convert VCF to MAF
for vcf_file in TCGA-*.vcf; do 
    echo "$vcf_file"
    vcf_fn="${vcf_file%.*}"
    colnames=($(cat "$vcf_file" | grep "#CHROM"))
    tumor_id=${colnames[-2]}
    normal_id=${colnames[-1]}
    perl ../mskcc-vcf2maf-754d68a/vcf2maf.pl --input-vcf "$vcf_file" \
    --output-maf ../individual_maf_files_50/"$vcf_fn".maf \
    --ref-fasta $HOME/.vep/homo_sapiens/106_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz \
    --vep-path ../ensembl-vep \
    --ncbi-build GRCh38 \
    --tumor-id $tumor_id \
    --normal-id $normal_id \
    --vep-forks 16
    rm "$vcf_file"
done
cd ..

## 5. Upload the results to the platform
We upload the `individual_maf_files_50` folder using `dx upload -r` command.

In [None]:
dx upload -r individual_maf_files_50