# Introduction
## Python 2 vs 3
## Conda
## Jupyter Notebook


# Preprocessing VCF - adding readcounts using bam-readcount and VAtools

## Setup

For this module we'll be working with a somatic exome VCF file created by the Mutect variant caller with some basic filtering already done. This VCF can be found in the `week_09` folder of the [bfx-workshop repository](https://github.com/genome/bfx-workshop). Let's create a working directory and download this file.

In [2]:
!echo $PWD
!mkdir -p $PWD/bfx_workshop_week_09
!wget https://github.com/genome/bfx-workshop/blob/master/week_09/mutect.filtered.vcf.gz?raw=true -O $PWD/bfx_workshop_week_09/mutect.filtered.vcf.gz

/home/john/mgi_workshop
--2020-11-13 10:21:19--  https://github.com/genome/bfx-workshop/blob/master/week_09/mutect.filtered.vcf.gz?raw=true
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/genome/bfx-workshop/raw/master/week_09/mutect.filtered.vcf.gz [following]
--2020-11-13 10:21:19--  https://github.com/genome/bfx-workshop/raw/master/week_09/mutect.filtered.vcf.gz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/genome/bfx-workshop/master/week_09/mutect.filtered.vcf.gz [following]
--2020-11-13 10:21:19--  https://raw.githubusercontent.com/genome/bfx-workshop/master/week_09/mutect.filtered.vcf.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.68.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199

We will also need the reference and tumor bam files used previously.

In [4]:
!docker run -v $PWD/bfx_workshop_week_09:/staging mgibio/data_downloader:0.1.0 gsutil -m cp gs://analysis-workflows-example-data/somatic_inputs/hla_and_brca_genes.fa /staging
!docker run -v $PWD/bfx_workshop_week_09:/staging mgibio/data_downloader:0.1.0 gsutil -m cp gs://analysis-workflows-example-data/somatic_inputs/hla_and_brca_genes.fa.fai /staging
!wget https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/bfx_workshop/tumor.bam -O $PWD/bfx_workshop_week_09/tumor.bam
!wget https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/bfx_workshop/tumor.bam.bai -O $PWD/bfx_workshop_week_09/tumor.bam.bai

Copying gs://analysis-workflows-example-data/somatic_inputs/hla_and_brca_genes.fa...
- [1/1 files][246.3 MiB/246.3 MiB] 100% Done                                    
Operation completed over 1 objects/246.3 MiB.                                    
Copying gs://analysis-workflows-example-data/somatic_inputs/hla_and_brca_genes.fa.fai...
/ [1/1 files][   54.0 B/   54.0 B] 100% Done                                    
Operation completed over 1 objects/54.0 B.                                       
--2020-11-13 10:26:28--  https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/bfx_workshop/tumor.bam
Resolving xfer.genome.wustl.edu (xfer.genome.wustl.edu)... 128.252.233.42
Connecting to xfer.genome.wustl.edu (xfer.genome.wustl.edu)|128.252.233.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 128053546 (122M)
Saving to: ‘/home/john/mgi_workshop/bfx_workshop_week_09/tumor.bam’


2020-11-13 10:26:30 (87.3 MB/s) - ‘/home/john/mgi_workshop/bfx_workshop_week_09/t

## Splitting multialleleic sites using vt decompose

Our VCF might contain variants with multiple alt alleles. In these cases the ALT field of the VCF will have multiple alt alleles in it. Take for example this variant:
```
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Exome_Normal	Exome_Tumor
chr17	3017916	.	CGTGT	C,CGT	.	germline;multiallelic;normal_artifact	AS_FilterStatus=weak_evidence|SITE;AS_SB_TABLE=31,0|3,0|9,0;DP=48;ECNT=1;GERMQ=1;MBQ=30,30,30;MFRL=0,0,0;MMQ=60,60,60;MPOS=49,31;NALOD=0.710,-7.297e+00;NLOD=2.65,-6.178e+00;POPAF=6.00,6.00;RPA=16,14,15;RU=GT;STR;STRQ=93;TLOD=6.76,12.45	GT:AD:AF:DP:F1R2:F2R1:SB	0/0:9,0,3:0.066,0.271:12:0,0,0:8,0,3:9,0,3,0	0/1/2:22,3,6:0.117,0.205:31:0,0,0:22,3,6:22,0,9,0
```
This might happen if both chromsomes have a mutation at the same position, but the exact mutation differs between the two chromosomes. It might also happen if there is a subclonal mutation in some tumor cells. It might also just be an artifact.

It is usually easier to process a VCF if these sort of variants are preprocessed to split up multi-allelic sites since some information is encoded on a per-allele basis (e.g., per-allele depth, per-allele VAF). 

vt decompose is part of the [vt tool package](https://genome.sph.umich.edu/wiki/Vt) and available on quay container at `quay.io/biocontainers/vt:0.57721--hf74b74d_1`.

In [5]:
!docker run -v $PWD/bfx_workshop_week_09:/data -it quay.io/biocontainers/vt:0.57721--hf74b74d_1 vt decompose /data/mutect.filtered.vcf.gz -s -o /data/mutect.filtered.decomposed.vcf.gz 

decompose v0.5

options:     input VCF file        /data/mutect.filtered.vcf.gz
         [s] smart decomposition   true (experimental)
         [o] output VCF file       /data/mutect.filtered.decomposed.vcf.gz


stats: no. variants                 : 768
       no. biallelic variants       : 747
       no. multiallelic variants    : 21

       no. additional biallelics    : 24
       total no. of biallelics      : 792

Time elapsed: 0.04s



After running vt decompose the above variant is split up into two lines and looks like this:
```
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Exome_Normal	Exome_Tumor
chr17	3017916	.	CGTGT	C	.	germline;multiallelic;normal_artifact	AS_FilterStatus=weak_evidence|SITE;AS_SB_TABLE=31,0|3,0|9,0;DP=48;ECNT=1;GERMQ=1;MBQ=30,30;MFRL=0,0;MMQ=60,60;MPOS=49;NALOD=0.71;NLOD=2.65;POPAF=6;RPA=16,14;RU=GT;STR;STRQ=93;TLOD=6.76;OLD_MULTIALLELIC=chr17:3017916:CGTGT/C/CGT	GT:AD:AF:DP:F1R2:F2R1:SB	0/0:9,0:0.066:12:0,0:8,0:9,0,3,0	0/1/.:22,3:0.117:31:0,0:22,3:22,0,9,0
chr17	3017916	.	CGTGT	CGT	.	germline;multiallelic;normal_artifact	AS_FilterStatus=weak_evidence|SITE;AS_SB_TABLE=31,0|3,0|9,0;DP=48;ECNT=1;GERMQ=1;MBQ=30,30;MFRL=0,0;MMQ=60,60;MPOS=31;NALOD=-7.297;NLOD=-6.178;POPAF=6;RPA=16,15;RU=GT;STR;STRQ=93;TLOD=12.45;OLD_MULTIALLELIC=chr17:3017916:CGTGT/C/CGT	GT:AD:AF:DP:F1R2:F2R1:SB	0/0:9,3:0.271:12:0,0:8,3:9,0,3,0	0/./1:22,6:0.205:31:0,0:22,6:22,0,9,0
```

## bam-readcount
Some variant callers will already output read depth and allelic depths but this is useful in cases where this information is not already present in the VCF. This is also useful if you run RNAseq on top of somatic variant calling to add RNA coverage information to your VCF.

We will be using the `mgibio/bam_readcount_helper-cwl:1.1.1` docker container to run bam-readcount. This Docker image already has bam-readcount installed and it also contains a script that will take care of creating a region list from your VCF, which is a required input to bam-readcount.

### Required inputs
- vcf
- sample name
- reference fasta
- bam file
- output file prefix
- output directory

In [6]:
!docker run -v $PWD/bfx_workshop_week_09:/data -it mgibio/bam_readcount_helper-cwl:1.1.1 python /usr/bin/bam_readcount_helper.py /data/mutect.filtered.decomposed.vcf.gz Exome_Tumor /data/hla_and_brca_genes.fa /data/tumor.bam Exome_Tumor /data

Complex variant or MNP will be skipped: chr17	3017916	CGTGT	CGT
Complex variant or MNP will be skipped: chr17	7578700	CA	CAA
Complex variant or MNP will be skipped: chr17	7578700	CA	CAAA
Complex variant or MNP will be skipped: chr17	8513688	GTT	GT
Complex variant or MNP will be skipped: chr17	17249049	GAA	GA
Complex variant or MNP will be skipped: chr17	32939409	GAA	GA
Complex variant or MNP will be skipped: chr17	39204770	GT	GTT
Complex variant or MNP will be skipped: chr17	42728741	CA	CAA
Complex variant or MNP will be skipped: chr17	48916258	GT	TT
Complex variant or MNP will be skipped: chr17	50964587	CA	CAA
Complex variant or MNP will be skipped: chr17	56321063	GAA	GA
Complex variant or MNP will be skipped: chr17	56321063	GAA	GAAA
Complex variant or MNP will be skipped: chr17	67077719	CA	CAA
Complex variant or MNP will be skipped: chr17	67909448	CT	CTT
Complex variant or MNP will be skipped: chr17	68045755	ATT	AT
Complex variant or MNP will be skipped: chr17	69520199

## VAtools
[VAtools](http://www.vatools.org) is a python package that provides a suite of tools that help with processing VCF annotations. We will be using the [vcf-readcount-annotator tool](https://vatools.readthedocs.io/en/latest/vcf_readcount_annotator.html) included with VAtools to write the readcounts calculated in the previous step to our VCF. VAtools is available as a Docker image at `griffithlab/vatools:4.1.0`.

In [38]:
!docker run -v $PWD/bfx_workshop_week_09:/data -it griffithlab/vatools:4.1.0 vcf-readcount-annotator /data/mutect.filtered.decomposed.vcf.gz /data/Exome_Tumor_Exome_Tumor_bam_readcount_snv.tsv DNA -t snv -s Exome_Tumor -o /data/mutect.filtered.decomposed.readcount_snvs.vcf.gz
!docker run -v $PWD/bfx_workshop_week_09:/data -it griffithlab/vatools:4.1.0 vcf-readcount-annotator /data/mutect.filtered.decomposed.readcount_snvs.vcf.gz /data/Exome_Tumor_Exome_Tumor_bam_readcount_indel.tsv DNA -s Exome_Tumor -t indel -o /data/mutect.filtered.decomposed.readcount_snvs_indel.vcf.gz

# Parsing VCFs in Python

## PyVCF vs VCFPy

[PyVCF](https://pyvcf.readthedocs.io/en/latest/) is the "original" Python VCF parser. It does a good job reading VCFs but doesn't support modifying VCF entries very well. It also doesn't appear to be maintained anymore. [VCFPy](https://vcfpy.readthedocs.io/en/stable/) was created to solve this problem. For that reason we'll be using VCFPy for the next tasks.

First, we need to ensure that the `vcfpy` package is installed.

In [2]:
pip install vcfpy pysam

Note: you may need to restart the kernel to use updated packages.


### Reading in a VCF and exploring its contents

In [3]:
import vcfpy

Create the VCF reader object from your VCF path

In [4]:
vcf_reader = vcfpy.Reader.from_path("bfx_workshop_week_09/mutect.filtered.decomposed.readcount_snvs_indel.vcf.gz")

Which samples are in your VCF?

In [5]:
vcf_reader.header.samples.names

['Exome_Normal', 'Exome_Tumor']

Which FILTERS are defined in the VCF header?

In [11]:
vcf_reader.header.filter_ids()

['PASS',
 'FAIL',
 'base_qual',
 'clustered_events',
 'contamination',
 'duplicate',
 'fragment',
 'germline',
 'haplotype',
 'low_allele_frac',
 'map_qual',
 'multiallelic',
 'n_ratio',
 'normal_artifact',
 'orientation',
 'panel_of_normals',
 'position',
 'possible_numt',
 'slippage',
 'strand_bias',
 'strict_strand',
 'weak_evidence']

Similar methods `info_ids` and `format_ids` exist for the INFO and FORMAT fields.

Get information for a specific INFO header

In [10]:
vcf_reader.header.get_info_field_info('DP')

InfoHeaderLine('INFO', '<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">', {'ID': 'DP', 'Number': 1, 'Type': 'Integer', 'Description': 'Approximate read depth; some reads may have been filtered'})

Get information for each variant

In [16]:
for entry in vcf_reader:
    #Get all FILTER fields applied to this variant 
    entry.FILTER
    #Get the VAFs of a variant
    calls = entry.call_for_sample['Exome_Tumor']

After your're done with all processing, you will need to close the file.

In [17]:
vcf_reader.close()

### Filtering a VCF

Let's create a filtered VCF so that only variants with a `PASS` filter and a VAF over 0.25 will remain. 

In [44]:
import vcfpy
vcf_reader = vcfpy.Reader.from_path("bfx_workshop_week_09/mutect.filtered.decomposed.vcf.gz")
vcf_writer = vcfpy.Writer.from_path("bfx_workshop_week_09/mutect.filtered.decomposed.pass_vaf_filtered.vcf.gz", vcf_reader.header)
for entry in vcf_reader:
    if 'PASS' in entry.FILTER and entry.call_for_sample['Exome_Tumor'].data['AF'][0] > 0.25:
        vcf_writer.write_record(entry)
vcf_reader.close()
vcf_writer.close()

### Creating a human-readable TSV file

In [41]:
import vcfpy
import csv

vcf_reader = vcfpy.Reader.from_path("bfx_workshop_week_09/mutect.filtered.decomposed.vcf.gz")
with open("bfx_workshop_week_09/out.csv", 'w') as out_fh:
    headers = ['CHROM', 'POS', 'REF', 'ALT', 'FILTER', 'DEPTH', 'VAF']
    tsv_writer = csv.DictWriter(out_fh, delimiter = '\t', fieldnames = headers)
    tsv_writer.writeheader()
    for entry in vcf_reader:
        out = {
            'CHROM': entry.CHROM,
            'POS': entry.POS,
            'REF': entry.REF,
            'ALT': entry.ALT,
            'FILTER': ','.join(entry.FILTER),
            'DEPTH': entry.call_for_sample['Exome_Tumor'].data['DP'],
            'VAF': ','.join( [str(vaf) for vaf in entry.call_for_sample['Exome_Tumor'].data['AF']] )
        }
        tsv_writer.writerow(out)
vcf_reader.close()



samtools pysam 
biotools? biopython for fasta