# Introduction
## Python 2 vs 3
## Conda
## Jupyter Notebook


# Preprocessing VCF - adding readcounts using bam-readcount and VAtools

adding a section about vt decompose?

## bam-readcount
Some variant callers will already output read depth and allelic depths but this is useful in cases where this information is not already present in the VCF. This is also useful if you run RNAseq on top of somatic variant calling to add RNA coverage information to your VCF.

We will be using the `mgibio/bam_readcount_helper-cwl:1.1.1` docker container to run bam-readcount. This Docker image already has bam-readcount installed and it also contains a script that will take care of creating a region list from your VCF, which is a required input to bam-readcount.

### Required inputs
- vcf
- sample name
- reference fasta
- bam file
- output file prefix
- output directory

In [1]:
!docker run -v $PWD:/data -it mgibio/bam_readcount_helper-cwl:1.1.1 python /usr/bin/bam_readcount_helper.py vcf sample reference_fasta bam output_file_prefix output_directory

[E::hts_open_format] Failed to open file vcf
Traceback (most recent call last):
  File "/usr/bin/bam_readcount_helper.py", line 51, in <module>
    vcf_file = VCF(vcf_filename)
  File "cyvcf2/cyvcf2.pyx", line 195, in cyvcf2.cyvcf2.VCF.__init__
    raise IOError("Error opening %s" % fname)
IOError: Error opening vcf


## VAtools
[VAtools](http://www.vatools.org) is a python package that provides a suite of tools that help with processing VCF annotations. We will be using the [vcf-readcount-annotator tool](https://vatools.readthedocs.io/en/latest/vcf_readcount_annotator.html) included with VAtools to write the readcounts calculated in the previous step to our VCF. VAtools is available as a Docker image at `griffithlab/vatools:4.1.0`.

In [2]:
!docker run -v $PWD:/data -it griffithlab/vatools:4.1.0 vcf-readcount-annotator input_vcf bam_readcount_file DNA

Traceback (most recent call last):
  File "/usr/local/bin/vcf-readcount-annotator", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/vatools/vcf_readcount_annotator.py", line 190, in main
    read_counts = parse_bam_readcount_file(args)
  File "/usr/local/lib/python3.6/dist-packages/vatools/vcf_readcount_annotator.py", line 54, in parse_bam_readcount_file
    with open(args.bam_readcount_file, 'r') as reader:
FileNotFoundError: [Errno 2] No such file or directory: 'bam_readcount_file'


# Parsing VCFs in Python

## PyVCF vs VCFPy

[PyVCF](https://pyvcf.readthedocs.io/en/latest/) is the "original" Python VCF parser. It does a good job reading VCFs but doesn't support modifying VCF entries very well. It also doesn't appear to be maintained anymore. [VCFPy](https://vcfpy.readthedocs.io/en/stable/) was created to solve this problem. For that reason we'll be using VCFPy for the next tasks.

First, we need to ensure that the `vcfpy` package is installed.

In [3]:
pip install vcfpy

Note: you may need to restart the kernel to use updated packages.


### Reading in a VCF and exploring its contents

In [4]:
import vcfpy

Create the VCF reader object from your VCF path

In [5]:
vcf_reader = vcfpy.Reader.from_path("input.vcf")

FileNotFoundError: [Errno 2] No such file or directory: 'input.vcf'

Which samples are in your VCF?

In [None]:
vcf_reader.header.samples.names

Which FILTERS are defined in the VCF header?

In [None]:
vcf_reader.header.filters_ids()

Similar methods `info_ids` and `format_ids` exist for the INFO and FORMAT fields.

Get information for a specific INFO header

In [None]:
vcf_reader.header.get_info_field_info('sth')

Get information for each variant

In [None]:
for entry in vcf_reader:
    #Get the value of a specific FILTER 
    entry.FILTER['sth']
    #Get the VAFs of a variant
    calls = entry.call_for_sample('sample_name')

After your're done with all processing, you will need to close the file.

In [None]:
vcf_reader.close()

### Filtering a VCF

Let's create a filtered VCF so that only variants with a `PASS` filter and a VAF over 0.25 will remain. 

In [None]:
import vcfpy

vcf_reader = vcfpy.Reader.from_path("input.vcf")
vcf_writer = vcfpr.Writer.from_path("out.vcf", vcf_reader.header)

for entry in vcf_reader:
    for alt in entry.ALT:
        genotype_bases = entry.call_for_sample('sample_name').gt_bases
        if alt in genotype_bases:
            if 'PASS' in entry.FILTER and entry.call_for_sample('sample_name')['AF'] > 0.25:
                vcf_writer.write_record(entry)
        
vcf_reader.close()
vcf_writer.close()

### Creating a human-readable TSV file

In [None]:
import vcfpy
import csv

vcf_reader = vcfpy.Reader.from_path("input.vcf")
with open("out.csv", 'w') as out_fh:
    headers = ['CHROM', 'POS', 'REF', 'ALT', 'FILTER', 'DEPTH', 'VAF']
    tsv_writer = csv.DictWriter(out_fh, delimiter = '\t', fieldnames = headers)
    tsv_writer.writeheader()
    for entry in vcf_reader:
        out = {
            'CHROM': entry.CHROM,
            'POS': entry.POS,
            'REF': entry.REF,
            'ALT': entry.ALT,
            'FILTER': ','.join(entry.FILTER),
            'DEPTH': entry.call_for_sample('sample_name')['DP'],
            'VAF': ','.join(entry.call_for_sample('sample_name')['AF'])
        }
        tsv_writer.writerow(out)
vcf_reader.close()



samtools pysam 
biotools? biopython for fasta