In [2]:
# RUN
import sys
sys.path.append("/opt/src")
import mip_functions as mip
import json
import os
import numpy as np
import pandas as pd

Classes reloading.
functions reloading


# Processing Variant Calls
Freebayes produces high quality vcf files with haplotype based variant calls. This is important for getting more accurate calls, especially for complex regions where SNVs may overlap with indels and there may be many possible alleles as opposed to a simple biallelic SNV call.   

haplotype based variant example:  

chr1  1000 AAA,AGC,TGC  

However, it may be desired to "decompose" these complex variants for some applications. For example, if we are interested in knowing the prevalence of a specific drug resistance mutation, it would make sense to combine all variants containing this mutation even though they may be part of different haplotypes, and hence are represented in the vcf in different variants.  

Decomposed variants:  

chr1  1000 A T  
chr1  1001 A G  
chr1  1002 A C  

vcf_to_tables function takes the vcf file generated by freebayes and generates allele count and coverage data in table form. It is possible to decompose and aggregate amino acid and/or nucleotide level variants. 3 files containing count data are generated: alternate_table.csv, reference_table.csv, coverage_table.csv, for alt allele, ref allele and coverage count values for each variant, respectively.

It first separates the multiallelic calls to bi-allelic calls.

#### annotate, default=True
It then annotates variants using snpEff.

#### geneid_to_genename, default=None
Variant annotation provides a gene ID (e.g. PF3D7_0709000) but it does not provide common gene names (e.g. crt). If common names are used in target files, or they are desired in general, a tab separated gene ID to gene name file can be used. **gene_name and gene_id** columns are required. If no file is provided, gene name will be the same as the gene ID.

#### aggregate_aminoacids, default=False
If aminoacid level aggregation is requested, it decomposes multi amino acid missense variants into single components and aggregates the alternate allele and coverage counts per amino acid change. For example, Asn75Glu change for crt gene is a known drug resistance mutation in Plasmodium falciparum. There may be 3 separate variants in the vcf file that contain this mutation: Asn75Glu, MetAsn75IleGlu, Asn75Glu_del76-80*. All three has the missense variant Asn75Glu. While the first two  are simple changes, the third is a complex change including a 5 amino acid deletion and a stop codon following Asn75Glu. In this case, it makes sense to combine the counts of the first two variants towards Asn75Glu counts but the third one is debatable because of the complexity; i.e. the drug resistance mutation Asn75Glu probably is not that improtant in that context because of the stop codon following it. So we decompose the simple changes and aggregate but leave complex changes as they are. If aminoacid aggregation is carried out, file names will contain AA tag.

#### target_aa_annotation, default=None
It is also possible to annotate the targeted variants (such as Asn75Glu above) in the generated tables as 'Targeted' in case some analysis should be carried out on targeted variants only. A tab separated file containing the annotation details is required for this operation. **gene_name, aminoacid_change and mutation_name** are required fields. If a variants gene_name and aminoacid_change are matching to a row in the target file, that variant will be marked as targeted and will have the correspondign mutation name. Note that if common gene name conversion (see above) is not used, the gene_name column in this file must match the actual gene ID and not the common name. It may be more convenient to keep the gene IDs in the target file as well and use that file for ID to name mapping. **aggregate_aminoacids must be set to True** for this option to be used.

#### aggregate_nucleotides, default=False
A similar aggregation can be done at nucleotide level. If specified, biallelic variants will be decomposed using the tool **vt decompose_blocksub**. By default it decomposes block substitutions that do not include indels. However, it is also possible to decompose complex variants including indels by providing -a option. For possible decompose options see vt help:
```bash
vt decompose_blocksub options : 
  -p  Output phased genotypes and PS tags for decomposed variants [false]
  -m  keep MNVs (multi-nucleotide variants) [false]
  -a  enable aggressive/alignment mode [false]
  -d  MNVs max distance (when -m option is used) [2]
  -o  output VCF file [-]
  -I  file containing list of intervals []
  -i  intervals []
  -?  displays help
```
If nucleotide level aggregation is done, the file names will include AN tag.

#### target_nt_annotation, default=None
Annotation of targeted nucleotides requires a file similar to the targeted amino acid annotation. However, the required fields for this annotation are: CHROM, POS, REF, ALT and mutation_name. **aggregate_nucleotides must be set to True** for this option to be used.

#### aggregate_none, default=False
It is also possible to generate count tables without doing any aggregation. This will generate the 3 count files, and all of the variant information included in the vcf file will be a separate column in the table's index. For annotated initial vcf files, or if annotate option is selected, each subfield in the INFO/ANN field will have its own column.

#### min_site_qual, default=-1
Filter variant sites for a minimum QUAL value assigned by the variant caller. This value is described in freebayes manual as:
```bash
Of primary interest to most users is the QUAL field, which estimates the probability that there is a polymorphism at the loci described by the record. In freebayes, this value can be understood as 1 - P(locus is homozygous given the data). It is recommended that users use this value to filter their results, rather than accepting anything output by freebayes as ground truth.

By default, records are output even if they have very low probability of variation, in expectation that the VCF will be filtered using tools such as vcffilter in vcflib, which is also included in the repository under vcflib/. For instance,

freebayes -f ref.fa aln.bam | vcffilter -f "QUAL > 20" >results.vcf

removes any sites with estimated probability of not being polymorphic less than phred 20 (aka 0.01), or probability of polymorphism > 0.99.

In simulation, the receiver-operator characteristic (ROC) tends to have a very sharp inflection between Q1 and Q30, depending on input data characteristics, and a filter setting in this range should provide decent performance. Users are encouraged to examine their output and both variants which are retained and those they filter out. Most problems tend to occur in low-depth areas, and so users may wish to remove these as well, which can also be done by filtering on the DP flag.
```
Therefore, a **minimum of 1** should be used as a min_site_qual to remove low quality sites. If a site is annotated as **targeted**, the site will be kept regardless of its qual value, however, the alternate observation counts for the site may be reset to zero depending on the min_target_site_qual value described below.

#### min_target_site_qual, default=-1
If a variant site is targeted but the site qual is lower than this,
reset the alternate observation counts to 0. It may be best to leave
this at the default value since there is usually additional evidence
that a targeted variant exists in a samples compared to a de novo
variant, i.e. those variants that are targeted had been observed in other samples/studies.

#### Example cell
```python
# provide a file that maps gene names to gene IDs
# this is necessary when targeted variant annotations use
# gene names instead of gene IDs
geneid_to_genename = "/opt/project_resources/geneid_to_genename.tsv"
# annotate targted amino acid changes in the tables.
target_aa_annotation = "/opt/project_resources/targets.tsv"
# decompose multi amino acid changes and combine counts of
# resulting single amino acid changes
aggregate_aminoacids = True
# decompose MNVs and combine counts for resulting SNVs
aggregate_nucleotides = True
# annotate targeted nucleotide changes in the tables.
target_nt_annotation = None
```

In [None]:
# USER INPUT

# provide a file that maps gene names to gene IDs
# this is necessary when targeted variant annotations use
# gene names instead of gene IDs. Otherwise provide None
geneid_to_genename = None
# annotate targeted amino acid changes in the tables
# using the file, or otherwise provide None
target_aa_annotation = None
# decompose multi amino acid changes and combine counts of
# resulting single amino acid changes
aggregate_aminoacids = None
# decompose MNVs and combine counts for resulting SNVs
aggregate_nucleotides = None
# annotate targeted nucleotide changes in the tables.
target_nt_annotation = None

In [None]:
# OPTIONAL USER INPUT

# analysis settings dictionary
#settings = settings
# provide the path to the settings file
# if settings dictionary has not been loaded
settings_file = None
# use snpEff to annotate the variants
annotate = True
# additional vt options for decomposing nucleotides.
# Supply ["-a"] to include indels and complex variants
# in decomposition, or other options shown above if desired.
decompose_options = []
# was the initial vcf file was annotated by snpEff?
annotated_vcf = False
# create tables for variants as they are in the vcf file
# without decomposing compex variants or indels.
# Multiallelic variants will be split into biallelic.
aggregate_none = True
# filter variant sites for quality
min_site_qual = 1
# reset targeted variant counts to zero
# when the site quality is below this value
min_target_site_qual = 0
# reset genotypes in the vcf file to NA
# and depth to 0 if FORMAT/GQ value for a variant/sample
# is below this value:
min_genotype_qual = 1
# reset alt allele count in the vcf file to 0
# if FORMAT/QA value divided by FORMAT/AO for a variant/sample
# is below this value:
min_mean_alt_qual = 15 # average quality cut off for variants
# There are also available, similar filters for:
# min_mean_ref_qual : resetting low qual reference allele counts
# min_alt_qual : similar to min_mean_alt_qual, but for total qual score
# min_ref_qual : similar to min_alt_qual but for reference alleles

# prefix for output files, if desired.
# this is useful when different quality thresholds etc will be used
# to avoid overwriting the files. For example, if min_genotype_qual = 1
# and min_mean_alt_qual = 15 is used, a suitable prefix could be
# "gq1.mqa15."
output_prefix = ""

In [12]:
# RUN

# input vcf file
vcf_file = vcf_file.split("/")[-1]
mip.vcf_to_tables_fb(
     vcf_file,
     settings=settings,
     settings_file=settings_file,
     annotate=annotate,
     geneid_to_genename=geneid_to_genename,
     target_aa_annotation=target_aa_annotation,
     aggregate_aminoacids=aggregate_aminoacids,
     target_nt_annotation=target_nt_annotation, 
     aggregate_nucleotides=aggregate_nucleotides, 
     decompose_options=decompose_options,
     annotated_vcf=annotated_vcf,
     aggregate_none=aggregate_none,
     min_site_qual=min_site_qual,
     min_target_site_qual=min_target_site_qual,
     min_genotype_qual=min_genotype_qual,
     min_mean_alt_qual=min_mean_alt_qual,
     output_prefix=output_prefix)

## Tables created
alternate_XX_table.csv files will contain the ALT allele count for that table type while coverage_XX_table.csv will contain the depth of coverage at each locus.
### Nucleotide changes (aggregated)
For some projects we may be interested in specific single nucleotide changes. For these, it would make sense to decompose multi nucleotide changes and combine counts of the same single nucleotide changes. Two tables will be generated for count and coverage data for aggregated nucleotide changes:  

**alternate_AN_table.csv** file in the analysis directory is created if aggregate_nucleotides option was selected when creating data tables. This table has the UMI counts for each alternate nucleotide.  

**coverage_AN_table.csv** file is the corresponding coverage depth for each variant's position.  

**genotypes_AN_table.csv** file contains the aggregated value of the genotypes called by freebayes: 0/0->0, 0/1->1, 1/1->2, N/A (.) ->-1. When calls from multiple variants are aggregated; if all 0/0 then -> 0, if any 0/0 and non-0/0 then -> 1, if all 1/1 then -> 2

### Amino acid changes (aggregated)
For some projects we may be interested in the amino acid changes, particularly specific, targeted amino acid changes, such as drug resistance mutations in *Plasmodium falciparum*, which is the data set provided for pipeline test. For these type of projects, we may want to analyze the variants from the amino acid perspective, rather than nucleotide changes which is standard output for variant callers.  

**alternate_AA_table.csv** file in the analysis directory is created if aggregate_aminoacids option was selected when creating data tables. This table has the UMI counts for each alternate amino acid.  

**coverage_AA_table.csv** file is the corresponding coverage depth for each variant's position.  

**genotypes_AA_table.csv** file contains the aggregated value of the genotypes called by freebayes: 0/0->0, 0/1->1, 1/1->2, N/A (.) ->-1. When calls from multiple variants are aggregated; if all 0/0 then -> 0, if any 0/0 and non-0/0 then -> 1, if all 1/1 then -> 2

### Nucleotide changes (not aggregated)
For some projects we may be interested in keeping composite variants as they are called by the pipeline. These will include MNVs, comlplex variants including indels, etc. Two tables will be generated for count and coverage data for original nucleotide changes:  

**alternate_table.csv** file in the analysis directory is created if aggregate_none option was selected when creating data tables. This table has the UMI counts for each alternate nucleotide.  

**coverage_table.csv** file is the corresponding coverage depth for each variant's position.  

**genotypes_table.csv** file contains the aggregated value of the genotypes called by freebayes: 0/0->0, 0/1->1, 1/1->2, N/A (.) ->-1.

# Calling genotypes, prevalences and filtering data
The original vcf file created by freebayes contain the genotypes determined by the program itself. In addition, genotype values for aggregated and non-aggregated nucleotides and aminoacids are also available as *_genotypes_table.csv files as described above.  

However, the default parameters generating the vcf file are not very strict. In this part of the analysis we will apply various filters to the count tables and generate genotype calls based on those filters.

### Chose which tables to analyse
Select the type of data to analyse. Make sure the count file is matching the coverage file. e.g. alternate_XX_table and coverage_XX_table, XX must be the same value (AA, AN or nothing).

#### Example cell
```python
mutation_count_file = "/opt/analysis/q1.mqa15.alternate_AA_table.csv"
mutation_coverage_file = "/opt/analysis/q1.mqa15.coverage_AA_table.csv"
```

In [None]:
# USER INPUT

mutation_count_file = 
mutation_coverage_file = 

In [None]:
# RUN
mutation_counts = pd.read_csv(mutation_count_file,
                              header=list(range(6)),
                              index_col=0)
mutation_counts.head()

In [None]:
# RUN
mutation_coverage = pd.read_csv(mutation_coverage_file,
                                index_col=0,
                                header=list(range(6)))
mutation_coverage.head()

### Set your filters   
1.  **min_coverage**: how many UMIs are needed to for a genomic position for a sample to reliably call genotypes. If we set min_coverage = 10, any locus within a sample that is covered below this threshold will have an NA genotype.
2.  **min_count**: if a genomic position have enough coverage, how many UMIs supporting an ALT allele call is needed for a reliable call. If we set min_count = 2, any mutation with an  call that has less than 2 barcodes supporting the ALT call will revert to REF.
3.  **min_freq**: a minimum within sample allele frequency threshold to consider a variant valid. If set to 0.01, for example, a variant locus in a sample that is at 0.005 frequency for the ALT allele within the sample, the locus would be called REF, if the within sample AF is between 0.01 and 0.99, it would be considered HET, and if > 0.99, it would be homozygous ALT.

#### Example cell
```python
# filter mutation counts for minimum count parameter
# by setting counts to zero if it is below threshold
min_count = 2
# filter loci without enough coverage by setting
# coverage to zero if it is below threshold
min_coverage = 10
# call genotypes using the minimum within sample
# allele frequency
min_freq = 0
```

In [15]:
# USER INPUT 

# filter mutation counts for minimum count parameter
# by setting counts to zero if it is below threshold
min_count = 
# filter loci without enough coverage by setting
# coverage to zero if it is below threshold
min_coverage = 
# call genotypes using the minimum within sample
# allele frequency
min_freq = 

In [None]:
# RUN

# import the PCA module which has genotype calling and
# filtering functions 
import PCA

gt_calls = PCA.call_genotypes(mutation_counts, mutation_coverage,
                              min_count, min_coverage, min_freq)
gt_calls.keys()

### What are the dataframes generated by call_genotypes function and how  are they generated?

**filtered_mutation_counts**: take the mutation_counts table, if a cell's value is below *min_count*, reset that cell's value to zero, otherwise leave as is.  

In [None]:
# RUN
filtered_mutation_counts = gt_calls["filtered_mutation_counts"]
filtered_mutation_counts.head()

**filtered_mutation_coverage**: take the mutation_coverage table, if a cell's value is below *min_coverage*, reset that cell's value to zero, otherwise leave as is.

In [None]:
# RUN
filtered_mutation_coverage = gt_calls["filtered_mutation_coverage"]
filtered_mutation_coverage.head()

**wsaf**: divide *filtered_mutation_counts* table by *filtered_mutation_coverage* table, yielding within sample allele frequencies.  

In [None]:
# RUN
freq = gt_calls["wsaf"]
freq.head()

**genotypes**: take the *wsaf* table, if a cell's value is less than *min_freq* set the genotype value to 0 (homozygous wild type); if the cell's value is more than (*1 - min_freq*) set the genotype value to 2 (homozygous mutant), if the cell's value is between *min_freq* and (*1 - min_freq*) set the genotype value to 1 (heterozygous/mixed).  

In [None]:
# RUN
genotypes = gt_calls["genotypes"]
genotypes.head()

**prevalences**: take the *genotypes* table, if a cell's value is 2, reset its value to 1; otherwise leave as is.

In [None]:
# RUN
prevalences = gt_calls["prevalences"]
prevalences.head()

## Filter genotypes / prevalences
It is generally a good idea to do some basic noise removal once the genotypes are created. Some suggestions are provided here.

### Filter variants that are always at low WSAF
If a variant is only seen at a low frequency within samples, it is a good indication that it could be just noise. Here we will set a number of samples and minimum WSAF threshold to remove such noise.

```python
num_samples_wsaf = 2
min_wsaf = 0.5
wsaf_filter = ((freq > min_wsaf).sum()) >= num_samples_wsaf
```

The above options will keep the variants that are in at > 0.5 WSAF in at least 2 samples.

In [None]:
# USER INPUT
num_samples_wsaf = 
min_wsaf = 

In [None]:
wsaf_filter = ((freq > min_wsaf).sum()) >= num_samples_wsaf
print(("{} of {} variants will remain after the wsaf filter").format(
    wsaf_filter.sum(), freq.shape[1]))

### Filter variants that are observed with low UMI counts
If a variant is only supported by a low number of UMIs across the entire sample set, it is another indication of noise.

```python
num_samples_umi = 2
min_umi = 3
umi_filter = ((filtered_mutation_counts >= min_umi).sum()) > num_samples_umi
```

The above options will keep the variants that are supported by at least 3 UMIs in at least 2 samples.

In [None]:
# USER INPUT
num_samples_umi = 
min_umi = 

In [None]:
# RUN
umi_filter = ((filtered_mutation_counts >= min_umi).sum()) > num_samples_umi
print(("{} of {} variants will remain after the UMI filter").format(
    umi_filter.sum(), freq.shape[1]))

### Keep variants that were targeted
In most projects there are a number of variants that we would like to report, even if they are not seen in the sample set. We would like to stop those variants from being removed by the above filters.

In [None]:
# RUN
targ = freq.columns.get_level_values("Targeted") == "Yes"

### Combine filters
Keep the variants that are either targeted or passing filters

In [None]:
variant_mask = targ | (wsaf_filter & umi_filter)
print(("{} variants will remain in the final call set.\n"
       "{} variants were targeted and will be kept; and {} will be removed by "
       "the combined UMI and WSAF filters.").format(
    variant_mask.sum(), targ.sum(), (wsaf_filter & umi_filter).sum()))

## Filter data tables with the combined filters

In [None]:
filtered_genotypes = genotypes.loc[:, variant_mask]
filtered_genotypes.head()

In [None]:
filtered_prevalences = prevalences.loc[:, variant_mask]
filtered_prevalences.head()

## Save the tables you want to keep

In [None]:
filtered_prevalences.to_csv("filtered_prevalences.csv")