# How to use code cells in this notebook
If a code cell starts with 
```python
# RUN
```
Run the cell by CTRL+Enter, or the Run button above.  

If a code cell starts with
```python
# USER INPUT
```
User input is needed before running the cell. Usually there will be a cell preceding this which gives an example for the values to be provided.

If a code cell starts with
```python
# OPTIONAL USER INPUT
```
User input is needed before running the cell. However, some defaults are provided, so make sure that either the settings will work for your run, or change them appropriately.

If a cell starts with
#### Example cell
These cells are not code cells but examples of user inputs from the test data analysis for the actual code cell that follows it, informing the user about the formatting etc.

**Important note on entering input:** When entering user input, please make sure you follow the formatting provided in the example cells. For example, when the parameter is text, make sure you have quotation marks around the parameters but when it is a number, do not enclose in quotes. If it is a list, then provide a list in brackets.

In [None]:
# RUN
import sys
sys.path.append("/opt/src")
import mip_functions as mip
import probe_summary_generator
import pickle
import json
import copy
import math
import os
import numpy as np
import subprocess
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from matplotlib.lines import Line2D
plt.rcParams['svg.fonttype'] = 'none'
import pandas as pd
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
import allel
wdir = "/opt/analysis/"
data_dir = "/opt/data/"

Classes reloading.
functions reloading


#### Example cell
```python

# provide the MIPWrangler output files
# which must be located in the /opt/data directory within the container.
# if more than one run is to be merged, provide all files
info_file = "allInfo.tsv.gz"

# sample sheets associated with each wrangler file,
# in the same order as the wrangler files.
sample_sheet = "sample_sheet.tsv"


# No input below
info_file = [data_dir + info_file]
sample_sheet = [data_dir + sample_sheet]
pd.read_table(sample_sheet[0]).groupby(["sample_set", "probe_set"]).first()
```

In [None]:
# USER INPUT

# provide the MIPWrangler output file
# which must be located in the /opt/data directory within the container.
info_file = ""

# sample sheet associated with wrangler file,
#you should only have one sample sheet (in cases of multiple sample sheets, merge them first)
sample_sheet = ""

# No input below
info_file = [data_dir + info_file]
sample_sheet=[data_dir+sample_sheet]
pd.read_table(sample_sheet[0]).groupby(["sample_set", "probe_set"]).first()

The above table shows the unique sample_set, probe_set combinations in the sample sheets provided. Select which combinations should be used for this analysis.

#### Example cell
Which sample sets and probe sets would you like to analyze? These are listed in your sample sheet under the "sample_set" and "probe_set" columns.  Enter a single sample_set and a single probe_set.

If a sample was captured/sequenced with multiple probe sets at the same time, there might optionally be multiple comma delimited probe sets in the probe set column from the sample sheet (e.g. DR1,VAR4 if sequencing was performed on DR1 and VAR4 probe sets).  You only need to enter the probe set you are interested in analyzing here.

```python
sample_set = "PRX-00,PRX-04,PRX-07"
probe_set = "DR23K"
```

In [None]:
# USER INPUT
sample_set = ""
probe_set = ""

### Specify the species
For the species, the options are: "pf" for *Plasmodium falciparum*, "pv" for *Plasmodium vivax*, "hg19" for *Homo sapiens* genome assembly hg19/GRCh37 and "hg38" for *Homo sapiens* genome assembly hg38/GRCh38

#### Example cell
```python
species = "pf"
```

In [None]:
# USER INPUT
species = ""

#### Example cell
```python
# available cpu count
processorNumber = 20
freebayes_threads = 8

## extra bwa options for haplotype alignment
# use "-a" for getting all alignments
# use "-L 500" to penalize soft clipping 
# use "-t" to set number of available processors
bwaExtra = ["-t", str(processorNumber)]
```

In [None]:
# OPTIONAL USER INPUT
# available cpu count
processorNumber = 20
freebayes_threads = 8

## extra bwa options for haplotype alignment
# use "-a" for getting all alignments
# use "-L 500" to penalize soft clipping 
# use "-t" to set number of available processors
bwaExtra = ["-t", str(processorNumber)]

### Get/Set the analysis settings
Use the settings template for the species specified to get the  analysis settings and change the vaules specified in the above cell. This will create a template_settings.txt file in your analysis directory and a settings.txt file to be used for the analysis. These files also will serve as a reference of analysis settings for the sake of reproducibility.  

The last step of the below cell attempts to save a file to the /opt/project_resources directory. If you do not have write permission to the location, you cannot save that file. However, if a file has been previously saved in the directory, it will be fine.

In [None]:
# RUN

# extract the settings template
settings = mip.get_analysis_settings("/opt/resources/templates/analysis_settings_templates/settings.txt")

# update bwa settings with the options set above
bwaOptions = [settings["bwaOptions"]]
bwaOptions.extend(bwaExtra)

# create a dictionary for which settings should be updated
# using the user specified parameters.

settings['processorNumber']=processorNumber
settings['freebayes_threads']=freebayes_threads
settings['bwaOptions']=bwaOptions
settings['species']=species
settings['mipSetKey']=[probe_set.strip(), ""]
# create a settings file in the analysis directory.
settings_file='settings.txt'
settings_path = os.path.join(wdir, settings_file)
mip.write_analysis_settings(settings, settings_path)
#reparse settings from settings file
settings = mip.get_analysis_settings(wdir + settings_file)
print(settings['mipSetKey'])
# create probe sets dictionary
try:
    mip.update_probe_sets("/opt/project_resources/mip_ids/mipsets.csv",
                         "/opt/project_resources/mip_ids/probe_sets.json")
except IOError:
    pass

# Process run data
First section of the data analysis involves processing the MIPWrangler output file, mapping haplotypes, and creating summary files and plots showing how the sequencing runs went.

## MIPWrangler output file processing
Libraries are labeled by combining three fields in the sample sheet: sample_name-sample_set-replicate, which makes the Sample ID.

The below operation just filters and renames some columns from the original file.

In [None]:
# RUN
mip.process_info_file(wdir,
                      settings_file, 
                      info_file,
                      sample_sheet,
                      settings["mipsterFile"],
                      sample_set.strip(),
                      probe_set.strip())

## Filter and map haplotype sequences
Align each haplotype sequence to the reference genome. Remove off target haplotypes. All haplotype mappings will be saved to the disk so off targets can be inspected if needed. 

Some filters can be applied to remove noise and speed up processing:
*  minHaplotypeBarcodes: minimum total UMI cut off across all samples.
*  minHaplotypeSamples: minimum number of samples a haplotype is observed in.
*  minHaplotypeSampleFraction: minimum fraction of samples a haplotype is observed in.  

It is usually better to not filter at this step (using the difault filtering levels below of filtering nothing) unless the downstream operations are difficult to compute. However, filters can and should be applied after variant calls are made.

#### Example cell
```python
# filter haplotype sequences based on the number of total supporting UMIs
settings["minHaplotypeBarcodes"] = 1
# filter haplotype sequences based on the number of samples they were observed in
settings["minHaplotypeSamples"] = 1
# filter haplotype sequences based on the fraction of samples they were observed in
settings["minHaplotypeSampleFraction"] = 0.0001
```

In [None]:
# OPTIONAL USER INPUT
# filter haplotype sequences based on the number of total supporting UMIs
settings["minHaplotypeBarcodes"] = 1
# filter haplotype sequences based on the number of samples they were observed in
settings["minHaplotypeSamples"] = 1
# filter haplotype sequences based on the fraction of samples they were observed in
settings["minHaplotypeSampleFraction"] = 0.0001 

In [None]:
#RUN
mip.map_haplotypes(settings)
mip.get_haplotype_counts(settings)

### Preview the mapping results
Plotting the probe coverage by samples is a good  way to see overall experiment perfomance. The chart below shows a heatmap of how many UMIs are present and uses a log scale

Dark columns point to poor performing probes whereas dark rows indicate poor samples. Note that this excludes samples with no reads at all. Data is pulled from the file "UMI_counts.csv"

In [None]:
# RUN
# alternate version of the chart above
graphing_list, rows=[],[]
for line_number, line in enumerate(open("UMI_counts.csv")):
	line=line.strip().split(',')
	if line_number==0:
		columns=line[1:]
	if line_number>2:
		rows.append(line[0])
		int_line=list(map(int, list(map(float, line[1:]))))
		log_line=[math.log(number+1, 2) for number in int_line]
		graphing_list.append(log_line)
fig = px.imshow(graphing_list, aspect='auto', labels=dict(x="mips", y="samples",
color='log2 of umi_counts+1'), x=columns, y=rows)
fig.update_xaxes(side="top")
#fig.update_layout(width=2000, height=4000, autosize=False)
fig.update_layout(height=1000)
fig.show()

### Look at summary stats 
There are summary statistics and meta data (if provided) we can use to determine if coverage is enough, whether further sequencing is necessary, and how to proceed if further sequencing will be needed.

In [None]:
# RUN
sample_summary = pd.read_csv(wdir + "sample_summary.csv")
sample_summary.head()

### Plot total UMI count vs probe coverage
A scatter plot of total UMI count vs number of probes covered at a certain UMI count is a good way to see the relationship between total coverage and probe coverage, which is useful in determining how to proceed to the next experiments or analyses.

In [None]:
# RUN
fig = px.scatter(sample_summary, 
        x="UMI Count",
        y="targets_with_>=10_UMIs",
        height=700,
        hover_name="Sample ID",
        title="UMI Count vs. Probe Coverage",
        hover_data="Read Count"
	)
fig.show()

## Repooling capture reactions for further sequencing.
### Factors to consider:
1. What do you we want to accomplish? In most cases, we would like to get enough coverage for a number of probes for each sample. For example, the test data contains **50 probes** in total. Let's say it is sufficient if we had a coverage of **10** or more for each probe for a sample. Then, we would not want to sequence any more of that sample. 
```python
target_coverage_count = 50
target_coverage_key='targets_with_>=10_UMIs'
```
Alternatively, we can set a goal of a fraction of total probes to reach a certain coverage rather than an absolute number of probes. For 95% of the maximum number of probes observed (47 in this case): 
```python
target_coverage_fraction = 0.95
target_coverage_key='targets_with_>=10_UMIs'
```
Although we set our goal to 47 probes, it is likely that some sample will never reach that number regardless of how much we sequence, if there is a deletion in the region, for example. So it makes sense to set a total coverage threshold after which we don't expect more data. Looking at the plot above, it seems like after 1000 UMI counts, we would reach our goal for most samples. 
```python
high_UMI_threshold = 10000
```
Another metric to use for determining if we want to sequence a sample more is the average read count per UMI count. This value indicates we have sequenced each unique molecular index in our sample so many times, so when the value is high, it is unlikely that we'd get more UMIs by sequencing the same library more. It makes more sense for a fresh MIP capture from these samples if more data is needed.
```python
UMI_coverage_threshold=10
```
Some samples perform very poorly for one reason or another. There are two options for these samples for repooling consideration: 1) Repool as much as we can for the next run, 2) Assuming there is a problem in the capture reaction, set up a new MIP capture reaction for these samples. It makes more sense to use option 1 if this is the first sequencing data using this library. Use option 2 if this library have been repooled at a higher volume already, but is still producing poor data.
```python
UMI_count_threshold=100 # samples below total UMI count of this value is considered low coverage
low_coverage_action='Repool' # what to do for low coverage samples (Repool or Recapture)
```
Sometimes a handful of samples show uneven coverage of loci, i.e. they have very good coverage of a handful of loci but poor coverage in others, which may point to a problem with the sample or the experiment in general. These samples are determined by comparing the subset of samples that reached the goal we set (completed samples) and those that have not. We look at the number of UMIs per probe for _completed_ samples and get 25th percentile (or other percentile as set) and assume that if a sample on average has this many UMIs per target, it should have reached the set goal. For example, if on average _completed_ samples, i.e. samples that cover 47 probes at 10 UMIs or more, have 10000 total UMIs, they would have ~200 (10000/47) UMIs per target covered. And if an _incomplete_ sample has 5000 total UMIs and only 10 targets covered, this value would be 500 for that sample and it would be flagged as **uneven coverage** in the repooling document.
```python
assesment_key='targets_with_>=1_UMIs' # coverage key to compare "complete" and "incomplete" samples
good_coverage_quantile=0.25 # percentile to set the threshold
```

#### Example cell
```python
high_UMI_threshold = 10000
target_coverage_count = None
target_coverage_fraction = 0.95
target_coverage_key = 'targets_with_>=10_UMIs'
UMI_coverage_threshold = 10
UMI_count_threshold = 100
low_coverage_action = 'Recapture'
assesment_key = 'targets_with_>=1_UMIs'
good_coverage_quantile = 0.25
```

In [None]:
# USER INPUT
high_UMI_threshold = 
low_coverage_action = 

In [None]:
# OPTIONAL USER INPUT
target_coverage_count = None
target_coverage_fraction = 0.95
target_coverage_key = 'targets_with_>=10_UMIs'
UMI_coverage_threshold = 10
UMI_count_threshold = 100
assesment_key = 'targets_with_>=1_UMIs'
good_coverage_quantile = 0.25

In [None]:
# RUN
meta = pd.read_csv(wdir + "run_meta.csv")
data_summary = pd.merge(sample_summary, meta)
mip.repool(wdir, 
           data_summary, 
           high_UMI_threshold, 
           target_coverage_count=target_coverage_count, 
           target_coverage_fraction=target_coverage_fraction, 
           target_coverage_key=target_coverage_key,
           UMI_coverage_threshold=UMI_coverage_threshold,
           UMI_count_threshold=UMI_count_threshold, 
           low_coverage_action=low_coverage_action,
           assesment_key=assesment_key,
           good_coverage_quantile=good_coverage_quantile,
           output_file='repool.csv')

### Inspect the repool document
Library to completion field in the repool document has the value (volume) of how much from a sample should be pooled for re-sequencing. These values are only rough estimates and care should be taken to make sure there will be enough material to sequence.

In [None]:
# RUN
pd.read_csv(wdir + "repool.csv").head()

# Variant Calling
Second part of the analysis involves variant calling and variant analysis. 

### Options for freebayes wrapper
```Python
align = True # Default is set to true, fastq files and bam files per sample
# will be created in 'fastq_dir' and 'bam_dir'. 
# it should be set to false if bam files are available.

settings = settings # analysis settings dictionary created above.

bam_files = None # default is to use all bam files within the bam_dir.
# if specific files should be used, then they can be specified in a list.

verbose = True # prints errors and warnings as well as saving to disk.
# if set to false, it will print that there is an error which will
# be saved to disk which should be inspected for details.

targets_file = None # force calls on specific loci even if there is
# no observations satisfying filter criteria. Useful in cases of targeted
# mutations such as drug resistance mutations.
# Usually a file at "/opt/project_resources/targets.tsv" would be present
# if the project requires it. Then targets_file should be set to this path.

# paths for input-output files with default values that can be left unchanged
fastq_dir, bam_dir, vcf_file, settings_file, errors_file, warnings_file

# additional options to pass to freebayes directly:
options = [] # see below for suggestions and possibilities.
```
#### Additional options for freebayes caller. 
Most of the freebayes options are shown below in the **freebayes help** section at the bottom of this document. Some options are integrated into the python wrapper freebayes_call, but others should be added depending on the data type, species etc.

integrated options:
```bash
    -r region
            limit calls to a specific region. 
            This is done internally, splitting the results into contigs and processing each contig
            separately (in parallel if multiple cpus are available).
            Per-contig vcf files are concatenated at the end into a single file.
    -@ targets.vcf
            force calls on positions provided in the vcf file
            a vcf file is generated if a tab separated file containing targets are provided.
    -L --bam-list
            a list of bam files to be used. By default, all bams in bams directory will be used.
            A list of specific bams can be specified to freebayes_call as bam_files option.
```
options to consider adding for parasite sequencing:
```bash
    --pooled-continuous
             This option does not make assumptions about the ploidy when making genotype calls.
             It makes sense for a mixed ploidy sample such as parasite infected blood DNA.
             variants are still called as diploid. 
    --min-alternate-fraction 0.01
             since we assume a pooled continuous sample, it would be better to set a within
             samlpe allele frequency threshold to remove noise. 
             this is likely not needed when dealing with a diploid sample because a frequency 
             of 0.01 will likely be considered noise for a diploid sample.
    --min-alternate-count 2
             number of reads supporting a variant to consider for genotype calls.
             having this at at least 2 is good. It will be possible to process
             variants with 1 reads in postprocessing steps if a specific variant
             is observed at least in one sample at > 1 reads. So this removes the 
             variant from consideration if no sample has > 1 reads supporting it.
    --min-alternate-total 10
             total read support for a variant across samples.
```
options to consider for human sequences:
```bash
    --min-mapping-quality 0
             default for this setting is 1. I do not think this is helping much in 
             addressing mapping issues. However, reads in copy number variant regions
             may have 0 mapping quality. These would be worth to keep, but they
             should be handled appropriately at postprocessing steps.
    --min-alternate-count 2
    --min-alternate-fraction 0.05 (default)
    --min-alternate-total 10
```

####  Example cell
```python
# provide freebayes options.
# These will be directy passed to freebayes

# example for plasmodium falciparum calls
original_options = ["--pooled-continuous",
           "--min-alternate-fraction", "0.01",
           "--min-alternate-count", "2",
           "--haplotype-length", "3",
           "--min-alternate-total", "10",
           "--use-best-n-alleles", "70",
           "--genotype-qualities", "--gvcf",
           "--gvcf-dont-use-chunk", "true"]

# example for human genome calls with gvcf output
original_options = ["--haplotype-length", "-1",
           "--use-best-n-alleles", "50",
           "--genotype-qualities", "--gvcf",
           "--gvcf-dont-use-chunk", "true"]

# example for human genome calls without gvcf output
original_options = ["--haplotype-length", "-1",
           "--use-best-n-alleles", "50",
           "--genotype-qualities"]
```

In [None]:
# USER INPUT

# provide freebayes options.
# These will be directy passed to freebayes
original_options = 

In [None]:
# OPTIONAL USER INPUT

align=True
verbose=True
# where to save generated fastq files
fastq_dir="/opt/analysis/padded_fastqs"
# where to save generated bam files
bam_dir="/opt/analysis/padded_bams"
# where to save the output vcf file
vcf_file="/opt/analysis/variants.vcf.gz"
# where is the targeted variants file
targets_file="/opt/project_resources/targets.tsv"
# where to save errors and warnings generated by freebayes
errors_file="/opt/analysis/freebayes_errors.txt"
warnings_file="/opt/analysis/freebayes_warnings.txt"

In [None]:
# OPTIONAL USER INPUT

# freebayes caller creates fastq files from the haplotype sequences
# by default 20 bp flanking sequence from the reference genome is added
# to ensure correct deletion calls when they are towards the ends.
# This assumes the 20 bp flank is wild type, however the sequence
# is given a quality of 1, which should help avoiding some issues.
# If this is not desired, set the below parameter to 0
fastq_padding = 20

In [None]:
# RUN
from multiprocessing import Pool
import multiprocessing
import multiprocessing.pool
import copy

freebayes_command_dict, contig_vcf_gz_paths = mip.freebayes_call(
        settings=settings,
        options=copy.deepcopy(original_options),
        align=align,
        verbose=verbose,
        fastq_dir=fastq_dir,
        bam_dir=bam_dir,
        vcf_file=vcf_file,
        targets_file=targets_file,
        bam_files=None,
        errors_file=errors_file,
        warnings_file=warnings_file,
        fastq_padding=fastq_padding)
freebayes_commands=list(freebayes_command_dict.values())
pool = Pool(int(settings["freebayes_threads"]))
# run the freebayes worker program in parallel
# create a results container for the return values from the worker function

results = []
errors = []
pool.map_async(mip.freebayes_worker, freebayes_commands, callback=results.extend,
                   error_callback=errors.extend)
#print(results)
pool.close()
pool.join()
#comment in these print statements if you get any errors for more details on which contigs failed to run in freebayes
#print('\n\n\n\n\n')
#print(results, '\n\n\n')
#print(errors, '\n\n\n')

mip.concatenate_headers(settings=settings, wdir='/opt/analysis', freebayes_settings=original_options, vcf_paths=contig_vcf_gz_paths)

# Potential Exit Point
The above cell should create the vcf file **variants.vcf.gz** in the analysis directory (assuming the vcf_file parameter was not changed). You can use this file in any downstream pipeline that utilizes vcf files. The variants are called rather generously, i.e. even when there is a good chance that a called variant is not there, with the assumption that the vcf will be further processed using whatever metric is deemed suitable for the data set.  

In addition, you should now have a **padded_fastqs** subdirectory in your analysis directory containing fastq files for each sample. These fastq files contain 1 read per UMI and they are stitched together and cleaned up using MIPWrangler. You should be able to use these files in any pipeline that accepts fastq inputs (virtually all bioinformatics pipelines).  

Finally, there is a **padded_bams** folder containing bam files for each sample obtained by mapping the *padded fastqs* to the reference genome.  

---
The next steps in this notebook are dealing with postprocessing the vcf file in the ways that we found useful so far.

# Processing Variant Calls
Freebayes produces high quality vcf files with haplotype based variant calls. This is important for getting more accurate calls, especially for complex regions where SNVs may overlap with indels and there may be many possible alleles as opposed to a simple biallelic SNV call.   

haplotype based variant example:  

chr1  1000 AAA,AGC,TGC  

However, it may be desired to "decompose" these complex variants for some applications. For example, if we are interested in knowing the prevalence of a specific drug resistance mutation, it would make sense to combine all variants containing this mutation even though they may be part of different haplotypes, and hence are represented in the vcf in different variants.  

Decomposed variants:  

chr1  1000 A T  
chr1  1001 A G  
chr1  1002 A C  

vcf_to_tables function takes the vcf file generated by freebayes and generates allele count and coverage data in table form. It is possible to decompose and aggregate amino acid and/or nucleotide level variants. 3 files containing count data are generated: alternate_table.csv, reference_table.csv, coverage_table.csv, for alt allele, ref allele and coverage count values for each variant, respectively.

It first separates the multiallelic calls to bi-allelic calls.

#### annotate, default=True
It then annotates variants using snpEff.

#### geneid_to_genename, default=None
Variant annotation provides a gene ID (e.g. PF3D7_0709000) but it does not provide common gene names (e.g. crt). If common names are used in target files, or they are desired in general, a tab separated gene ID to gene name file can be used. **gene_name and gene_id** columns are required. If no file is provided, gene name will be the same as the gene ID.

#### aggregate_aminoacids, default=False
If aminoacid level aggregation is requested, it decomposes multi amino acid missense variants into single components and aggregates the alternate allele and coverage counts per amino acid change. For example, Asn75Glu change for crt gene is a known drug resistance mutation in Plasmodium falciparum. There may be 3 separate variants in the vcf file that contain this mutation: Asn75Glu, MetAsn75IleGlu, Asn75Glu_del76-80*. All three has the missense variant Asn75Glu. While the first two  are simple changes, the third is a complex change including a 5 amino acid deletion and a stop codon following Asn75Glu. In this case, it makes sense to combine the counts of the first two variants towards Asn75Glu counts but the third one is debatable because of the complexity; i.e. the drug resistance mutation Asn75Glu probably is not that improtant in that context because of the stop codon following it. So we decompose the simple changes and aggregate but leave complex changes as they are. If aminoacid aggregation is carried out, file names will contain AA tag.

#### target_aa_annotation, default=None
It is also possible to annotate the targeted variants (such as Asn75Glu above) in the generated tables as 'Targeted' in case some analysis should be carried out on targeted variants only. A tab separated file containing the annotation details is required for this operation. **gene_name, aminoacid_change and mutation_name** are required fields. If a variants gene_name and aminoacid_change are matching to a row in the target file, that variant will be marked as targeted and will have the correspondign mutation name. Note that if common gene name conversion (see above) is not used, the gene_name column in this file must match the actual gene ID and not the common name. It may be more convenient to keep the gene IDs in the target file as well and use that file for ID to name mapping. **aggregate_aminoacids must be set to True** for this option to be used.

#### aggregate_nucleotides, default=False
A similar aggregation can be done at nucleotide level. If specified, biallelic variants will be decomposed using the tool **vt decompose_blocksub**. By default it decomposes block substitutions that do not include indels. However, it is also possible to decompose complex variants including indels by providing -a option. For possible decompose options see vt help:
```bash
vt decompose_blocksub options : 
  -p  Output phased genotypes and PS tags for decomposed variants [false]
  -m  keep MNVs (multi-nucleotide variants) [false]
  -a  enable aggressive/alignment mode [false]
  -d  MNVs max distance (when -m option is used) [2]
  -o  output VCF file [-]
  -I  file containing list of intervals []
  -i  intervals []
  -?  displays help
```
If nucleotide level aggregation is done, the file names will include AN tag.

#### target_nt_annotation, default=None
Annotation of targeted nucleotides requires a file similar to the targeted amino acid annotation. However, the required fields for this annotation are: CHROM, POS, REF, ALT and mutation_name. **aggregate_nucleotides must be set to True** for this option to be used.

#### aggregate_none, default=False
It is also possible to generate count tables without doing any aggregation. This will generate the 3 count files, and all of the variant information included in the vcf file will be a separate column in the table's index. For annotated initial vcf files, or if annotate option is selected, each subfield in the INFO/ANN field will have its own column.

#### min_site_qual, default=-1
Filter variant sites for a minimum QUAL value assigned by the variant caller. This value is described in freebayes manual as:
```bash
Of primary interest to most users is the QUAL field, which estimates the probability that there is a polymorphism at the loci described by the record. In freebayes, this value can be understood as 1 - P(locus is homozygous given the data). It is recommended that users use this value to filter their results, rather than accepting anything output by freebayes as ground truth.

By default, records are output even if they have very low probability of variation, in expectation that the VCF will be filtered using tools such as vcffilter in vcflib, which is also included in the repository under vcflib/. For instance,

freebayes -f ref.fa aln.bam | vcffilter -f "QUAL > 20" >results.vcf

removes any sites with estimated probability of not being polymorphic less than phred 20 (aka 0.01), or probability of polymorphism > 0.99.

In simulation, the receiver-operator characteristic (ROC) tends to have a very sharp inflection between Q1 and Q30, depending on input data characteristics, and a filter setting in this range should provide decent performance. Users are encouraged to examine their output and both variants which are retained and those they filter out. Most problems tend to occur in low-depth areas, and so users may wish to remove these as well, which can also be done by filtering on the DP flag.
```
Therefore, a **minimum of 1** should be used as a min_site_qual to remove low quality sites. If a site is annotated as **targeted**, the site will be kept regardless of its qual value, however, the alternate observation counts for the site may be reset to zero depending on the min_target_site_qual value described below.

#### min_target_site_qual, default=-1
If a variant site is targeted but the site qual is lower than this,
reset the alternate observation counts to 0. It may be best to leave
this at the default value since there is usually additional evidence
that a targeted variant exists in a samples compared to a de novo
variant, i.e. those variants that are targeted had been observed in other samples/studies.

#### Example cell
```python
# provide a file that maps gene names to gene IDs
# this is necessary when targeted variant annotations use
# gene names instead of gene IDs
geneid_to_genename = "/opt/project_resources/geneid_to_genename.tsv"
# annotate targted amino acid changes in the tables.
target_aa_annotation = "/opt/project_resources/targets.tsv"
# decompose multi amino acid changes and combine counts of
# resulting single amino acid changes
aggregate_aminoacids = True
# decompose MNVs and combine counts for resulting SNVs
aggregate_nucleotides = True
# annotate targeted nucleotide changes in the tables.
target_nt_annotation = None
```

In [None]:
# USER INPUT

# provide a file that maps gene names to gene IDs
# this is necessary when targeted variant annotations use
# gene names instead of gene IDs. Otherwise provide None
geneid_to_genename = 
# annotate targeted amino acid changes in the tables
# using the file, or otherwise provide None
target_aa_annotation = 
# decompose multi amino acid changes and combine counts of
# resulting single amino acid changes
aggregate_aminoacids = 
# decompose MNVs and combine counts for resulting SNVs
aggregate_nucleotides = 
# annotate targeted nucleotide changes in the tables.
target_nt_annotation = 

In [None]:
# OPTIONAL USER INPUT

# analysis settings dictionary
settings = settings
# provide the path to the settings file
# if settings dictionary has not been loaded
settings_file = None
# use snpEff to annotate the variants
annotate = True
# additional vt options for decomposing nucleotides.
# Supply ["-a"] to include indels and complex variants
# in decomposition, or other options shown above if desired.
decompose_options = []
# was the initial vcf file was annotated by snpEff?
annotated_vcf = False
# create tables for variants as they are in the vcf file
# without decomposing compex variants or indels.
# Multiallelic variants will be split into biallelic.
aggregate_none = True
# filter variant sites for quality
min_site_qual = 1
# reset targeted variant counts to zero
# when the site quality is below this value
min_target_site_qual = -1
# reset genotypes in the vcf file to NA
# and depth to 0 if FORMAT/GQ value for a variant/sample
# is below this value:
min_genotype_qual = -1
# reset alt allele count in the vcf file to 0
# if FORMAT/QA value divided by FORMAT/AO for a variant/sample
# is below this value:
min_mean_alt_qual = -1 # average quality cut off for variants
# There are also available, similar filters for:
# min_mean_ref_qual : resetting low qual reference allele counts
# min_alt_qual : similar to min_mean_alt_qual, but for total qual score
# min_ref_qual : similar to min_alt_qual but for reference alleles

# prefix for output files, if desired.
# this is useful when different quality thresholds etc will be used
# to avoid overwriting the files. For example, if min_genotype_qual = 1
# and min_mean_alt_qual = 15 is used, a suitable prefix could be
# "gq1.mqa15."
output_prefix = ""

In [None]:
# RUN

# input vcf file
vcf_file = vcf_file.split("/")[-1]
mip.vcf_to_tables_fb(
     vcf_file,
     settings=settings,
     settings_file=settings_file,
     annotate=annotate,
     geneid_to_genename=geneid_to_genename,
     target_aa_annotation=target_aa_annotation,
     aggregate_aminoacids=aggregate_aminoacids,
     target_nt_annotation=target_nt_annotation, 
     aggregate_nucleotides=aggregate_nucleotides, 
     decompose_options=decompose_options,
     annotated_vcf=annotated_vcf,
     aggregate_none=aggregate_none,
     min_site_qual=min_site_qual,
     min_target_site_qual=min_target_site_qual,
     min_genotype_qual=min_genotype_qual,
     min_mean_alt_qual=min_mean_alt_qual,
     output_prefix=output_prefix)

## Tables created
alternate_XX_table.csv files will contain the ALT allele count for that table type while coverage_XX_table.csv will contain the depth of coverage at each locus.
### Nucleotide changes (aggregated)
For some projects we may be interested in specific single nucleotide changes. For these, it would make sense to decompose multi nucleotide changes and combine counts of the same single nucleotide changes. Two tables will be generated for count and coverage data for aggregated nucleotide changes:  

**alternate_AN_table.csv** file in the analysis directory is created if aggregate_nucleotides option was selected when creating data tables. This table has the UMI counts for each alternate nucleotide.  

**coverage_AN_table.csv** file is the corresponding coverage depth for each variant's position.  

**genotypes_AN_table.csv** file contains the aggregated value of the genotypes called by freebayes: 0/0->0, 0/1->1, 1/1->2, N/A (.) ->-1. When calls from multiple variants are aggregated; if all 0/0 then -> 0, if any 0/0 and non-0/0 then -> 1, if all 1/1 then -> 2

### Amino acid changes (aggregated)
For some projects we may be interested in the amino acid changes, particularly specific, targeted amino acid changes, such as drug resistance mutations in *Plasmodium falciparum*, which is the data set provided for pipeline test. For these type of projects, we may want to analyze the variants from the amino acid perspective, rather than nucleotide changes which is standard output for variant callers.  

**alternate_AA_table.csv** file in the analysis directory is created if aggregate_aminoacids option was selected when creating data tables. This table has the UMI counts for each alternate amino acid.  

**coverage_AA_table.csv** file is the corresponding coverage depth for each variant's position.  

**genotypes_AA_table.csv** file contains the aggregated value of the genotypes called by freebayes: 0/0->0, 0/1->1, 1/1->2, N/A (.) ->-1. When calls from multiple variants are aggregated; if all 0/0 then -> 0, if any 0/0 and non-0/0 then -> 1, if all 1/1 then -> 2

### Nucleotide changes (not aggregated)
For some projects we may be interested in keeping composite variants as they are called by the pipeline. These will include MNVs, comlplex variants including indels, etc. Two tables will be generated for count and coverage data for original nucleotide changes:  

**alternate_table.csv** file in the analysis directory is created if aggregate_none option was selected when creating data tables. This table has the UMI counts for each alternate nucleotide.  

**coverage_table.csv** file is the corresponding coverage depth for each variant's position.  

**genotypes_table.csv** file contains the aggregated value of the genotypes called by freebayes: 0/0->0, 0/1->1, 1/1->2, N/A (.) ->-1.