# Purpose of this notebook:
This notebook converts microhaplotypes to VCF files and variant tables (counts of how many UMIs were associated with
each mutation in each sample). This notebook uses a tool called freebayes to convert microhaplotypes into individual
variants, and SNPEff to assign amino acid numbers and names to mutations that fall within protein coding genes.

This notebook is meant to be run after completing the wrangler_QC_repool_stats notebook.

**WARNING**: If you run this notebook without completing the wrangler_QC_repool_stats notebook you will get an error.

# How to use code cells in this notebook
If a code cell starts with # RUN, Run the cell by CTRL+Enter, or the Run button above. If a code cell starts with # USER INPUT
User input is needed before running the cell. If a code cell starts with # OPTIONAL USER INPUT, the cell needs to be run but default values are pre-filled, with the option for the user to change them if needed.

**Important note on entering input:** When entering user input, please make sure you follow the formatting provided in the comments and default values (ignoring the # that denotes the comment). For example, if there are quotes, brackets, commas, or spaces in the examples, make sure your values follow these conventions also.

In [12]:
# RUN
import sys
sys.path.append("/opt/src")
import mip_functions as mip
import copy
import os
import warnings
warnings.filterwarnings('ignore')
wdir = "/opt/user/stats_and_variant_calling/"
data_dir = "/opt/user/wrangled_data/"

In [2]:
# USER INPUT
# set the number of processors to use to parallelize non-freebayes portions of this notebook. You can set this number
# high and it will help the job complete faster with relatively low likelihood of crashing your machine.
processorNumber = 20

# set the number of processors to use to parallelize freebayes - freebayes is very memory intensive. Set this number
# high enough to take advantage of the number of processors in your machine, but low enough to avoid running out of
# the RAM available on your machine and crashing the run. The "correct" number will depend on your machine stats
# and the size of your dataset, but I like to set this number as low as I can afford to go without the run taking
# forever - out of memory crashes are very frequent when this number is set too high with large datasets.
freebayes_threads = 8

In [4]:
# RUN

# extract the settings from the previous jupyter notebook
settings_file='settings.txt'
settings = mip.get_analysis_settings(wdir + settings_file)
settings['processorNumber']=processorNumber
settings['freebayes_threads']=freebayes_threads
settings_path = os.path.join(wdir, settings_file)
mip.write_analysis_settings(settings, settings_path)

# Variant Calling

### Options for freebayes wrapper
```Python
align = True # Default is set to true, fastq files and bam files per sample
# will be created in 'fastq_dir' and 'bam_dir'. 
# it should be set to false if bam files are available.

settings = settings # analysis settings dictionary created above.

bam_files = None # default is to use all bam files within the bam_dir.
# if specific files should be used, then they can be specified in a list.

verbose = True # prints errors and warnings as well as saving to disk.
# if set to false, it will print that there is an error which will
# be saved to disk which should be inspected for details.

targets_file = None # force calls on specific loci even if there is
# no observations satisfying filter criteria. Useful in cases of targeted
# mutations such as drug resistance mutations.
# Usually a file at "/opt/project_resources/targets.tsv" would be present
# if the project requires it. Then targets_file should be set to this path.

# paths for input-output files with default values that can be left unchanged
fastq_dir, bam_dir, vcf_file, settings_file, errors_file, warnings_file

# additional options to pass to freebayes directly:
options = [] # see below for suggestions and possibilities.
```
#### Additional options for freebayes caller. 
Most of the freebayes options are shown below in the **freebayes help** section at the bottom of this document. Some options are integrated into the python wrapper freebayes_call, but others should be added depending on the data type, species etc.

integrated options:
```bash
    -r region
            limit calls to a specific region. 
            This is done internally, splitting the results into contigs and processing each contig
            separately (in parallel if multiple cpus are available).
            Per-contig vcf files are concatenated at the end into a single file.
    -@ targets.vcf
            force calls on positions provided in the vcf file
            a vcf file is generated if a tab separated file containing targets are provided.
    -L --bam-list
            a list of bam files to be used. By default, all bams in bams directory will be used.
            A list of specific bams can be specified to freebayes_call as bam_files option.
```
options to consider adding for parasite sequencing:
```bash
    --pooled-continuous
             This option does not make assumptions about the ploidy when making genotype calls.
             It makes sense for a mixed ploidy sample such as parasite infected blood DNA.
             variants are still called as diploid. 
    --min-alternate-fraction 0.01
             since we assume a pooled continuous sample, it would be better to set a within
             samlpe allele frequency threshold to remove noise. 
             this is likely not needed when dealing with a diploid sample because a frequency 
             of 0.01 will likely be considered noise for a diploid sample.
    --min-alternate-count 2
             number of reads supporting a variant to consider for genotype calls.
             having this at at least 2 is good. It will be possible to process
             variants with 1 reads in postprocessing steps if a specific variant
             is observed at least in one sample at > 1 reads. So this removes the 
             variant from consideration if no sample has > 1 reads supporting it.
    --min-alternate-total 10
             total read support for a variant across samples.
```
options to consider for human sequences:
```bash
    --min-mapping-quality 0
             default for this setting is 1. I do not think this is helping much in 
             addressing mapping issues. However, reads in copy number variant regions
             may have 0 mapping quality. These would be worth to keep, but they
             should be handled appropriately at postprocessing steps.
    --min-alternate-count 2
    --min-alternate-fraction 0.05 (default)
    --min-alternate-total 10
```

####  Example cell
```python
# provide freebayes options.
# These will be directy passed to freebayes

# example for plasmodium falciparum calls
original_options = ["--pooled-continuous",
           "--min-alternate-fraction", "0.01",
           "--min-alternate-count", "2",
           "--haplotype-length", "3",
           "--min-alternate-total", "10",
           "--use-best-n-alleles", "70",
           "--genotype-qualities", "--gvcf",
           "--gvcf-dont-use-chunk", "true"]

# example for human genome calls with gvcf output
original_options = ["--haplotype-length", "-1",
           "--use-best-n-alleles", "50",
           "--genotype-qualities", "--gvcf",
           "--gvcf-dont-use-chunk", "true"]

# example for human genome calls without gvcf output
original_options = ["--haplotype-length", "-1",
           "--use-best-n-alleles", "50",
           "--genotype-qualities"]
```

In [5]:
# USER INPUT

# provide freebayes options.
# These will be directy passed to freebayes
original_options = ["--pooled-continuous",
           "--min-alternate-fraction", "0.01",
           "--min-alternate-count", "2",
           "--haplotype-length", "3",
           "--min-alternate-total", "10",
           "--use-best-n-alleles", "70",
           "--genotype-qualities", "--gvcf",
           "--gvcf-dont-use-chunk", "true"]

In [6]:
# OPTIONAL USER INPUT

align=True
verbose=True
# where to save generated fastq files
fastq_dir="/opt/user/stats_and_variant_calling/padded_fastqs"
# where to save generated bam files
bam_dir="/opt/user/stats_and_variant_calling/padded_bams"
# where to save the output vcf file
vcf_file="/opt/user/stats_and_variant_calling/variants.vcf.gz"
# where is the targeted variants file
targets_file="/opt/project_resources/targets.tsv"
# where to save errors and warnings generated by freebayes
errors_file="/opt/user/stats_and_variant_calling/freebayes_errors.txt"
warnings_file="/opt/user/stats_and_variant_calling/freebayes_warnings.txt"

In [7]:
# OPTIONAL USER INPUT

# freebayes caller creates fastq files from the haplotype sequences
# by default 20 bp flanking sequence from the reference genome is added
# to ensure correct deletion calls when they are towards the ends.
# This assumes the 20 bp flank is wild type, however the sequence
# is given a quality of 1, which should help avoiding some issues.
# If this is not desired, set the below parameter to 0
fastq_padding = 20

In [8]:
# RUN
from multiprocessing import Pool
import multiprocessing
import multiprocessing.pool
import copy
import gzip

freebayes_command_dict, contig_vcf_gz_paths = mip.freebayes_call(
        settings=settings,
        options=copy.deepcopy(original_options),
        align=align,
        verbose=verbose,
        fastq_dir=fastq_dir,
        bam_dir=bam_dir,
        vcf_file=vcf_file,
        targets_file=targets_file,
        bam_files=None,
        errors_file=errors_file,
        warnings_file=warnings_file,
        fastq_padding=fastq_padding)
freebayes_commands=list(freebayes_command_dict.values())
pool = Pool(int(settings["freebayes_threads"]))
# run the freebayes worker program in parallel
# create a results container for the return values from the worker function

results = []
errors = []
pool.map_async(mip.freebayes_worker, freebayes_commands, callback=results.extend,
                   error_callback=errors.extend)
#print(results)
pool.close()
pool.join()
#comment in these print statements if you get any errors for more details on which contigs failed to run in freebayes
#print('\n\n\n\n\n')
#print(results, '\n\n\n')
#print(errors, '\n\n\n')

mip.concatenate_headers(settings=settings, wdir='/opt/user/stats_and_variant_calling', freebayes_settings=original_options, vcf_paths=contig_vcf_gz_paths)

file_in = gzip.open(vcf_file, 'rt')
file_out = gzip.open(vcf_file.replace('.vcf.gz','_mutations_only.vcf.gz'),'wt')
for line in file_in:
    if "<*>" not in line:
        file_out.write(line)
file_out.close()

Checking the headers and starting positions of 7 files
Concatenating /opt/user/stats_and_variant_calling/contig_vcfs/chr13_0.vcf.gz	0.142431 seconds
	0.069318 secondst/user/stats_and_variant_calling/contig_vcfs/chr14_0.vcf.gz
Concatenating /opt/user/stats_and_variant_calling/contig_vcfs/chr14_1.vcf.gz	0.060304 seconds
Concatenating /opt/user/stats_and_variant_calling/contig_vcfs/chr4_0.vcf.gz	0.069989 seconds
	0.155633 secondst/user/stats_and_variant_calling/contig_vcfs/chr5_0.vcf.gz
Concatenating /opt/user/stats_and_variant_calling/contig_vcfs/chr7_0.vcf.gz	0.127667 seconds
	0.089557 secondst/user/stats_and_variant_calling/contig_vcfs/chr8_0.vcf.gz


did a reheader


# Potential Exit Point
The above cell should create the vcf file **variants.vcf.gz** in the analysis directory (assuming the vcf_file parameter was not changed). You can use this file in any downstream pipeline that utilizes vcf files. The variants are called rather generously, i.e. even when there is a good chance that a called variant is not there, with the assumption that the vcf will be further processed using whatever metric is deemed suitable for the data set.  

In addition, you should now have a **padded_fastqs** subdirectory in your analysis directory containing fastq files for each sample. These fastq files contain 1 read per UMI and they are stitched together and cleaned up using MIPWrangler. You should be able to use these files in any pipeline that accepts fastq inputs (virtually all bioinformatics pipelines).  

Finally, there is a **padded_bams** folder containing bam files for each sample obtained by mapping the *padded fastqs* to the reference genome.  

---
The next steps in this notebook are dealing with postprocessing the vcf file in the ways that we found useful so far.

# Processing Variant Calls
Freebayes produces high quality vcf files with haplotype based variant calls. This is important for getting more accurate calls, especially for complex regions where SNVs may overlap with indels and there may be many possible alleles as opposed to a simple biallelic SNV call.   

haplotype based variant example:  

chr1  1000 AAA,AGC,TGC  

However, it may be desired to "decompose" these complex variants for some applications. For example, if we are interested in knowing the prevalence of a specific drug resistance mutation, it would make sense to combine all variants containing this mutation even though they may be part of different haplotypes, and hence are represented in the vcf in different variants.  

Decomposed variants:  

chr1  1000 A T  
chr1  1001 A G  
chr1  1002 A C  

vcf_to_tables function takes the vcf file generated by freebayes and generates allele count and coverage data in table form. It is possible to decompose and aggregate amino acid and/or nucleotide level variants. 3 files containing count data are generated: alternate_table.csv, reference_table.csv, coverage_table.csv, for alt allele, ref allele and coverage count values for each variant, respectively.

It first separates the multiallelic calls to bi-allelic calls.

#### annotate, default=True
It then annotates variants using snpEff.

#### geneid_to_genename, default=None
Variant annotation provides a gene ID (e.g. PF3D7_0709000) but it does not provide common gene names (e.g. crt). If common names are used in target files, or they are desired in general, a tab separated gene ID to gene name file can be used. **gene_name and gene_id** columns are required. If no file is provided, gene name will be the same as the gene ID.

#### aggregate_aminoacids, default=False
If aminoacid level aggregation is requested, it decomposes multi amino acid missense variants into single components and aggregates the alternate allele and coverage counts per amino acid change. For example, Asn75Glu change for crt gene is a known drug resistance mutation in Plasmodium falciparum. There may be 3 separate variants in the vcf file that contain this mutation: Asn75Glu, MetAsn75IleGlu, Asn75Glu_del76-80*. All three has the missense variant Asn75Glu. While the first two  are simple changes, the third is a complex change including a 5 amino acid deletion and a stop codon following Asn75Glu. In this case, it makes sense to combine the counts of the first two variants towards Asn75Glu counts but the third one is debatable because of the complexity; i.e. the drug resistance mutation Asn75Glu probably is not that improtant in that context because of the stop codon following it. So we decompose the simple changes and aggregate but leave complex changes as they are. If aminoacid aggregation is carried out, file names will contain AA tag.

#### target_aa_annotation, default=None
It is also possible to annotate the targeted variants (such as Asn75Glu above) in the generated tables as 'Targeted' in case some analysis should be carried out on targeted variants only. A tab separated file containing the annotation details is required for this operation. **gene_name, aminoacid_change and mutation_name** are required fields. If a variants gene_name and aminoacid_change are matching to a row in the target file, that variant will be marked as targeted and will have the correspondign mutation name. Note that if common gene name conversion (see above) is not used, the gene_name column in this file must match the actual gene ID and not the common name. It may be more convenient to keep the gene IDs in the target file as well and use that file for ID to name mapping. **aggregate_aminoacids must be set to True** for this option to be used.

#### aggregate_nucleotides, default=False
A similar aggregation can be done at nucleotide level. If specified, biallelic variants will be decomposed using the tool **vt decompose_blocksub**. By default it decomposes block substitutions that do not include indels. However, it is also possible to decompose complex variants including indels by providing -a option. For possible decompose options see vt help:
```bash
vt decompose_blocksub options : 
  -p  Output phased genotypes and PS tags for decomposed variants [false]
  -m  keep MNVs (multi-nucleotide variants) [false]
  -a  enable aggressive/alignment mode [false]
  -d  MNVs max distance (when -m option is used) [2]
  -o  output VCF file [-]
  -I  file containing list of intervals []
  -i  intervals []
  -?  displays help
```
If nucleotide level aggregation is done, the file names will include AN tag.

#### target_nt_annotation, default=None
Annotation of targeted nucleotides requires a file similar to the targeted amino acid annotation. However, the required fields for this annotation are: CHROM, POS, REF, ALT and mutation_name. **aggregate_nucleotides must be set to True** for this option to be used.

#### aggregate_none, default=False
It is also possible to generate count tables without doing any aggregation. This will generate the 3 count files, and all of the variant information included in the vcf file will be a separate column in the table's index. For annotated initial vcf files, or if annotate option is selected, each subfield in the INFO/ANN field will have its own column.

#### min_site_qual, default=-1
Filter variant sites for a minimum QUAL value assigned by the variant caller. This value is described in freebayes manual as:
```bash
Of primary interest to most users is the QUAL field, which estimates the probability that there is a polymorphism at the loci described by the record. In freebayes, this value can be understood as 1 - P(locus is homozygous given the data). It is recommended that users use this value to filter their results, rather than accepting anything output by freebayes as ground truth.

By default, records are output even if they have very low probability of variation, in expectation that the VCF will be filtered using tools such as vcffilter in vcflib, which is also included in the repository under vcflib/. For instance,

freebayes -f ref.fa aln.bam | vcffilter -f "QUAL > 20" >results.vcf

removes any sites with estimated probability of not being polymorphic less than phred 20 (aka 0.01), or probability of polymorphism > 0.99.

In simulation, the receiver-operator characteristic (ROC) tends to have a very sharp inflection between Q1 and Q30, depending on input data characteristics, and a filter setting in this range should provide decent performance. Users are encouraged to examine their output and both variants which are retained and those they filter out. Most problems tend to occur in low-depth areas, and so users may wish to remove these as well, which can also be done by filtering on the DP flag.
```
Therefore, a **minimum of 1** should be used as a min_site_qual to remove low quality sites. If a site is annotated as **targeted**, the site will be kept regardless of its qual value, however, the alternate observation counts for the site may be reset to zero depending on the min_target_site_qual value described below.

#### min_target_site_qual, default=-1
If a variant site is targeted but the site qual is lower than this,
reset the alternate observation counts to 0. It may be best to leave
this at the default value since there is usually additional evidence
that a targeted variant exists in a samples compared to a de novo
variant, i.e. those variants that are targeted had been observed in other samples/studies.

#### Example cell
```python
# provide a file that maps gene names to gene IDs
# this is necessary when targeted variant annotations use
# gene names instead of gene IDs
geneid_to_genename = "/opt/project_resources/geneid_to_genename.tsv"
# annotate targted amino acid changes in the tables.
target_aa_annotation = "/opt/project_resources/targets.tsv"
# decompose multi amino acid changes and combine counts of
# resulting single amino acid changes
aggregate_aminoacids = True
# decompose MNVs and combine counts for resulting SNVs
aggregate_nucleotides = True
# annotate targeted nucleotide changes in the tables.
target_nt_annotation = None
```

In [9]:
# USER INPUT

# provide a file that maps gene names to gene IDs
# this is necessary when targeted variant annotations use
# gene names instead of gene IDs. Otherwise provide None
geneid_to_genename = '/opt/project_resources/geneid_to_genename.tsv'
# annotate targeted amino acid changes in the tables
# using the file, or otherwise provide None
target_aa_annotation = '/opt/project_resources/targets.tsv'
# decompose multi amino acid changes and combine counts of
# resulting single amino acid changes
aggregate_aminoacids = True
# decompose MNVs and combine counts for resulting SNVs
aggregate_nucleotides = True
# annotate targeted nucleotide changes in the tables.
target_nt_annotation = None

In [10]:
# OPTIONAL USER INPUT

# analysis settings dictionary
settings = settings
# provide the path to the settings file
# if settings dictionary has not been loaded
settings_file = None
# use snpEff to annotate the variants
annotate = True
# additional vt options for decomposing nucleotides.
# Supply ["-a"] to include indels and complex variants
# in decomposition, or other options shown above if desired.
decompose_options = []
# was the initial vcf file was annotated by snpEff?
annotated_vcf = False
# create tables for variants as they are in the vcf file
# without decomposing compex variants or indels.
# Multiallelic variants will be split into biallelic.
aggregate_none = True
# filter variant sites for quality
min_site_qual = 1
# reset targeted variant counts to zero
# when the site quality is below this value
min_target_site_qual = -1
# reset genotypes in the vcf file to NA
# and depth to 0 if FORMAT/GQ value for a variant/sample
# is below this value:
min_genotype_qual = -1
# reset alt allele count in the vcf file to 0
# if FORMAT/QA value divided by FORMAT/AO for a variant/sample
# is below this value:
min_mean_alt_qual = -1 # average quality cut off for variants
# There are also available, similar filters for:
# min_mean_ref_qual : resetting low qual reference allele counts
# min_alt_qual : similar to min_mean_alt_qual, but for total qual score
# min_ref_qual : similar to min_alt_qual but for reference alleles

# prefix for output files, if desired.
# this is useful when different quality thresholds etc will be used
# to avoid overwriting the files. For example, if min_genotype_qual = 1
# and min_mean_alt_qual = 15 is used, a suitable prefix could be
# "gq1.mqa15."
output_prefix = ""

In [11]:
# RUN

# input vcf file
vcf_file = vcf_file.split("/")[-1].replace('.vcf.gz','_mutations_only.vcf.gz')
mip.vcf_to_tables_fb(
     vcf_file,
     settings=settings,
     settings_file=settings_file,
     annotate=annotate,
     geneid_to_genename=geneid_to_genename,
     target_aa_annotation=target_aa_annotation,
     aggregate_aminoacids=aggregate_aminoacids,
     target_nt_annotation=target_nt_annotation, 
     aggregate_nucleotides=aggregate_nucleotides, 
     decompose_options=decompose_options,
     annotated_vcf=annotated_vcf,
     aggregate_none=aggregate_none,
     min_site_qual=min_site_qual,
     min_target_site_qual=min_target_site_qual,
     min_genotype_qual=min_genotype_qual,
     min_mean_alt_qual=min_mean_alt_qual,
     output_prefix=output_prefix)

decompose_blocksub v0.5

options:     input VCF file        /opt/user/stats_and_variant_calling/split.variants_mutations_only.vcf.gz
         [o] output VCF file       /opt/user/stats_and_variant_calling/decomposed.variants_mutations_only.vcf.gz
         [a] align/aggressive mode false


stats: no. variants                       : 789
       no. biallelic block substitutions  : 76

       no. additional SNPs                : 416
       no. variants after decomposition   : 1129

Time elapsed: 0.34s



## Tables created
alternate_XX_table.csv files will contain the ALT allele count for that table type while coverage_XX_table.csv will contain the depth of coverage at each locus.
### Nucleotide changes (aggregated)
For some projects we may be interested in specific single nucleotide changes. For these, it would make sense to decompose multi nucleotide changes and combine counts of the same single nucleotide changes. Two tables will be generated for count and coverage data for aggregated nucleotide changes:  

**alternate_AN_table.csv** file in the analysis directory is created if aggregate_nucleotides option was selected when creating data tables. This table has the UMI counts for each alternate nucleotide.  

**coverage_AN_table.csv** file is the corresponding coverage depth for each variant's position.  

**genotypes_AN_table.csv** file contains the aggregated value of the genotypes called by freebayes: 0/0->0, 0/1->1, 1/1->2, N/A (.) ->-1. When calls from multiple variants are aggregated; if all 0/0 then -> 0, if any 0/0 and non-0/0 then -> 1, if all 1/1 then -> 2

### Amino acid changes (aggregated)
For some projects we may be interested in the amino acid changes, particularly specific, targeted amino acid changes, such as drug resistance mutations in *Plasmodium falciparum*, which is the data set provided for pipeline test. For these type of projects, we may want to analyze the variants from the amino acid perspective, rather than nucleotide changes which is standard output for variant callers.  

**alternate_AA_table.csv** file in the analysis directory is created if aggregate_aminoacids option was selected when creating data tables. This table has the UMI counts for each alternate amino acid.  

**coverage_AA_table.csv** file is the corresponding coverage depth for each variant's position.  

**genotypes_AA_table.csv** file contains the aggregated value of the genotypes called by freebayes: 0/0->0, 0/1->1, 1/1->2, N/A (.) ->-1. When calls from multiple variants are aggregated; if all 0/0 then -> 0, if any 0/0 and non-0/0 then -> 1, if all 1/1 then -> 2

### Nucleotide changes (not aggregated)
For some projects we may be interested in keeping composite variants as they are called by the pipeline. These will include MNVs, comlplex variants including indels, etc. Two tables will be generated for count and coverage data for original nucleotide changes:  

**alternate_table.csv** file in the analysis directory is created if aggregate_none option was selected when creating data tables. This table has the UMI counts for each alternate nucleotide.  

**coverage_table.csv** file is the corresponding coverage depth for each variant's position.  

**genotypes_table.csv** file contains the aggregated value of the genotypes called by freebayes: 0/0->0, 0/1->1, 1/1->2, N/A (.) ->-1.

## freebayes help documentation
Below are the various sections of freebayes --help output showing examples and options.
```bash
citation: Erik Garrison, Gabor Marth
          "Haplotype-based variant detection from short-read sequencing"
          arXiv:1207.3907 (http://arxiv.org/abs/1207.3907)

author:   Erik Garrison <erik.garrison@bc.edu>, Marth Lab, Boston College, 2010-2014
version:  v1.3.1-dirty
```


### overview:
```bash
    To call variants from aligned short-read sequencing data, supply BAM files and
    a reference.  FreeBayes will provide VCF output on standard out describing SNPs,
    indels, and complex variants in samples in the input alignments.

    By default, FreeBayes will consider variants supported by at least 2
    observations in a single sample (-C) and also by at least 20% of the reads from
    a single sample (-F).  These settings are suitable to low to high depth
    sequencing in haploid and diploid samples, but users working with polyploid or
    pooled samples may wish to adjust them depending on the characteristics of
    their sequencing data.

    FreeBayes is capable of calling variant haplotypes shorter than a read length
    where multiple polymorphisms segregate on the same read.  The maximum distance
    between polymorphisms phased in this way is determined by the
    --max-complex-gap, which defaults to 3bp.  In practice, this can comfortably be
    set to half the read length.

    Ploidy may be set to any level (-p), but by default all samples are assumed to
    be diploid.  FreeBayes can model per-sample and per-region variation in
    copy-number (-A) using a copy-number variation map.

    FreeBayes can act as a frequency-based pooled caller and describe variants
    and haplotypes in terms of observation frequency rather than called genotypes.
    To do so, use --pooled-continuous and set input filters to a suitable level.
    Allele observation counts will be described by AO and RO fields in the VCF output.

```

### examples:
```bash
    # call variants assuming a diploid sample
    freebayes -f ref.fa aln.bam >var.vcf

    # call variants assuming a diploid sample, providing gVCF output
    freebayes -f ref.fa --gvcf aln.bam >var.gvcf

    # require at least 5 supporting observations to consider a variant
    freebayes -f ref.fa -C 5 aln.bam >var.vcf

    # discard alignments overlapping positions where total read depth is greater than 200
    freebayes -f ref.fa -g 200 aln.bam >var.vcf

    # use a different ploidy
    freebayes -f ref.fa -p 4 aln.bam >var.vcf

    # assume a pooled sample with a known number of genome copies
    freebayes -f ref.fa -p 20 --pooled-discrete aln.bam >var.vcf

    # generate frequency-based calls for all variants passing input thresholds
    freebayes -f ref.fa -F 0.01 -C 1 --pooled-continuous aln.bam >var.vcf

    # use an input VCF (bgzipped + tabix indexed) to force calls at particular alleles
    freebayes -f ref.fa -@ in.vcf.gz aln.bam >var.vcf

    # generate long haplotype calls over known variants
    freebayes -f ref.fa --haplotype-basis-alleles in.vcf.gz \
                        --haplotype-length 50 aln.bam

    # naive variant calling: simply annotate observation counts of SNPs and indels
    freebayes -f ref.fa --haplotype-length 0 --min-alternate-count 1 \
        --min-alternate-fraction 0 --pooled-continuous --report-monomorphic >var.vcf
```

### input:
```bash
   -b --bam FILE   Add FILE to the set of BAM files to be analyzed.
   -L --bam-list FILE
                   A file containing a list of BAM files to be analyzed.
   -c --stdin      Read BAM input on stdin.  
   -f --fasta-reference FILE
                   Use FILE as the reference sequence for analysis.
                   An index file (FILE.fai) will be created if none exists.
                   If neither --targets nor --region are specified, FreeBayes
                   will analyze every position in this reference.
   -t --targets FILE
                   Limit analysis to targets listed in the BED-format FILE.
   -r --region <chrom>:<start_position>-<end_position>
                   Limit analysis to the specified region, 0-base coordinates,
                   end_position not included (same as BED format).
                   Either '-' or '..' maybe used as a separator.
   -s --samples FILE
                   Limit analysis to samples listed (one per line) in the FILE.
                   By default FreeBayes will analyze all samples in its input
                   BAM files.
   --populations FILE
                   Each line of FILE should list a sample and a population which
                   it is part of.  The population-based bayesian inference model
                   will then be partitioned on the basis of the populations.
   -A --cnv-map FILE
                   Read a copy number map from the BED file FILE, which has
                   either a sample-level ploidy:
                      sample_name copy_number
                   or a region-specific format:
                      seq_name start end sample_name copy_number
                   ... for each region in each sample which does not have the
                   default copy number as set by --ploidy. These fields can be delimited
                   by space or tab.

```

### output:
```bash
   -v --vcf FILE   Output VCF-format results to FILE. (default: stdout)
   --gvcf
                   Write gVCF output, which indicates coverage in uncalled regions.
   --gvcf-chunk NUM
                   When writing gVCF output emit a record for every NUM bases.
   -& --gvcf-dont-use-chunk BOOL
                   When writing the gVCF output emit a record for all bases if
                   set to "true" , will also route an int to --gvcf-chunk
                   similar to --output-mode EMIT_ALL_SITES from GATK
   -@ --variant-input VCF
                   Use variants reported in VCF file as input to the algorithm.
                   Variants in this file will included in the output even if
                   there is not enough support in the data to pass input filters.
   -l --only-use-input-alleles
                   Only provide variant calls and genotype likelihoods for sites
                   and alleles which are provided in the VCF input, and provide
                   output in the VCF for all input alleles, not just those which
                   have support in the data. 
   --haplotype-basis-alleles VCF
                   When specified, only variant alleles provided in this input
                   VCF will be used for the construction of complex or haplotype
                   alleles.
   --report-all-haplotype-alleles
                   At sites where genotypes are made over haplotype alleles,
                   provide information about all alleles in output, not only
                   those which are called.   
   --report-monomorphic
                   Report even loci which appear to be monomorphic, and report all
                   considered alleles, even those which are not in called genotypes.
                   Loci which do not have any potential alternates have '.' for ALT.
   -P --pvar N     Report sites if the probability that there is a polymorphism
                   at the site is greater than N.  default: 0.0.  Note that post-
                   filtering is generally recommended over the use of this parameter.
   --strict-vcf
                   Generate strict VCF format (FORMAT/GQ will be an int)

```

### population model:
```bash
-T --theta N    The expected mutation rate or pairwise nucleotide diversity
                   among the population under analysis.  This serves as the
                   single parameter to the Ewens Sampling Formula prior model
                   default: 0.001
   -p --ploidy N   Sets the default ploidy for the analysis to N.  default: 2
   -J --pooled-discrete
                   Assume that samples result from pooled sequencing.
                   Model pooled samples using discrete genotypes across pools.
                   When using this flag, set --ploidy to the number of
                   alleles in each sample or use the --cnv-map to define
                   per-sample ploidy.
   -K --pooled-continuous
                   Output all alleles which pass input filters, regardles of
                   genotyping outcome or model.
```
### reference allele:
```bash
   -Z --use-reference-allele
                   This flag includes the reference allele in the analysis as
                   if it is another sample from the same population.
   --reference-quality MQ,BQ
                   Assign mapping quality of MQ to the reference allele at each
                   site and base quality of BQ.  default: 100,60
```
### allele scope:
```bash
   -n --use-best-n-alleles N
                   Evaluate only the best N SNP alleles, ranked by sum of
                   supporting quality scores.  (Set to 0 to use all; default: all)
   -E --max-complex-gap N
      --haplotype-length N
                   Allow haplotype calls with contiguous embedded matches of up
                   to this length. Set N=-1 to disable clumping. (default: 3)
   --min-repeat-size N
                   When assembling observations across repeats, require the total repeat
                   length at least this many bp.  (default: 5)
   --min-repeat-entropy N
                   To detect interrupted repeats, build across sequence until it has
                   entropy > N bits per bp. Set to 0 to turn off. (default: 1)
   --no-partial-observations
                   Exclude observations which do not fully span the dynamically-determined
                   detection window.  (default, use all observations, dividing partial
                   support across matching haplotypes when generating haplotypes.)

  These flags are meant for testing.
  They are not meant for filtering the output.
  They actually filter the input to the algorithm by throwing away alignments.
  This hurts performance by hiding information from the Bayesian model.
  Do not use them unless you can validate that they improve results!

   -I --throw-away-snp-obs     Remove SNP observations from input.
   -i --throw-away-indels-obs  Remove indel observations from input.
   -X --throw-away-mnp-obs     Remove MNP observations from input.
   -u --throw-away-complex-obs Remove complex allele observations from input.

  If you need to break apart haplotype calls to obtain one class of alleles,
  run the call with default parameters, then normalize and subset the VCF:
    freebayes ... | vcfallelicprimitives -kg >calls.vcf
  For example, this would retain only biallelic SNPs.
    <calls.vcf vcfsnps | vcfbiallelic >biallelic_snp_calls.vcf
```
### indel realignment:
```bash
   -O --dont-left-align-indels
                   Turn off left-alignment of indels, which is enabled by default.

```

### input filters:
```bash
   -4 --use-duplicate-reads
                   Include duplicate-marked alignments in the analysis.
                   default: exclude duplicates marked as such in alignments
   -m --min-mapping-quality Q
                   Exclude alignments from analysis if they have a mapping
                   quality less than Q.  default: 1
   -q --min-base-quality Q
                   Exclude alleles from analysis if their supporting base
                   quality is less than Q.  default: 0
   -R --min-supporting-allele-qsum Q
                   Consider any allele in which the sum of qualities of supporting
                   observations is at least Q.  default: 0
   -Y --min-supporting-mapping-qsum Q
                   Consider any allele in which and the sum of mapping qualities of
                   supporting reads is at least Q.  default: 0
   -Q --mismatch-base-quality-threshold Q
                   Count mismatches toward --read-mismatch-limit if the base
                   quality of the mismatch is >= Q.  default: 10
   -U --read-mismatch-limit N
                   Exclude reads with more than N mismatches where each mismatch
                   has base quality >= mismatch-base-quality-threshold.
                   default: ~unbounded
   -z --read-max-mismatch-fraction N
                   Exclude reads with more than N [0,1] fraction of mismatches where
                   each mismatch has base quality >= mismatch-base-quality-threshold
                   default: 1.0
   -$ --read-snp-limit N
                   Exclude reads with more than N base mismatches, ignoring gaps
                   with quality >= mismatch-base-quality-threshold.
                   default: ~unbounded
   -e --read-indel-limit N
                   Exclude reads with more than N separate gaps.
                   default: ~unbounded
   -0 --standard-filters  Use stringent input base and mapping quality filters
                   Equivalent to -m 30 -q 20 -R 0 -S 0
   -F --min-alternate-fraction N
                   Require at least this fraction of observations supporting
                   an alternate allele within a single individual in the
                   in order to evaluate the position.  default: 0.05
   -C --min-alternate-count N
                   Require at least this count of observations supporting
                   an alternate allele within a single individual in order
                   to evaluate the position.  default: 2
   -3 --min-alternate-qsum N
                   Require at least this sum of quality of observations supporting
                   an alternate allele within a single individual in order
                   to evaluate the position.  default: 0
   -G --min-alternate-total N
                   Require at least this count of observations supporting
                   an alternate allele within the total population in order
                   to use the allele in analysis.  default: 1
   --min-coverage N
                   Require at least this coverage to process a site. default: 0
   --limit-coverage N
                   Downsample per-sample coverage to this level if greater than this coverage.
                   default: no limit
   -g --skip-coverage N
                   Skip processing of alignments overlapping positions with coverage >N.
                   This filters sites above this coverage, but will also reduce data nearby.
                   default: no limit


```

### population priors:
```bash
   -k --no-population-priors
                   Equivalent to --pooled-discrete --hwe-priors-off and removal of
                   Ewens Sampling Formula component of priors.
```
### mappability priors:
```bash
   -w --hwe-priors-off
                   Disable estimation of the probability of the combination
                   arising under HWE given the allele frequency as estimated
                   by observation frequency. 
   -V --binomial-obs-priors-off
                   Disable incorporation of prior expectations about observations.
                   Uses read placement probability, strand balance probability,
                   and read position (5'-3') probability.
   -a --allele-balance-priors-off
                   Disable use of aggregate probability of observation balance between alleles
                   as a component of the priors.
```
### genotype likelihoods:
```bash
   --observation-bias FILE
                   Read length-dependent allele observation biases from FILE.
                   The format is [length] [alignment efficiency relative to reference]
                   where the efficiency is 1 if there is no relative observation bias.
   --base-quality-cap Q
                   Limit estimated observation quality by capping base quality at Q.
   --prob-contamination F
                   An estimate of contamination to use for all samples.  default: 10e-9
   --legacy-gls    Use legacy (polybayes equivalent) genotype likelihood calculations
   --contamination-estimates FILE
                   A file containing per-sample estimates of contamination, such as
                   those generated by VerifyBamID.  The format should be:
                       sample p(read=R|genotype=AR) p(read=A|genotype=AA)
                   Sample '*' can be used to set default contamination estimates.
```
### algorithmic features:
```bash
   --report-genotype-likelihood-max
                   Report genotypes using the maximum-likelihood estimate provided
                   from genotype likelihoods.
   -B --genotyping-max-iterations N
                   Iterate no more than N times during genotyping step. default: 1000.
   --genotyping-max-banddepth N
                   Integrate no deeper than the Nth best genotype by likelihood when
                   genotyping. default: 6.   
   -W --posterior-integration-limits N,M
                   Integrate all genotype combinations in our posterior space
                   which include no more than N samples with their Mth best
                   data likelihood. default: 1,3.
   -N --exclude-unobserved-genotypes
                   Skip sample genotypings for which the sample has no supporting reads.
   -S --genotype-variant-threshold N
                   Limit posterior integration to samples where the second-best
                   genotype likelihood is no more than log(N) from the highest
                   genotype likelihood for the sample.  default: ~unbounded
   -j --use-mapping-quality
                   Use mapping quality of alleles when calculating data likelihoods.
   -H --harmonic-indel-quality
                   Use a weighted sum of base qualities around an indel, scaled by the
                   distance from the indel.  By default use a minimum BQ in flanking sequence.
   -D --read-dependence-factor N
                   Incorporate non-independence of reads by scaling successive
                   observations by this factor during data likelihood

```