Different callers represent the same variant in different ways, which makes the comparison between differen call sets complicated. For this reason, it is necessary to have a procedure to normalize the variants.

This document presents the different tools and the procedure to standardize the variants. This procedure will take care of:
* multiallelic variants
* MNPs
* normalize snps+indels

** NOTE: ** Freebayes is capable of identifying complex variants that according to the Freebayes manual are:
composite insertion and substitution events
I haven't found the way of normalizing them

# Multiallelic variants
I use bcftools to split the multiallelic sites into different records. For example we have the following multiallelic SNP site:

This splitting is necessary in order for the downstream normalization processes to be effective

Command used for this splitting is:

**Note:** If you want to select a certain variant type (i.e. snps,indels) to be normalized, one has to select the desired variant after splitting the multiallelic variants, as the same multiallelic record can have different variant types. For example:

# MNPs
Multinucleotide polymorphisms can be decomposed and converted to SNVs. For example, we have the following multiallelic SNP site:

In [None]:
After decomposition we will have:

We use vcfallelicprimitives (https://github.com/vcflib/vcflib.git) with the following command line:

**Note:** This tool can normalize SNPs and INDELs, according to manual vcfallelicprimitives can:

**Note:** This tools normalizes the variants apart from decomposing the MNPs but sites will be unsorted after normalization

# Normalization of SNPs and INDELs

For this one can use vcfallelicprimitives also. An example of VCF with SNPs+INDELs prenormalization:

In [None]:
chr1    10236   .       AACCCTA AACCCCTAA   TYPE=complex
chr1    10236   .       AACCCTA AA      4.0116e-12      .       AB=0.178106;ABP=3110.69;AC=2;AF=0.123864;AN=2;AO=3622;CIGAR=1M5D1M;DP=38235;DPB=39399;DPRA=1.82578;EPP=609.749;EPPR=510.378;GTI=10;LEN=5;MEANALT=3.95724;MQM=18.8738;MQMR=16.732;NS=2640;NUMALT=7;ODDS=35949.8;PAIRED=0.739647;PAIREDR=0.849491;PAO=133.083;PQA=1866.46;PQR=21537.7;PRO=1137.83;QA=74586;QR=648816;RO=24570;RPL=2101;RPP=204.69;RPPR=2429.69;RPR=1521;RUN=1;SAF=1561;SAP=152.891;SAR=2061;SRF=12850;SRP=115.861;SRR=11720;TYPE=del;technology.ILLUMINA=1
chr1    10236   .       AACCCTA AACCCCTA        4.0116e-12      .       AB=0.0855323;ABP=1642.82;AC=0;AF=0.00530303;AN=2;AO=680;CIGAR=2M1I5M;DP=38235;DPB=39399;DPRA=1.66044;EPP=52.1111;EPPR=510.378;GTI=10;LEN=1;MEANALT=4.97588;MQM=17.6588;MQMR=16.732;NS=2640;NUMALT=7;ODDS=35949.8;PAIRED=0.907353;PAIREDR=0.849491;PAO=789.833;PQA=13880.2;PQR=21537.7;PRO=1137.83;QA=9327;QR=648816;RO=24570;RPL=463;RPP=196.259;RPPR=2429.69;RPR=217;RUN=1;SAF=515;SAP=394.195;SAR=165;SRF=12850;SRP=115.861;SRR=11720;TYPE=ins;technology.ILLUMINA=1
chr1    10236   .       AACCCTA AACCCTAA        4.0116e-12      .       AB=0.2171;ABP=2596.63;AC=0;AF=0.0655303;AN=2;AO=3968;CIGAR=6M1I1M;DP=38235;DPB=39399;DPRA=1.8104;EPP=667.588;EPPR=510.378;GTI=10;LEN=1;MEANALT=3.92714;MQM=19.0071;MQMR=16.732;NS=2640;NUMALT=7;ODDS=35949.8;PAIRED=0.748488;PAIREDR=0.849491;PAO=938.083;PQA=16850;PQR=21537.7;PRO=1137.83;QA=92054;QR=648816;RO=24570;RPL=2023;RPP=6.33975;RPPR=2429.69;RPR=1945;RUN=1;SAF=1802;SAP=75.5182;SAR=2166;SRF=12850;SRP=115.861;SRR=11720;TYPE=ins;technology.ILLUMINA=1
chr1    10236   .       AACCCTA ACCCCTA 4.0116e-12      .       AB=0.0903194;ABP=2924.49;AC=0;AF=0.0306818;AN=2;AO=1229;CIGAR=1M1X5M;DP=38235;DPB=39399;DPRA=1.66887;EPP=46.5617;EPPR=510.378;GTI=10;LEN=1;MEANALT=4.64391;MQM=18.6876;MQMR=16.732;NS=2640;NUMALT=7;ODDS=35949.8;PAIRED=0.7738;PAIREDR=0.849491;PAO=240;PQA=4013.38;PQR=21537.7;PRO=1137.83;QA=15017;QR=648816;RO=24570;RPL=490;RPP=112.557;RPPR=2429.69;RPR=739;RUN=1;SAF=724;SAP=87.7507;SAR=505;SRF=12850;SRP=115.861;SRR=11720;TYPE=snp;technology.ILLUMINA=1
chr1    10236   .       AACCCTA AGCCCTA 4.0116e-12      .       AB=0.0830189;ABP=403.225;AC=0;AF=0.0373106;AN=2;AO=138;CIGAR=1M1X5M;DP=38235;DPB=39399;DPRA=1.567;EPP=10.6262;EPPR=510.378;GTI=10;LEN=1;MEANALT=5.39344;MQM=19.9855;MQMR=16.732;NS=2640;NUMALT=7;ODDS=35949.8;PAIRED=0.92029;PAIREDR=0.849491;PAO=211.5;PQA=3878.38;PQR=21537.7;PRO=1137.83;QA=1498;QR=648816;RO=24570;RPL=96;RPP=48.8945;RPPR=2429.69;RPR=42;RUN=1;SAF=38;SAP=63.4968;SAR=100;SRF=12850;SRP=115.861;SRR=11720;TYPE=snp;technology.ILLUMINA=1
chr1    10236   .       AACCCTA AACCCAA 4.0116e-12      .       AB=0.0832408;ABP=1362.29;AC=0;AF=0.00625;AN=2;AO=464;CIGAR=5M1X1M;DP=38235;DPB=39399;DPRA=1.53169;EPP=25.9418;EPPR=510.378;GTI=10;LEN=1;MEANALT=5.18135;MQM=17.7306;MQMR=16.732;NS=2640;NUMALT=7;ODDS=35949.8;PAIRED=0.821121;PAIREDR=0.849491;PAO=662.083;PQA=10627.5;PQR=21537.7;PRO=1137.83;QA=5643;QR=648816;RO=24570;RPL=295;RPP=77.3084;RPPR=2429.69;RPR=169;RUN=1;SAF=296;SAP=79.6857;SAR=168;SRF=12850;SRP=115.861;SRR=11720;TYPE=snp;technology.ILLUMINA=1

And after the normalization:

# Sorting positions
The vcfallelicprimitives step will produce an unsorted vcf. These need to be sorted by doing:

# Removing identical sites
The procedure of decomposition and normalization may introduce replicated sites that need to be removed. 
For example:

In [None]:
chr20   10023499        .       A       G
chr20   10023499        .       A       G
chr20   10023499        .       A       G

These positions should be removed. In our procedure we use vt (https://github.com/atks/vt.git) uniq by using the following command:

After vt uniq: