This repository contains several variant calling workflows. The workflows are designed to identify variants from sequence data produced by several different types of NGS techonologies. The pipelines follow the rigorous standards described in the refernce material associated with the Genome Analysis Toolkit at https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/best-practices-workflows and the same general workflow depicted below:
The pipelines are ordered in a logical sequence for initial discovery of variants in a large cost effective SNP-array followed by validation of the variants via targeted sequencing. In order to perform cost effective targetded sequencing, thorough variant discovery practices are employed to reduce the number of false positive. The pipelines also differ in how the samples are considered which is desribed further in the individual sections for each workflow.
This pipeline identifies variants from Illumina's Infinium GSA SNP-array IDAT files. The pipeline uses Illumina's proprietary iiap command-line software to convert the IDAT files to GTC files. The rest of the pipeline relies on bcftools plugins and the Genome Analysis Toolkit in order to annotate, phase, filter, and identify variants. The pipeline is more thoroughly described .
- Convert IDAT to GTC (Illumina's iaap-cli gencall)
- Convert GTC to VCF (bcftools +gtc2vcf)
- Annotate variants (bcftools annotate)
- Extract ACMG59 table (bcftools view & bcftools query)
- Perform SNP QC (GATK)
- Phase genotypes
- Run MoChA (MoChA)
- Compute principal components and ancestry
- Extract final tables
- Generate MoChA call plots (MoChA)
This is a targeted sequencing variant anlysis workflow. This analysis requires I llumina MiSeq or Illumina NextSeq targeted sequencing data as input.
This pipeline identifies variants from targeted sequencing data.For Data produced by an Illumina MiSeq can be input into this pipeline. Infinium GSA SNP array. The pipeline uses Illumina's proprietary iiap command line software to conver the IDAT files to GTCs. The rest of the pipeline relies on bcftools plugins and the Genome Analysis Toolkit in order to identify variants.
- Align data (BWA MEM)
- Remove duplicates (picard)
- Recalibrate base pairs (GATK)
- Estimate target coverages (bdetools coverage)
- Run Mutect2 (GATK)
- HaplotypeCaller (GATK)
- Merge calls (bcftools merge)
- Annotate variants (bcftools annotate & bcftools csq)
- Extract final table with ACMG59 (bcftools query)
- Generate IGV plots (IGV)