This is a pipeline to process whole exome sequencing data from fastq to annotated vcf.
The pipeline is deployed on a local university cluster.
This repository is intended for the author's pesonal use. Version 08.16
The main steps include:
- Source FASTQ import, trimming withCutadapt and assessing with FastQC,
- Alignment against b37 using BWA MEM or backtrack,
- BAM files cleaning, merging, deduplication, preprocessing and QC (samtools, picard, GATK, Qualimap etc)
- Variants calling by GATK HC - GVCF
- Variants assessment (custom R scripts and samtools vcfstats)
- Variants filtering by VQSR and a custom hard filters (DP and QUAL)
- Variants annotation using kgen ,exac and VEP
- Export of annotated variants to plain text files for downstream analysis in R.
Code is split into modules (steps), located in folders with self-explanatory names.
After each step the user assess the results (metrics produced by fastQC, picard, qualimap, vqsr, vcfstats, vep etc) before taking analysis to the next step.
The steps are started by the launcher script with a job description file (located in folder with the job description templates).
Updates in version of 08.16
- Removed option for importing CRUK data
- Updated what results are kept on on HPC (logs etc)
- Make adaptors trimming optional (cutadapt)
- Base quality fastq trimming from both ends (cutadapt)
- Allow for PE and SE data
- Allow for BWA-MEM and BWA-Backtrack
- Additional bams checks and cleaning after BWA
- Updated variables names for target folders / intervals
- Removed resource monitoring
- Use padding 10 in all GATK steps
- Split and process multi-allelic variants
- Added exac annotations
- Added kgen AFs instead of masks (removed the tables with masks)
- Removed PDF reports (keep HTML only)
- Removed TXT outputs and some fields in VEP
- Removed "clean" and "full" outputs at some steps (keep only one output)
- Added some cleaning to exported VV table
- Switched to last version of GATK in most steps