### Variant calling module

**CMM262, Winter 2022**

Kyle Gaulton, kgaulton@gmail.com
<br>
<br>
<br>

In this walkthrough we will be calling, filtering and annotating genetic variants from a sequence alignment file

**Required Files in resources/:**<br>
*Human hg38 chr20 reference*<br>
chr20.fa.gz 
chr20.dict,chr20.fa.fai,chr20.fa.gzi 
<br><br>
*Variant call sets*<br>
resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz 
resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz.tbi 
resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz 
resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz.tbi 
resources_broad_hg38_v0_hapmap_3.3.hg38.vcf.gz 
resources_broad_hg38_v0_hapmap_3.3.hg38.vcf.gz.tbi 
<br><br>
*Annotation scripts*<br>
annovar/
   table_annovar.pl
   annotate_variation.pl
   humandb/*


**Download and prepare .bam file for genotyping**

In [None]:
/opt/conda/envs/r-bio/bin/samtools view -h -b ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/GBR/HG00249/alignment/HG00249.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram chr20:30000000-30500000 > HG00249.bam


In [None]:
# index bam file
/opt/conda/envs/r-bio/bin/samtools index HG00249.bam


In [None]:
# list directory contents
ls -la


In [None]:
# view just the reads in the first 1000 bases of the file
/opt/conda/envs/r-bio/bin/samtools view -h HG00249.bam chr20:30000000-30001000


In [None]:
# summarize alignments
/opt/conda/envs/r-bio/bin/samtools flagstat HG00249.bam


In [None]:
## commands to fix and sort the alignment for duplicate marking/removal
/opt/conda/envs/r-bio/bin/samtools sort -n -o HG00249.sort.bam HG00249.bam  ## commands to fix and sort the alignment for duplicate marking
/opt/conda/envs/r-bio/bin/samtools fixmate -m HG00249.sort.bam HG00249.sort.fixed.bam
/opt/conda/envs/r-bio/bin/samtools sort -o HG00249.resort.bam HG00249.sort.fixed.bam


In [None]:
# filter alignments based on quality score of 30
/opt/conda/envs/r-bio/bin/samtools view -b -q 30 -o HG00249.filter.bam HG00249.resort.bam


In [None]:
# index bam
/opt/conda/envs/r-bio/bin/samtools index HG00249.filter.bam

In [None]:
# view stats for bam with quality filter
/opt/conda/envs/r-bio/bin/samtools flagstat HG00249.filter.bam


In [None]:
## remove duplicates (probably better to just mark them)
/opt/conda/envs/r-bio/bin/samtools markdup -r HG00249.filter.bam HG00249.rmdup.bam


In [None]:
## index bam
/opt/conda/envs/r-bio/bin/samtools index HG00249.rmdup.bam


In [None]:
## view stats for bam with duplicates removed
/opt/conda/envs/r-bio/bin/samtools flagstat HG00249.rmdup.bam


In [None]:
## view pileup of read counts per base
/opt/conda/envs/r-bio/bin/samtools mpileup -f chr20.fa.gz HG00249.rmdup.bam


**Call genetic variants with bcftools**

In [None]:
## call variants using bcftools and output to VCF
/opt/conda/envs/py-bio/bin/bcftools mpileup -Ou -f chr20.fa.gz HG00249.rmdup.bam | /opt/conda/envs/py-bio/bin/bcftools call -mv -Ov -o HG00249.bcftools.vcf


In [None]:
## filter variant calls by quality score
/opt/conda/envs/py-bio/bin/bcftools view -i '%QUAL>=20' HG00249.bcftools.vcf > HG00249.bcftools.filter.vcf


In [None]:
# view top of file
head -n 5000 HG00249.bcftools.filter.vcf

In [None]:
# view summary of variant calls
/opt/conda/envs/py-bio/bin/bcftools stats HG00249.bcftools.filter.vcf

**Call genetic variants using GATK**

In [None]:
# see all of the tools available in GATK
/opt/conda/envs/r-bio/bin/gatk --list

In [None]:
# base recalibration
/opt/conda/envs/r-bio/bin/gatk BaseRecalibrator -I HG00249.rmdup.bam -R chr20.fa.gz --known-sites resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz -O recal.table
/opt/conda/envs/r-bio/bin/gatk ApplyBQSR -R chr20.fa.gz -I HG00249.rmdup.bam --bqsr-recal-file recal.table -O HG00249.rmdup.recal.bam


In [None]:
head -n 142 recal.table

In [None]:
# create initial call set using GATK HaplotypeCaller
/opt/conda/envs/r-bio/bin/gatk HaplotypeCaller -I HG00249.rmdup.recal.bam -O HG00249.gatk.vcf -R chr20.fa.gz


In [None]:
# check stats of initial VCF file
/opt/conda/envs/py-bio/bin/bcftools stats HG00249.gatk.vcf


In [None]:
# variant quality recalibration and filtering
/opt/conda/envs/r-bio/bin/gatk VariantRecalibrator -R chr20.fa.gz -V HG00249.gatk.vcf --resource:hapmap,known=false,training=true,truth=true,prior=15.0 resources_broad_hg38_v0_hapmap_3.3.hg38.vcf.gz --resource:omni,known=false,training=true,truth=false,prior=12.0 resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz --resource:1000G,known=false,training=true,truth=false,prior=10.0 resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz -an QD -an FS -mode SNP -O recal.var --tranches-file output.tranches --rscript-file output.plots.R
/opt/conda/envs/r-bio/bin/gatk ApplyVQSR -R chr20.fa.gz -V HG00249.gatk.vcf -O HG00249.gatk.filter.vcf --truth-sensitivity-filter-level 90.0 --tranches-file output.tranches --recal-file recal.var -mode SNP


In [None]:
# check stats of filtered VCF file
/opt/conda/envs/py-bio/bin/bcftools stats HG00249.gatk.filtered.vcf


**Annotate genetic variants**

In [None]:
# run variant annotation
perl annovar/table_annovar.pl HG00249.bcftools.filter.vcf annovar/humandb/ -buildver hg38 -out HG00249 -remove -protocol refGene -operation g -nastring . -vcfinput


In [None]:
# Should produce a VCF with the annotations included and a text file of annotations
ls -la *multianno*

In [None]:
# pull out all variants in the promoter region of a gene
grep 'upstream' HG00249.hg38_multianno.txt


**Convert genotypes to tab-delimited file**

In [None]:
# compress and tabix VCFs
bgzip HG00249.bcftools.filter.vcf
tabix -vcf HG00249.bcftools.filter.vcf.gz

bgzip HG00249.gatk.filter.vcf
tabix -vcf HG00249.gatk.filter.vcf.gz

In [None]:
# output text file
/opt/conda/envs/variant_calling/bin/vcf2tsv -g HG00249.gatk.filter.vcf.gz > HG00249.gatk.filter.txt