### Variant calling module

**CMM262, Winter 2024**

Kyle Gaulton, kgaulton@health.ucsd.edu
<br>
<br>

<b>In this walkthrough we will be calling and filtering genetic variants from a sequence alignment file</b>
<br><br>
<b><u>Required Files in resources:</u></b><br>
*<b>Human hg38 chr20 reference</b>*<br>
chr20.fa.gz, chr20.dict, chr20.fa.fai, chr20.fa.gzi  
<br>
*<b>Variant call sets</b>*<br>
resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz  
resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz.tbi  
resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz  
resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz.tbi  
resources_broad_hg38_v0_hapmap_3.3.hg38.vcf.gz  
resources_broad_hg38_v0_hapmap_3.3.hg38.vcf.gz.tbi  
<br>
*<b>Annotation scripts</b>*<br>
annovar/
   table_annovar.pl
   annotate_variation.pl
   humandb/*


<br>
<b><u>Download and prepare alignment file for genotyping</u></b>
<br><br>
Here we will use samtools to extract reads aligned to a part of chromosome 20 from a 1000 Genomes Project BAM file hosted remotely, and save this alignment to a local file.   

In [1]:
samtools view -h -b ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/GBR/HG00249/alignment/HG00249.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram chr20:30000000-30500000 > HG00249.bam


SyntaxError: invalid decimal literal (1879703832.py, line 1)

<br>
Next we will use samtools to index the BAM file so that it can be used in downstream analysis tools

In [None]:
samtools index HG00249.bam


<br>
Let's view the contents of the directory to see what files we have

In [None]:
ls -la


<br>
Next we will use samtools to print out reads mapping to just the first 1000 bases in the file so we can examine the alignments

In [None]:
samtools view -h HG00249.bam chr20:30000000-30001000


<br>
And summarize the properties of the alignments using flagstat in samtools

In [None]:
samtools flagstat HG00249.bam


<br>
Next we will perform multiple commands to fix the alignments so that we can then perform duplicate marking/removal - these steps clean up information for paired reads. Since we extracted just a small portion of the chromosome, some of the pairs will now not have a mate 

In [None]:
samtools sort -n -o HG00249.sort.bam HG00249.bam
samtools fixmate -m HG00249.sort.bam HG00249.sort.fixed.bam
samtools sort -o HG00249.resort.bam HG00249.sort.fixed.bam


<br>
Next we will filter alignments to remove those with low quality/confidence - using a quality threshold of 30

In [None]:
samtools view -b -q 30 -o HG00249.filter.bam HG00249.resort.bam


<br>
Need to index the new filtered BAM file before duplicate marking/removal

In [None]:
samtools index HG00249.filter.bam


<br>
Summarize the properties of the alignments in the filtered BAM using samtools - compare to the previous unfiltered BAM

In [None]:
samtools flagstat HG00249.filter.bam


<br>
Remove duplicate reads from filtered .bam and save to new BAM file (could have instead 'marked' duplicates which would have kept them in the BAM file and just changed their flag)

In [None]:
samtools markdup -r HG00249.filter.bam HG00249.rmdup.bam


<br>
Index the new filtered, de-duped BAM file

In [None]:
samtools index HG00249.rmdup.bam


<br>
Summarize properties of alignments in filtered, de-duped BAM file

In [None]:
samtools flagstat HG00249.rmdup.bam


<br>
View pileup of filtered, de-duped read counts for each genomic position in the BAM file

In [None]:
samtools mpileup -f chr20.fa.gz HG00249.rmdup.bam


<br>
<b><u>Call genetic variants from aligment with bcftools</u></b>
<br><br>
From the filtered, de-duped BAM file - we will next identify genomic positions which are polymorphic in the sample
<br><br>
We will first use bcftools, which first uses the 'mpileup' command followed by the 'call' command and outputs a VCF file

In [None]:
bcftools mpileup -Ou -f chr20.fa.gz HG00249.rmdup.bam | bcftools call -mv -Ov -o HG00249.bcftools.vcf


<br>
Filter bcftools variant calls by quality score > 20 and output to filtered VCF file

In [None]:
bcftools view -i '%QUAL>=20' HG00249.bcftools.vcf > HG00249.bcftools.filter.vcf


<br>
Examine the first 5000 lines of the filtered VCF file - see what is in the header and the variant call lines

In [None]:
head -n 5000 HG00249.bcftools.filter.vcf


<br>
Summarize properties of the variant calls in the filtered VCF

In [None]:
bcftools stats HG00249.bcftools.filter.vcf


<br>
<b><u>Call genetic variants using GATK</u></b>
<br><br>
First let's list out all of the tools that are available in GATK 

In [None]:
gatk --list


<br>
We will use the base recalibration tool to update the base quality scores based on comparison to known variant positions.  First, we use the BaseRecalibrator function which estimates the true error rate of bases in quality score bins.  Second, we use the output to update the quality scores in the BAM file

In [None]:
gatk BaseRecalibrator -I HG00249.rmdup.bam -R chr20.fa.gz --known-sites resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz -O recal.table
gatk ApplyBQSR -R chr20.fa.gz -I HG00249.rmdup.bam --bqsr-recal-file recal.table -O HG00249.rmdup.recal.bam


<br>
If we look at the output of BaseRecalibrator it shows the error rate of the original quality scores

In [None]:
head -n 142 recal.table

<br>
Next we will use the BAM file with the recalibrated quality scores to call an initial set of variants using GATK HaplotypeCaller

In [None]:
gatk HaplotypeCaller -I HG00249.rmdup.recal.bam -O HG00249.gatk.vcf -R chr20.fa.gz


<br>
Then we will summarize the properties of this initial variant call set

In [None]:
bcftools stats HG00249.gatk.vcf


<br>
Next we will perform recalibration of variant quality scores and filtering.  First we will use the VariantRecalibrator command to determine the error rate of variants across qualty scores compared to known variant positions.  Next we will use the output in ApplyVQSR to update the variant quality scores and produce a filtered VCF

In [None]:
gatk VariantRecalibrator -R chr20.fa.gz -V HG00249.gatk.vcf --resource:hapmap,known=false,training=true,truth=true,prior=15.0 resources_broad_hg38_v0_hapmap_3.3.hg38.vcf.gz --resource:omni,known=false,training=true,truth=false,prior=12.0 resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz --resource:1000G,known=false,training=true,truth=false,prior=10.0 resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz -an QD -an FS -mode SNP -O recal.var --tranches-file output.tranches --rscript-file output.plots.R
gatk ApplyVQSR -R chr20.fa.gz -V HG00249.gatk.vcf -O HG00249.gatk.filter.vcf --truth-sensitivity-filter-level 90.0 --tranches-file output.tranches --recal-file recal.var -mode SNP


<br>
Then we will summarize the properties of this filtered variant call set

In [None]:
bcftools stats HG00249.gatk.filtered.vcf


<br>
<b><u>Convert genotypes to tab-delimited file</u></b>
<br><br>
Compress the VCFs and then use the 'tabix' command to index the VCFs

In [None]:
bgzip HG00249.bcftools.filter.vcf
tabix -vcf HG00249.bcftools.filter.vcf.gz

bgzip HG00249.gatk.filter.vcf
tabix -vcf HG00249.gatk.filter.vcf.gz

<br>
Output tab-delimited text file that can be used for additional analyses

In [None]:
# output text file
vcf2tsv -g HG00249.gatk.filter.vcf.gz > HG00249.gatk.filter.txt