### Variant calling module

**CMM262, Winter 2023**

Kyle Gaulton, kgaulton@health.ucsd.edu
<br>
<br>

Note: this notebook should be run using the `Bash` kernel.

<b>In this walkthrough we will be calling, filtering and annotating genetic variants from a sequence alignment file</b>
<br><br>
<b><u>Required Files in resources:</u></b><br>
*Human hg38 chr20 reference*<br>
chr20.fa.gz 
chr20.dict,chr20.fa.fai,chr20.fa.gzi 
<br><br>
*Variant call sets*<br>
resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz 
resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz.tbi 
resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz 
resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz.tbi 
resources_broad_hg38_v0_hapmap_3.3.hg38.vcf.gz 
resources_broad_hg38_v0_hapmap_3.3.hg38.vcf.gz.tbi 
<br><br>
*Annotation scripts*<br>
annovar/
   table_annovar.pl
   annotate_variation.pl
   humandb/*


<br>
<b><u>Download and prepare alignment file for genotyping</u></b>
<br><br>
Here we will use samtools to extract reads aligned to a part of chromosome 20 from a 1000 Genomes Project BAM file hosted remotely, and save this alignment to a local file.   

In [1]:
#/opt/conda/envs/r-bio/bin/samtools view -h -b ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/GBR/HG00249/alignment/HG00249.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram chr20:30000000-30500000 > HG00249.bam
#this command takes too long to run, so we have the file pre-downloaded in the outputs directory!
cp outputs/HG00249.bam .

<br>
Next we will use samtools to index the BAM file so that it can be used in downstream analysis tools

In [2]:
/opt/conda/envs/r-bio/bin/samtools index HG00249.bam


<br>
Let's view the contents of the directory to see what files we have

In [3]:
ls -la 


total 186991
drwxrwx---  5 grader-bggn237-01 root        27 Feb 23 22:36 .
drwx------ 15 grader-bggn237-01 users       25 Feb 18 00:06 ..
drwxr-x---  3 grader-bggn237-01 root         9 Mar  8  2022 annovar
-rwxr-x---  1 grader-bggn237-01 root       169 Mar  8  2022 chr20.dict
-rwxr-x---  1 grader-bggn237-01 root  20832687 Mar  8  2022 chr20.fa.gz
-rwxr-x---  1 grader-bggn237-01 root        23 Mar  8  2022 chr20.fa.gz.fai
-rwxr-x---  1 grader-bggn237-01 root     16104 Mar  8  2022 chr20.fa.gz.gzi
-rw-r-----  1 grader-bggn237-01 root   3149791 Feb 23 22:38 HG00249.bam
-rw-rw----  1 grader-bggn237-01 root     42808 Feb 23 22:38 HG00249.bam.bai
-rw-rw----  1 grader-bggn237-01 root   2254621 Feb 23 22:36 HG00249.filter.bam
-rw-rw----  1 grader-bggn237-01 root     42760 Feb 23 22:36 HG00249.filter.bam.bai
-rw-rw----  1 grader-bggn237-01 root   3254290 Feb 23 22:36 HG00249.resort.bam
-rw-rw----  1 grader-bggn237-01 root   2228537 Feb 23 22:36 HG00249.rmdup.bam
-rw-rw----  1 grader-bggn237-01 

<br>
Next we will use samtools to print out reads mapping to just the first 1000 bases in the file so we can examine the alignments

In [4]:
/opt/conda/envs/r-bio/bin/samtools view -h HG00249.bam chr20:30000000-30001000 | tail


ERR251020.50396426	163	chr20	30000839	39	100M	=	30001209	468	AATCTGAAACTGGATATTTGGAGAGCTTTGAGGCCTGTGGTGAAAAAGGAAACACCTTCACAAAAAAAACTAGAGCAGAAGCATTCTCAGAAACTTCTTT	<<<<BB<<B<BB'7<7<<BBB<B'7<<<B<BB<<<BB<<<<7BBB<<<<<<0<B7BB<BB<BBBBBBBB7B<<BB07<BBBBBBBBFBBB<BB<FBBBB7	AS:i:95	MC:Z:2S98M	MQ:i:39	XS:i:85	MD:Z:23C76	NM:i:1	RG:Z:ERR251020
ERR251019.22009737	163	chr20	30000882	0	100M	=	30001270	488	AAAAGGAAACACCTTCACAAAAAAAACTAGAGCAGAAGCATTCTCAGAAACTTCTTTGTGATGTGTGCATTCAACTCACAGAGTTGAACCTTTTTTTTTG	<<<<BBBBB<B<<B<<B<BBBBBBBB<BBBBBBBBBBBBBBB<B<BBBB<<BB<BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB<BFBBBBBB<BB	AS:i:100	MC:Z:100M	MQ:i:0	XS:i:100	MD:Z:100	NM:i:0	RG:Z:ERR251019
ERR251019.1922307	99	chr20	30000887	0	100M	=	30001261	474	GAAACACCTTCACAAAAAAAACTAGAGCAGAAGCATTCTCAGAAACTTCTTTGTGATGTGTGCATTCAACTCACAGTGTTGAACCTTTTTTTTTGATAGA	<<<B<B<BBBBB<BBBBBBBB<BBBBBBBBBBBBBBBBB<BBBBB<BBB<<BBBB<BBBBBBBBBBBBBBBBBBBBB<BBBBBBBBBBBBBBBBBBB<BB	AS:i:95	MC:Z:100M	MQ:i:0	XS:i:95	MD:Z:76A23	NM:i:1	RG:Z:ERR251019
ERR251019.21636

<br>
And summarize the properties of the alignments using flagstat in samtools

In [5]:
/opt/conda/envs/r-bio/bin/samtools flagstat HG00249.bam


36654 + 0 in total (QC-passed reads + QC-failed reads)
42 + 0 secondary
0 + 0 supplementary
477 + 0 duplicates
36391 + 0 mapped (99.28% : N/A)
36612 + 0 paired in sequencing
18286 + 0 read1
18326 + 0 read2
34003 + 0 properly paired (92.87% : N/A)
36086 + 0 with itself and mate mapped
263 + 0 singletons (0.72% : N/A)
1159 + 0 with mate mapped to a different chr
421 + 0 with mate mapped to a different chr (mapQ>=5)


<br>
Next we will perform multiple commands to fix the alignments so that we can then perform duplicate marking/removal - these steps clean up information for paired reads. Since we extracted just a small portion of the chromosome, some of the pairs will now not have a mate 

In [6]:
/opt/conda/envs/r-bio/bin/samtools sort -n -o HG00249.sort.bam HG00249.bam
/opt/conda/envs/r-bio/bin/samtools fixmate -m HG00249.sort.bam HG00249.sort.fixed.bam
/opt/conda/envs/r-bio/bin/samtools sort -o HG00249.resort.bam HG00249.sort.fixed.bam


<br>
Next we will filter alignments to remove those with low quality/confidence - using a quality threshold of 30

In [7]:
/opt/conda/envs/r-bio/bin/samtools view -b -q 30 -o HG00249.filter.bam HG00249.resort.bam


<br>
Need to index the new filtered BAM file before duplicate marking/removal

In [8]:
/opt/conda/envs/r-bio/bin/samtools index HG00249.filter.bam


<br>
Summarize the properties of the alignments in the filtered BAM using samtools - compare to the previous unfiltered BAM

In [9]:
/opt/conda/envs/r-bio/bin/samtools flagstat HG00249.filter.bam


25416 + 0 in total (QC-passed reads + QC-failed reads)
2 + 0 secondary
0 + 0 supplementary
335 + 0 duplicates
25416 + 0 mapped (100.00% : N/A)
25178 + 0 paired in sequencing
12637 + 0 read1
12541 + 0 read2
24936 + 0 properly paired (99.04% : N/A)
25096 + 0 with itself and mate mapped
82 + 0 singletons (0.33% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


<br>
Remove duplicate reads from filtered .bam and save to new BAM file.  Could have instead 'marked' duplicates which would have kept them in the BAM file and just changed their flag

In [10]:
/opt/conda/envs/r-bio/bin/samtools markdup -r HG00249.filter.bam HG00249.rmdup.bam


<br>
Index the new filtered, de-duped BAM file

In [11]:
/opt/conda/envs/r-bio/bin/samtools index HG00249.rmdup.bam


<br>
Summarize properties of alignments in filtered, de-duped BAM file

In [12]:
/opt/conda/envs/r-bio/bin/samtools flagstat HG00249.rmdup.bam


25055 + 0 in total (QC-passed reads + QC-failed reads)
2 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
25055 + 0 mapped (100.00% : N/A)
24834 + 0 paired in sequencing
12462 + 0 read1
12372 + 0 read2
24596 + 0 properly paired (99.04% : N/A)
24756 + 0 with itself and mate mapped
78 + 0 singletons (0.31% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


<br>
View pileup of filtered, de-duped read counts for each genomic position in the BAM file

In [13]:
/opt/conda/envs/r-bio/bin/samtools mpileup -f chr20.fa.gz HG00249.rmdup.bam | head -n 20
/opt/conda/envs/r-bio/bin/samtools mpileup -f chr20.fa.gz HG00249.rmdup.bam | tail -n 20

[mpileup] 1 samples in 1 input files
chr20	29999912	t	1	^G.	1
chr20	29999913	t	1	.	5
chr20	29999914	t	1	.	B
chr20	29999915	g	1	.	B
chr20	29999916	t	1	.	B
chr20	29999917	g	1	.	B
chr20	29999918	t	1	.	B
chr20	29999919	t	1	.	B
chr20	29999920	g	1	.	B
chr20	29999921	t	1	.	B
chr20	29999922	g	1	.	B
chr20	29999923	t	1	.	B
chr20	29999924	g	1	.	B
chr20	29999925	c	1	.	B
chr20	29999926	a	1	.	B
chr20	29999927	t	1	.	B
chr20	29999928	t	1	.	B
chr20	29999929	c	1	.	<
chr20	29999930	a	1	.	B
chr20	29999931	a	1	.	B
[mpileup] 1 samples in 1 input files
chr20	30500008	A	2	,.	BB
chr20	30500009	G	2	,.	BB
chr20	30500010	C	2	,.	B<
chr20	30500011	T	2	,.	<B
chr20	30500012	T	2	,.	<7
chr20	30500013	T	2	,$.	<7
chr20	30500014	C	1	.	<
chr20	30500015	C	1	.	7
chr20	30500016	T	1	.	B
chr20	30500017	A	1	.	B
chr20	30500018	G	1	.	<
chr20	30500019	G	1	.	B
chr20	30500020	G	1	.	B
chr20	30500021	A	1	.	B
chr20	30500022	G	1	.	0
chr20	30500023	G	1	.	<
chr20	30500024	G	1	.	B
chr20	30500025	A	1	.	7
chr20	30500026	G	1	.	<
chr20	30500027

<br>
<b><u>Call genetic variants from aligment with bcftools</u></b>
<br><br>
From the filtered, de-duped BAM file - we will next identify genomic positions which are polymorphic in the sample
<br><br>
We will first use bcftools, which first uses the 'mpileup' command followed by the 'call' command and outputs a VCF file

In [14]:
/opt/conda/envs/py-bio/bin/bcftools mpileup -Ou -f chr20.fa.gz HG00249.rmdup.bam | /opt/conda/envs/py-bio/bin/bcftools call -mv -Ov -o HG00249.bcftools.vcf


Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250


<br>
Filter bcftools variant calls by quality score > 20 and output to filtered VCF file

In [15]:
/opt/conda/envs/py-bio/bin/bcftools view -i '%QUAL>=20' HG00249.bcftools.vcf > HG00249.bcftools.filter.vcf


<br>
Examine the first 5000 lines of the filtered VCF file - see what is in the header and the variant call lines

In [17]:
#head -n 5000 HG00249.bcftools.filter.vcf
head HG00249.bcftools.filter.vcf
head -n 5000 HG00249.bcftools.filter.vcf | tail

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.15+htslib-1.15.1
##bcftoolsCommand=mpileup -Ou -f chr20.fa.gz HG00249.rmdup.bam
##reference=file://chr20.fa.gz
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr3,length=198295559>
##contig=<ID=chr4,length=190214555>
##contig=<ID=chr5,length=181538259>
chr20	30493978	.	G	A	145.416	.	DP=8;VDB=0.913132;SGB=-0.651104;MQSBZ=1.9189;FS=0;MQ0F=0;AC=2;AN=2;DP4=0,0,5,3;MQ=54	GT:PL	1/1:175,24,0
chr20	30494099	.	G	A	44.0738	.	DP=14;VDB=0.249971;SGB=-0.556411;RPBZ=0.141421;MQBZ=1.1807;MQSBZ=0.444487;BQBZ=2.38952;SCBZ=-0.928191;FS=0;MQ0F=0;AC=1;AN=2;DP4=4,6,3,1;MQ=57	GT:PL	0/1:77,0,136
chr20	30494111	.	G	T	56.0046	.	DP=16;VDB=0.178985;SGB=-0.590765;RPBZ=-0.736374;MQBZ=-1.57877;MQSBZ=0.462177;BQBZ=1.86937;SCBZ=-0.984732;FS=0;MQ0F=0;AC=1;AN=2;DP4=5,6,3,2;MQ=57	GT:PL	0/1:89,0,159
chr20	30495492	.	A	C	34.3353	.	DP=7;VDB=0.549396;SGB=-0.511536;RPBZ=-0.353553;MQBZ=0.866025;MQ

<br>
Summarize properties of the variant calls in the filtered VCF

In [18]:
/opt/conda/envs/py-bio/bin/bcftools stats HG00249.bcftools.filter.vcf | head -n 20
echo
/opt/conda/envs/py-bio/bin/bcftools stats HG00249.bcftools.filter.vcf | tail -n 20

# This file was produced by bcftools stats (1.15+htslib-1.15.1) and can be plotted using plot-vcfstats.
# The command line was:	bcftools stats  HG00249.bcftools.filter.vcf
#
# Definition of sets:
# ID	[2]id	[3]tab-separated file names
ID	0	HG00249.bcftools.filter.vcf
# SN, Summary numbers:
#   number of records   .. number of data rows in the VCF
#   number of no-ALTs   .. reference-only sites, ALT is either "." or identical to REF
#   number of SNPs      .. number of rows with a SNP
#   number of MNPs      .. number of rows with a MNP, such as CC>TT
#   number of indels    .. number of rows with an indel
#   number of others    .. number of rows with other type, for example a symbolic allele or
#                          a complex substitution, such as ACT>TCGA
#   number of multiallelic sites     .. number of rows with multiple alternate alleles
#   number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
# 
#   Note that rows containing multiple t

<br>
<b><u>Call genetic variants using GATK</u></b>
<br><br>
First let's list out all of the tools that are available in GATK 

In [19]:
/opt/conda/envs/r-bio/bin/gatk --list


Using GATK jar /opt/conda/envs/r-bio/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/envs/r-bio/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar --help
[1m[31mUSAGE:  [32m<program name>[1m[31m [-h]

[0m[1m[31mAvailable Programs:
[0m[37m--------------------------------------------------------------------------------------
[0m[31mBase Calling:                                    Tools that process sequencing machine data, e.g. Illumina base calls, and detect sequencing level attributes, e.g. adapters[0m
[32m    CheckIlluminaDirectory (Picard)              [36mAsserts the validity for specified Illumina basecalling data.  [0m
[32m    CollectIlluminaBasecallingMetrics (Picard)   [36mCollects Illumina Basecalling metrics for a sequencing run.  [0m
[32m    CollectIlluminaLaneMet

<br>
We will use the base recalibration tool to update the base quality scores based on comparison to known variant positions.  First, we use the BaseRecalibrator function which estimates the true error rate of bases in quality score bins.  Second, we use the output to update the quality scores in the BAM file

In [20]:
/opt/conda/envs/r-bio/bin/gatk BaseRecalibrator -I HG00249.rmdup.bam -R chr20.fa.gz --known-sites resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz -O recal.table
/opt/conda/envs/r-bio/bin/gatk ApplyBQSR -R chr20.fa.gz -I HG00249.rmdup.bam --bqsr-recal-file recal.table -O HG00249.rmdup.recal.bam


Using GATK jar /opt/conda/envs/r-bio/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/envs/r-bio/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar BaseRecalibrator -I HG00249.rmdup.bam -R chr20.fa.gz --known-sites resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz -O recal.table
22:39:35.518 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/envs/r-bio/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Feb 23, 2023 10:39:35 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
22:39:35.713 INFO  BaseRecalibrator - ------------------------------------------------------------
22:39:35.71

<br>
If we look at the output of BaseRecalibrator it shows the error rate of the original quality scores

In [21]:
head -n 142 recal.table

#:GATKReport.v1.1:5
#:GATKTable:2:17:%s:%s:;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value                                                                   
binary_tag_name             null                                                                    
covariate                   ReadGroupCovariate,QualityScoreCovariate,ContextCovariate,CycleCovariate
default_platform            null                                                                    
deletions_default_quality   45                                                                      
force_platform              null                                                                    
indels_context_size         3                                                                       
insertions_default_quality  45                                                                      
low_quality_tail            2                                      

<br>
Next we will use the BAM file with the recalibrated quality scores to call an initial set of variants using GATK HaplotypeCaller

In [22]:
/opt/conda/envs/r-bio/bin/gatk HaplotypeCaller -I HG00249.rmdup.recal.bam -O HG00249.gatk.vcf -R chr20.fa.gz
#this will take about 8 mins to run

Using GATK jar /opt/conda/envs/r-bio/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/envs/r-bio/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar HaplotypeCaller -I HG00249.rmdup.recal.bam -O HG00249.gatk.vcf -R chr20.fa.gz
22:39:59.616 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/envs/r-bio/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Feb 23, 2023 10:39:59 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
22:39:59.758 INFO  HaplotypeCaller - ------------------------------------------------------------
22:39:59.759 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.0.0
22:39:59.75

<br>
Then we will summarize the properties of this initial variant call set

In [25]:
/opt/conda/envs/py-bio/bin/bcftools stats HG00249.gatk.vcf | head -n 50


# This file was produced by bcftools stats (1.15+htslib-1.15.1) and can be plotted using plot-vcfstats.
# The command line was:	bcftools stats  HG00249.gatk.vcf
#
# Definition of sets:
# ID	[2]id	[3]tab-separated file names
ID	0	HG00249.gatk.vcf
# SN, Summary numbers:
#   number of records   .. number of data rows in the VCF
#   number of no-ALTs   .. reference-only sites, ALT is either "." or identical to REF
#   number of SNPs      .. number of rows with a SNP
#   number of MNPs      .. number of rows with a MNP, such as CC>TT
#   number of indels    .. number of rows with an indel
#   number of others    .. number of rows with other type, for example a symbolic allele or
#                          a complex substitution, such as ACT>TCGA
#   number of multiallelic sites     .. number of rows with multiple alternate alleles
#   number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
# 
#   Note that rows containing multiple types will be counted m

<br>
Next we will perform recalibration of variant quality scores and filtering.  First we will use the VariantRecalibrator command to determine the error rate of variants across qualty scores compared to known variant positions.  Next we will use the output in ApplyVQSR to update the variant quality scores and produce a filtered VCF

In [26]:
/opt/conda/envs/r-bio/bin/gatk VariantRecalibrator -R chr20.fa.gz -V HG00249.gatk.vcf --resource:hapmap,known=false,training=true,truth=true,prior=15.0 resources_broad_hg38_v0_hapmap_3.3.hg38.vcf.gz --resource:omni,known=false,training=true,truth=false,prior=12.0 resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz --resource:1000G,known=false,training=true,truth=false,prior=10.0 resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz -an QD -an FS -mode SNP -O recal.var --tranches-file output.tranches --rscript-file output.plots.R
/opt/conda/envs/r-bio/bin/gatk ApplyVQSR -R chr20.fa.gz -V HG00249.gatk.vcf -O HG00249.gatk.filter.vcf --truth-sensitivity-filter-level 90.0 --tranches-file output.tranches --recal-file recal.var -mode SNP


Using GATK jar /opt/conda/envs/r-bio/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/envs/r-bio/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar VariantRecalibrator -R chr20.fa.gz -V HG00249.gatk.vcf --resource:hapmap,known=false,training=true,truth=true,prior=15.0 resources_broad_hg38_v0_hapmap_3.3.hg38.vcf.gz --resource:omni,known=false,training=true,truth=false,prior=12.0 resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz --resource:1000G,known=false,training=true,truth=false,prior=10.0 resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.chr20.vcf.gz -an QD -an FS -mode SNP -O recal.var --tranches-file output.tranches --rscript-file output.plots.R
22:48:23.910 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/envs/r-bio/share/gatk4-4.2.0.0-1/gatk

<br>
Then we will summarize the properties of this filtered variant call set

In [28]:
/opt/conda/envs/py-bio/bin/bcftools stats HG00249.gatk.filter.vcf | head -n 50


# This file was produced by bcftools stats (1.15+htslib-1.15.1) and can be plotted using plot-vcfstats.
# The command line was:	bcftools stats  HG00249.gatk.filter.vcf
#
# Definition of sets:
# ID	[2]id	[3]tab-separated file names
ID	0	HG00249.gatk.filter.vcf
# SN, Summary numbers:
#   number of records   .. number of data rows in the VCF
#   number of no-ALTs   .. reference-only sites, ALT is either "." or identical to REF
#   number of SNPs      .. number of rows with a SNP
#   number of MNPs      .. number of rows with a MNP, such as CC>TT
#   number of indels    .. number of rows with an indel
#   number of others    .. number of rows with other type, for example a symbolic allele or
#                          a complex substitution, such as ACT>TCGA
#   number of multiallelic sites     .. number of rows with multiple alternate alleles
#   number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
# 
#   Note that rows containing multiple types wil

<br>
<b><u>Annotate genetic variants</u></b>
<br><br>
We will use the filtered bcftools VCF and functionally annotate variant calls using ANNOVAR

In [29]:
perl annovar/table_annovar.pl HG00249.bcftools.filter.vcf annovar/humandb/ -buildver hg38 -out HG00249 -remove -protocol refGene -operation g -nastring . -vcfinput


NOTICE: the --polish argument is set ON automatically (use --nopolish to change this behavior)

NOTICE: Running with system command <convert2annovar.pl  -includeinfo -allsample -withfreq -format vcf4 HG00249.bcftools.filter.vcf > HG00249.avinput>
NOTICE: Finished reading 4356 lines from VCF file
NOTICE: A total of 960 locus in VCF file passed QC threshold, representing 905 SNPs (537 transitions and 368 transversions) and 56 indels/substitutions
NOTICE: Finished writing allele frequencies based on 905 SNP genotypes (537 transitions and 368 transversions) and 56 indels/substitutions for 1 samples

NOTICE: Running with system command <annovar/table_annovar.pl HG00249.avinput annovar/humandb/ -buildver hg38 -outfile HG00249 -remove -protocol refGene -operation g -nastring . -otherinfo>
NOTICE: the --polish argument is set ON automatically (use --nopolish to change this behavior)
-----------------------------------------------------------------
NOTICE: Processing operation=g protocol=refGen

<br>
This step should produce a VCF with the annotations included as well as a text file of the annotations

In [30]:
ls -la *multianno*


-rw-rw---- 1 grader-bggn237-01 root 259271 Feb 23 22:48 HG00249.hg38_multianno.txt
-rw-rw---- 1 grader-bggn237-01 root 509120 Feb 23 22:48 HG00249.hg38_multianno.vcf


<br>
Extract all variants annotated with the promoter region of a gene


In [31]:
grep 'upstream' HG00249.hg38_multianno.txt


chr20	30286638	30286638	G	A	upstream	LINC01597	dist=101	.	.	1	94.4151	5	chr20	30286638	.	G	A	94.4151	.	DP=5;VDB=0.930466;SGB=-0.590765;MQSBZ=-0.592349;FS=0;MQ0F=0;AC=2;AN=2;DP4=0,0,2,3;MQ=47	GT:PL	1/1:124,15,0
chr20	30287288	30287288	C	A	upstream	LINC01597	dist=751	.	.	1	75.4196	5	chr20	30287288	.	C	A	75.4196	.	DP=5;VDB=0.0850649;SGB=-0.590765;MQSBZ=0;FS=0;MQ0F=0;AC=2;AN=2;DP4=0,0,3,2;MQ=60	GT:PL	1/1:105,12,0
chr20	30376387	30376387	A	G	upstream	FRG1BP	dist=777	.	.	0.5	57.3799	10	chr20	30376387	.	A	G	57.3799	.	DP=10;VDB=0.948062;SGB=-0.590765;RPBZ=-0.209529;MQBZ=0;MQSBZ=0;BQBZ=-1.93649;SCBZ=0;FS=0;MQ0F=0;AC=1;AN=2;DP4=4,1,1,4;MQ=60	GT:PL	0/1:90,0,108
chr20	30376586	30376586	C	A	upstream	FRG1BP	dist=578	.	.	0.5	64.2769	6	chr20	30376586	.	C	A	64.2769	.	DP=6;VDB=0.765925;SGB=-0.556411;RPBZ=0;MQBZ=-0.707107;MQSBZ=-0.447214;BQBZ=1.90693;SCBZ=0;FS=0;MQ0F=0;AC=1;AN=2;DP4=0,2,1,3;MQ=57	GT:PL	0/1:98,0,23


<br>
<b><u>Convert genotypes to tab-delimited file</u></b>
<br><br>
Compress the VCFs and then use the 'tabix' command to index the VCFs

In [32]:
/opt/conda/envs/r-bio/bin/bgzip HG00249.bcftools.filter.vcf
/opt/conda/envs/r-bio/bin/tabix -p vcf HG00249.bcftools.filter.vcf.gz

/opt/conda/envs/r-bio/bin/bgzip HG00249.gatk.filter.vcf
/opt/conda/envs/r-bio/bin/tabix -p vcf HG00249.gatk.filter.vcf.gz

<br>
Output tab-delimited text file that can be used for additional analyses

In [None]:
# output text file


In [None]:
#/opt/conda/envs/variant_calling/bin/vcf2tsv -g HG00249.gatk.filter.vcf.gz > HG00249.gatk.filter.txt