### Variant calling module

**CMM262, Winter 2024**

Kyle Gaulton, kgaulton@health.ucsd.edu
<br>
<br>

<b>In this walkthrough we will be functionally annotating variant calls from 23andme</b>

<br>
<b><u>Download and format 23andme file</u></b>
<br><br>
For the purposes of this walkthrough, the Harvard Personal Genome Project has publicly available genetic data from many individuals:   
https://my.pgp-hms.org/public_genetic_data  
<br><br>
We will download 23andme genetic data from one of the individuals in this database
<br><br>
If you have used 23andme you could download your genotype data directly and annotate your own variants instead 
<br><br>

In [None]:
wget --mirror --no-parent --no-host --cut-dirs=1 https://3996cdadd6946ea4d2685f2a71949d6e-107.collections.ac2it.arvadosapi.com/_/

<br>
If we look at the file we can see that it isn't in a standard (e.g. VCF) format, but just lists the variants and the genotype

In [None]:
head -n 50 genome_Patrick_Finney_v4_Full_20170327075235\[1\].txt

<br>
Therefore, before annotating the variant calls we need to first convert the 23andme output to a VCF
<br><br>
We will use a Perl script '23andme2vcf.pl' to convert the file to VCF

In [None]:
cp ~/public/variantcalling/resources/23andme_v3_hg19_ref.txt.gz .
perl ~/public/variantcalling/resources/23andme2vcf.pl genome_Patrick_Finney_v4_Full_20170327075235\[1\].txt my_vars.vcf

In [None]:
head -n 50 my_vars.vcf

<br>
<b><u>Functionally annotate 23andme VCF</u></b>
<br><br>
Next we will functionally annotate variants in the VCF file for effects on protein-coding genes and to identify variants in ClinVar using ANNOVAR 

In [None]:
cp ~/public/variantcalling/resources/annovar.tar.gz .
gunzip annovar.tar.gz
tar -xvf annovar.tar

In [None]:
perl annovar/table_annovar.pl my_vars.vcf annovar/humandb/ -buildver hg19 -out annotated -remove -protocol refGene,clinvar_20131105 -operation g,f -nastring . -vcfinput

<br>
This step should produce both a VCF with the annotations included as well as a text file of variant annotations

In [None]:
ls -la *multianno*

<br>
If we look at the annotated text file we can see many columns - including some redundant information - so first we want to clean up the file so it is a bit easier to read

In [None]:
head -n 100 annotated.hg19_multianno.vcf

In [None]:
head -n 20 annotated.hg19_multianno.txt

In [None]:
cut -f1,2,4,5,7,9,10,11,17,24 annotated.hg19_multianno.txt > annotated.hg19_multianno.trim.txt

In [None]:
head -n 20 annotated.hg19_multianno.trim.txt

<br>
Now looking at the file it is clear that genotypes for all of the variants are provided, including ones which were homozygote for the reference allele.  Therefore we need to filter the file to just those variants which are heterozygote or homozygote non-reference.

In [None]:
grep -v '0/0' annotated.hg19_multianno.trim.txt > annotated.hg19_multianno.trim.nohomref.txt

In [None]:
head -n 20 annotated.hg19_multianno.trim.nohomref.txt

<br>
Finally, we want to extract variants that may have clinical significance in ClinVar

In [None]:
grep '[=|]pathogenic' annotated.hg19_multianno.trim.nohomref.txt

<br>
<b><u>Format VCF for genotype imputation using TOPMed</u></b>
<br><br>
Most of the variants in your genome aren't captured by the microarray used by 23andMe/Ancestry etc.  However, you can use imputation to accurately ypredict the genotypes of most (not all) variants in your genome
<br><br>
In order to do this, we need to take several steps to format the VCF so that it can be uploaded to an imputation server 

First we need to strip out the 'chr' from the chromosome column in the VCF

In [3]:
awk '{gsub(/\chr/, "")}1' my_vars.vcf > my_vars.no_chr.vcf

Next we need to compress and index the resulting VCF

In [5]:
/opt/conda/envs/variant_calling/bin/bgzip my_vars.no_chr.vcf
/opt/conda/envs/variant_calling/bin/tabix my_vars.no_chr.vcf.gz

Finally, we need to split the VCF by chromosome

In [6]:
/opt/conda/envs/variant_calling/bin/bcftools index -s my_vars.no_chr.vcf.gz | cut -f 1 | while read C; do /opt/conda/envs/variant_calling/bin/bcftools view -O z -o split.${C}.vcf.gz my_vars.no_chr.vcf.gz "${C}" ; done

These per-chromosome VCFs can then be uploaded to TOPMed or another imputation server (will show you how this works now)