### Variant calling module

**CMM262, Winter 2024**

Kyle Gaulton, kgaulton@health.ucsd.edu
<br>
<br>

<b>In this walkthrough we will be functionally annotating variant calls from 23andme</b>

<br>
<b><u>Download and format 23andme file</u></b>
<br><br>
For the purposes of this walkthrough, the Harvard Personal Genome Project has publicly available genetic data from many individuals:   
https://my.pgp-hms.org/public_genetic_data  
<br><br>
We will download 23andme genetic data from one of the individuals in this database
<br><br>
If you have used 23andme you should be able to download your genotype data directly 
<br><br>

In [1]:
wget --mirror --no-parent --no-host --cut-dirs=1 https://3996cdadd6946ea4d2685f2a71949d6e-107.collections.ac2it.arvadosapi.com/_/

SyntaxError: invalid decimal literal (2198975931.py, line 1)

<br>
If we look at the file we can see that it isn't in a standard (e.g. VCF) format, but just lists the variants and the genotype

In [2]:
more genome_Patrick_Finney_v4_Full_20170327075235\[1\].txt

FileNotFoundError: [Errno 2] No such file or directory: 'genome_Katrina_Gardner_v5_Full_20220813004258.txt'

<br>
Therefore, before annotating the variant calls we need to first convert the 23andme output to a VCF
<br><br>
We will use a Perl script '23andme2vcf.pl' to convert the file to VCF

In [None]:
perl 23andme2vcf.pl more genome_Patrick_Finney_v4_Full_20170327075235\[1\].txt my_vars.vcf

In [None]:
more my_vars.vcf

<br>
Next we will functionally annotate variants in the VCF file for effects on protein-coding genes and to identify variants in ClinVar using ANNOVAR 

In [None]:
perl annovar/table_annovar.pl my_vars.vcf annovar/humandb/ -buildver hg19 -out annotated -remove -protocol refGene,clinvar_20131105 -operation g,f -nastring . -vcfinput

<br>
This step should produce both a VCF with the annotations included as well as a text file of variant annotations

In [None]:
ls -la *multianno*

<br>
If we look at the annotated text file we can see many columns - including some redundant information - so first we want to clean up the file so it is a bit easier to read

In [None]:
more annotated.hg19_multianno.txt

In [None]:
cut -f1,2,4,5,7,9,10,11,17,24 annotated.hg19_multianno.txt > annotated.hg19_multianno.trim.txt

In [None]:
more annotated.hg19_multianno.trim.txt

<br>
Now looking at the file it is clear that genotypes for all of the variants are provided, including ones which were homozygote for the reference allele.  Therefore we need to filter the file to just those variants which are heterozygote or homozygote non-reference.

In [None]:
grep -v '0/0' annotated.hg19_multianno.trim.txt > annotated.hg19_multianno.trim.nohomref.txt

<br>
Finally, we want to extract variants that may have clinical significance

In [None]:
grep 'CLINSIG=pathogenic' annotated.hg19_multianno.trim.nohomref.txt