# Annotation of exome variants using Annovar

## Aim

Prepare the data for further association analyses using the LMM.ipynb on rare variants. 

## Description of the pipeline

This pipeline provides 3 different possibilities depending on the type of input data you are starting with:

### Scenario 1 : you have multiple bim files (e.g. one per chromosome) and you want to merge them into one file for later annotation with annovar

Run `bim_merge` to concatenate all the bim files and then run `annovar` to annotate all the variants at once

### Scenario 2: you either want to work with common or rare variants.

Run `get_snps` using the `--maf` or `max-maf` depending on the type of variants you would like to extract and then run `annovar`

### Scenario 3: you already have a specific list of variants you would like to annotate stored in a bim file. 

Run `annovar`

## Command interface

In [1]:
sos run annovar.ipynb -h

RuntimeError: Failed to start kernel "Bash". No such kernel named calysto_bash
Error Message:


## Illustration with minimal working example

**Scenario 3:** On Yale's cluster, here modify humandb and ukbb paths to match the location of the databases needed by annovar to function

```
sos run ~/project/bioworkflows/variant-annotation/annovar.ipynb annovar \
    --cwd output \
    --bim_name ukb23156_c22.merged.filtered.bim \
    --humandb /gpfs/ysm/datasets/db/annovar/humandb \
    --ukbb /gpfs/gibbs/pi/dewan/data/UKBiobank \
    --job_size 1 \
    --name_prefix mwe_chr22 \
    --container_annovar /gpfs/gibbs/pi/dewan/data/UKBiobank/annovar.sif
```

On Columbia's cluster

```
sos run ~/project/bioworkflows/variant-annotation/annovar.ipynb annovar \
    --cwd output \
    --bim_name /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/plink_files/ukb23156_c22.merged.filtered.bim \
    --humandb /mnt/mfs/statgen/isabelle/REF/humandb  \
    --ukbb /mnt/mfs/statgen/isabelle/REF/humandb \
    --job_size 1 \
    --name_prefix mwe_chr22 \
    --container_annovar /mnt/mfs/statgen/containers/gatk4-annovar.sif
```

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path
# Specific number of threads to use
parameter: numThreads = 2
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Human genome build
parameter: build = 'hg38'
# Name for the merged bimfiles
parameter: bim_name = path
# Prefix for the name based on common/rare variant filtering
parameter: name_prefix = str
# Load annovar module from cluster
parameter: annovar_module = '''
module load ANNOVAR/2020Jun08-foss-2018b-Perl-5.28.0
echo "Module annovar loaded"
{cmd}
'''
# Software container option
parameter: container_annovar = 'gaow/gatk4-annovar'
parameter: container_lmm = 'statisticalgenetics/lmm:2.0'

### Format file for plink .bim

A text file with no header line, and one line per variant with the following six fields:
1. Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name
2. Variant identifier
3. Position in morgans or centimorgans (safe to use dummy value of '0')
4. Base-pair coordinate (1-based; limited to 231-2)
5. Allele 1 (corresponding to clear bits in .bed; usually minor)
6. Allele 2 (corresponding to set bits in .bed; usually major)

In the bim file the second column e.g `1:930232:C:T` contains the alleles in ref/alt mode

In [None]:
# Merge all the bimfiles into a single file to use later with awk
# Only need to run this cell once
[bim_merge: provides = bim_name]
parameter: bimfiles = paths
input: bimfiles
output: bim_name
task: trunk_workers = 1, walltime = '10h', mem = '10G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
      cat ${_input} > ${_output}

In [None]:
# Get a list of common SNPs above (--maf) or rare SNPs below (--max-maf) certain MAF
[get_snps_1]
# bed files plink format
parameter: bfiles = paths
# Filter based on minor allele frequency (use when filtering common variants)
parameter: maf_filter = 0.0
# Filter based on the maximum maf allowed (use when filtering rare variants)
parameter: max_maf_filter = 0.001
# Filter out variants with missing call rate higher that this value
parameter: geno_filter = 0.0
# Filter according to Hardy Weiberg Equilibrium
parameter: hwe_filter = 0.0
# Fitler out samples with missing rate higher than this value
parameter: mind_filter = 0.0
input: bfiles, group_by=1
output: f'{cwd}/cache/{_input:bn}.{name_prefix}.snplist'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
    plink2 \
      --bfile ${_input:n}\
      ${('--maf %s' % maf_filter) if maf_filter > 0 else ''} \
      ${('--max-maf %s' % max_maf_filter) if max_maf_filter > 0 else ''} \
      ${('--geno %s' % geno_filter) if geno_filter > 0 else ''} \
      ${('--hwe %s' % hwe_filter) if hwe_filter > 0 else ''} \
      ${('--mind %s' % mind_filter) if mind_filter > 0 else ''} \
      --write-snplist --no-id-header\
      --freq \
      --threads ${numThreads} \
      --out ${_output:n} 

In [None]:
# Merge all of the common_var.snplist into a single file and all the rare_var.snplist into another single file
[get_snps_2]
input: group_by='all'
output: f'{cwd}/cache/{name_prefix}.snplist'
task: trunk_workers = 1, walltime = '10h', mem = '10G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output:n}.stdout' 
      cat ${_input} > ${_output}

In [None]:
# Search for common or rare variants in bimfile and generate annovar input file
[get_snps_3]
depends: bim_name
output: f'{cwd}/{_input:bn}.avinput'
task: trunk_workers = 1, walltime = '10h', mem = '10G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
    awk -F" " 'FNR==NR {lines[$1]; next} $2 in lines ' ${_input} ${bim_name} > ${_output:n}.tmp
    awk '{if ($2 ~ /D/) {print $1, $4, $4 + (length ($6) - length ($5)), $6, $5 } else {print $1, $4, $4, $6, $5 }}'  ${_output:n}.tmp >  ${_output}
    # remove temporary files
    rm -f ${_output:n}.tmp 

## Annovar details

For a list of available [databases](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/)

On Farnam's Yale HPC there is a folder for shared databases
```/gpfs/ysm/datasets/db/annovar/humandb``` 

and a folder for the x_ref database ```/gpfs/gibbs/pi/dewan/data/UKBiobank/mart_export_2019_LOFtools3.txt```

On Yale's cluster there folder for shared databases for build hg19 is under Isabelle's folder
```/mnt/mfs/statgen/isabelle/REF/humandb```

and the x_ref database is under that same folder ```/mnt/mfs/statgen/isabelle/REF/humandb```


### Important note

Please make sure you are using the correct build for your annotations UKBB exome data for 200K individuals need hg38 build

### Format file for annovar input

On each line, the first five space- or tab- delimited columns represent 

1. chromosome 
2. start position 
3. end position 
4. the reference nucleotides
5. the observed nucleotides

In [None]:
# Create annovar input file
[annovar_1]
input: bim_name
output: f'{cwd}/{_input:bn}.{build}.avinput'
task: trunk_workers = 1, walltime = '10h', mem = '10G', cores = numThreads, tags = f'{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.err', stdout = f'{_output:n}.out' 
    awk '{if ($2 ~ /D/) {print $1, $4, $4 + (length ($6) - length ($5)), $6, $5, $2} else {print $1, $4, $4, $6, $5, $2}}'  ${_input} >  ${_output}

In [None]:
# Annotate vcf file using ANNOVAR
[annovar_2]
# humandb path for ANNOVAR
parameter: humandb = path
parameter: ukbb = path
#add xreffile to option without -exonicsplicing
#mart_export_2019_LOFtools3.txt #xreffile latest option -> Phenotype description,HGNC symbol,MIM morbid description,CGD_CONDITION,CGD_inh,CGD_man,CGD_comm,LOF_tools
parameter: x_ref = path(f"{ukbb}/mart_export_2019_LOFtools3.txt")
# Annovar protocol
parameter: protocol = ['refGene', 'refGeneWithVer', 'knownGene', 'ensGene', 'phastConsElements30way', 'encRegTfbsClustered', 'gwasCatalog', 'gnomad211_genome', 'gnomad211_exome', 'gme', 'kaviar_20150923', 'abraom', 'avsnp150', 'dbnsfp41a', 'dbscsnv11', 'regsnpintron', 'clinvar_20200316', 'gene4denovo201907']
# Annovar operation
parameter: operation = ['g', 'g', 'g', 'gx', 'r', 'r', 'r', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f']
# Annovar args
parameter: arg = ['"-splicing 12 -exonicsplicing"', '"-splicing 30"', '"-splicing 12 -exonicsplicing"', '"-splicing 12"', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
output: f'{cwd}/{_input:bn}.{build}_multianno.csv'
task: trunk_workers = 1, walltime = '60h', mem = '48G', cores = numThreads, tags = f'{_output:bn}'
bash: container=container_annovar, volumes=[f'{humandb:a}:{humandb:a}', f'{x_ref:ad}:{x_ref:ad}'], expand="${ }", stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'
    #do not add -intronhgvs as option -> writes cDNA variants as HGVS but creates issues (+2 splice site reported only)
    #-nastring . can only be . for VCF files
    #regsnpintron might cause shifted lines (be carefull using)
    table_annovar.pl \
        ${_input} \
        ${humandb} \
        -buildver ${build} \
        -out ${_output:nn}\
        -remove \
        -polish \
        -nastring . \
        -protocol ${",".join(protocol)} \
        -operation ${",".join(operation)} \
        -arg ${",".join(arg)} \
        -csvout \
        -xreffile ${x_ref}