# Annotation of exome variants using Annovar

## Aim

Prepare the data for further association analyses using the LMM.ipynb on rare variants. 

## Description of the pipeline

This pipeline provides 3 different possibilities depending on the type of input data you are starting with:

### Scenario 1 : you have multiple bim files (e.g. one per chromosome) and you want to merge them into one file for later annotation with annovar

Run `bim_merge` to concatenate all the bim files and then run `annovar` to annotate all the variants at once

### Scenario 2: you either want to work with common or rare variants.

Run `get_snps` using the `--maf` or `max-maf` depending on the type of variants you would like to extract and then run `annovar`

### Scenario 3: you already have a specific list of variants you would like to annotate stored in a bim file. 

Run `annovar`

## Command interface

In [1]:
!sos run annovar.ipynb -h

[91mERROR[0m: [91mNotebook JSON is invalid: %s[0m
usage: sos run annovar.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  bim_merge
  get_snps
  annovar

Global Workflow Options:
  --cwd VAL (as path, required)
                        the output directory for generated files
  --numThreads 2 (as int)
                        Specific number of threads to use
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --build hg38
                        Human genome build
  --bim-name VAL (as path, required)
                        Name for the merged bimfiles
  --name-prefix VAL (as str, required)
                        Prefix for the n

## Illustration with minimal working example

**Scenario 3:** On Yale's cluster, here modify humandb and ukbb paths to match the location of the databases needed by annovar to function

```
sos run ~/project/bioworkflows/variant-annotation/annovar.ipynb annovar \
    --cwd output \
    --bim_name ukb23156_c22.merged.filtered.bim \
    --humandb /gpfs/ysm/datasets/db/annovar/humandb \
    --ukbb /gpfs/gibbs/pi/dewan/data/UKBiobank \
    --job_size 1 \
    --name_prefix mwe_chr22 \
    --container_annovar /gpfs/gibbs/pi/dewan/data/UKBiobank/annovar.sif
```

On Columbia's cluster running `annovar`

```
sos run ~/project/bioworkflows/variant-annotation/annovar.ipynb annovar \
    --cwd output \
    --bim_name /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/plink_files/ukb23156_c22.merged.filtered.bim \
    --humandb /mnt/mfs/statgen/isabelle/REF/humandb  \
    --ukbb /mnt/mfs/statgen/isabelle/REF/humandb \
    --job_size 1 \
    --name_prefix mwe_chr22 \
    --container_annovar /mnt/mfs/statgen/containers/gatk4-annovar.sif
```
On Columbia's cluster running `burden_files`
```
sos run ~/project/bioworkflows/variant-annotation/annovar.ipynb burden_files\
    --cwd ~/output \
    --annotated_file /mnt/mfs/statgen/UKBiobank/results/annovar_exome/ukb32285_exomespb_chr1_22.hg38.hg38_multianno.csv\
    --bim_name /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/plink_files/ukb23156_c1.merged.filtered.bim \
    --job_size 1 \
    --name_prefix test \
    --container_lmm /mnt/mfs/statgen/containers/lmm.sif
```

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path
# Specific number of threads to use
parameter: numThreads = 2
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Name for the merged bimfiles to use
parameter: bim_name = path
# Human genome build hg19 or hg38
parameter: build = 'hg38'
# Prefix for the name based on common/rare variant filtering
parameter: name_prefix = str
# Wall clock time expected
parameter: walltime = "15h"
# Memory expected
parameter: mem = "30G"
# Load annovar module from cluster
parameter: annovar_module = '''
module load ANNOVAR/2020Jun08-foss-2018b-Perl-5.28.0
echo "Module annovar loaded"
{cmd}
'''
# Software container option
parameter: container_annovar = 'gaow/gatk4-annovar'
parameter: container_lmm = 'statisticalgenetics/lmm:2.4'

### Format file for plink .bim

A text file with no header line, and one line per variant with the following six fields:
1. Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name
2. Variant identifier
3. Position in morgans or centimorgans (safe to use dummy value of '0')
4. Base-pair coordinate (1-based; limited to 231-2)
5. Allele 1 (corresponding to clear bits in .bed; usually minor)
6. Allele 2 (corresponding to set bits in .bed; usually major)

In the bim file the second column e.g `1:930232:C:T` contains the alleles in ref/alt mode

## Step to merge *.bim files from plink formatted data (e.g exome data in the UKBB, genotype array data)

In [None]:
# Merge all the *.bim files into a single file. Needs to be run once per type of data (e.g. genotype, exome)
[bim_from_plink]
# Path to the *.bim files to merge
parameter: bimfiles= paths
# Specify path of the merged bim file
parameter: bim_name = path
input: bimfiles 
output: bim_name
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
      cat ${_input} >> ${_output}

## Step to create a list of variants from *.bgen files and a merged *.bim file to annotate (e.g imputed genotype data UKBB)

In [None]:
# Create a merged *.bim file from *.bgen files
[bim_from_bgen]
# Specify bgen files path
parameter: genoFile = paths
# Specify name of the merged bim file
parameter: bim_name = str
# The input here is the bgen file from which to extract the list of variants
input: genoFile, group_by=1
output: f'{cwd}/{_input:bn}.bim'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    bgenix -g ${_input} -list | awk 'NR>2 { gsub("_",":",$1); print $3, $1, $4, $7, $6 }' | awk 'BEGIN{FS=OFS=" "}{$2 = $2 OFS 0}1'  > ${_output}
    cat ${_output} | awk '{x=$1+0;print x,$2,$3,$4,$5,$6}' >> ${cwd}/${bim_name}.merged.bim

In [None]:
# Get a list of common SNPs above (--maf) or rare SNPs below (--max-maf) certain MAF
[get_snps_1]
# bed files plink format
parameter: bfiles = paths
# Filter based on minor allele frequency (use when filtering common variants)
parameter: maf_filter = 0.0
# Filter based on the maximum maf allowed (use when filtering rare variants)
parameter: max_maf_filter = 0.001
# Filter out variants with missing call rate higher that this value
parameter: geno_filter = 0.0
# Filter according to Hardy Weiberg Equilibrium
parameter: hwe_filter = 0.0
# Fitler out samples with missing rate higher than this value
parameter: mind_filter = 0.0
input: bfiles, group_by=1
output: f'{cwd}/cache/{_input:bn}.{name_prefix}.snplist'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
    plink2 \
      --bfile ${_input:n}\
      ${('--maf %s' % maf_filter) if maf_filter > 0 else ''} \
      ${('--max-maf %s' % max_maf_filter) if max_maf_filter > 0 else ''} \
      ${('--geno %s' % geno_filter) if geno_filter > 0 else ''} \
      ${('--hwe %s' % hwe_filter) if hwe_filter > 0 else ''} \
      ${('--mind %s' % mind_filter) if mind_filter > 0 else ''} \
      --write-snplist --no-id-header\
      --freq \
      --threads ${numThreads} \
      --out ${_output:n} 

In [None]:
# Merge all of the common_var.snplist into a single file and all the rare_var.snplist into another single file
[get_snps_2]
input: group_by='all'
output: f'{cwd}/cache/{name_prefix}.snplist'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output:n}.stdout' 
      cat ${_input} > ${_output}

In [None]:
# Search for common or rare variants in bimfile and generate annovar input file
[get_snps_3]
depends: bim_name
output: f'{cwd}/{_input:bn}.avinput'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
    awk -F" " 'FNR==NR {lines[$1]; next} $2 in lines ' ${_input} ${bim_name} > ${_output:n}.tmp
    awk '{if ($2 ~ /D/) {print $1, $4, $4 + (length ($6) - length ($5)), $6, $5 } else {print $1, $4, $4, $6, $5 }}'  ${_output:n}.tmp >  ${_output}
    # remove temporary files
    rm -f ${_output:n}.tmp 

## Annovar details

For a list of available [databases](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/)

On Farnam's Yale HPC there is a folder for shared databases
```/gpfs/ysm/datasets/db/annovar/humandb``` 

and a folder for the x_ref database ```/gpfs/gibbs/pi/dewan/data/UKBiobank/mart_export_2019_LOFtools3.txt```

On Columbia's cluster there folder for shared databases for build hg19 is under Isabelle's folder
```/mnt/mfs/statgen/isabelle/REF/humandb```

and the x_ref database is under that same folder ```/mnt/mfs/statgen/isabelle/REF/humandb```


### Important note

Please make sure you are using the correct build for your annotations UKBB exome data for 200K individuals need hg38 build

### Format file for annovar input

On each line, the first five space- or tab- delimited columns represent 

1. chromosome 
2. start position 
3. end position 
4. the reference nucleotides
5. the observed nucleotides

In [None]:
# Create annovar input file
[annovar_1]
input: bim_name
output: f'{cwd}/{_input:bn}.{build}.avinput'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.err', stdout = f'{_output:n}.out' 
    # $6 ref_allele, $5 alt_allele in the bim files 
    # Output as annovar avinput chr, start, end (has to be calculated depending on allele length), reference, alternative
    awk '{if ($6 > $5) {print $1, $4, $4 + (length ($6) - length ($5)), $6, $5, $2} else {print $1, $4, $4, $6, $5, $2}}'   ${_input} >  ${_output}

In [None]:
# Annotate variants file using ANNOVAR
[annovar_2]
# humandb path for ANNOVAR
parameter: humandb = path
# Path to x-ref file
parameter: xref_path = path
# Annovar protocol
if build == 'hg19':
    protocol = ['refGene', 'refGeneWithVer', 'knownGene', 'ensGene', 'phastConsElements46way', 'gwasCatalog', 'gnomad211_exome', 'avsnp150', 'dbnsfp42a', 'dbscsnv11', 'gene4denovo201907']
    operation = ['g', 'g', 'g', 'g', 'r', 'r', 'f', 'f', 'f', 'f', 'f']
    arg = ['"-splicing 12 -exonicsplicing"', '"-splicing 30"', '"-splicing 12 -exonicsplicing"', '"-splicing 12"', '', '', '', '', '', '', '']
else:
    protocol = ['refGene', 'refGeneWithVer', 'knownGene', 'ensGene', 'phastConsElements30way', 'encRegTfbsClustered', 'gwasCatalog', 'gnomad30_genome', 'gnomad211_exome', 'gme', 'kaviar_20150923', 'avsnp150', 'dbnsfp41a', 'dbscsnv11', 'clinvar_20200316', 'gene4denovo201907']
    operation = ['g', 'g', 'g', 'gx', 'r', 'r', 'r', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f']
    arg = ['"-splicing 12 -exonicsplicing"', '"-splicing 30"', '"-splicing 12 -exonicsplicing"', '"-splicing 12"', '', '', '', '', '', '', '', '', '', '', '', '']

#add xreffile to option without -exonicsplicing
#mart_export_2019_LOFtools3.txt #xreffile latest option -> Phenotype description,HGNC symbol,MIM morbid description,CGD_CONDITION,CGD_inh,CGD_man,CGD_comm,LOF_tools
parameter: x_ref = path(f"{xref_path}/mart_export_2021_LOFtools.txt")
output: f'{cwd}/{_input:bn}.{build}_multianno.csv'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}', template = '{cmd}' if executable('annotate_variation.pl').target_exists() else annovar_module
bash: container=container_annovar, volumes=[f'{humandb:a}:{humandb:a}', f'{x_ref:ad}:{x_ref:ad}'], expand="${ }", stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'
    #do not add -intronhgvs as option -> writes cDNA variants as HGVS but creates issues (+2 splice site reported only)
    #-nastring . can only be . for VCF files
    #regsnpintron might cause shifted lines (be carefull using)
    table_annovar.pl \
        ${_input} \
        ${humandb} \
        -buildver ${build} \
        -out ${_output:nn}\
        -otherinfo\
        -remove \
        -polish \
        -nastring . \
        -protocol ${",".join(protocol)}\
        -operation ${",".join(operation)} \
        -arg ${",".join(arg)} \
        -csvout \
        -xreffile ${x_ref} 

## Generate files for burden_test regenie from the annotated file

This workflow's aim is to generate the `--anno_file` and the `--set_list` files needed to run regenie_burden in the LMM.ipynb

Required files
1. The anno_files: define variant sets and functional annotations that will be used to generate the masks. The format is `chr:start:ref:alt gene_name functional_annot`
2. The set-list-files: lists variants within each set/gene to use when building masks. The format is set/gene chr start_pos and a comma separated list of variants included in that gene
3. Mask file: this file specifies which annotation categories should be combined into masks

Optional files

4. Set inclusion/exclusion file: one column with a list of sets/genes to be included/excluded from the set-list-file
5. Alternative allele frequency file (AAF): by default the AAF is computed by the sample but you can specify an AAF for each variant using this file


In [None]:
# Make the anno_file and set_list_file for regenie_burden analysis
[burden_files]
parameter: annotated_file = path
parameter: rsid = False
input: annotated_file
output:f'{cwd}/{_input:bn}.anno_file',
       f'{cwd}/{_input:bn}.aff_file',     
       f'{cwd}/{_input:bn}.set_list_file'
task: trunk_workers = 1, walltime = '10h', mem = '20G', cores = numThreads, tags = f'{_output[0]:bn}'
python: container=container_lmm, expand="${ }", stderr=f'{_output[0]:n}.err', stdout=f'{_output[0]:n}.out'
    import pandas as pd
    import numpy as np
    import csv

    columns= ['Chr', "Start", "Ref", "Alt", "Gene.refGene", "ExonicFunc.refGene", "avsnp150", "AF_nfe_exome" ]
    df = pd.read_csv('${_input}', compression='gzip', header=0, dtype='string', usecols=columns, index_col=False)

    # Get the refGene definition

    # Get only the first gene name present
    df1 = df["Gene.refGene"].str.split(";", n = 1, expand = True)
    df["Gene"]= df1[0]

    # Get the variants in the chr, pos, ref, alt format

    if ${rsid} == True:
        df["varID"] = df.avsnp150
        tmp = df[~df["avsnp150"].str.startswith('rs', na=False)]
        tmp["varID"] = tmp.Chr.str.cat(others=[tmp.Start, tmp.Ref, tmp.Alt], sep=':')
        df[~df["avsnp150"].str.startswith('rs', na=False)] = tmp
    else:
        df["varID"] = df.Chr.str.cat(others=[df.Start, df.Ref, df.Alt], sep=':')

    # Drop duplicated variants
    df2=df.drop_duplicates(subset='varID', keep='first')
    df2=df2.replace({'AF_nfe_exome':'.'},'0.0')
    df2['AF_nfe_freq'] = df2['AF_nfe_exome'].astype(float)
    # Find duplicated genes in the file
    genes = dict()
    def find_chrom(row):
        if row["Gene"] not in genes.keys():
            genes[row["Gene"]] = set()
        genes[row["Gene"]].add(row["Chr"])
    df2[["Chr", "Gene"]].apply(find_chrom, axis=1)
    
    # Rename the duplicated genes adding the chromosome number to make them unique
    def rename_chrom(row):
        if len(genes[row["Gene"]]) > 1:
            return f'{row["Gene"]}-{row["Chr"]}'
        return row["Gene"]
    df2["Gene"] = df2.apply(rename_chrom, axis=1)
    
    # Match annovar annotations with regenie_burden needs 
    annotation_mappings = {"nonsynonymous":'missense', "frameshift":'LoF', "stopgain":'LoF', "stoploss":'LoF', "synonymous":'synonymous'}
    def annotation(x):
        x = x.strip().split()
        for i in x:
            if i in annotation_mappings.keys():
                return annotation_mappings[i]
        return 'other'
    df2["anno_cat"] = df2["ExonicFunc.refGene"].apply(annotation)
    
    # Create the anno_file and add chr in the beggining to match the plink bim files
    with open('${_output[0]}', 'w') as output:
        for row in df2[["varID", "Gene", "anno_cat"]].to_numpy():
            if ${rsid} == True:
                output.write(f'{row[0]} {row[1]} {row[2]}\n')
            else: 
                output.write(f'chr{row[0]} {row[1]} {row[2]}\n')
    # Create the aaf-file for allele frequencies using gnomAD dataset
    with open('${_output[1]}', 'w') as output:
        for row in df2[["varID", "AF_nfe_freq"]].to_numpy():
            if ${rsid} == True:
                output.write(f'{row[0]} {row[1]}\n')
            else: 
                output.write(f'chr{row[0]} {row[1]}\n')
    
    # Create the set_list_file grouping variants per gene. Format gene chrom start varIDs
    grouped = df2.groupby(["Gene"])
    with open('${_output[2]}', 'w') as output:
        for key, values in grouped:
            s = []
            start = values["Start"].min()
            chrom = values["Chr"].min()
            def addToS(x, s):
                if ${rsid} == True:
                    s.append(str(x["varID"]))
                else:
                    s.append(str("chr") + str(x["varID"]))                   
            fxn = lambda x: addToS(x, s)
            values.apply(fxn, axis=1)
            output.write(key + " " + chrom + " " + start + " " + ",".join(s) + "\n")
    