# VEP annotation of rare variants using plugins

This workflow focuses on using VEP offline command line tool to annotate variants. 
Plugins like CADD, gnomAD and LOTFEE can be used to create custom annotations.

## Input file

The input file should be a vcf file. Please make sure the format is:

`CHROM  POS ID REF  ALT QUAL  FILTER  INFO  FORMAT` and that the chromosome column does not contain chr.


If you would like to know more about VEP input formats please take a look at their [documentation](http://useast.ensembl.org/info/docs/tools/vep/vep_formats.html)


## Databases

This workflow needs that very big databases are downloaded first. 

**1. Download CADD databases**

Download CADD files to match your genome build, and place in the folder `data/cadd`

The databases for hg38 can be found in the CADD website (make sure the index files are also in the same folder)

* SNVs (HG38): 
    - [tsv.gz](https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/whole_genome_SNVs_inclAnno.tsv.gz) (313G)
    - [tbi](https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/whole_genome_SNVs_inclAnno.tsv.gz.tbi) (2.7M)

If you would like to annotate the CADD score for indels, then you also need to download the specific database

* Indels (HG38)
    - [tsv.gz](https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/gnomad.genomes.r3.0.indel_inclAnno.tsv.gz) (7.2G)
    - [tbi](https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/gnomad.genomes.r3.0.indel_inclAnno.tsv.gz.tbi) (2.5M)
    
**2. VEP setup and databases**

In our case we are using a container for VEP configuration. If you are working with Singularity you can do:

If you are working on Columbia's cluster please load the latest version of Singularity, otherwise you will gen an error. The container should be using the latest version of the container. However you can pull the image manually with the commands below

```
module load Singularity/3.11.4
singularity pull oras://ghcr.io/cumc/rare_variation_apptainer:latest
```

* Download cache files for VEP annotation and place in the folder `data/vep`. Run tar xzf to unzip this after downloading. Please be sure to match the vep version with the corresponding cache files

    - [Ensembl 110 / GRCh38](https://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/homo_sapiens_vep_110_GRCh38.tar.gz) (13G)
    - [Ensembl 110 / GRCh37](https://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/#:~:text=homo_sapiens_vep_110_GRCh37.tar.gz) (20G)
    - You can find other VEP cache versions [here](https://ftp.ensembl.org/pub)

For full VEP installation and instructions look at the documentation [here](https://ftp.ensembl.org/pub)

**3. Set up Loftee**

* Download Loftee files to `data/vep`

    - GERP (GRCh38):
        * [bw](https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/gerp_conservation_scores.homo_sapiens.GRCh38.bw) (12G)
    - Human ancestor (GRCh38):
        * [fa.gz](https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz) (844M)
        * [fai](https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz.fai)
        * [gzi](https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz.gzi)
    - PhyloCSV (GRCh38):
        * [sql.gz](https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/loftee.sql.gz) (29M) unzip after downloading

**4. Download genecode file**

* Download the GTF file corresponding to your build, unzip it, and and place in `data/genocode`

    - [Gencode v43 / GRCh37](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/GRCh37_mapping/gencode.v43lift37.annotation.gtf.gz) (62M)
    - [Gencode v44 / GRCh38](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz) (47M)

**5. Download gnomAD databases**

* If you wish to annotate both gnomAD genome and exome frequencies you will need to download both databases. Please keep in mind that these are very big databases. For more information please visit the [gnomAD website](https://gnomad.broadinstitute.org/downloads) 

    - [gnomAD_exome_v2.1.1 / GRCh38](https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz)(86G)
    - [gnomAD_exome_index_v2.1.1 / GRCh38](https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz.tbi)(1.3M)
    
    - [gnomAD_genome_v2.1.1 / GRCh37/hg19 ](https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz)(461Gb)
    - [gnomAD_genome_index_v2.1.1 / GRCh37/hg19 ](https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz.tbi)(2.73M)

## Run the workflow

```
sos run vep.ipynb \
--cwd output \
--vcf data/test.vcf \
--human_ancestor data/vep/human_ancestor.fa.gz \
--conservation_file data/vep/loftee.sql \
--gerp_bigwig data/vep/gerp_conservation_scores.homo_sapiens.GRCh38.bw \
--cadd_snps data/cadd/whole_genome_SNVs_inclAnno.tsv.gz \
--cad_indels data/cadd/gnomad.genomes.r3.0.indel.tsv.gz \
--dir_cache data/vep \
--walltime 30h \
--mem 30G
```


## Output file

The output file will be formatted as VCF

## Important notes

* Please be mindful that when you run this code those SNVs and indels that are not in the CADD SNV or indel database won't be annotated and that could impact your downstream analysis.

* In this piepline we are using the cache version 103 with the VEP install 110 if this is not what you want please modify the parameters accordingly `cache_version`

* If you would like to annovate clinvar database add the `clinvar_db` paramater which in Columbia's cluster is `/mnt/vast/hpc/csg/data_public/clinvar/clinvar_20231028.vcf.gz`

## Annotation

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path
# Specific number of threads to use
parameter: numThreads = 2
# Input vcf file to annotate
parameter: vcf = path
# Human ancestor database
parameter: human_ancestor = path
# Convervation file path
parameter: conservation_file = path
# GERP bigwig
parameter: gerp_bigwig = path
# CADD database for SNV's
parameter: cadd_snps = path
# CADD databse for indels
parameter: cadd_indels = path
# Clinvar datavase
parameter: clinvar_db = path('.')
# Cache version to use
parameter: cache_version = int
# Genome assembly to use
parameter: assembly = 'GRCh38'
# Cache dir
parameter: dir_cache = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: mem = '15G'
parameter: walltime = '10h'
# The container 
parameter: container = "oras://ghcr.io/cumc/rare_variation_apptainer:latest"

In [None]:
[default_1]
input: vcf
output: f'{cwd}/{_input:bn}.rare.VEP.CADD_gnomAD.tsv', f'{cwd}/{_input:bn}.rare.VEP.CADD_gnomAD.tsv.gz' 
parameter: vep_window = 10000
parameter: most_severe = False
parameter: everything = True
bash: container=container, entrypoint="micromamba run -a '' -n rare_variation", expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
  vep \
    --verbose \
    --tab \
    -i ${_input} \
    -o ${_output[0]} \
    --distance ${vep_window} \
    --no_stats \
    --cache_version ${cache_version} \
    --assembly ${assembly} \
    --force_overwrite \
    --offline \
    ${('--most_severe') if most_severe else ''} \
    --dir_cache ${dir_cache} \
    --dir_plugins $VEP_PLUGIN_DIR \
    ${('--everything') if everything else ''} \
    ${('--custom file='+ str(clinvar_db)+',short_name=ClinVar,format=vcf,type=exact,coords=0,fields=CLNSIG%CLNREVSTAT%CLNDN') if clinvar_db else ''} \
    --plugin LoF,human_ancestor_fa:${human_ancestor},loftee_path:$VEP_PLUGIN_DIR,conservation_file:${conservation_file},gerp_bigwig:${gerp_bigwig} \
    --plugin CADD,snv=${cadd_snps},indels=${cadd_indels}
    bgzip --keep ${_output[0]}
    tabix ${_output[1]}

## Post-processing

In [None]:
[default_2]
input: f'{cwd}/{vcf:bn}.rare.VEP.CADD_gnomAD.tsv.gz'
output: f'{cwd}/{_input:bn}.formatted_anno.tsv'
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    pandas as pd
    import gzip
    # Initialize an empty list to store the comments
    comments = []

    # Open the gzip-compressed VCF file and read it line by line
    with gzip.open(${_input}, 'rt') as file:
        for line in file:
            if line.startswith('#'):
                comments.append(line.strip())

    # The last item in the 'comments' list is the header comment
    header_comment = comments[-1]
    # Extract the header from the comment
    header = header_comment.split('#')[1]
    # Split the header into column names
    columns = header.split('\t')
    df = pd.read_csv(${_input}, compression='gzip',comment='#', sep='\t', names=columns)
    df.to_csv(${_output}, index=False, sep='\t',header=True)