# Variant Annotation with Hail: Variant Effect Predictor (VEP)

This notebook shows how to annotate genomic data with VEP using Hail and save it as a Hail Table to an Apollo database (dnax://) on the DNAnexus platform. See documentation for guidance on launch specs for the JupyterLab with Spark Cluster app for different data sizes: https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data

Additional documentation: https://hail.is/docs/0.2/methods/genetics.html#hail.methods.vep

Pre-conditions for running this notebook successfully:
- There is an existing Hail MatrixTable in DNAX
- The necessary configuration file is already uploaded to the project (see https://documentation.dnanexus.com/user/jupyter-notebooks/dxjupyterlab-spark-cluster#using-vep-with-hail for an example config file)

## 1) Initiate Spark and Hail

In [None]:
# Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.

from pyspark.sql import SparkSession
import hail as hl

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

## 2) Read MT

In [None]:
# define MT url

mt_url = "dnax://database-GFpXJ5j0vzZxPZQ2Ggf14x7q/geno.mt"

In [None]:
# read MT

mt = hl.read_matrix_table(mt_url)

In [None]:
# View structure of MT before annotation

mt.describe()

## 3) Annotate

All data uploaded to the project before running the JupyterLab app is mounted (https://documentation.dnanexus.com/user/jupyter-notebooks?#accessing-data) and can be accessed in `/mnt/project/<path_to_data>`. The file URL follows the format: `file:///mnt/project/<path_to_data>`

Note: only VEP 103 GRCh38 is available on DNAnexus (custom annotations not available)

*Additional documentation: https://hail.is/docs/0.2/methods/genetics.html#hail.methods.vep*

In [None]:
# Run VEP with the config file in the project

ann_mt = hl.vep(mt, "file:///mnt/project/use_cases/annotation/config.json")

In [None]:
# See details of MT after annotation

ann_mt.describe()

We see that `vep` and `vep_proc_id` have been added to the row fields of the MT

## 4) Create VEP Annotated Table and save in Apollo Database

In [None]:
# Create Hail Table from MT

ann_tb = ann_mt.rows()

In [None]:
# Define database and table name

# Note: It is recommended to only use lowercase letters for the database name.
# If uppercase lettering is used, the database name will be lowercased when creating the database.
db_name = "database_name"
tb_name = "ann_vep.ht"

In [None]:
# Create database in DNAX

stmt = f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'"
print(stmt)
spark.sql(stmt).show()

In [None]:
# Store Table in DNAX

import dxpy

# find database ID of newly created database using a dxpy method
db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
url = f"dnax://{db_uri}/{tb_name}"


# Note: Writing (saving/storing) the Table to the database can be computationally expensive
# depending on the size of the annotations.
# 
# Before this step, the Hail Table is just an object in memory. To persist it and be able to access 
# it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
# See https://hail.is/docs/0.2/hail.Table.html#hail.Table.write for additional documentation.
ann_tb.write(url) # Note: output should describe size of Table (i.e. number of rows, partitions)