# 4. Running VEP on matrix tables
This script reads in matrix tables formed in stage 1 and annotates variants with their function using VEP. It then writes out these annotations as a hail table. These can then be annotated back to a matrix table where required.

For this stage, I used a mem2_ssd1_v2_x8 instance with 60 nodes. Setting up the VEP in an instance is quite slow and I therefore used the same instance to loop through each chromosome once the VEP had been set up. This stage cost a total of ~£45 and took around 3 hours.

The outputted hail tables are stored within DNAX. These total 11GB. 

## Set up Hail 

In [1]:
# Initialise hail and spark logs? Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.

# CHanging spark configuration?? 
import pyspark.sql

config = pyspark.SparkConf().setAll([('spark.kryoserializer.buffer.max', '128')])
sc = pyspark.SparkContext(conf=config) 

from pyspark.sql import SparkSession

import hail as hl
builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=sc)

# Check this has actually worked though?!!! 

import dxpy

pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/backend/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/backend/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 3.2.3
SparkUI available at http://ip-10-60-175-114.eu-west-2.compute.internal:8081
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.116-cd64e0876c94
LOGGING: writing to /opt/notebooks/hail-20240802-1356-0.2.116-cd64e0876c94.log


## Annotate variants with VEP and write annotations out

In [None]:
# Define the chromosome you're working with 
for chr in range (1,24):
    print(f'Annotating chr {chr} with VEP ... ')

    # Read in MT from stage 1
    mt=hl.read_matrix_table(f"dnax://database-GgPbpq8J637bkp84VQyQ83X9/chromosome_{chr}_post_genoqc_final.mt")
    # Check this table is as you'd expect
    print(mt.count())


    # Run VEP using a json plugin file to annotate with loftee too 
    ## This json file is saved on git alongside this script. 
    mt=hl.vep(mt, "file:///mnt/project/WES_QC/annotations/04a_helper_file_config_plugin_details.json")


    # Annotate each row (variant) with the most severe consequence
    ## Set up annotations
    PTV_annotations = hl.set(["splice_acceptor_variant", "splice_donor_variant", "stop_gained", "frameshift_variant"])
    NS_annotations = hl.set(["splice_acceptor_variant", "splice_donor_variant", "stop_gained", "inframe_insertion", "inframe_deletion", "inframe_insertion", "missense_variant", "stop_lost", "start_lost", "frameshift_variant"])
    Miss_annotations = hl.set(["missense_variant"])
    S_annotations = hl.set(["synonymous_variant"])
    # Annotate
    mt = mt.annotate_rows(LoF_worstCsq = (PTV_annotations.contains(mt.vep.most_severe_consequence)),
                          NS_worstCsq = (NS_annotations.contains(mt.vep.most_severe_consequence)),
                          Miss_worstCsq = (Miss_annotations.contains(mt.vep.most_severe_consequence)),
                          Syn_worstCsq = (S_annotations.contains(mt.vep.most_severe_consequence)),
                          gene_symbol_worstCsq = (mt.vep.transcript_consequences.find(lambda x : x.consequence_terms.contains(mt.vep.most_severe_consequence)).gene_symbol),
                          gene_id_worstCsq = (mt.vep.transcript_consequences.find(lambda x : x.consequence_terms.contains(mt.vep.most_severe_consequence)).gene_id)
                         )
    

    # Filter to just rows (which contain annotations) so annotations can be written out as hail tables. 
    ht=mt.rows()


    # Write out as a hail table using DNAX which can then be reannotated to hail tables at a later point
    db_name = f"vep_annotations_from_mts"
    ht_name = f"chr_{chr}_annotations_LoFTEE_updated.ht"
    # Create database in DNAX
    stmt = f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'"
    print(stmt)
    spark.sql(stmt).show()
    # Store MT in DNAX
    import dxpy
    # Find database ID of newly created database using dxpy method
    db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
    url = f"dnax://{db_uri}/{ht_name}" # Note: the dnax url must follow this format to properly save MT to DNAX
    # Before this step, the Hail MatrixTable is just an object in memory. To persist it and be able to access 
    # # it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
    ht.checkpoint(url) # Note: output should describe size of MT (i.e. number of rows, columns, partitions) 


    # Check the file has the annotations you want
    ## Initially when using the VEP on the RAP I had issues with lots of missing LoFTEE annotations. 
    ## Below, I check if most variants annotated as LoF have at least some transcripts with LoFTEE high or low confidence flags. 
    LoF=mt.filter_rows(mt.LoF_worstCsq==True)
    LoF.vep.transcript_consequences.lof.show(10) #Check p much every row has a HC or LC somewhere. If not, still issue w LoFTEE!! 

    print(f"VEP annotations for chr {chr} complete!")