# Import pVCF Genomic Data with Hail

This notebook shows how to import genomic data from pVCFs into a Hail MatrixTable and save it to an Apollo database (dnax://) on the DNAnexus platform. See documentation for guidance on launch specs for the JupyterLab with Spark Cluster app for different data sizes: https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data

Pre-conditions for running this notebook successfully:
- pVCF(s) are uploaded to the project

## 1) Initiate Spark and Hail

In [None]:
# Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.

from pyspark.sql import SparkSession
import hail as hl

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

## 2) Locate and import data into a Hail MatrixTable

All data uploaded to the project before running the JupyterLab app is mounted (https://documentation.dnanexus.com/user/jupyter-notebooks?#accessing-data) and can be accessed in `/mnt/project/<path_to_data>`. The file URL follows the format: `file:///mnt/project/<path_to_data>`

In [None]:
# Define variables used in import

file_url = "file:///mnt/project/use_cases/100_sample/vcf_format/*.vcf.gz" # regex can be used if genomic data is in multiple pVCFs

In [None]:
# Import genomic data into a MT

mt = hl.import_vcf(file_url, 
                   force_bgz=True, 
                   reference_genome="GRCh38", 
                   array_elements_required=False)

In [None]:
# View basic properties of MT

print(f"Num partitions: {mt.n_partitions()}")
mt.describe()

## 3) Store Hail MT in DNAX

In [None]:
# Define database and MT names

db_name = "database_name"
mt_name = "geno.mt"

In [None]:
# Create database in DNAX

stmt = f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'"
print(stmt)
spark.sql(stmt).show()

In [None]:
# Store MT in DNAX

import dxpy

# Find database ID of newly created database using dxpy method
db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
url = f"dnax://{db_uri}/{mt_name}" # Note: the dnax url must follow this format to properly save MT to DNAX

# Before this step, the Hail MatrixTable is just an object in memory. To persist it and be able to access 
# it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
# See https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.write for additional documentation.
mt.write(url) # Note: output should describe size of MT (i.e. number of rows, columns, partitions) 