# Import BGEN Genomic Data with Hail

This notebook shows how to import genomic data from BGEN files into a Hail MatrixTable and save it to an Apollo database (dnax://) on the DNAnexus platform. See documentation for guidance on launch specs for the JupyterLab with Spark Cluster app for different data sizes: https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data

Pre-conditions for running this notebook successfully:
- BGEN file(s) are uploaded to the project
- If data is spread across multiple BGEN files, they should be organized into one directory in the project and have unique file names
- BGEN file(s) end in `.bgen`
- BGEN file(s) should have an accompanying `.sample` file if sample identifiers are not stored in the BGEN file
- BGEN file(s) are `zlib` compressed. Currently, Hail only supports `zlib` compression (does not support `zstd` compression): https://discuss.hail.is/t/index-bgen-zlib-compression-exception/2652

## 1) Initiate Spark and Hail

In [None]:
# Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.

from pyspark.sql import SparkSession
import hail as hl
import os

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

## 2) Locate and import data into a Hail MatrixTable

All data uploaded to the project before running the JupyterLab app is mounted (https://documentation.dnanexus.com/user/jupyter-notebooks?#accessing-data) and can be accessed in `/mnt/project/<path_to_data>`. The file URL follows the format: `file:///mnt/project/<path_to_data>`

*If the required Hail-specific index files for each BGEN file have been uploaded onto the platform skip to step 2c in this notebook. Note: The directory of index files for each BGEN file must end in `.idx2` and each directory must contain 2 files: `index` and `metadata.json.gz`.*

In [4]:
# Define the path of BGEN files directory in the project

bgen_path = "/mnt/project/use_cases/BGEN/data/multiple_chromosomes"

### 2a) Create index files

If there are no Hail-specific index files for each BGEN, they will need to be created using Hail's `index_bgen()` method. If creating index files within this notebook, the index files must be written out to HDFS so that the data is available on all nodes. The HDFS URL should follow the following format: `hdfs:///<name_of_bgen_file>.idx2` (the index file(s) must end with the extension `.idx2`).

Notes from Hail documentation:
- Hail only supports 8-bit probabilities (If using qctools to convert a VCF into BGEN, use the option `-bgen-bits 8`).
- While the `index_bgen()` method parallelizes over a list of BGEN files, each file is indexed serially by one core. Indexing several BGEN files on a large cluster is a waste of resources, so indexing should generally be done once, separately from large analyses.

In [None]:
# Create an index file for each BGEN file in the directory

for filename in os.listdir(bgen_path):
    file_url = f"file://{bgen_path}/{filename}"
    hl.index_bgen(path=file_url,
                  index_file_map={file_url:f"hdfs:///{filename}.idx2"},
                  reference_genome="GRCh38",
                  contig_recoding=None,
                  skip_invalid_loci=False)

### 2b) Create index file map dictionary

In [None]:
# Create the index file map necessary for importing

index_file_map = {}
for filename in os.listdir(bgen_path):
    index_file_map[f"file://{bgen_path}/{filename}"] = f"hdfs:///{filename}.idx2"

In [4]:
# If index files were created outside of this notebook, manually create the index_file_map dictionary using file URLs
#
# Example:
# index_file_map = {"file:///mnt/project/use_cases/BGEN/data/indexed_data/chr1.bgen":"file:///mnt/project/use_cases/BGEN/data/indexed_data/chr1.bgen.idx2",
#                   "file:///mnt/project/use_cases/BGEN/data/indexed_data/chr2.bgen":"file:///mnt/project/use_cases/BGEN/data/indexed_data/chr2.bgen.idx2"}

### 2c) Import BGEN

In [None]:
mt = hl.import_bgen(path=f"file://{bgen_path}/*.bgen", # regex can be used if genomic data is in multiple BGEN files
                    entry_fields=['GT', 'GP'],
                    sample_file=f"file://{bgen_path}/*.sample",
                    n_partitions=None,
                    block_size=None,
                    index_file_map=index_file_map,
                    variants=None,)

In [None]:
# View basic properties of MT
# 
# Note: running 'mt.rows().count()' or 'mt.cols().count()' can be computationally 
# expensive and take longer for bigger datasets.

print(f"Num partitions: {mt.n_partitions()}")
mt.describe()

## 3) Store Hail MT in DNAX

In [9]:
# Define database and MT names

# Note: It is recommended to only use lowercase letters for the database name.
# If uppercase lettering is used, the database name will be lowercased when creating the database.
db_name = "database_name"
mt_name = "geno.mt"

In [None]:
# Create database in DNAX

stmt = f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'"
print(stmt)
spark.sql(stmt).show()

In [None]:
# Store MT in DNAX

import dxpy

# Find database ID of newly created database using dxpy method
db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
url = f"dnax://{db_uri}/{mt_name}" # Note: the dnax url must follow this format to properly save MT to DNAX

# Before this step, the Hail MatrixTable is just an object in memory. To persist it and be able to access 
# it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
# See https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.write for additional documentation.
mt.write(url) # Note: output should describe size of MT (i.e. number of rows, columns, partitions) 