# Export Genomic Data as BGEN with Hail

This notebook shows how to export genomic data as BGEN with Hail. This may be useful if downstream analysis includes using regenie for GWAS (see step 6). See documentation for guidance on launch specs for the JupyterLab with Spark Cluster app for different data sizes: https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data

Pre-conditions for running this notebook successfully:
- There is an existing Hail MatrixTable in DNAX

## 1) Initiate Spark and Hail

In [None]:
# Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.

from pyspark.sql import SparkSession
import hail as hl

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

## 2) Read MT

In [2]:
# define MT url

mt_url = "dnax://database-GFpXJ5j0vzZxPZQ2Ggf14x7q/geno.mt"

In [3]:
# read MT

mt = hl.read_matrix_table(mt_url)

In [None]:
# View structure of MT

mt.describe()

## 3) Export as BGEN: 1 file per chromosome

There is no direct way for the notebook to write data into the project, so it first writes into HDFS (see https://documentation.dnanexus.com/developer/apps/developing-spark-apps#spark-cluster-management-software). After writing out the BGEN files to HDFS, we can then move the data to the project in the next step. 

*Additional documentation: https://hail.is/docs/0.2/methods/impex.html#hail.methods.export_bgen*

In [38]:
# Create a set of unique chromosomes found in MT

chr_set = mt.aggregate_rows(hl.agg.collect_as_set(mt.locus.contig))

In [None]:
# Filter MT to a single chromosome and write out as BGEN file to HDFS as a single file for each chromosome in the MT

for chrom in chr_set:
    filtered_mt = hl.filter_intervals(mt, [hl.parse_locus_interval(chrom, reference_genome="GRCh38"),])
    hl.export_bgen(filtered_mt, f"{chrom}")

## 4) Copy files from HDFS

In [None]:
%%bash
# Copy BGEN files from HDFS to the JupyterLab execution environment file system

hdfs dfs -get ./*.bgen .

In [None]:
%%bash
# Copy SAMPLE files from HDFS to the JupyterLab execution environment file system

hdfs dfs -get ./*.sample .

## 5) Upload files to project

In [71]:
%%bash
# Upload BGEN and SAMPLE files to project

dx upload *.bgen
dx upload *.sample

## 6) Prep for GWAS with regenie (optional)

If the next step of your analysis is to run a GWAS using regenie, the documentation suggests the following for running `step 2` of regenie: 
- "Step 2 of regenie can be sped up by using BGEN files using v1.2 format with 8 bits encoding as well as having an accompanying .bgi index file."
- "We recommend to test chromosomes separately as these parameters may need to be altered when fitting the null model for each chromosome"

Lucky for us, we already created separate BGEN files per chromosome and Hail exported the MatrixTable as BGEN files in v1.2 format with 8 bits encoding (and the necessary SAMPLE file for regenie). See documentation for guidance on how to create BGI files: https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data

*Additional documentation:* 
- https://rgcgithub.github.io/regenie/
- https://bgen.readthedocs.io/en/latest/index.html
- https://enkre.net/cgi-bin/code/bgen/doc/trunk/doc/wiki/bgenix.md