# Filter Genomic Data with Hail: Sample IDs

This notebook shows how to filter by sample IDs in Hail and save the filtered Hail Table to an Apollo database (dnax://) on the DNAnexus platform. See documentation for guidance on launch specs for the JupyterLab with Spark Cluster app for different data sizes: https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data

Note: For population scale data, sample IDs may be referred to as individual IDs. In this notebook, the word "sample" will be used.

Pre-conditions for running this notebook successfully:
- There is an existing Hail MatrixTable in DNAX

## 1) Initiate Spark and Hail

In [None]:
# Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.

from pyspark.sql import SparkSession
import hail as hl

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

## 2) Read MT

The MatrixTable url stored in an Apollo database should be: `dnax://<database_ID>/<mt_name>`

In [2]:
# define MT url

mt_url = "dnax://database-GFpXJ5j0vzZxPZQ2Ggf14x7q/geno.mt"

In [23]:
# read MT

mt = hl.read_matrix_table(mt_url)

In [None]:
# View basic properties of MT before filtering
# 
# Note: running 'count()' can be computationally expensive and take
# longer for bigger datasets (these lines can be commented out).

print(f"Num partitions: {mt.n_partitions()}")
print(f"Num samples: {mt.cols().count()}")
mt.describe()

## 3) Filter MT

Let's filter for the following sample IDs: `sample_1_0`, `sample_1_2`, and `sample_1_4`.

*Additional documentation on filtering with Hail: https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.filter_cols*

In [25]:
# Define sample ID set to filter for

filter_ind_id = hl.set(["sample_1_0",  "sample_1_2", "sample_1_4"]) # format needs to match data in the MT

In [26]:
# Filter by checking to see if sample ID (column 's') is in the set of sample IDs to be filtered for (defined above)

filtered_mt = mt.filter_cols(filter_ind_id.contains(mt.s), keep=True)

In [None]:
# View basic properties of MT after filtering
# 
# Note: running 'count()' can be computationally expensive and take
# longer for bigger datasets (these lines can be commented out).

print(f"Num partitions: {filtered_mt.n_partitions()}")
print(f"Num samples: {filtered_mt.cols().count()}")
filtered_mt.describe()

After filtering, we see that the number of samples decreased.

## 4) Create Hail Table from MT and store in DNAX

In [28]:
# Create Hail Table from MT

filtered_tb = filtered_mt.cols()

In [30]:
# Define database and table name

db_name = "database_name"
tb_name = "filtered_sampleid.ht"

In [None]:
# Create database in DNAX

stmt = f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'"
print(stmt)
spark.sql(stmt).show()

In [None]:
# Store Table in DNAX

import dxpy

# find database ID of newly created database using a dxpy method
db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
url = f"dnax://{db_uri}/{tb_name}"

# Before this step, the Hail Table is just an object in memory. To persist it and be able to access 
# it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
# See https://hail.is/docs/0.2/hail.Table.html#hail.Table.write for additional documentation.
filtered_tb.write(url) # Note: output should describe size of Table (i.e. number of rows, partitions)