# Replace Sample IDs with Hail

This notebook shows how to replace sample IDs in a Hail MatrixTable using a mapping Table. See documentation for guidance on launch specs for the JupyterLab with Spark Cluster app for different data sizes: https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data

Note: For population scale data, sample IDs may be referred to as individual IDs. In this notebook, the word "sample" will be used.

Pre-conditions for running this notebook successfully:
- There is an existing Hail MatrixTable in DNAX
- There is a mapping file of the sample IDs in the project

## 1) Initiate Spark and Hail

In [None]:
# Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.

from pyspark.sql import SparkSession
import hail as hl

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

## 2) Read MT

The MatrixTable url stored in an Apollo database should be: `dnax://<database_ID>/<mt_name>`

In [None]:
# define MT url

mt_url = "dnax://database-GFpXJ5j0vzZxPZQ2Ggf14x7q/geno.mt"

In [None]:
# read MT

mt = hl.read_matrix_table(mt_url)

In [None]:
# View sample IDs in the MT
# Note: running this cell can be computationally expensive and take
# longer for bigger datasets (this cell can be commented out).

mt.s.show(5)

## 3) Create a Hail Table that maps sample IDs

All files uploaded to the project before running the JupyterLab app is mounted (https://documentation.dnanexus.com/user/jupyter-notebooks?#accessing-data) and can be accessed in `/mnt/project/<path_to_data>`. The file URL follows the format: `file:///mnt/project/<path_to_data>`

The mapping file being used in this notebook has two columns: 
- `old_sample_id`: the sample ID that is in the MT (this column will be the key)
- `new_sample_id`: the new sample ID that will replace the old one

In [None]:
# Import the mapping file as a Hail Table

mapping_table = hl.import_table("file:///mnt/project/use_cases/100_sample/sampleidmap.csv",
                                delimiter=',',
                                impute=True,
                                key='old_sample_id') # specify the column that will be the key (values must match what is in the MT 's' column)

In [None]:
# View mapping Table
# Note: running this cell can be computationally expensive and take
# longer for bigger datasets (this cell can be commented out).

mapping_table.show(5)

## 4) Replace sample IDs in MT using the mapping Table

*Additional documentation: https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_cols*

In [None]:
# Annotate the 's' column with its new sample ID from the mapping Table by its key

replaced_mt = mt.annotate_cols(**mapping_table[mt.s])

In [None]:
# View basic structure of MT after annotating with new sample IDs

replaced_mt.describe()

We can see there's a new column field called "new_sample_id" in the MT. (This column is from the mapping Table we annotated with in the last step)

*Additional documentation: https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.collect_cols_by_key, https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.drop*

In [None]:
# Replace the 's' column with the new sample IDs and drop the new column created from annotating in the previous step

replaced_mt = replaced_mt.key_cols_by(s = replaced_mt.new_sample_id).drop("new_sample_id")

In [None]:
# View basic structure of MT after dropping column

replaced_mt.describe()

We can see that there's only one 's' column field now. 

In [None]:
# View new sample IDs in the MT
# Note: running this cell can be computationally expensive and take
# longer for bigger datasets (this cell can be commented out).

replaced_mt.s.show(5)

We see that the sample IDs in the MT have been replaced with the new sample IDs!