# GWAS with Hail

This notebook shows how to perform a GWAS for 1 case–control trait using Firth's logistic regression with Hail and save the results as a Hail Table to an Apollo database (dnax://) on the DNAnexus platform. See documentation for guidance on launch specs for the JupyterLab with Spark Cluster app for different data sizes: https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data

Note: For population scale data, samples may be referred to as individuals. In this notebook, the word "sample" will be used.

Pre-conditions for running this notebook successfully:
- There is an existing Hail MatrixTable in DNAX
- There is an existing variant QC Table in DNAX (see *pre-GWAS with Hail: Locus QC*)
- There is an existing sample QC Table in DNAX (see *pre-GWAS with Hail: Sample QC*)
- There is phenotypic data for the samples

## 1) Initiate Spark and Hail

In [None]:
# Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.

from pyspark.sql import SparkSession
import hail as hl

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

## 2) Read MT

In [None]:
# define MT url

mt_url = "dnax://database-GFpXJ5j0vzZxPZQ2Ggf14x7q/geno.mt"

In [None]:
# read MT

mt = hl.read_matrix_table(mt_url)

In [None]:
# View structure of MT before adding pheno data

mt.describe()

## 3) Create pheno Table

Phenotypic traits data can come from different sources (i.e. Cohorts from the Cohort Browser, separate text file, etc.) In this notebook, we will obtain our pheno data from a CSV file that was uploaded to the project. In this (very basic) example pheno data, we will look at the phenotypic trait `is_case` for each sample. The values will indicate if the sample is a case (`is_case=true`) or a control (`is_case=false`)

All data uploaded to the project before running the JupyterLab app is mounted (https://documentation.dnanexus.com/user/jupyter-notebooks?#accessing-data) and can be accessed in `/mnt/project/<path_to_data>`. The file URL follows the format: `file:///mnt/project/<path_to_data>`

In [None]:
# Import the pheno CSV file as a Hail Table

pheno_table = hl.import_table("file:///mnt/project/use_cases/GWAS/pheno.csv",
                              delimiter=',',
                              impute=True,
                              key='sample_id') # specify the column that will be the key (values must match what is in the MT 's' column)

In [None]:
# View structure of pheno Table

pheno_table.describe()

## 4) Annotate MT with pheno Table

In [None]:
# Annotate the MT with pheno Table by matching the MT's column key ('s') with the pheno Table's key ('sample_id')

phenogeno_mt = mt.annotate_cols(**pheno_table[mt.s])

In [None]:
# View structure of MT after annotating with pheno Table

phenogeno_mt.describe()

We see that the pheno traits have been added in the column fields of the MT

## 5) Filter MT using QC Tables

#### 5a) Filter locus QC Table

In [None]:
# Define locus QC Table url

locus_qc_url = "dnax://database-GFpXJ5j0vzZxPZQ2Ggf14x7q/variant_qc.ht"

In [None]:
# Read locus QC Table

pre_locus_qc_tb = hl.read_table(locus_qc_url)

In [None]:
# View structure of locus QC Table

pre_locus_qc_tb.describe()

Let's filter for loci that have:
- an AF value between 0.001-0.999,
- a HWE p-value greater or equal to 1e-10,
- a call rate greater or equal to 0.9

In [None]:
# Filter QC Table using expressions
# Note: Viewing the structure of the locus QC table in from the cell above 
# shows us that the "AF", "p_value_hwe", and "call_rate" fields are within
# the "variant_qc" struct field.

locus_qc_tb = pre_locus_qc_tb.filter(
    (pre_locus_qc_tb["variant_qc"]["AF"][0] >= 0.001) & 
    (pre_locus_qc_tb["variant_qc"]["AF"][0] <= 0.999) & 
    (pre_locus_qc_tb["variant_qc"]["p_value_hwe"] >= 1e-10) & 
    (pre_locus_qc_tb["variant_qc"]["call_rate"] >= 0.9)
)

In [None]:
# View number of loci in QC Table before and after filtering
#
# Note: running this cell can be computationally expensive and take
# longer for bigger datasets (this cell can be commented out).

print(f"Num loci before filtering: {pre_locus_qc_tb.count()}")
print(f"Num loci after filtering: {locus_qc_tb.count()}")

#### 5b) Filter sample QC Table

In [None]:
# Define sample QC Table url

sample_qc_url = "dnax://database-GFpXJ5j0vzZxPZQ2Ggf14x7q/sample_qc.ht"

In [None]:
# Read sample QC Table

pre_sample_qc_tb = hl.read_table(sample_qc_url)

In [None]:
# View structure of sample QC Table

pre_sample_qc_tb.describe()

Let's filter for samples that have a call rate greater or equal to 0.99

In [None]:
# Filter sample QC Table using expressions
# Note: Viewing the structure of the sample QC table in from the cell above 
# shows us that the "call_rate" field is within the "sample_qc" struct field

sample_qc_tb = pre_sample_qc_tb.filter(
    pre_sample_qc_tb["sample_qc"]["call_rate"] >= 0.99) 

In [None]:
# View number of samples in QC Table before and after filtering
#
# Note: running this cell can be computationally expensive and take
# longer for bigger datasets (this cell can be commented out).

print(f"Num samples before filtering: {pre_sample_qc_tb.count()}")
print(f"Num samples after filtering: {sample_qc_tb.count()}")

#### 5c) Filter MT with both QC Tables

In [None]:
# Filter the MT using the locus QC Table

qc_mt = phenogeno_mt.semi_join_rows(locus_qc_tb)

In [None]:
# Filter the MT using the sample QC Table

qc_mt = qc_mt.semi_join_cols(sample_qc_tb)

In [None]:
# View MT after QC filters
# 
# Note: running 'mt.rows().count()' or 'mt.cols().count()' can be computationally 
# expensive and take longer for bigger datasets (these lines can be commented out).

print(f"Num partitions: {qc_mt.n_partitions()}")
print(f"Num loci: {qc_mt.rows().count()}")
print(f"Num samples: {qc_mt.cols().count()}")
qc_mt.describe()

## 6) Run GWAS

Additional documentation: https://hail.is/docs/0.2/methods/stats.html#hail.methods.logistic_regression_rows

In [None]:
# Run Hail's logistic regression method

gwas = hl.logistic_regression_rows(test="firth",
                                   y=qc_mt.is_case, # the column field of the pheno trait we are looking at ('is_case')
                                   x=qc_mt.GT.n_alt_alleles(), # n_alt_alleles() returns the count of non-reference alleles
                                   covariates=[1.0])

In [None]:
# View structure of GWAS results Table

gwas.describe()

## 7) Save GWAS results Table in Apollo Database

In [None]:
# Define database and table name

db_name = "database_name"
tb_name = "gwas.ht"

In [None]:
# Create database in DNAX

stmt = f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'"
print(stmt)
spark.sql(stmt).show()

In [None]:
# Store Table in DNAX

import dxpy

# find database ID of newly created database using a dxpy method
db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
url = f"dnax://{db_uri}/{tb_name}"

# Before this step, the Hail Table is just an object in memory. To persist it and be able to access 
# it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
# See https://hail.is/docs/0.2/hail.Table.html#hail.Table.write for additional documentation.
gwas.write(url) # Note: output should describe size of Table (i.e. number of rows, partitions)