Notebook 3: Sequencing/diversity and QC metrics notebook

Plots that need to be added to nb (currently in the works by Ally) 
1. Contamination/Freemix - sample_filters.contamination and bam_metrics.freemix - *DONE*
2. Heterozygosity - plot distribution to show how it doesn’t work with diverse popns as expected (e.g. high for AFR, low for FIN) - *PENDING*
3. Plot fail_n_snp_residual and using that as an example, explain the rest in the description - the plots were created early on in the project showing which gnomAD QC filters were dropping whole populations. Currently in R, but needs to be implemented in Hail - *PENDING*

---------------------
Further edits needed in this nb:
- Add Ally's code for plots 

## Index
1. [General Overview](#1.-General-Overview)
2. [Read in Datasets and Annotate](#2.-Read-in-Datasets-and-Annotate)
3. [gnomAD Filter QC](#3.-gnomAD-Filter-QC)
4. [Remove Duplicate Sample](#4.-Remove-Duplicate-Sample)
5. [Filter to Only PASS Variants](#5.-Filter-to-Only-PASS-Variants)
6. [Write Out Matrix Table](#6.-Write-Out-Matrix-Table)

# 1. General Overview 
The purpose of this notebook is to setup the merged HGDP+1kGP dataset correctly so it can be used for subsequent analyses. It contains steps on how to: 

- Read in the merged Hail matrix table and annotate a variant QC metadata (in a form of a Hail table) onto it  
- Using plots, identify which gnomAD QC filters are removing populations entirely (fail_n_snp_residual used as an example)
- Retrive those populations (mostly AFR and OCE populations)
- Remove a dublicate sample 
- Filter matrix table to only PASS variants (those which passed variant QC)
- Write out a matrix table and read it back in 
- Plot certain fields from the matrix table:
    - Contamination
    - Freemix
    - Heterozygosity (plot distribution to show how it doesn’t work with diverse populations as expected (e.g. high for AFR, low for FIN)

Author: Mary T. Yohannes

1a. Import needed libraries and packages 

In [7]:
# import hail
import hail as hl

# for renaming purposes
import re 

1b. Input and output path variables to be edited by users as needed 

In [1]:
# input 
input_path = 'gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/pre_qc_final.mt'

# output 
output_path = 'gs://hgdp-1kg/hgdp_tgp/intermediate_files/pre_running_varqc.mt'

# 2. Read in Datasets and Annotate

<details>
<summary> - More about the input dataset [click here] </summary>
<br>
This input matrix table is a combination of 3 datasets: a harmonized sample metadata for the HGDP+1KG dataset, a gnomAD v3.1 sample qc metadata with samples that failed gnomAD QC filters flagged, and a densified HGDP+1KG matrix table. 
</details>

In [None]:
# read-in the matrix table (shortened as mt)
mt = hl.read_matrix_table(input_path) 

# how many snps and samples are there? counts 
print('Num of snps and samples prior to any analysis = ' + str(mt.count())) # 211358784 snps & 4151 samples 

# read in variant QC metadata containing information on which variants passed/failed gnomAD QC filters
var_meta = hl.read_table('gs://gcp-public-data--gnomad/release/3.1.1/ht/genomes/gnomad.genomes.v3.1.1.sites.ht')

# annotate variant QC metadata onto mt 
mt = mt.annotate_rows(**var_meta[mt.locus, mt.alleles]) 

# explore combined mt 
mt.describe()

<details>
<summary> - More information on Hail methods and expressions [click here] </summary>

- <a href="https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_matrix_table"> More on  <i> read_matrix_table() </i></a>

- <a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.count"> More on  <i> count() </i></a>

- <a href="https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_table"> More on  <i> read_table() </i></a>

- <a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_rows"> More on  <i> annotate_rows() </i></a>

- <a href="https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.describe"> More on  <i> describe() </i></a>
</details>

# 3. gnomAD Filter QC

<details>
<summary> - Why are we doing this? [click here] </summary>
    
9 out of the 28 gnomAD sample filters were dropping huge numbers of ancestrally diverse individuals (mostly African (AFR) and Oceanian (OCE) populations). The filters use gnomAD’s principal component analysis (PCA) which is obtained from other samples to residualize the distribution of values from different populations and identify outliers. If there is an error and outliers are identified, the sample fails the filter. 
</details>

In [18]:
# put the gnomAD qc filters in a set 
all_sample_filters = set(mt['sample_filters']) 

# select out the filters that are removing whole populations despite them passing all other gnomAD filters
# if a filter name starts with 'fail_', add it to a new set after removing 'fail_' from the name  
bad_sample_filters = {re.sub('fail_', '', x) for x in all_sample_filters if x.startswith('fail_')} 

# filter out the samples that passed all gnomad QC filters OR only failed the filters that were removing population wholly
mt_filt = mt.filter_cols(mt['sample_filters']['qc_metrics_filters'].difference(bad_sample_filters).length() == 0)

# how many samples were removed by the initial QC?
print('Num of samples before initial QC = ' + str(mt.count()[1])) # 4151
print('Num of samples after initial QC = ' + str(mt_filt.count()[1])) # 4120
print('Samples removed = ' + str(mt.count()[1] - mt_filt.count()[1])) # 31

<details>
<summary> - More information on Hail methods and expressions [click here] </summary>

- <a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.filter_cols"> More on  <i> filter_cols() </i></a>

- <a href="https://hail.is/docs/0.2/hail.expr.SetExpression.html#hail.expr.SetExpression.difference"> More on  <i> difference() </i></a>

- <a href=" https://hail.is/docs/0.2/hail.expr.CollectionExpression.html#hail.expr.CollectionExpression.length"> More on  <i> length() </i></a>
</details>

# 4. Remove Duplicate Sample

In [22]:
# NA06985 is duplicate sample in the dataset 
mt_filt = mt_filt.distinct_by_col()
print('Num of samples after removal of duplicate sample = ' + str(mt_filt.count()[1])) # 4119

Num of samples after removal of duplicate sample = 4119


<details>
<summary> - More information on Hail methods and expressions [click here] </summary>
    
- <a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.distinct_by_col"> More on  <i> distinct_by_col() </i></a>
</details>

# 5. Filter to Only PASS Variants

In [23]:
# subset dataset to variants that passed gnomAD's variant QC 
mt_filt = mt_filt.filter_rows(hl.len(mt_filt.filters) !=0, keep=False) # ~13min to run 
print('Num of only PASS variants = ' + str(mt_filt.count()[0])) # 155648020

Num of only PASS variants = 155648020


<details>
<summary> - More information on Hail methods and expressions [click here] </summary>
    
- <a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.filter_rows"> More on  <i> filter_rows() </i></a>

- <a href="https://hail.is/docs/0.2/functions/collections.html#hail.expr.functions.len"> More on  <i> hl.len() </i></a>
</details>

# 6. Write Out Matrix Table 

In [None]:
# write out dataset since it is used across multiple tutorial notebooks 
mt_filt.write(output_path, overwrite=False) # >1hr to run

<details>
<summary> - More information on Hail methods and expressions [click here] </summary>
    
- <a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.write"> More on  <i> write() </i></a>
</details>