Notebook 3: Sequencing/diversity and QC metrics notebook

Plots that need to be added to nb (currently in the works by Ally)
1. Contamination/Freemix - sample_filters.contamination and bam_metrics.freemix
2. Heterozygosity - plot distribution to show how it doesn’t work with diverse popns as expected (e.g. high for AFR, low for FIN)
3. Plot fail_n_snp_residual and using that as an example, explain the rest in the description - the plots were created early on in the project showing which gnomAD QC filters were dropping whole populations. Currently in R, but needs to be implemented in Hail 

## Index
- [General Overview](#General-Overview)
- [Read in Datasets and Annotate](#Read-in-Datasets-and-Annotate)
- [gnomAD Filter QC](#gnomAD-Filter-QC)
- [Remove Duplicate Sample](#Remove-Duplicate-Sample)
- [Filter to Only PASS Variants](#Filter-to-Only-PASS-Variants)
- [Write Out Matrix Table](#Write-Out-Matrix-Table)

# General Overview 
The purpose of this notebook is to setup the merged HGDP+1kGP dataset correctly so it can be used for subsequent analyses. It contains steps on how to: 

- Read in the merged Hail matrix table and annotate a variant QC metadata (in a form of a Hail table) onto it  
- Using plots, identify which gnomAD QC filters are removing populations entirely (fail_n_snp_residual used as an example)
- Retrive those populations (mostly AFR and OCE populations)
- Remove a dublicate sample 
- Filter matrix table to only PASS variants (those which passed variant QC)
- Write out a matrix table and read it back in 
- Plot certain fields from the matrix table:
    - Contamination
    - Freemix
    - Heterozygosity (plot distribution to show how it doesn’t work with diverse populations as expected (e.g. high for AFR, low for FIN)

Author: Mary T. Yohannes

In [None]:
# import hail
import hail as hl

# Read in Datasets and Annotate

In [None]:
# read in the dataset Ally produced 
# metadata from Alicia + sample QC metadata from Julia + densified mt from Konrad
# no samples or variants removed yet  
mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/pre_qc_final.mt') 
print(mt.count()) # 211358784 snps & 4151 samples - counts prior to any analysis  

In [None]:
# read in variant QC metadata - I believe it's gnomAD's 
var_meta = hl.read_table('gs://gcp-public-data--gnomad/release/3.1.1/ht/genomes/gnomad.genomes.v3.1.1.sites.ht')

# annotate variant QC metadata onto mt 
mt = mt.annotate_rows(**var_meta[mt.locus, mt.alleles]) 

mt.describe()

# gnomAD Filter QC

In [None]:
# editing the format of the filter names and putting them together in a set so that we won't have an issue later when filtering the matrixTable using difference()
# create a set of the gnomAD qc filters (column names under "sample filters") - looks like: {'sex_aneuploidy', 'insert_size', ...} but not in a certain order (randomly ordered)
all_sample_filters = set(mt['sample_filters']) 

In [None]:
import re # for renaming purposes

# bad_sample_filters are filters that removed whole populations despite them passing all other gnomAD filters (mostly AFR and OCE popns)
# remove "fail_" from the filter names and pick those out (9 filters) - if the filter name starts with 'fail_' then replace it with ''
bad_sample_filters = {re.sub('fail_', '', x) for x in all_sample_filters if x.startswith('fail_')} 

In [None]:
# this filters to only samples that passed all gnomad QC or only failed filters in bad_sample_filters
# 'qc_metrics_filters' is under 'sample_filters' and includes a set of all qc filters a particular sample failed 
# if a sample passed all gnomAD qc filters then the column entry for that sample under 'qc_metrics_filters' is an empty set
# so this line goes through the 'qc_metrics_filters'column and sees if there are any samples that passed all the other qc filters except for the ones in the "bad_sample_filters" set (difference()) 
# if a sample has an empty set for the 'qc_metrics_filters' column or if it only failed the filters that are found in the bad_sample_filters set, then a value of zero is returned and we would keep that sample 
# if a sample failed any filters that are not in the "bad_sample_filters" set, remove it
# same as gs://african-seq-data/hgdp_tgp/hgdp_tgp_dense_meta_filt.mt - 211358784 snps & 4120 samples  
mt_filt = mt.filter_cols(mt['sample_filters']['qc_metrics_filters'].difference(bad_sample_filters).length() == 0) 

In [None]:
# How many samples were removed by the initial QC?

print('Num of samples before initial QC = ' + str(mt.count()[1])) # 4151
print('Num of samples after initial QC = ' + str(mt_filt.count()[1])) # 4120
print('Samples removed = ' + str(mt.count()[1] - mt_filt.count()[1])) # 31

# Remove Duplicate Sample

In [None]:
# duplicate sample - NA06985
mt_filt = mt_filt.distinct_by_col()
print('Num of samples after removal of duplicate sample = ' + str(mt_filt.count()[1])) # 4119

# Filter to Only PASS Variants

In [None]:
# subset to only PASS variants (those which passed variant QC) ~13min to run 
mt_filt = mt_filt.filter_rows(hl.len(mt_filt.filters) !=0, keep=False)
print('Num of only PASS variants = ' + str(mt_filt.count()[0])) # 155648020

# Write Out Matrix Table 

In [None]:
# write out file since it is used across multiple nbs (>1hr to run)
mt_filt.write('gs://hgdp-1kg/hgdp_tgp/intermediate_files/pre_running_varqc.mt', overwrite=False)