# Metadata and QC

Author: Zan Koenig,  Mary Yohannes, & Ally Kim

## Index
1. [Set Default Paths](#-1.-Set-Default-Paths)
2. [Read in Datasets and Apply Quality Control Filters](#2.-Read-in-Datasets-and-Apply-Quality-Control-Filters)
3. [Investigating gnomAD Sample Filters](#3.-Investigating-gnomAD-sample-filters)
4. [Plotting Results of gnomAD Sample Filter Investigation](#-4.-Plotting-results-of-gnomAD-sample-filter-investigation)
5. [Pre-QC Plots](#5.-Pre-QC-Plots)
    1. [Number of SNPs](#5a-Number-of-SNPs)
    2. [Mean Coverage](#5b-Mean-Coverage)
    3. [Freemix](#5c-Freemix)
6. [Post-QC Plots](#6.-Post-QC-Plots)
    1. [Number of SNPs](#6a-Number-of-SNPs)
    2. [Mean Coverage](#6b-Mean-Coverage)
    3. [Freemix](#6c-Freemix)
    4. [Site Frequency Spectrum](#6d-Site-Frequency-Spectrum)

# General Overview
The purpose of this script is to merge metadata components needed for the HGDP+1kGP dataset and then apply QC filters on that resulting dataset. The metadata included sample and variant information (e.g. geographic region, and which samples/variants passed QC) that were initially located in different datasets. The QC filters were run using sample and variant flags from the metadata datasets. These flags were generated as a result of the dataset being run through the gnomAD QC pipeline. More information on the gnomAD QC pipeline can be found [here](https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#sample-and-variant-quality-control). To see how these [EDIT] filters were updated as a result of our analyses, see [gnomAD sample filters](#3.-Investigating-gnomAD-sample-filters) and the resulting gnomAD [minor release.](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/#improvements-to-the-hgdp--1kg-subset-release)

**This script contains information on how to**:
- Use plots to identify which gnomAD QC filters are removing populations entirely (`fail_n_snp_residual` is used as an example)
- Retrieve populations being unduly removed by filters (mostly `AFR` and `OCE` populations)
- Filter Matrix Tables using a field within the Matrix Table
- Filter samples using a hardcoded list of samples to remove
- Plot certain fields from the Matrix Table:
    - Number of SNPs
    - Coverage
    - Site Frequency 
    - Freemix
    - Number of samples which failed a filter

In [1]:
import hail as hl

# For renaming purposes
import re

# The import statements below allow for plotting in hail
from hail.ggplot import *
import plotly
import pandas as pd

from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()

# 1. Set Default Paths
These default paths can be edited by users as needed. It is recommended to run these tutorials without writing out datasets. 

By default all of the write sections are shown as markdown cells. If you would like to write out your own datasets, you can copy the code and paste it into a new code cell. 

[Back to Index](#Index)

In [None]:
# Path for HGDP+1kGP dataset prior to applying gnomAD QC filters
pre_qc_path = 'gs://gcp-public-data--gnomad/release/3.1.2/mt/genomes/gnomad.genomes.v3.1.2.hgdp_1kg_subset_dense.mt'

# Path for HGDP+1kGP dataset after applying gnomAD QC filters
post_qc_path = 'gs://hgdp-1kg/tutorial_datasets/metadata_and_qc/post_qc.mt'


# 2. Read in Datasets and Apply Quality Control Filters

<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_matrix_table"> More on  <i> read_matrix_table() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.count"> More on  <i> count() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_table"> More on  <i> read_table() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_rows"> More on  <i> annotate_rows() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.describe"> More on  <i> describe() </i></a></li>
    </ul>
</details>

[Back to Index](#Index)

In [None]:
# Read in the HGDP+1kGP pre-QC Matrix Table (shortened as mt)
mt = hl.read_matrix_table(pre_qc_path)

# Check how many SNPs and samples there are
print('Num of SNPs and samples prior to any analysis = ' + str(mt.count())) # 211358784 snps & 4151 samples 

# get mt schema
mt.describe()

In [None]:
# Apply sample QC filters to dataset
# Filtering samples to those who should pass gnomADs sample QC
# This filters to only samples that passed gnomAD sample QC hard filters
mt = mt.filter_cols(~mt.sample_filters.hard_filtered)

# Apply variant QC filters to dataset
# Subsetting the variants in the dataset to only PASS variants (those which passed gnomAD's variant QC)
# PASS variants are variants which have an entry in the filters field.
# This field contains an array which contains a bool if any variant qc filter was failed
# This is the last step in the QC process
mt = mt.filter_rows(hl.len(mt.filters) != 0, keep=False)

In [None]:
# Writing out the dataset after applying gnomAD sample and variant QC filters
# This is done to speed up downstream steps
mt.write(post_qc_path)

# 3. Investigating gnomAD sample filters
   
9 out of the 28 gnomAD sample filters were dropping huge numbers of ancestrally diverse individuals (mostly African (`AFR`) and Oceanian (`OCE`) populations). The filters rely on gnomAD’s ancestry principal component analysis (PCA), which captures genetic variance across the larger gnomAD callset, and smaller, under-represented groups such as those in the HGDP+1kGP callset can appear erroneously as outliers. Here we explore which original gnomAD sample QC filters remove entire populations.


<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.filter_cols"> More on  <i> filter_cols() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.expr.SetExpression.html#hail.expr.SetExpression.difference"> More on  <i> difference() </i></a></li>

<li><a href=" https://hail.is/docs/0.2/hail.expr.CollectionExpression.html#hail.expr.CollectionExpression.length"> More on  <i> length() </i></a></li>
    </ul>
</details>

[Back to Index](#Index)

In [None]:
# put the gnomAD qc filters in a set 
all_sample_filters = set(mt['sample_filters']) 

# select out the filters that are removing whole populations which pass all other gnomAD filters
# if a filter name starts with 'fail_', add it to a new set after removing 'fail_' from the name  
bad_sample_filters = {re.sub('fail_', '', x) for x in all_sample_filters if x.startswith('fail_')} 

# filter out the samples that passed all gnomAD QC filters OR only failed the filters that were removing population wholly
mt_filt = mt.filter_cols(mt['sample_filters']['qc_metrics_filters'].difference(bad_sample_filters).length() == 0)

# how many samples were removed by the initial QC?
print('Num of samples before initial QC = ' + str(mt.count()[1])) # 4151
print('Num of samples after initial QC = ' + str(mt_filt.count()[1])) # 4120
print('Samples removed = ' + str(mt.count()[1] - mt_filt.count()[1])) # 31

# 4. Plotting results of gnomAD sample filter investigation
[Back to Index](#Index)

In [None]:
# Mary to add additional information on this failed_filters_population_level file here
filepath = "gs://hgdp-1kg/hgdp_tgp/intermediate_files/failed_filters_population_level.csv"
filters = hl.import_table(filepath, delimiter=',')

In [None]:
#grab only "sample_filters.fail_n_snp_residual"
n_snp_resid = filters.annotate(population = filters['"population"'][1:hl.len(filters['"population"'])-1], \
                                   num_samples = hl.int(filters['"num_of_samples"']), \
                                   fail_n_snp_resid = hl.int(filters['"sample_filters.fail_n_snp_residual"']),\
                                 fail_gnomAD = hl.str(filters['"failed_gnomAD"']))

#manipulate all strings to remove the extraneous quotation marks
n_snp_resid = n_snp_resid.select("population", "num_samples", "fail_n_snp_resid", "fail_gnomAD")

# calculate the ratio between the number of samples that failed and the total number of samples in the population. 
n_snp_resid = n_snp_resid.annotate(fail_ratio = n_snp_resid.fail_n_snp_resid/n_snp_resid.num_samples)

In [None]:
n_snp_resid.show()

In [None]:
# generate scatter plots of ratios for each filter column across all populations colored by gnomAD failure 
plot_n_snp_resid = hl.ggplot.ggplot(n_snp_resid, hl.ggplot.aes(x=n_snp_resid.population, y=n_snp_resid.fail_ratio, \
                                                color=n_snp_resid.fail_gnomAD)) + \
    hl.ggplot.geom_point() +\
    hl.ggplot.ylab("Ratio of failed samples/total samples") + \
    hl.ggplot.ggtitle("Failure of gnomAD n_snp_resids filter by population")+\
    hl.ggplot.scale_x_discrete(breaks=list(range(78)))

plot_n_snp_resid.show()

# 5. Pre-QC Plots
When conducting quality control, it is often a good idea to create plots of your data and summary statistics. For example, we look at the number of SNPs and coverage before and after QC, so that after removing samples or variants we get a visual representation of changes in the dataset. This can potentially flag issues for further investigation.. 

The following plots show the dataset prior to running any sample QC filters.

[Back to Index](#Index)

In [None]:
# Dict that maps color for plotting to region name for both pre and post QC plots
newnames = {'AMR':"#E41A1C",'AFR':"#984EA3", 'OCE':"#999999", 'CSA':"#FF7F00", 
            'EAS':"#4DAF4A", 'EUR':"#377EB8", 'MID':"#A65628" }

In [None]:
# Using func to get pre_qc version of dataset
pre_qc = apply_qc(raw=True)
# As of Hail v. 0.2.82, ggplot only takes in tables as input
# Making a table of samples for plotting
pre_qc_col = pre_qc.cols()
pre_qc_row = pre_qc.rows()

#### 5a. Number of SNPs - 

[Back to Index](#Index)

In [None]:
# Plotting histogram of number of SNPs for each individual within each global region
n_snp_pre_qc = hl.ggplot.ggplot(pre_qc_col, hl.ggplot.aes(x = pre_qc_col.sample_qc.n_snp)) + \
    hl.ggplot.geom_histogram(hl.ggplot.aes(fill = pre_qc_col.hgdp_tgp_meta.Genetic.region), min_val = 5000000, 
                             max_val = 7500000, bins = 200, position="identity", alpha = .7) + \
    hl.ggplot.xlab("Number of SNPs")+ \
    hl.ggplot.ggtitle("Number of SNPs, Pre-QC")+ \
    hl.ggplot.coord_cartesian(ylim = (0,260))


# Update colors
n_snp_pre_qc = n_snp_pre_qc.to_plotly()

n_snp_pre_qc.for_each_trace(
    lambda trace: trace.update(marker=dict(color = newnames[trace.name]))
)

# Show plot
n_snp_pre_qc.show()

#### 5b. Mean Coverage - 

[Back to Index](#Index)

In [None]:
# Create a density plot of mean coverage per individual
cov_pre_qc = hl.ggplot.ggplot(pre_qc_col, hl.ggplot.aes(x = pre_qc_col.bam_metrics.mean_coverage)) + \
    hl.ggplot.geom_density(hl.ggplot.aes(fill=pre_qc_col.project_meta.title),
                             alpha = .7) + \
    hl.ggplot.xlab("Coverage (x)")+ \
    hl.ggplot.ggtitle("Mean coverage, Pre-QC")


# Show plot
cov_pre_qc.show()

#### 5c. Freemix - 

[Back to Index](#Index)

In [None]:
# Plotting freemix colored by population 
freemix_pre_qc_pop = hl.ggplot.ggplot(pre_qc_col, hl.ggplot.aes(x = pre_qc_col.bam_metrics.freemix)) +\
    hl.ggplot.geom_histogram(hl.ggplot.aes(fill=pre_qc_col.hgdp_tgp_meta.Genetic.region), bins = 140) + \
    hl.ggplot.scale_y_log10("Count (log scale)") +\
    hl.ggplot.xlab("Freemix") + \
    hl.ggplot.ggtitle("Bam metrics: Freemix, Pre-QC")+ \
    hl.ggplot.coord_cartesian(xlim = (0,.5))

# Update legends
# Update colors
freemix_pre_qc_pop = freemix_pre_qc_pop.to_plotly()
freemix_pre_qc_pop.for_each_trace(lambda trace: trace.update(marker=dict(color = newnames[trace.name])))

# Show plot
freemix_pre_qc_pop.show()

In [None]:
# Plotting freemix colored by project
freemix_pre_qc_proj = hl.ggplot.ggplot(pre_qc_col, hl.ggplot.aes(x = pre_qc_col.bam_metrics.freemix)) +\
    hl.ggplot.geom_histogram(hl.ggplot.aes(fill=pre_qc_col.project_meta.title), position="identity", bins = 140,\
                            alpha = .5) + \
    hl.ggplot.scale_y_log10("Count (log scale)") +\
    hl.ggplot.xlab("Freemix") + \
    hl.ggplot.ggtitle("Bam metrics: Freemix, Pre-QC")+ \
    hl.ggplot.coord_cartesian(xlim = (0,.5))

# Show plot
freemix_pre_qc_proj.show()

# 6. Post-QC Plots

The following plots are the same as those made above with pre-QC data except now the dataset has gone through:
- sample filtering
- variant filtering
- duplicate removal
- PCA outlier removal

[Back to Index](#Index)

In [None]:
# Reading in postQC Matrix Table
post_qc = apply_qc(post_qc=True)

# Making a table of samples for plotting
post_qc_col = post_qc.cols()
post_qc_row = post_qc.rows()

#### 6a. Number of SNPs - 

[Back to Index](#Index)

In [None]:
# Using ggplot, differentiate between populations
# Used to do fill by geographic region
n_snp_post_qc = hl.ggplot.ggplot(post_qc_col, hl.ggplot.aes(x = post_qc_col.sample_qc.n_snp)) + \
    hl.ggplot.geom_histogram(hl.ggplot.aes(fill=post_qc_col.hgdp_tgp_meta.Genetic.region), min_val = 5000000, 
                             max_val = 7500000, bins = 200, position="identity", alpha = .7) + \
    hl.ggplot.xlab("Number of SNPs")+ \
    hl.ggplot.ggtitle("Number of SNPs, Post-QC") + \
    hl.ggplot.coord_cartesian(ylim = (0,260)) 


# Update legends
n_snp_post_qc = n_snp_post_qc.to_plotly()

n_snp_post_qc.for_each_trace(
    lambda trace: trace.update(marker=dict(color = newnames[trace.name]))
)

# Show plot
n_snp_post_qc.show()

#### 6b. Mean Coverage - 

[Back to Index](#Index)

In [None]:
# Plot histogram of mean coverage from bam_metrics
# Separate by project (HGDP or 1kGP)
cov_post_qc = hl.ggplot.ggplot(post_qc_col, hl.ggplot.aes(x = post_qc_col.bam_metrics.mean_coverage)) + \
    hl.ggplot.geom_density(hl.ggplot.aes(fill=post_qc_col.project_meta.title),
                             alpha = .7) + \
    hl.ggplot.xlab("Coverage (x)")+ \
    hl.ggplot.ggtitle("Mean coverage, Post-QC")

cov_post_qc.show()

#### 6c. Freemix - 

[Back to Index](#Index)

In [None]:
freemix_post_qc_pop = hl.ggplot.ggplot(post_qc_col, hl.ggplot.aes(x = post_qc_col.bam_metrics.freemix)) +\
    hl.ggplot.geom_histogram(hl.ggplot.aes(fill=post_qc_col.hgdp_tgp_meta.Genetic.region), bins = 70,\
                            alpha = 1) + \
    hl.ggplot.scale_y_log10("Count (log scale)") +\
    hl.ggplot.xlab("Freemix") + \
    hl.ggplot.ggtitle("Bam metrics: Freemix, Post-QC")+ \
    hl.ggplot.coord_cartesian(xlim = (0,.5))

# Update legends
freemix_post_qc_pop = freemix_post_qc_pop.to_plotly()
freemix_post_qc_pop.for_each_trace(lambda trace: trace.update(marker=dict(color = newnames[trace.name])))

# Show plot
freemix_post_qc_pop.show()

In [None]:
freemix_post_qc_proj = hl.ggplot.ggplot(post_qc_col, hl.ggplot.aes(x = post_qc_col.bam_metrics.freemix)) +\
    hl.ggplot.geom_histogram(hl.ggplot.aes(fill=post_qc_col.project_meta.title), position="identity", bins = 70,\
                            alpha = .5) + \
    hl.ggplot.scale_y_log10("Count (log scale)") +\
    hl.ggplot.xlab("Freemix") + \
    hl.ggplot.ggtitle("Bam metrics: Freemix, Post-QC")+ \
    hl.ggplot.coord_cartesian(xlim = (0,.5))

# Show plot
freemix_post_qc_proj.show()

#### 6d. Site Frequency Spectrum -

[Back to Index](#Index)

Below is the code used to write out the file for site frequency spectrum plotting in order to cut down on runtime

```python3 
# Aggregating site frequency data for plotting
sfs_data = ht_rows.aggregate(hl.agg.hist(post_qc.freq.AF[1], 0,1,250))
with hl.hadoop_open('gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/sfs_pre_qc.txt', 'w') as f:
    f.write(str(dict(sfs_data)))
 ```

In [None]:
# Load in data
sfs_post_qc = hl.hadoop_open('gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/sfs_post_qc.txt')
sfs_dict = eval(sfs_post_qc.read())
sfs_struct = hl.Struct(**sfs_dict)

# Plot site frequency spectrum histogram
sfs_p = hl.plot.histogram(sfs_struct, log = True, legend = "Frequency of major allele at site")
show(sfs_p)