# Metadata and QC

Authors: Zan Koenig,  Mary T. Yohannes, & Ally Kim

**To run this tutorial, you need to have started your cluster with `--packages-gnomad`.**

*If you have not done this, you will need to shut down your current cluster and start a new one with the `--packages-gnomad` argument.* 

See the tutorials [README](https://github.com/atgu/hgdp_tgp/tree/master/tutorials#readme) for more information on how to start a cluster.

## Index
1. [Set Default Paths](#1.-Set-Default-Paths)
2. [Read in Datasets and Apply Quality Control Filters](#2.-Read-in-Datasets-and-Apply-Quality-Control-Filters)
3. [Plots](#3.-Plots)
    1. [Pre-QC Plots](#3a.-Pre-QC-Plots)
        1. [Number of SNVs](#3a-1.-Number-of-SNVs)
        2. [Mean Coverage](#3a-2.-Mean-Coverage)
        3. [Freemix](#3a-3.-Freemix)
        4. [Heterozygosity](#3a-4.-Heterozygosity)
            1. [Expected Heterozygosity](#3a-4a.-Expected-Heterozygosity)
            2. [Actual Heterozygosity](#3a-4b.-Actual-Heterozygosity)
    2. [Post-QC Plots](#3b.-Post-QC-Plots)
        1. [Number of SNVs](#3b-1.-Number-of-SNVs)
        2. [Mean Coverage](#3b-2.-Mean-Coverage)
        3. [Freemix](#3b-3.-Freemix)
        4. [Heterozygosity](#3b-4.-Heterozygosity)
            1. [Expected Heterozygosity](#3b-4a.-Expected-Heterozygosity)
            2. [Actual Heterozygosity](#3b-4b.-Actual-Heterozygosity)
            3. [Difference Between Expected and Actual Heterozygosity](#3b-4c.-Difference-Between-Expected-and-Actual-Heterozygosity)
        5. [Site Frequency Spectrum](#3b-5.-Site-Frequency-Spectrum)
4. [Investigating gnomAD Sample Filters](#4.-Investigating-gnomAD-Sample-Filters)
    1. [Plotting Results of gnomAD Sample Filter Investigation](#4a.-Plotting-Results-of-gnomAD-Sample-Filter-Investigation)

# General Overview

The purpose of this script is to merge metadata components needed for the HGDP+1kGP dataset and then apply QC filters on the resulting dataset. The metadata includes sample and variant information (ex. genetic region and samples/variants QC status) that are initially located in different datasets. The QC filters are run using sample and variant flags from the metadata datasets. These flags are generated as a result of the dataset being run through the gnomAD QC pipeline. More information on the gnomAD QC pipeline can be found [here](https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#sample-and-variant-quality-control). To see how the gnomAD sample QC filters were updated as a result of our analyses, see [gnomAD sample filters](#3.-Investigating-gnomAD-sample-filters) and the resulting gnomAD [minor release.](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/#improvements-to-the-hgdp--1kg-subset-release)

**This script contains information on how to**:
- Use plots to identify which gnomAD sample QC filters are removing populations entirely (`fail_n_snp_residual` is used as an example here)
- Retrieve populations being unduly removed by filters (mostly <code>AFR</code> and <code>OCE</code> populations)
- Filter matrix tables using a field within the matrix table (shortened as mt)
- Filter samples using a hardcoded list of samples to remove
- Plot certain fields from the matrix Table:
    - Number of SNVs
    - Coverage
    - Site Frequency 
    - Freemix
    - Number of samples which failed a sample filter

In [None]:
import hail as hl

# For renaming purposes
import re

# Function from gnomAD to apply genotype filters 
from gnomad.utils.filtering import filter_to_adj

# For plotting in Hail
from hail.ggplot import *
import plotly

from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()

# 1. Set Default Paths
These default paths can be edited by users as needed. It is recommended to run these tutorials without writing out datasets. 

By default all of the write sections are shown as markdown cells. If you would like to write out your own datasets, you can copy the code and paste it into a new code cell. 

[Back to Index](#Index)

In [None]:
# Path for HGDP+1kGP dataset prior to applying gnomAD QC filters
pre_qc_path = 'gs://gcp-public-data--gnomad/release/3.1.2/mt/genomes/gnomad.genomes.v3.1.2.hgdp_1kg_subset_dense.mt'

# Path for HGDP+1kGP dataset after applying gnomAD QC filters
post_qc_path = 'gs://hgdp-1kg/tutorial_datasets/metadata_and_qc/post_qc.mt'

# Path for gnomAD's HGDP+1kGP metadata for plotting 
metadata_path = 'gs://hgdp-1kg/tutorial_datasets/metadata_and_qc/gnomad_meta_v1.tsv'

## Paths for plotting 
# Pre-QC
pre_qc_cols_path = 'gs://hgdp-1kg/tutorial_datasets/plot_datasets/pre_qc_plotting.ht' 
exp_het_pre_qc_path = 'gs://hgdp-1kg/tutorial_datasets/plot_datasets/expected_hets_pre_qc.ht' # expected heterozygosity
act_het_pre_qc_path = 'gs://hgdp-1kg/tutorial_datasets/plot_datasets/actual_hets_pre_qc.ht' # actual heterozygosity

# Post-QC
post_qc_cols_path = 'gs://hgdp-1kg/tutorial_datasets/plot_datasets/post_qc_plotting.ht' 
exp_het_post_qc_path = 'gs://hgdp-1kg/tutorial_datasets/plot_datasets/expected_hets_post_qc.ht' # expected heterozygosity
act_het_post_qc_path = 'gs://hgdp-1kg/tutorial_datasets/plot_datasets/actual_hets_post_qc.ht' # actual heterozygosity
sfs_post_qc_path = 'gs://hgdp-1kg/tutorial_datasets/plot_datasets/sfs_post_qc.txt' # site frequency spectrum

# 2. Read in Datasets and Apply Quality Control Filters

<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_matrix_table"> More on  <i> read_matrix_table() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.count"> More on  <i> count() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_table"> More on  <i> read_table() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_rows"> More on  <i> annotate_rows() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.describe"> More on  <i> describe() </i></a></li>
    </ul>
</details>

[Back to Index](#Index)

In [None]:
# Read in the HGDP+1kGP pre-QC mt
pre_qc_mt = hl.read_matrix_table(pre_qc_path)

# Get mt schema
pre_qc_mt.describe()

In [None]:
# Validitiy check: number of variants and samples prior to applying QC filters
print('Num of SNVs and samples prior to any analysis = ' + str(pre_qc_mt.count())) 

**If running the cell below results in an error, double check that you used the  `--packages gnomad` argument when starting your cluster.**  
- See the tutorials [README](https://github.com/atgu/hgdp_tgp/tree/master/tutorials#readme) for more information on how to start a cluster.

In [None]:
### Apply sample QC filters to dataset ### 
# This filters to only samples that passed gnomAD's sample QC hard filters  
post_qc_mt = pre_qc_mt.filter_cols(~pre_qc_mt.gnomad_sample_filters.hard_filtered) # removed 31 samples

### Apply variant QC filters to dataset ###
# This subsets to only PASS variants - those which passed gnomAD's variant QC
# PASS variants have an entry in the filters field 
post_qc_mt = post_qc_mt.filter_rows(hl.len(post_qc_mt.filters) != 0, keep=False)

### Apply genotype QC filters to the dataset ###
# This is done using a function imported from gnomAD and is the last step in the QC process
post_qc_mt = filter_to_adj(post_qc_mt)

In [None]:
## Write out the dataset after applying gnomAD's sample, variant and genotype QC filters
## This is done to speed up downstream steps and took ~1hr & 9min to run 
#post_qc_mt.write(post_qc_path, overwrite=False)

In [None]:
# Read in the post-QC dataset that's been written out
post_qc_mt = hl.read_matrix_table(post_qc_path)

In [None]:
# Validity check: number of variants and samples after applying QC filters
print('Num of SNVs and samples after applying QC filters = ' + str(post_qc_mt.count())) 

After applying QC filters, the num of SNVs decreased from <code>189,381,961</code> to <code>159,795,273</code> and the num of samples decreased from <code>4151</code> to <code>4120</code>.

# 3. Plots

When conducting quality control, it is often a good idea to create plots of your data and summary statistics. For example, we look at the number of SNVs and coverage before and after QC so that after removing samples or variants we get a visual representation of the changes in the dataset. This can potentially flag issues for further investigation. 

**Make sure to run the next cell before attempting to run any of the plotting code chunks so that colors are mapped to region names and plots are generated without an error.**

[Back to Index](#Index)

In [None]:
# Dictionary mapping colors to region names 
cont_colors = {'AMR':"#E41A1C",
               'AFR':"#984EA3", 
               'OCE':"#999999",
               'CSA':"#FF7F00",
               'EAS':"#4DAF4A", 
               'EUR':"#377EB8", 
               'MID':"#A65628" }

In [None]:
# Run Hail's sample QC on both pre and post QC datasets 
pre_qc_mt = hl.sample_qc(pre_qc_mt)
post_qc_mt = hl.sample_qc(post_qc_mt)

# Read in gnomAD's HGDP+1kGP metadata for plotting 
metadata = hl.import_table(metadata_path, impute = True, key = 's')

# Add plot annotations for both datasets  
pre_qc_mt = pre_qc_mt.annotate_cols(subpop_color = metadata[pre_qc_mt.s]['hgdp_tgp_meta.Pop.colors'],
                    subpop_shapes = metadata[pre_qc_mt.s]['hgdp_tgp_meta.Pop.shapes'],
                    global_color = metadata[pre_qc_mt.s]['hgdp_tgp_meta.Continent.colors'])

post_qc_mt = post_qc_mt.annotate_cols(subpop_color = metadata[post_qc_mt.s]['hgdp_tgp_meta.Pop.colors'],
                    subpop_shapes = metadata[post_qc_mt.s]['hgdp_tgp_meta.Pop.shapes'],
                    global_color = metadata[post_qc_mt.s]['hgdp_tgp_meta.Continent.colors'])

<code>CHMI_CHMI3_WGS2</code> is a sample added by gnomAD for QC purposes, and thus doesn't have metadata information. To avoid a <code>None</code> error, we removed it prior to generating plots. From the dataset itself, it is removed together with PCA outliers in [Notebook 2: PCA and Ancestry Analyses](https://github.com/atgu/hgdp_tgp/blob/master/tutorials/nb2.ipynb). 


In [None]:
# Remove CHMI_CHMI3_WGS2 from both pre and post QC datasets 
pre_qc_mt = pre_qc_mt.filter_cols(pre_qc_mt.s == 'CHMI_CHMI3_WGS2', keep = False)
post_qc_mt = post_qc_mt.filter_cols(post_qc_mt.s == 'CHMI_CHMI3_WGS2', keep = False)

# Subset to column fields only 
pre_qc_cols = pre_qc_mt.cols()
post_qc_cols = post_qc_mt.cols()

In [None]:
## Write out column fields to make plotting faster 
## Took 55min to run 
#pre_qc_cols.write(pre_qc_cols_path, overwrite = False)
#post_qc_cols.write(post_qc_cols_path, overwrite = False)

## 3a. Pre-QC Plots

The following plots show the dataset prior to running any QC filters.

[Back to Index](#Index)

In [None]:
# Read in the column fields of the pre-QC mt for plotting 
pre_qc_cols = hl.read_table(pre_qc_cols_path)

### 3a-1. Number of SNVs

Histogram of number of SNVs for each individual within each genetic region

[Back to Index](#Index)

In [None]:
# Make plot
p = ggplot(pre_qc_cols, aes(x = pre_qc_cols.sample_qc.n_snp, fill = pre_qc_cols.hgdp_tgp_meta.genetic_region)) + \
    geom_histogram(min_val = 4000000, max_val = 7000000, 
                    bins = 200, position="identity", alpha = .6) + \
    xlab("Number of SNVs")+ \
    ggtitle("Number of SNVs, Pre-QC")+ \
    coord_cartesian(ylim = (0,200)) +\
    scale_fill_manual(values=cont_colors) # use the colors specified above

# Show plot
p.show()

### 3a-2. Mean Coverage 

Density plot of mean coverage per individual

[Back to Index](#Index)

In [None]:
# Make plot
p = ggplot(pre_qc_cols, aes(x = pre_qc_cols.bam_metrics.mean_coverage)) + \
    geom_density(aes(fill = pre_qc_cols.hgdp_tgp_meta.project), alpha = .7) + \
    xlab("Coverage (x)")+ \
    ggtitle("Mean coverage, Pre-QC")


# Show plot
p.show()

### 3a-3. Freemix 

[Back to Index](#Index)

In [None]:
# Plot freemix colored by genetic region  
p = ggplot(pre_qc_cols, hl.ggplot.aes(x = pre_qc_cols.bam_metrics.freemix)) +\
    geom_histogram(aes(fill=pre_qc_cols.hgdp_tgp_meta.genetic_region), bins = 140) + \
    scale_y_log10("Count (log scale)") +\
    xlab("Freemix") + \
    ggtitle("Bam metrics: Freemix by genetic region, Pre-QC")+ \
    coord_cartesian(xlim = (0,.5))+\
    scale_fill_manual(values = cont_colors)

# Show plot
p.show()

In [None]:
# Plot freemix colored by project/study
p = ggplot(pre_qc_cols, aes(x = pre_qc_cols.bam_metrics.freemix)) +\
    geom_histogram(aes(fill=pre_qc_cols.hgdp_tgp_meta.project), position="identity", bins = 70,\
                            alpha = .5) + \
    scale_y_log10("Count (log scale)") +\
    xlab("Freemix") + \
    hl.ggplot.ggtitle("Bam metrics: Freemix by project, Pre-QC")+ \
    hl.ggplot.coord_cartesian(xlim = (0,.5))

# Show plot
p.show()

### 3a-4. Heterozygosity

[Back to Index](#Index)

### 3a-4a. Expected Heterozygosity

Within each subpopulation, we first compute allele frequencies of each variant separately. Using this information, we then compute the expected heterozygosity by using the sum of variance (sum of <code>2pq</code>) across all alleles within each population.

[Back to Index](#Index)

In [None]:
# Specify the equation for AF
af_equation = hl.agg.mean(pre_qc_mt.GT.n_alt_alleles()/2)

# Apply the equation specified above to aggregate AF over populations
pop_labels = pre_qc_mt.hgdp_tgp_meta.population # grab the column with labels
exp_het_pre_qc = pre_qc_mt.group_cols_by(pop_labels).aggregate(pop_af = af_equation)

# Set the equation for expected heterozygosity (2*p*(1-p)) for each population
var = 2*(exp_het_pre_qc.pop_af)*(1-exp_het_pre_qc.pop_af)

# Annotate column named "pop_var" with the equation specified above
exp_het_pre_qc = exp_het_pre_qc.annotate_cols(pop_var = hl.agg.filter(~hl.is_nan(exp_het_pre_qc.pop_af), # ignore NAs
                                                                  hl.agg.sum(var)))

# Grab only the column fields of the mt 
exp_het_pre_qc_cols = exp_het_pre_qc.cols()

# Identify genetic regions 
grouped = pre_qc_cols.group_by(pre_qc_cols.hgdp_tgp_meta.population).aggregate(
    region = hl.agg.collect_as_set(pre_qc_cols.hgdp_tgp_meta.genetic_region))
grouped = grouped.key_by(grouped.population)

# Annotate colors
exp_het_pre_qc_cols = exp_het_pre_qc_cols.annotate(region = grouped[exp_het_pre_qc_cols.population])

# Change set of colors to string
exp_het_pre_qc_cols = exp_het_pre_qc_cols.annotate(region = hl.str(exp_het_pre_qc_cols.region.region)[2:5])

# Sort by values
exp_het_pre_qc_cols = exp_het_pre_qc_cols.order_by(hl.desc(exp_het_pre_qc_cols.pop_var))

In [None]:
## Write out file to make plotting faster 
## Took 12min to run 
#exp_het_pre_qc_cols.write(exp_het_pre_qc_path, overwrite = False)

In [None]:
# Read table back in for plotting 
exp_het_pre_qc_cols = hl.read_table(exp_het_pre_qc_path)
exp_het_pre_qc_cols = exp_het_pre_qc_cols.filter(hl.is_missing(exp_het_pre_qc_cols.region), keep = False) # keep only non-NA individuals

# Make plot 
p = ggplot(exp_het_pre_qc_cols, aes(x=exp_het_pre_qc_cols.population, y=exp_het_pre_qc_cols.pop_var)) + \
    geom_point(aes(color=exp_het_pre_qc_cols.region)) +\
    ylab("Expected number of heterozygous sites") +\
    ggtitle("Expected Heterozygosity, Pre-QC") +\
    scale_x_discrete(breaks=list(range(exp_het_pre_qc_cols.count()))) +\
    scale_color_manual(values=cont_colors)+\
    labs(color = 'Population')

# Show plot
p.show()

### 3a-4b. Actual Heterozygosity

To compute the number of heterozygous sites, we calculate the average <code>n_hets</code> for every locus within each subpopulation through an aggregator as we group by population. Then, we average the number of heterozygous sites across all loci within each population to result in one final number of average <code>n_hets</code>.

[Back to Index](#Index)

In [None]:
# Define n_het
n_het = pre_qc_mt.sample_qc.n_het

# Compute the mean number of heterozygous sites for each locus and subpopulation
act_het_pre_qc = pre_qc_mt.group_cols_by(pre_qc_mt.hgdp_tgp_meta.population).aggregate(mean_hets = hl.agg.mean(n_het))

# Average mean hets values across all loci for each population
act_het_pre_qc = act_het_pre_qc.annotate_cols(mean_hets_final = hl.agg.filter(~hl.is_nan(act_het_pre_qc.mean_hets), hl.agg.mean(act_het_pre_qc.mean_hets)))

# Subset to column fields only
act_het_pre_qc_cols = act_het_pre_qc.cols()

# Annotate colors
act_het_pre_qc_cols = act_het_pre_qc_cols.annotate(region = grouped[act_het_pre_qc_cols.population]) # "grouped" is from the previous plot (expected heterozygosity)

# Change set of colors to string
act_het_pre_qc_cols = act_het_pre_qc_cols.annotate(region = hl.str(act_het_pre_qc_cols.region.region)[2:5])

# Sort by values
act_het_pre_qc_cols = act_het_pre_qc_cols.order_by(hl.desc(act_het_pre_qc_cols.mean_hets_final))

In [None]:
## Write out file to make plotting faster 
## Took 22min to run
#act_het_pre_qc_cols.write(act_het_pre_qc_path, overwrite = False)

In [None]:
# Read table back in for plotting 
act_het_pre_qc_cols = hl.read_table(act_het_pre_qc_path)
act_het_pre_qc_cols = act_het_pre_qc_cols.filter(hl.is_missing(act_het_pre_qc_cols.region), keep = False) # keep only non-NA individuals

# Make plot
p = ggplot(act_het_pre_qc_cols, hl.ggplot.aes(x=act_het_pre_qc_cols.population, y=act_het_pre_qc_cols.mean_hets_final, \
                                                           color=act_het_pre_qc_cols.region)) + \
    geom_point() +\
    ylab("Number of heterozygous sites") +\
    ggtitle("Actual Heterozygosity, Pre-QC") +\
    scale_x_discrete(breaks=list(range(act_het_pre_qc_cols.count()))) +\
    scale_color_manual(values=cont_colors)+\
    labs(color = 'Population')

# Show plot
p.show()


## 3b. Post-QC Plots

The following plots show the dataset after applying sample, variant and genotype QC filters.

[Back to Index](#Index)

In [None]:
# Read in the column fields of the post-QC mt for plotting 
post_qc_cols = hl.read_table(post_qc_cols_path)

### 3b-1. Number of SNVs

Histogram of number of SNVs for each individual within each genetic region

[Back to Index](#Index)

In [None]:
# Make plot
p = ggplot(post_qc_cols, aes(x = post_qc_cols.sample_qc.n_snp, fill = post_qc_cols.hgdp_tgp_meta.genetic_region)) + \
    geom_histogram(min_val = 4000000, max_val = 7000000, 
                    bins = 200, position="identity", alpha = .6) + \
    xlab("Number of SNVs")+ \
    ggtitle("Number of SNVs, Post-QC")+ \
    coord_cartesian(ylim = (0,200)) +\
    scale_fill_manual(values=cont_colors) # use the colors specified above

# Show plot
p.show()

### 3b-2. Mean Coverage 

Density plot of mean coverage per individual

[Back to Index](#Index)

In [None]:
# Make plot
p = ggplot(post_qc_cols, aes(x = post_qc_cols.bam_metrics.mean_coverage)) + \
    geom_density(aes(fill = post_qc_cols.hgdp_tgp_meta.project), alpha = .7) + \
    xlab("Coverage (x)")+ \
    ggtitle("Mean coverage, Post-QC")


# Show plot
p.show()

### 3a-3. Freemix 

[Back to Index](#Index)

In [None]:
# Plot freemix colored by genetic region  
p = ggplot(post_qc_cols, hl.ggplot.aes(x = post_qc_cols.bam_metrics.freemix)) +\
    geom_histogram(aes(fill=post_qc_cols.hgdp_tgp_meta.genetic_region), bins = 70) + \
    scale_y_log10("Count (log scale)") +\
    xlab("Freemix") + \
    ggtitle("Bam metrics: Freemix by genetic region, Post-QC")+ \
    coord_cartesian(xlim = (0,.5))+\
    scale_fill_manual(values = cont_colors)

# Show plot
p.show()

In [None]:
# Plot freemix colored by project/study
p = ggplot(post_qc_cols, aes(x = post_qc_cols.bam_metrics.freemix)) +\
    geom_histogram(aes(fill=post_qc_cols.hgdp_tgp_meta.project), position="identity", bins = 70,\
                            alpha = .5) + \
    scale_y_log10("Count (log scale)") +\
    xlab("Freemix") + \
    hl.ggplot.ggtitle("Bam metrics: Freemix by project, Post-QC")+ \
    hl.ggplot.coord_cartesian(xlim = (0,.5))

# Show plot
p.show()

### 3b-4. Heterozygosity

[Back to Index](#Index)

### 3b-4a. Expected Heterozygosity

[Back to Index](#Index)

In [None]:
# Specify the equation for AF
af_equation = hl.agg.mean(post_qc_mt.GT.n_alt_alleles()/2)

# Apply the equation specified above to aggregate AF over populations
pop_labels = post_qc_mt.hgdp_tgp_meta.population # grab the column with labels
exp_het_post_qc = post_qc_mt.group_cols_by(pop_labels).aggregate(pop_af = af_equation)

# Set the equation for expected heterozygosity (2*p*(1-p)) for each population
var = 2*(exp_het_post_qc.pop_af)*(1-exp_het_post_qc.pop_af)

# Annotate column named "pop_var" with the equation specified above
exp_het_post_qc = exp_het_post_qc.annotate_cols(pop_var = hl.agg.filter(~hl.is_nan(exp_het_post_qc.pop_af), # ignore NAs
                                                                  hl.agg.sum(var)))

# Grab only the column fields of the mt 
exp_het_post_qc_cols = exp_het_post_qc.cols()

# Identify genetic regions 
grouped = post_qc_cols.group_by(post_qc_cols.hgdp_tgp_meta.population).aggregate(
    region = hl.agg.collect_as_set(post_qc_cols.hgdp_tgp_meta.genetic_region))
grouped = grouped.key_by(grouped.population)

# Annotate colors
exp_het_post_qc_cols = exp_het_post_qc_cols.annotate(region = grouped[exp_het_post_qc_cols.population])

# Change set of colors to string
exp_het_post_qc_cols = exp_het_post_qc_cols.annotate(region = hl.str(exp_het_post_qc_cols.region.region)[2:5])

# Sort by values
exp_het_post_qc_cols = exp_het_post_qc_cols.order_by(hl.desc(exp_het_post_qc_cols.pop_var))

In [None]:
## Write out file to make plotting faster 
## Took 8min to run 
#exp_het_post_qc_cols.write(exp_het_post_qc_path, overwrite = False)

In [None]:
# Read table back in for plotting 
exp_het_post_qc_cols = hl.read_table(exp_het_post_qc_path)
exp_het_post_qc_cols = exp_het_post_qc_cols.filter(hl.is_missing(exp_het_post_qc_cols.region), keep = False) # keep only non-NA individuals

# Make plot 
p = ggplot(exp_het_post_qc_cols, aes(x=exp_het_post_qc_cols.population, y=exp_het_post_qc_cols.pop_var)) + \
    geom_point(aes(color=exp_het_post_qc_cols.region)) +\
    ylab("Expected number of heterozygous sites") +\
    ggtitle("Expected Heterozygosity, Post-QC") +\
    scale_x_discrete(breaks=list(range(exp_het_post_qc_cols.count()))) +\
    scale_color_manual(values=cont_colors)+\
    labs(color = 'Population')

# Show plot
p.show()

### 3b-4b. Actual Heterozygosity

[Back to Index](#Index)

In [None]:
# Define n_het
n_het = post_qc_mt.sample_qc.n_het

# Compute the mean number of heterozygous sites for each locus and subpopulation
act_het_post_qc = post_qc_mt.group_cols_by(post_qc_mt.hgdp_tgp_meta.population).aggregate(mean_hets = hl.agg.mean(n_het))

# Average mean hets values across all loci for each population
act_het_post_qc = act_het_post_qc.annotate_cols(mean_hets_final = hl.agg.filter(~hl.is_nan(act_het_post_qc.mean_hets), hl.agg.mean(act_het_post_qc.mean_hets)))

# Subset to column fields only
act_het_post_qc_cols = act_het_post_qc.cols()

# Annotate colors
act_het_post_qc_cols = act_het_post_qc_cols.annotate(region = grouped[act_het_post_qc_cols.population]) # "grouped" is from the previous plot (expected heterozygosity)

# Change set of colors to string
act_het_post_qc_cols = act_het_post_qc_cols.annotate(region = hl.str(act_het_post_qc_cols.region.region)[2:5])

# Sort by values
act_het_post_qc_cols = act_het_post_qc_cols.order_by(hl.desc(act_het_post_qc_cols.mean_hets_final))

In [None]:
## Write out file to make plotting faster 
## Took 8min to run
#act_het_post_qc_cols.write(act_het_post_qc_path, overwrite = False)

In [None]:
# Read table back in for plotting 
act_het_post_qc_cols = hl.read_table(act_het_post_qc_path)
act_het_post_qc_cols = act_het_post_qc_cols.filter(hl.is_missing(act_het_post_qc_cols.region), keep = False) # keep only non-NA individuals

# Make plot
p = ggplot(act_het_post_qc_cols, hl.ggplot.aes(x=act_het_post_qc_cols.population, y=act_het_post_qc_cols.mean_hets_final, \
                                                           color=act_het_post_qc_cols.region)) + \
    geom_point() +\
    ylab("Number of heterozygous sites") +\
    ggtitle("Actual Heterozygosity, Post-QC") +\
    scale_x_discrete(breaks=list(range(act_het_post_qc_cols.count()))) +\
    scale_color_manual(values=cont_colors)+\
    labs(color = 'Population')

# Show plot
p.show()

### 3b-4c. Difference Between Expected and Actual Heterozygosity

Check for stratification/artifacts 

[Back to Index](#Index)

In [None]:
exp_het_post_qc_cols = exp_het_post_qc_cols.key_by(exp_het_post_qc_cols.region)
act_het_post_qc_cols = act_het_post_qc_cols.key_by(act_het_post_qc_cols.region)

diff = act_het_post_qc_cols.annotate(expected = exp_het_post_qc_cols[act_het_post_qc_cols.region].pop_var)

diff = diff.annotate(obs_min_ex = diff.mean_hets_final - diff.expected)

In [None]:
# Make plot
p = ggplot(diff, aes(x=diff.population, 
                     y=diff.obs_min_ex,
                     color=diff.region)) + \
    geom_point() +\
    ylab("Number of Heterozygous sites") +\
    ggtitle("Difference of # of Heterozygous Sites Between Actual & Expected, Post-QC") +\
    scale_x_discrete(breaks=list(range(diff.count()))) +\
    scale_color_manual(values=cont_colors)+\
    labs(color = 'Population')

# Show plot
p.show()

### 3b-5. Site Frequency Spectrum

[Back to Index](#Index)

In [None]:
# This code chunk takes ~9min to run

# Perform Hail's variant QC 
post_qc_rows = hl.variant_qc(post_qc_mt).rows() 

# Aggregate site frequency data for plotting
sfs_post_qc = post_qc_rows.aggregate(hl.agg.hist(post_qc_rows.variant_qc.AF[1], 0,1,250))

In [None]:
## Write out the site frequency spectrum struct into a text file to make plotting faster 
# with hl.hadoop_open(sfs_post_qc_path, 'w') as f:
#     f.write(str(dict(sfs_post_qc)))

In [None]:
# Read in site frequency spectrum table 
sfs_post_qc = hl.hadoop_open(sfs_post_qc_path)
sfs_dict = eval(sfs_post_qc.read())
sfs_struct = hl.Struct(**sfs_dict)

# Plot site frequency spectrum histogram using hl.plot
# Similar to the other plots, this can also be plotted using ggplot. However, it does take more time to run
sfs_p = hl.plot.histogram(sfs_struct, log = True, legend = "Frequency of Major Allele at Site")
show(sfs_p)

# 4. Investigating gnomAD Sample Filters
   
The sample QC above already considers this (<code>mt = pre_qc_mt.filter_cols(~pre_qc_mt.gnomad_sample_filters.hard_filtered)</code>) but here we show how we retrieved samples that were wholley removed by gnomAD sample QC filters before having a field that indicated which samples were wrongly removed - <code>pre_qc_mt.gnomad_sample_filters.hard_filtered</code>. For validity check you can run <code>pre_qc_mt.aggregate_cols(hl.agg.counter(pre_qc_mt.gnomad_sample_filters.hard_filtered))</code> and the values of <code>TRUE</code> will be 31 - equal to the number of samples that were correctly removed. 

9 out of the 28 gnomAD sample filters were dropping huge numbers of ancestrally diverse individuals (mostly African (`AFR`) and Oceanian (`OCE`) populations): 
- BiakaPygmy
- MbutiPygmy
- Melanesian
- Papuan
- San

The filters rely on gnomAD’s ancestry principal component analysis (PCA), which captures genetic variance across the larger gnomAD callset, and smaller, under-represented groups such as those in the HGDP+1kGP callset can appear erroneously as outliers. Here we explore which original gnomAD sample QC filters remove entire populations.


<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.filter_cols"> More on  <i> filter_cols() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.expr.SetExpression.html#hail.expr.SetExpression.difference"> More on  <i> difference() </i></a></li>

<li><a href=" https://hail.is/docs/0.2/hail.expr.CollectionExpression.html#hail.expr.CollectionExpression.length"> More on  <i> length() </i></a></li>
    </ul>
</details>

[Back to Index](#Index)

In [None]:
# Put the gnomAD qc filters in a set - gnomAD has them in a column field
bad_sample_filters = set(pre_qc_mt.gnomad_sample_qc_metric_outlier_cutoffs.qc_metrics_stats) 

# Keep samples that passed all gnomAD QC filters OR only failed the filters that were removing populations wholly
mt_filt = pre_qc_mt.filter_cols(pre_qc_mt['gnomad_sample_filters']['qc_metrics_filters'].difference(bad_sample_filters).length() == 0)

# How many samples were removed by the gnomAD QC filters correctly? 
print('Num of samples at the beginning = ' + str(pre_qc_mt.count()[1])) 
print('Num of samples after retrieving wrongly removed ones = ' + str(mt_filt.count()[1])) 
print('Samples removed correctly by gnomAD filters = ' + str(pre_qc_mt.count()[1] - mt_filt.count()[1])) 

## 4a. Plotting Results of gnomAD Sample Filter Investigation

Here we only show <code>fail_n_snp_residual</code> as an example but the code can be implemented on any of the other gnomAD sample filters. 

[Back to Index](#Index)

In [None]:
# Read in gnomAD's HGDP+1kGP metadata without imputing field types from the file
metadata = hl.import_table(metadata_path)

In [None]:
# Add gnomAD's sample filters into a list 
sample_filters = [name for name in list(metadata.row) if 'sample_filters.' in name][:-1]

# Within each population, count the total number of samples and the number of samples that failed each filter   
filters_tbl = (metadata.group_by(metadata['hgdp_tgp_meta.Population'])
               .aggregate(n = hl.agg.count(),
                          **{col: hl.agg.count_where(metadata[col] == 'true') for col in sample_filters}))   

# Add a column to indicate the 5 populations that were filtered out by gnomAD's sample fiters 
filtered_samples = hl.set(["BiakaPygmy", "MbutiPygmy", "Melanesian", "Papuan", "San"])
filters_tbl = filters_tbl.annotate(failed_gnomAD = hl.if_else(filtered_samples.contains(filters_tbl['hgdp_tgp_meta.Population']), 'TRUE', 'FALSE'))

# Grab only "sample_filters.fail_n_snp_residual" column 
filters_tbl = filters_tbl.key_by() # unkey table first so the population column isn't duplicated 
n_snp_resid = filters_tbl.select(population = filters_tbl['hgdp_tgp_meta.Population'],
                       num_samples = filters_tbl['n'],
                       fail_n_snp_resid = filters_tbl['sample_filters.fail_n_snp_residual'],
                       fail_gnomAD = filters_tbl['failed_gnomAD'])

# Calculate the ratio between the number of samples that failed and the total number of samples in the population. 
n_snp_resid = n_snp_resid.annotate(fail_ratio = n_snp_resid.fail_n_snp_resid/n_snp_resid.num_samples)

# Generate a scatter plot of ratios across all populations colored by gnomAD failure 
p = ggplot(n_snp_resid, aes(x=n_snp_resid.population, 
                            y=n_snp_resid.fail_ratio, 
                            color=n_snp_resid.fail_gnomAD)) +\
    geom_point() +\
    ylab("Ratio of failed samples/total samples") + \
    ggtitle("Failure of gnomAD n_snp_resids Filter by Population") +\
    labs(color = 'Failed gnomAD filters') +\
    scale_x_discrete(breaks=list(range(n_snp_resid.count())))

# Show Plot
p.show()

[Back to Index](#Index)