# Metadata and QC

Authors: Zan Koenig,  Mary T. Yohannes, & Ally Kim

**To run this tutorial, you need to have started your cluster with `--packages-gnomad`.**

*If you have not done this, you will need to shut down your current cluster and start a new one with the `--packages-gnomad` argument.* 

See the tutorials [README](https://github.com/atgu/hgdp_tgp/tree/master/tutorials#readme) for more information on how to start a cluster.

## Index
1. [Set Default Paths](#1.-Set-Default-Paths)
2. [Read in Pre-QC Dataset and Apply Quality Control Filters](#2.-Read-in-Pre-QC-Dataset-and-Apply-Quality-Control-Filters)
3. [Data and Function Set Up for Plots](#3.-Data-and-Function-Set-Up-for-Plots)
4. [Plots](#4.-Plots)
    1. [Number of SNVs](#4.a.-Number-of-SNVs)
    2. [Mean Coverage](#4.b.-Mean-Coverage)
    3. [Freemix](#4.c.-Freemix)
    4. [Heterozygosity](#4.d.-Heterozygosity)
        1. [Expected Heterozygosity](#4.d.1.-Expected-Heterozygosity)
        2. [Actual Heterozygosity](#4.d.2.-Actual-Heterozygosity)
        3. [Difference Between Expected and Actual Heterozygosity (Post-QC only)](#4.d.3.-Difference-Between-Expected-and-Actual-Heterozygosity-(Post-QC-only))
    5. [Site Frequency Spectrum](#4.e.-Site-Frequency-Spectrum)
5. [Investigating gnomAD Sample Filters](#5.-Investigating-gnomAD-Sample-Filters)
    1. [Plotting Results of gnomAD Sample Filter Investigation](#5.a.-Plotting-Results-of-gnomAD-Sample-Filter-Investigation)

# General Overview

The purpose of this script is to merge metadata components needed for the HGDP+1kGP dataset and apply QC filters on the resulting dataset. The metadata includes sample and variant information (ex. genetic region and samples/variants QC status) that are initially located in different datasets. The QC filters are run using sample and variant flags from the metadata datasets. These flags are generated as a result of the dataset being run through the gnomAD QC pipeline. More information on the gnomAD QC pipeline can be found [here](https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#sample-and-variant-quality-control). To see how the gnomAD sample QC filters were updated as a result of our analyses, see [gnomAD sample filters](#3.-Investigating-gnomAD-sample-filters) and the resulting gnomAD [minor release.](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/#improvements-to-the-hgdp--1kg-subset-release)

**This script contains information on how to**:
- Read in and write out a matrix table (shortened as mt) 
- Filter a matrix table using a field within the matrix table and a function imported from an external library
- Use plots to identify which gnomAD sample QC filters are removing populations entirely (`fail_n_snp_residual` is used as an example here)
- Retrieve populations being unduly removed by filters (mostly <code>AFR</code> and <code>OCE</code> populations)
- Filter a matrix table using a list of samples to remove
- Plot certain fields from the matrix table:
    - Number of SNVs
    - Coverage
    - Site Frequency 
    - Freemix
    - Number of samples which failed a sample filter

In [None]:
import hail as hl

# For renaming purposes
import re

# Function from gnomAD library to apply genotype filters 
from gnomad.utils.filtering import filter_to_adj

# For plotting in Hail
from hail.ggplot import *
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()

In [None]:
# Initializing Hail 
hl.init()

In [1]:
# Allow output scrolling in Jupyter nb viewer for cells with long outputs 

from IPython.core.display import HTML
css = open('format.css').read()
HTML('<style>{}</style>'.format(css))

# 1. Set Default Paths

These default paths can be edited by users as needed. It is recommended to run these tutorials without writing out datasets. 

**By default all of the dataset write out sections are shown as markdown cells. If you would like to write out your own dataset, you can copy the code and paste it into a new code cell. Don't forget to change the paths in the following cell accordingly and edit the ```overwrite``` argument if you are writing out a dataset more than once.** 

[Back to Index](#Index)

In [3]:
# Path for HGDP+1kGP dataset prior to applying gnomAD QC filters
pre_qc_path = 'gs://gcp-public-data--gnomad/release/3.1.2/mt/genomes/gnomad.genomes.v3.1.2.hgdp_1kg_subset_dense.mt'

# Path for gnomAD's HGDP+1kGP metadata with updated population labels
metadata_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/metadata_and_qc/gnomad_meta_updated.tsv'

## Paths for plotting 
# Pre-QC
pre_qc_cols_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/plot_datasets/pre_qc_plotting.ht' 
exp_het_pre_qc_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/plot_datasets/expected_hets_pre_qc.ht'# expected heterozygosity
act_het_pre_qc_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/plot_datasets/actual_hets_pre_qc.ht' # actual heterozygosity

# Post-QC
post_qc_cols_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/plot_datasets/post_qc_plotting.ht' 
exp_het_post_qc_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/plot_datasets/expected_hets_post_qc.ht' # expected heterozygosity
act_het_post_qc_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/plot_datasets/actual_hets_post_qc.ht' # actual heterozygosity
sfs_post_qc_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/plot_datasets/sfs_post_qc.txt' # site frequency spectrum

# 2. Read in Pre-QC Dataset and Apply Quality Control Filters

<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_matrix_table"> More on  <i> read_matrix_table() </i></a></li>
        
<li><a href="https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.describe"> More on  <i> describe() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.count"> More on  <i> count() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/linalg/hail.linalg.BlockMatrix.html#hail.linalg.BlockMatrix.filter_cols"> More on  <i> filter_cols() </i></a></li>

<li><a href="https://hail.is/docs/0.2/linalg/hail.linalg.BlockMatrix.html#hail.linalg.BlockMatrix.filter_rows"> More on  <i> filter_rows() </i></a></li>
</ul>
</details>
    
[Back to Index](#Index)

In [4]:
# Read in the HGDP+1kGP pre-QC mt
pre_qc_mt = hl.read_matrix_table(pre_qc_path)

# Get mt schema
pre_qc_mt.show()

[Stage 0:>                                                          (0 + 4) / 4]

locus,alleles
locus<GRCh38>,array<str>
chr1:10055,"[""T"",""C""]"
chr1:10061,"[""T"",""C""]"
chr1:10109,"[""A"",""T""]"
chr1:10109,"[""AACCCT"",""A""]"
chr1:10114,"[""T"",""C""]"
chr1:10114,"[""TA"",""T""]"
chr1:10116,"[""A"",""G""]"
chr1:10119,"[""CT"",""C""]"
chr1:10120,"[""T"",""A""]"
chr1:10122,"[""A"",""G""]"


In [5]:
# Validitiy check: number of variants and samples prior to applying QC filters
print('Num of SNVs and samples prior to any analysis = ' + str(pre_qc_mt.count())) 

Num of SNVs and samples prior to any analysis = (189381961, 4151)


The following function applies the quality control filters to the pre-QC dataset. Since the post-QC mt will not be written out, the same function is applied the other notebooks where the post-QC dataset is used. 

**To avoid errors, make sure to run the next two cells before running any code that includes the post-QC dataset.**

**If running the cell below results in an error, double check that you used the  `--packages gnomad` argument when starting your cluster.**  

- See the tutorials [README](https://github.com/atgu/hgdp_tgp/tree/master/tutorials#readme) for more information on how to start a cluster.

[Back to Index](#Index)

In [6]:
# Set up function to:
# apply gnomAD's sample, variant and genotype QC filters
# remove two contaminated samples identified using CHARR - https://pubmed.ncbi.nlm.nih.gov/37425834/
# remove the gnomAD sample that's added for QC purposes
# only keep the variants which are found in the samples that are left 
# add gnomAD's HGDP+1kGP metadata with the updated population labels as a column field 

def run_qc(mt):
    
    ## Apply sample QC filters to dataset 
    # This filters to only samples that passed gnomAD's sample QC hard filters  
    mt = mt.filter_cols(~mt.gnomad_sample_filters.hard_filtered) # removed 31 samples
    
    ## Apply variant QC filters to dataset
    # This subsets to only PASS variants - those which passed gnomAD's variant QC
    # PASS variants have an entry in the filters field 
    mt = mt.filter_rows(hl.len(mt.filters) != 0, keep=False)
    
    # Remove the two contaminated samples identified by CHARR and 'CHMI_CHMI3_WGS2'
    contaminated_samples = {'HGDP01371', 'LP6005441-DNA_A09'}
    contaminated_samples_list = hl.literal(contaminated_samples)
    mt = mt.filter_cols(~contaminated_samples_list.contains(mt['s']))
    
    # CHMI_CHMI3_WGS2 is a sample added by gnomAD for QC purposes and has no metadata info 
    mt = mt.filter_cols(mt.s == 'CHMI_CHMI3_WGS2', keep = False)

    # Only keep the variants which are found in the samples that are left 
    mt = mt.filter_rows(hl.agg.any(mt.GT.is_non_ref()))
    
    # Read in and add the metadata with the updated population labels as a column field 
    metadata = hl.import_table(metadata_path, impute = True, key = 's') 
    mt = mt.annotate_cols(meta_updated = metadata[mt.s])
    
    ## Apply genotype QC filters to the dataset
    # This is done using a function imported from gnomAD and is the last step in the QC process
    mt = filter_to_adj(mt)

    return mt

In [None]:
# Run QC 
post_qc_mt = run_qc(pre_qc_mt)

In [8]:
# Validity check: number of variants and samples after applying QC filters
# Took ~1hr to print
print('Num of SNVs and samples after applying QC filters = ' + str(post_qc_mt.count()))



Num of SNVs and samples after applying QC filters = (159339147, 4117)




After applying QC filters, the num of SNVs decreased from <code>189,381,961</code> to <code>159,339,147</code> and the num of samples decreased from <code>4151</code> to <code>4117</code>.

[Back to Index](#Index)

# 3. Data and Function Set Up for Plots  

When conducting quality control, it is often a good idea to create plots of your data and summary statistics. For example, we look at the number of SNVs and coverage before and after QC so that after removing samples or variants, we get a visual representation of the changes in the dataset. This can potentially flag issues for further investigation. 

**Make sure to run the next cell before attempting to run any of the plotting code chunks so that colors are mapped to region names and plots are generated without an error.**

[Back to Index](#Index)

In [9]:
post_qc_mt.meta_updated.describe()

--------------------------------------------------------
Type:
        struct {
        `project_meta.sample_id`: str, 
        `project_meta.research_project_key`: str, 
        `project_meta.seq_project`: str, 
        `project_meta.ccdg_alternate_sample_id`: str, 
        `project_meta.ccdg_gender`: str, 
        `project_meta.ccdg_center`: str, 
        `project_meta.ccdg_study`: str, 
        `project_meta.cram_path`: str, 
        `project_meta.project_id`: str, 
        `project_meta.v2_age`: str, 
        `project_meta.v2_sex`: str, 
        `project_meta.v2_hard_filters`: str, 
        `project_meta.v2_perm_filters`: str, 
        `project_meta.v2_pop_platform_filters`: str, 
        `project_meta.v2_related`: str, 
        `project_meta.v2_data_type`: str, 
        `project_meta.v2_product`: str, 
        `project_meta.v2_product_simplified`: str, 
        `project_meta.v2_qc_platform`: str, 
        `project_meta.v2_project_id`: str, 
        `project_meta.v2_project_descrip

In [4]:
# Dictionary mapping colors to region names 
cont_colors = {'AMR':"#E41A1C",
               'AFR':"#984EA3", 
               'OCE':"#999999",
               'CSA':"#FF7F00",
               'EAS':"#4DAF4A", 
               'EUR':"#377EB8", 
               'MID':"#A65628" }

In [None]:
# Run Hail's sample QC on both pre and post-QC datasets 
pre_qc_mt = hl.sample_qc(pre_qc_mt)
post_qc_mt = hl.sample_qc(post_qc_mt)

# Read in gnomAD's HGDP+1kGP metadata for plotting 
metadata = hl.import_table(metadata_path, impute = True, key = 's')

# Add plot annotations for both datasets  
pre_qc_mt = pre_qc_mt.annotate_cols(subpop_color = metadata[pre_qc_mt.s]['hgdp_tgp_meta.Pop.colors'],
                    subpop_shapes = metadata[pre_qc_mt.s]['hgdp_tgp_meta.Pop.shapes'],
                    global_color = metadata[pre_qc_mt.s]['hgdp_tgp_meta.Continent.colors'])

post_qc_mt = post_qc_mt.annotate_cols(subpop_color = metadata[post_qc_mt.s]['hgdp_tgp_meta.Pop.colors'],
                    subpop_shapes = metadata[post_qc_mt.s]['hgdp_tgp_meta.Pop.shapes'],
                    global_color = metadata[post_qc_mt.s]['hgdp_tgp_meta.Continent.colors'])

<code>CHMI_CHMI3_WGS2</code> is a sample added by gnomAD for QC purposes, and thus doesn't have metadata information. To avoid a <code>None</code> error, we removed it prior to generating plots. From the dataset itself, it is removed together with PCA outliers in [Notebook 2: PCA and Ancestry Analyses](https://github.com/atgu/hgdp_tgp/blob/master/tutorials/nb2.ipynb). 

[Back to Index](#Index)

In [None]:
# Remove CHMI_CHMI3_WGS2 from pre-QC dataset 
# Already removed from post-QC mt during QC 
pre_qc_mt = pre_qc_mt.filter_cols(pre_qc_mt.s == 'CHMI_CHMI3_WGS2', keep = False)

# Subset to column fields only 
pre_qc_cols = pre_qc_mt.cols()
post_qc_cols = post_qc_mt.cols()

- Write out column fields to make plotting faster (took ~55min to run) 

```python3
    pre_qc_cols.write(pre_qc_cols_path, overwrite = False) # pre-QC
    post_qc_cols.write(post_qc_cols_path, overwrite = False) # post-QC
``` 

[Back to Index](#Index)

### *Number of SNVs*

Function to plot a histogram of number of SNVs for each individual within each genetic region

[Back to Index](#Index)

In [5]:
def plot_snvs(ht):
    p = ggplot(ht, aes(x = ht.sample_qc.n_snp, fill = ht.hgdp_tgp_meta.genetic_region)) + \
        geom_histogram(min_val = 4000000, max_val = 7000000, bins = 200, position="identity", alpha = .6) + \
        xlab("Number of SNVs")+ \
        coord_cartesian(ylim = (0,200)) +\
        scale_fill_manual(values=cont_colors) # use the colors specified above
    
    return p

### *Mean Coverage* 

Function to plot a density plot of mean coverage per individual

[Back to Index](#Index)

In [6]:
def plot_mean_cov(ht):
    p = ggplot(ht, aes(x = ht.bam_metrics.mean_coverage)) + \
        geom_density(aes(fill = ht.hgdp_tgp_meta.project), alpha = .7, smoothed=True) + \
        xlab("Coverage (x)")
    
    return p

### *Freemix*

[Back to Index](#Index)

Function to plot freemix colored by genetic region  

In [7]:
def plot_freemix_reg(ht, n_bins, xmax):
    p = ggplot(ht, hl.ggplot.aes(x = ht.bam_metrics.freemix)) +\
        geom_histogram(aes(fill=ht.hgdp_tgp_meta.genetic_region), bins = n_bins) + \
        scale_y_log10("Count (log scale)") +\
        xlab("Freemix") + \
        coord_cartesian(xlim = (0,xmax))+\
        scale_fill_manual(values = cont_colors)
    
    return p

Function to plot freemix colored by project/study

In [8]:
def plot_freemix_proj(ht, xmax):
    p = ggplot(ht, aes(x = ht.bam_metrics.freemix)) +\
        geom_histogram(aes(fill=ht.hgdp_tgp_meta.project), position="identity", bins = 70, alpha = .5) + \
        scale_y_log10("Count (log scale)") +\
        xlab("Freemix") + \
        hl.ggplot.coord_cartesian(xlim = (0,xmax))
    
    return p

### *Heterozygosity*

[Back to Index](#Index)

#### Expected Heterozygosity

Within each subpopulation, we first compute allele frequencies of each variant separately. Using this information, we then compute the expected heterozygosity by using the sum of variance (sum of <code>2pq</code>) across all alleles within each population. This is done for both pre and post-QC using the following function. 

[Back to Index](#Index)

In [17]:
# Set up function to compute expected heterozygosity
def run_exp_het(mt, ht):
    # Specify the equation for AF
    af_equation = hl.agg.mean(mt.GT.n_alt_alleles()/2)

    # Apply the equation specified above to aggregate AF over populations
    # pop_labels = mt.meta_updated.population # grab the column with labels
    exp_het = mt.group_cols_by(mt.meta_updated.population).aggregate(pop_af = af_equation)

    # Set the equation for expected heterozygosity (2*p*(1-p)) for each population
    var = 2*(exp_het.pop_af)*(1-exp_het.pop_af)

    # Annotate column named "pop_var" with the equation specified above
    exp_het = exp_het.annotate_cols(pop_var = hl.agg.filter(~hl.is_nan(exp_het.pop_af), # ignore NAs
                                                            hl.agg.sum(var)))

    # Grab only the column fields of the mt 
    exp_het_cols = exp_het.cols()

    # Identify genetic regions 
    grouped = ht.group_by(ht.meta_updated.population).aggregate(
        region = hl.agg.collect_as_set(ht.hgdp_tgp_meta.genetic_region))
    grouped = grouped.key_by(grouped.population)

    # Annotate colors
    exp_het_cols = exp_het_cols.annotate(region = grouped[exp_het_cols.population])

    # Change set of colors to string
    exp_het_cols = exp_het_cols.annotate(region = hl.str(exp_het_cols.region.region)[2:5])

    # Sort by values
    exp_het_cols = exp_het_cols.order_by(hl.desc(exp_het_cols.pop_var))
    
    return exp_het_cols

In [18]:
post_qc_mt.show()

locus,alleles
locus<GRCh38>,array<str>
chr1:10114,"[""TA"",""T""]"
chr1:10119,"[""CT"",""C""]"
chr1:10126,"[""T"",""C""]"
chr1:10126,"[""T"",""G""]"
chr1:10132,"[""T"",""C""]"
chr1:10134,"[""ACCCTAACCCTAAC"",""A""]"
chr1:10137,"[""CTAACCCTAACCCCT"",""C""]"
chr1:10138,"[""T"",""TA""]"
chr1:10138,"[""TA"",""T""]"
chr1:10138,"[""TAACCC"",""T""]"


In [None]:
# Compute expected heterozygosity 

# pre-QC
exp_het_pre_qc_cols = run_exp_het(pre_qc_mt, pre_qc_cols)

# post-QC
exp_het_post_qc_cols = run_exp_het(post_qc_mt, post_qc_cols)

- Write out file to make plotting faster 

```python3
exp_het_pre_qc_cols.write(exp_het_pre_qc_path, overwrite = False) # pre-QC - took 12min to run
exp_het_post_qc_cols.write(exp_het_post_qc_path, overwrite = False) # post-QC - took 8min to run 

```

[Back to Index](#Index)

Function to plot expected heterozygosity

In [9]:
def plot_exp_het(ht):
    ht = ht.filter(hl.is_missing(ht.region), keep = False) # keep only non-NA individuals
    p = ggplot(ht, aes(x=ht.population, y=ht.pop_var)) + \
        geom_point(aes(color=ht.region)) +\
        ylab("Expected number of heterozygous sites") +\
        scale_x_discrete(breaks=list(range(ht.count()))) +\
        scale_color_manual(values=cont_colors)+\
        labs(color = 'Population')
    return p

#### Actual Heterozygosity

To compute the number of heterozygous sites, we calculate the average <code>n_hets</code> for every locus within each subpopulation through an aggregator as we group by population. Then, we average the number of heterozygous sites across all loci within each population to result in one final number of average <code>n_hets</code>. This is done for both pre and post-QC using the following function.

[Back to Index](#Index)

In [21]:
# Set up function to compute actual heterozygosity
def run_act_het(mt, ht):

    # Run Hail sample QC
    mt = hl.sample_qc(mt)

    # Define n_het
    n_het = mt.sample_qc.n_het

    # Compute the mean number of heterozygous sites for each locus and subpopulation
    act_het = mt.group_cols_by(mt.meta_updated.population).aggregate(mean_hets = hl.agg.mean(n_het))

    # Average mean hets values across all loci for each population
    act_het = act_het.annotate_cols(mean_hets_final = hl.agg.filter(~hl.is_nan(act_het.mean_hets), hl.agg.mean(act_het.mean_hets)))

    # Subset to column fields only
    act_het_cols = act_het.cols()

    # Identify genetic regions
    grouped = ht.group_by(ht.meta_updated.population).aggregate(
        region = hl.agg.collect_as_set(ht.hgdp_tgp_meta.genetic_region))
    grouped = grouped.key_by(grouped.population)

    # Annotate colors
    act_het_cols = act_het_cols.annotate(region = grouped[act_het_cols.population])

    # Change set of colors to string
    act_het_cols = act_het_cols.annotate(region = hl.str(act_het_cols.region.region)[2:5])

    # Sort by values
    act_het_cols = act_het_cols.order_by(hl.desc(act_het_cols.mean_hets_final))

    return act_het_cols

In [None]:
# Compute actual heterozygosity 

# pre-QC
act_het_pre_qc_cols = run_act_het(pre_qc_mt, pre_qc_cols)

# post-QC
act_het_post_qc_cols = run_act_het(post_qc_mt, post_qc_cols)

- Write out file to make plotting faster 

```python3
act_het_pre_qc_cols.write(act_het_pre_qc_path, overwrite = False) # pre-QC - took 22min to run
act_het_post_qc_cols.write(act_het_post_qc_path, overwrite = False) # post-QC - took 8min to run
```

[Back to Index](#Index)

Function to plot actual heterozygosity

In [10]:
def plot_act_het(ht):
    ht = ht.filter(hl.is_missing(ht.region), keep = False) # keep only non-NA individuals
    p = ggplot(ht, hl.ggplot.aes(x=ht.population, y=ht.mean_hets_final, color=ht.region)) + \
        geom_point() +\
        ylab("Number of heterozygous sites") +\
        scale_x_discrete(breaks=list(range(ht.count()))) +\
        scale_color_manual(values=cont_colors)+\
        labs(color = 'Population')
    return p 

# 4. Plots


The following plots show the dataset before and after running sample, variant and genotype QC filters.

**Make sure to run all code chunks in section 3 above for this section to run without any errors.**

**Reading in already-written out files makes plotting faster.**

[Back to Index](#Index)

In [11]:
# Read in the column fields of the pre and post-QC mt for plotting (makes plotting faster)
pre_qc_cols = hl.read_table(pre_qc_cols_path)
post_qc_cols = hl.read_table(post_qc_cols_path)

## 4.a. Number of SNVs

Histogram of number of SNVs for each individual within each genetic region

[Back to Index](#Index)

In [33]:
# Pre-QC
pre_snvs = plot_snvs(pre_qc_cols)
pre_snvs = pre_snvs + ggtitle("Number of SNVs, Pre-QC")
pre_snvs.show()

# Post-QC
post_snvs = plot_snvs(post_qc_cols)
post_snvs = post_snvs + ggtitle("Number of SNVs, Post-QC")
post_snvs.show()

## 4.b. Mean Coverage 

Density plot of mean coverage per individual

[Back to Index](#Index)

In [34]:
# Pre-QC
pre_mean_cov = plot_mean_cov(pre_qc_cols)
pre_mean_cov = pre_mean_cov + ggtitle("Mean coverage, Pre-QC")
pre_mean_cov.show()

# Post-QC
post_mean_cov = plot_mean_cov(post_qc_cols)
post_mean_cov = post_mean_cov + ggtitle("Mean coverage, Post-QC")
post_mean_cov.show()

## 4.c. Freemix 

[Back to Index](#Index)

Plot freemix colored by genetic region

In [38]:
# Pre-QC with 140 bins and max xlim of 0.5
pre_freemix_reg = plot_freemix_reg(pre_qc_cols, 140, 0.5)
pre_freemix_reg = pre_freemix_reg + ggtitle("Bam metrics: Freemix by genetic region, Pre-QC")
pre_freemix_reg.show()

# Post-QC with 70 bins and max xlim of 0.1 (3 samples with freemix > 0.1 are removed in the post-QC dataset) 
post_freemix_reg = plot_freemix_reg(post_qc_cols, 70, 0.1)
post_freemix_reg = post_freemix_reg + ggtitle("Bam metrics: Freemix by genetic region, Post-QC")
post_freemix_reg.show()

Plot freemix colored by project/study

In [40]:
# Pre-QC with max xlim of 0.5
pre_freemix_proj = plot_freemix_proj(pre_qc_cols, 0.5)
pre_freemix_proj = pre_freemix_proj + hl.ggplot.ggtitle("Bam metrics: Freemix by project, Pre-QC")
pre_freemix_proj.show()

# Post-QC max xlim of 0.1 (3 samples with freemix > 0.1 are removed in the post-QC dataset)
post_freemix_proj = plot_freemix_proj(post_qc_cols, 0.1)
post_freemix_proj = post_freemix_proj + hl.ggplot.ggtitle("Bam metrics: Freemix by project, Post-QC")
post_freemix_proj.show()

[Stage 5:(50000 + 1) / 50000][Stage 11:(176 + 9) / 176][Stage 13:(160 + 6) / 160]

## 4.d. Heterozygosity 

[Back to Index](#Index)

### 4.d.1. Expected Heterozygosity

[Back to Index](#Index)

In [41]:
# Read in the pre and post-QC files which are specifically generated for expected heterozygosity plots
exp_het_pre_qc_cols = hl.read_table(exp_het_pre_qc_path) # pre-QC
exp_het_post_qc_cols = hl.read_table(exp_het_post_qc_path) # post-QC

In [42]:
# Pre-QC
pre_exp_het = plot_exp_het(exp_het_pre_qc_cols)
pre_exp_het = pre_exp_het + ggtitle("Expected heterozygosity, Pre-QC")
pre_exp_het.show()

# Post-QC
post_exp_het = plot_exp_het(exp_het_post_qc_cols)
post_exp_het = post_exp_het + ggtitle("Expected heterozygosity, Post-QC")
post_exp_het.show()

### 4.d.2. Actual Heterozygosity

[Back to Index](#Index)

In [43]:
# Read in the pre and post-QC files which are specifically generated for actual heterozygosity plots
act_het_pre_qc_cols = hl.read_table(act_het_pre_qc_path) # pre-QC
act_het_post_qc_cols = hl.read_table(act_het_post_qc_path) # post-QC

In [44]:
# Pre-QC
pre_act_het = plot_act_het(act_het_pre_qc_cols)
pre_act_het = pre_act_het + ggtitle("Actual heterozygosity, Pre-QC")
pre_act_het.show()

# Post-QC
post_act_het = plot_act_het(act_het_post_qc_cols)
post_act_het = post_act_het + ggtitle("Actual heterozygosity, Post-QC")
post_act_het.show()

[Stage 5:(50000 + 1) / 50000][Stage 11:(176 + 9) / 176][Stage 13:(160 + 6) / 160]

### 4.d.3. Difference Between Expected and Actual Heterozygosity (Post-QC only)

Check for stratification/artifacts 

[Back to Index](#Index)

In [45]:
exp_het_post_qc_cols = exp_het_post_qc_cols.key_by(exp_het_post_qc_cols.region)
act_het_post_qc_cols = act_het_post_qc_cols.key_by(act_het_post_qc_cols.region)

diff = act_het_post_qc_cols.annotate(expected = exp_het_post_qc_cols[act_het_post_qc_cols.region].pop_var)

diff = diff.annotate(obs_min_ex = diff.mean_hets_final - diff.expected)
diff = diff.filter(hl.is_missing(diff.region), keep = False) # keep only non-NA individuals


# Make plot
p = ggplot(diff, aes(x=diff.population, 
                     y=diff.obs_min_ex,
                     color=diff.region)) + \
    geom_point() +\
    ylab("Number of Heterozygous sites") +\
    ggtitle("Difference in # of heterozygous sites between Actual & Expected, Post-QC") +\
    scale_x_discrete(breaks=list(range(diff.count()))) +\
    scale_color_manual(values=cont_colors)+\
    labs(color = 'Population')

# Show plot
p.show()

2024-07-09 19:39:11.607 Hail: INFO: Coerced sorted dataset
2024-07-09 19:39:11.614 Hail: INFO: Coerced dataset with out-of-order partitions.
2024-07-09 19:39:14.965 Hail: INFO: Coerced sorted dataset
2024-07-09 19:39:14.967 Hail: INFO: Coerced dataset with out-of-order partitions.
2024-07-09 19:39:15.861 Hail: INFO: Coerced sorted dataset
2024-07-09 19:39:15.863 Hail: INFO: Coerced dataset with out-of-order partitions.


### 4.e. Site Frequency Spectrum

[Back to Index](#Index)

In [None]:
# This code chunk takes ~14min to run

# Perform Hail's variant QC 
post_qc_rows = hl.variant_qc(post_qc_mt).rows() 

# Aggregate site frequency data for plotting
sfs_post_qc = post_qc_rows.aggregate(hl.agg.hist(post_qc_rows.variant_qc.AF[1], 0,1,250))

- Write out the site frequency spectrum struct into a text file to make plotting faster 

```python3
with hl.hadoop_open(sfs_post_qc_path, 'w') as f:
    f.write(str(dict(sfs_post_qc)))
```

[Back to Index](#Index)

In [46]:
# Read in site frequency spectrum table 
sfs_post_qc = hl.hadoop_open(sfs_post_qc_path)
sfs_dict = eval(sfs_post_qc.read())
sfs_struct = hl.Struct(**sfs_dict)

# Plot site frequency spectrum histogram using hl.plot
# Similar to the other plots, this can also be plotted using ggplot. However, it does take more time to run
sfs_p = hl.plot.histogram(sfs_struct, log = True, legend = "Frequency of major allele at site")
show(sfs_p)

# 5. Investigating gnomAD Sample Filters
   
The sample QC above already considers this (<code>mt = pre_qc_mt.filter_cols(~pre_qc_mt.gnomad_sample_filters.hard_filtered)</code>) but here we show how we retrieved samples that were wholley removed by gnomAD sample QC filters before having a field that indicated which samples were wrongly removed - <code>pre_qc_mt.gnomad_sample_filters.hard_filtered</code>. For validity check you can run <code>pre_qc_mt.aggregate_cols(hl.agg.counter(pre_qc_mt.gnomad_sample_filters.hard_filtered))</code> and the values of <code>TRUE</code> will be 31 - equal to the number of samples that were correctly removed. 

9 out of the 28 gnomAD sample filters were dropping huge numbers of ancestrally diverse individuals (mostly African (`AFR`) and Oceanian (`OCE`) populations): 
- Biaka
- Mbuti
- Bougainville
- PapuanSepik
- PapuanHighlands
- San

The filters rely on gnomAD’s ancestry principal component analysis (PCA) which captures genetic variance across the larger gnomAD callset, and smaller, under-represented groups such as those in the HGDP+1kGP callset can appear erroneously as outliers. Here we explore which original gnomAD sample QC filters removed entire populations.


<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.filter_cols"> More on  <i> filter_cols() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.expr.SetExpression.html#hail.expr.SetExpression.difference"> More on  <i> difference() </i></a></li>

<li><a href=" https://hail.is/docs/0.2/hail.expr.CollectionExpression.html#hail.expr.CollectionExpression.length"> More on  <i> length() </i></a></li>
    </ul>
</details>

[Back to Index](#Index)

In [47]:
# Put the gnomAD QC filters in a set - gnomAD has them in a column field
bad_sample_filters = set(pre_qc_mt.gnomad_sample_qc_metric_outlier_cutoffs.qc_metrics_stats) 

# Keep samples that passed all gnomAD QC filters OR only failed the filters that removed populations wholly
mt_filt = pre_qc_mt.filter_cols(pre_qc_mt['gnomad_sample_filters']['qc_metrics_filters'].difference(bad_sample_filters).length() == 0)

# How many samples were removed by the gnomAD QC filters correctly? 
print('Num of samples at the beginning = ' + str(pre_qc_mt.count()[1])) 
print('Num of samples after retrieving wrongly removed ones = ' + str(mt_filt.count()[1])) 
print('Samples removed correctly by gnomAD filters = ' + str(pre_qc_mt.count()[1] - mt_filt.count()[1])) 

Num of samples at the beginning = 4150
Num of samples after retrieving wrongly removed ones = 4119


[Stage 5:(50000 + 1) / 50000][Stage 11:(176 + 9) / 176][Stage 13:(160 + 6) / 160]

Samples removed correctly by gnomAD filters = 31


## 5.a. Plotting Results of gnomAD Sample Filter Investigation

Here we only show <code>fail_n_snp_residual</code> as an example but the code can be implemented on any of the other gnomAD sample filters. 

[Back to Index](#Index)

In [None]:
# Read in gnomAD's HGDP+1kGP metadata without imputing field types from the file
metadata = hl.import_table(metadata_path)

In [49]:
# Add gnomAD's sample filters into a list 
sample_filters = [name for name in list(metadata.row) if 'sample_filters.' in name][:-1]

# Within each population, count the total number of samples and the number of samples that failed each filter   
filters_ht = (metadata.group_by(metadata['population'])
               .aggregate(n = hl.agg.count(),
                          **{col: hl.agg.count_where(metadata[col] == 'true') for col in sample_filters}))   

# Add a column to indicate the 6 populations that were filtered out by gnomAD's sample fiters 
filtered_samples = hl.set(["Biaka", "Mbuti", "Bougainville", "PapuanSepik", "PapuanHighlands", "San"])
filters_ht = filters_ht.annotate(failed_gnomAD = hl.if_else(filtered_samples.contains(filters_ht['population']), 'TRUE', 'FALSE'))

# Grab only "sample_filters.fail_n_snp_residual" column 
filters_ht = filters_ht.key_by() # unkey table first so the population column isn't duplicated 
n_snp_resid = filters_ht.select(population = filters_ht['population'],
                       num_samples = filters_ht['n'],
                       fail_n_snp_resid = filters_ht['sample_filters.fail_n_snp_residual'],
                       fail_gnomAD = filters_ht['failed_gnomAD'])

# Calculate the ratio between the number of samples that failed and the total number of samples in the population 
n_snp_resid = n_snp_resid.annotate(fail_ratio = n_snp_resid.fail_n_snp_resid/n_snp_resid.num_samples)

# Generate a scatter plot of ratios across all populations colored by gnomAD failure 
p = ggplot(n_snp_resid, aes(x=n_snp_resid.population, 
                            y=n_snp_resid.fail_ratio, 
                            color=n_snp_resid.fail_gnomAD)) +\
    geom_point() +\
    ylab("Ratio of failed samples/total samples") + \
    ggtitle("Failure of gnomAD's n_snp_residual sample filter (population-level)") +\
    labs(color = 'Failed gnomAD filters') +\
    scale_x_discrete(breaks=list(range(n_snp_resid.count())))

# Show Plot
p.show()

2024-07-09 19:40:19.823 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:40:23.427 Hail: INFO: Ordering unsorted dataset with network shuffle


[Stage 5:(50000 + 1) / 50000][Stage 11:(176 + 9) / 176][Stage 13:(160 + 6) / 160]

### NOTE: The gnomAD sample filters were first investigated and plotted (with better resolution) using R. Click [here](https://github.com/atgu/hgdp_tgp/blob/master/figure_generation/obtain_failed_samples_plot_ratio.Rmd) for more information.

[Back to Index](#Index)