## Index
1. [Data Management Function](#1.-Data-Management-Function)
2. [Read in Datasets and Annotate](#2.-Read-in-Datasets-and-Annotate)
3. [Investigating gnomAD Sample Filters](#3.-Investigating-gnomAD-sample-filters)
4. [Plotting Results of gnomAD Sample Filter Investigation](#-4.-Plotting-results-of-gnomAD-sample-filter-investigation)
5. [Pre-QC Plots](#5.-Pre-QC-Plots)
    1. [Number of SNPs](#5a-Number-of-SNPs)
    2. [Mean Coverage](#5b-Mean-Coverage)
    3. [Freemix](#5c-Freemix)
6. [Post-QC Plots](#6.-Post-QC-Plots)
    1. [Number of SNPs](#6a-Number-of-SNPs)
    2. [Mean Coverage](#6b-Mean-Coverage)
    3. [Freemix](#6c-Freemix)
    4. [Site Frequency Spectrum](#6d-Site-Frequency-Spectrum)

# General Overview
The purpose of this script is to merge metadata components needed for the HGDP+1kGP dataset and then run  QC filters on that resulting dataset. The metadata included sample and variant information such as geographic region, and which samples/variants passed QC were initially located in different datasets. The QC filters were run using sample/variant flags from the metadata datasets. These flags were generated as a result of the dataset being run through the gnomAD QC pipeline. More information on the gnomAD QC pipeline can be found [here](https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#sample-and-variant-quality-control). To see how these filters were updated as a result of our analyses, see [gnomAD sample filters](#3.-Investigating-gnomAD-sample-filters) and the resulting gnomAD [minor release.](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/#improvements-to-the-hgdp--1kg-subset-release)

**This script contains information on how to**:
- Annotate new fields onto a matrix table from another matrix table or hail table
- Unflatten a hail matrix table
- Harmonize datasets to prevent merge conflicts
- Use plots to identify which gnomAD QC filters are removing populations entirely (fail_n_snp_residual used as an example)
- Retrieve populations being unduly removed by filters (mostly AFR and OCE populations)
- Filter matrix tables using a field within the matrix table
- Filter samples using a hardcoded list of samples to remove
- Plot certain fields from the matrix table:
    - Number of SNPs
    - Coverage
    - Site Frequency 
    - Freemix
    - Number of samples which failed a filter

**Datasets merged are**:
- sample_meta: sample metadata table which contains harmonized metadata for the HGDP_1kGP dataset
- sample_qc_meta: gnomad v3.1 sample qc metadata from for the hgdp_1kg subset which contains flags to denote which samples failed gnomAD QC filters
- dense_mt: densified hgdp_1kg matrix table with a field of flags (mt.filters) to denote which variants passed or failed gnomAD qc filters

Authors: Zan Koenig & Mary T. Yohannes

In [1]:
import hail as hl

# for renaming purposes
import re

# the import statements below allow for plotting in hail
from hail.ggplot import *
import plotly
import pandas as pd

from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()

## Set Requester Pays Bucket
Running through these tutorials, users must specify which project is to be billed. To change which project is billed, set the `GCP_PROJECT_NAME` variable to your own project.

In [None]:
# setting requester pays bucket to use throughout tutorial
GCP_PROJECT_NAME = "diverse-pop-seq-ref" # change this to your project name
hl.init(spark_conf={
    'spark.hadoop.fs.gs.requester.pays.mode': 'CUSTOM',
    'spark.hadoop.fs.gs.requester.pays.buckets': 'hgdp_tgp,gcp-public-data--gnomad',
    'spark.hadoop.fs.gs.requester.pays.project.id': GCP_PROJECT_NAME
})

# 1. Data Management Function
This function serves the purpose of reading in the dataset of different stages throughout the tutorial and given certain flags, will allow the user to specify which filters they would like run on the dataset. This function helps to reduce the amount of times data needs to be written out, overall decreasing the computational and monetary cost of running the tutorials. 

<br>
<details><summary>Click <u><span style="color:blue">here</span></u> for more information on the function arguments.</summary> 
    
<br> 
Click on each argument name to learn more!

<ul>    
<li><details><summary><u>
<span style="color:blue">default</span></u></summary>
    
<p>when <i><b>True</b></i>, will return a pre-QC matrix table</p></details></li>

<li><details><summary><u>
<span style="color:blue">post_qc</span></u></summary> 
    
<p>when <i><b>True</b></i>, will return a matrix table which has the following conducted:</p>
<ul> 
    <li>sample_qc filtering</li>
    <li>variant_qc filtering</li>
    <li>outlier removal</li>
    <li>duplicate removal</li>  
    </ul></details></li>


<li><details><summary><u>  
<span style="color:blue">sample_qc</span></u></summary>     
    
<p>when <i><b>True</b></i>, will return a matrix table with gnomad's sampleQC filters run on the dataset. For more information on gnomAD's sample QC steps click <a href="https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#sample-qc-hard-filtering"> here.</a></p></details></li>    
    
<li><details><summary><u>  
<span style="color:blue">variant_qc</span></u></summary>     
    
<p>when <i><b>True</b></i>, will return a matrix table with gnomad's variant quality control filters run on the dataset. For more information on gnomAD's variant QC steps click <a href="https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#variant-qc"> here.</a></p></details></li>        

<li><details><summary><u> 
<span style="color:blue">duplicate</span></u></summary>     
    
<p>when <i><b>True</b></i>, will return a matrix table with any duplicates in the dataset removed. By default there are no duplicates in the dataset, but was included as it is a useful QC step to demonstrate</p></details></li> 
  
<li><details><summary><u>
<span style="color:blue">outlier_removal</span></u></summary>     
    
<p>when <i><b>True</b></i>, will return a matrix table with pca outliers removed. These outliers were determined by running pc_relate. More information on how we created the outlier list can be found <a href="https://nbviewer.org/github/atgu/hgdp_tgp/blob/master/tutorials/nb4.ipynb#5.-Outlier-Removal"> here.</a></p></details></li> 
    
<li><details><summary><u>  
<span style="color:blue">ld_pruning</span></u></summary>     
   
<p>when <i><b>True</b></i>, will return a matrix table which has the following conducted:</p>
<ul> 
    <li>sample_qc filtering</li>
    <li>variant_qc filtering</li>
    <li>outlier removal</li>
    <li>duplicate removal</li>
    <li>call rate filter to variants whose call rate is > 0.999</li>
    <li>allele frequency filter on variants to only keep variants with 0.05 < AF < 0.95</li>
    </ul></details></li>   
    
<li><details><summary><u>
<span style="color:blue">rel_unrel</span></u></summary>  

<p>when <i><b>default</b></i>, will return the same matrix table which would be returned when ld_pruning=True</p>

<p>when <i><b>related</b></i>, will return a matrix table with only related samples</p>

<p>when <i><b>unrelated</b></i> will return matrix table with only unrelated samples</p></details></li></ul>

</details> 
    
[Back to Index](#Index)

In [None]:
import hail as hl

def read_qc(
        default: bool = False,
        post_qc:bool = False,
        sample_qc: bool = False,
        variant_qc: bool = False,
        duplicate: bool = False,
        outlier_removal: bool = False,
        ld_pruning: bool = False,
        rel_unrel: str = 'default',
        n_partitions: int = 0) -> hl.MatrixTable:
    """
    Wrapper function to get HGDP+1kGP data as Matrix Table at different stages of QC/filtering.
    By default, returns pre QC MatrixTable with qc filters annotated but not filtered.

    :param bool default: if True will preQC version of the dataset
    :param bool post_qc: if True will return a post QC matrix table that has gone through:
        - sample QC
        - variant QC
        - duplicate removal
        - outlier removal
    :param bool sample_qc: if True will return a post sample QC matrix table
    :param bool variant_qc: if True will return a post variant QC matrix table
    :param bool duplicate: if True will return a matrix table with duplicate samples removed
    :param bool outlier_removal: if True will return a matrix table with PCA outliers and duplicate samples removed
    :param bool ld_pruning: if True will return a matrix table that has gone through:
        - sample QC
        - variant QC
        - duplicate removal
        - LD pruning
        - additional variant filtering
    :param bool rel_unrel: default will return same mt as ld pruned above
        if 'all' will return the same matrix table as if ld_pruning is True
        if 'related_pre_outlier' will return a matrix table with only related samples pre pca outlier removal
        if 'unrelated_pre_outlier' will return a matrix table with only unrelated samples pre pca outlier removal
        if 'related_post_outlier' will return a matrix table with only related samples post pca outlier removal
        if 'unrelated_post_outlier' wil return a matrix table with only unrelated samples post pca outlier removal
    :param int n_partitions: if specified, will read in dataset with given number of partitions for the following arguments:
        - ld_pruning
        - rel_unrel
    """
    # Reading in all the tables and matrix tables needed to generate the pre_qc matrix table
    sample_meta = hl.import_table('gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/gnomad_meta_v1.tsv')
    sample_qc_meta = hl.read_table('gs://hgdp_tgp/output/gnomad_v3.1_sample_qc_metadata_hgdp_tgp_subset.ht')
    dense_mt = hl.read_matrix_table(
        'gs://gcp-public-data--gnomad/release/3.1.2/mt/genomes/gnomad.genomes.v3.1.2.hgdp_1kg_subset_dense.mt')
    
    dense_mt = dense_mt.naive_coalesce(5000)


    # Takes a list of dicts and converts it to a struct format (works with nested structs too)
    def dict_to_struct(d):
        fields = {}
        for k, v in d.items():
            if isinstance(v, dict):
                v = dict_to_struct(v)
            fields[k] = v
        return hl.struct(**fields)

    # un-flattening a hail table with nested structure
    # dict to hold struct names as well as nested field names
    d = {}

    # Getting the row field names
    row = sample_meta.row_value

    # returns a dict with the struct names as keys and their inner field names as values
    for name in row:
        def recur(dict_ref, split_name):
            if len(split_name) == 1:
                dict_ref[split_name[0]] = row[name]
                return
            existing = dict_ref.get(split_name[0])
            if existing is not None:
                assert isinstance(existing, dict), existing
                recur(existing, split_name[1:])
            else:
                existing = {}
                dict_ref[split_name[0]] = existing
                recur(existing, split_name[1:])
        recur(d, name.split('.'))

    # using the dict created from flattened struct, creating new structs now un-flattened
    sample_meta = sample_meta.select(**dict_to_struct(d))
    sample_meta = sample_meta.key_by('s')

    # grabbing the columns needed from HGDP metadata
    new_meta = sample_meta.select(sample_meta.hgdp_tgp_meta, sample_meta.bergstrom)

    # creating a table with gnomAD sample metadata and HGDP metadata
    ht = sample_qc_meta.annotate(**new_meta[sample_qc_meta.s])

    # stripping 'v3.1::' from the names to match with the densified MT
    ht = ht.key_by(s=ht.s.replace("v3.1::", ""))

    # Using hl.annotate_cols() method to annotate the gnomAD variant QC metadata onto the matrix table
    mt = dense_mt.annotate_cols(**ht[dense_mt.s])

    print(f"sample_qc: {sample_qc}\nvariant_qc: {variant_qc}\nduplicate: {duplicate}" \
          f"\noutlier_removal: { outlier_removal}\nld_pruning: {ld_pruning}\nrel_unrel: {rel_unrel}")
    
    if default:
        print("Returning default preQC matrix table")
        # returns preQC dataset
        return mt
    
    if post_qc:
        print("Returning post sample and variant QC matrix table with duplicates and PCA outliers removed")
        sample_qc = True
        variant_qc = True
        duplicate = True
        outlier_removal = True
    
    if sample_qc:
        print("Running sample QC")
        # run data through sample QC
        # filtering samples to those who should pass gnomADs sample QC
        # this filters to only samples that passed gnomad sample QC hard filters
        mt = mt.filter_cols(~mt.sample_filters.hard_filtered)

    if variant_qc:
        print("Running variant QC")
        # run data through variant QC
        # Subsetting the variants in the dataset to only PASS variants (those which passed gnomAD's variant QC)
        # PASS variants are variants which have an entry in the filters field.
        # This field contains an array which contains a bool if any variant qc filter was failed
        # This is the last step in the QC process
        mt = mt.filter_rows(hl.len(mt.filters) != 0, keep=False)

    if duplicate:
        print("Removing any duplicate samples")
        # Removing any duplicates in the dataset using hl.distinct_by_col() which removes
        # columns with a duplicate column key. It keeps one column for each unique key.
        # after updating to the new dense_mt, this step is no longer necessary to run
        mt = mt.distinct_by_col()

    if outlier_removal:
        print("Removing PCA outliers")
        # remove PCA outliers and duplicates
        # reading in the PCA outlier list
        # To read in the PCA outlier list, first need to read the file in as a list
        # using hl.hadoop_open here which allows one to read in files into hail from Google cloud storage
        pca_outlier_path = 'gs://hgdp-1kg/hgdp_tgp/pca_outliers_v2.txt'
        with hl.utils.hadoop_open(pca_outlier_path) as file:
            outliers = [line.rstrip('\n') for line in file]

        # Using hl.literal here to convert the list from a python object to a hail expression so that it can be used
        # to filter out samples
        outliers_list = hl.literal(outliers)

        # Using the list of PCA outliers, using the ~ operator which is a negation operator and obtains the compliment
        # In this case the compliment is samples which are not contained in the pca outlier list
        mt = mt.filter_cols(~outliers_list.contains(mt['s']))

    if ld_pruning:
        print("Returning ld pruned post variant and sample QC matrix table pre PCA outlier removal ")
        # read in dataset which has additional variant filtering and ld pruning run
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        if n_partitions != 0:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/intermediate_files/filtered_n_pruned_output_updated.mt',
            _n_partitions = n_partitions)
        else:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/intermediate_files/filtered_n_pruned_output_updated.mt')
         

    if rel_unrel == "default":
        # do nothing
        # created a default value because there are multiple options for rel/unrel datasets
        mt = mt

    elif rel_unrel == 'related_pre_outlier':
        print("Returning post sample and variant QC matrix table " \
              "pre PCA outlier removal with only related individuals")
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        #   - LD pruning
        #   - pc_relate 
        #   - filter to only related individuals   
        if n_partitions != 0:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/rel_updated.mt',
            _n_partitions = n_partitions)
        else:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/rel_updated.mt')
        
    elif rel_unrel == 'unrelated_pre_outlier':
        print("Returning post QC matrix table with only unrelated individuals")
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        #   - LD pruning
        #   - pc_relate 
        #   - filter to only unrelated individuals
        if n_partitions != 0:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/unrel_updated.mt',
            _n_partitions = n_partitions)
        else:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/unrel_updated.mt')

    elif rel_unrel == 'related_post_outlier':
        print("Returning post sample and variant QC matrix table " \
              "pre PCA outlier removal with only related individuals")
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        #   - LD pruning
        #   - pc_relate 
        #   - filter to only related individuals
        #   - PCA outlier removal
        if n_partitions != 0:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/related.mt',
            _n_partitions = n_partitions)
        else:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/related.mt')

    elif rel_unrel == 'unrelated_pst_outlier':
        print("Returning post sample and variant QC matrix table " \
              "pre PCA outlier removal with only related individuals")
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        #   - LD pruning
        #   - pc_relate 
        #   - filter to only unrelated individuals
        #   - PCA outlier removal
        if n_partitions != 0:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt',
            _n_partitions = n_partitions)
        else:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt')
        
    # Calculating both variant and sample_qc metrics on the mt before returning
    # so the stats are up to date with the version being written out
    mt = hl.sample_qc(mt)
    mt = hl.variant_qc(mt)
    
    return mt

# 2. Read in Datasets and Annotate
<br>
<details><summary>Click <u><span style="color:blue">here</span></u> for more information about the input dataset.</summary>

This input matrix table is a combination of 3 datasets: a harmonized sample metadata for the HGDP+1KG dataset, a >gnomAD v3.1 sample qc metadata with samples that failed gnomAD QC filters flagged, and a densified HGDP+1KG matrix table.

</details>
<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_matrix_table"> More on  <i> read_matrix_table() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.count"> More on  <i> count() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_table"> More on  <i> read_table() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_rows"> More on  <i> annotate_rows() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.describe"> More on  <i> describe() </i></a></li>
    </ul>
</details>

[Back to Index](#Index)

In [None]:
# read-in the matrix table (shortened as mt)
mt = read_qc(default=True)

# how many snps and samples are there? counts 
print('Num of snps and samples prior to any analysis = ' + str(mt.count())) # 211358784 snps & 4151 samples 

# explore combined mt 
mt.describe()

# 3. Investigating gnomAD sample filters

<br>
<details><summary>Click <u><span style="color:blue">here</span></u> to learn why we are doing this.</summary>
    
9 out of the 28 gnomAD sample filters were dropping huge numbers of ancestrally diverse individuals (mostly African > (AFR) and Oceanian (OCE) populations). The filters use gnomAD’s principal component analysis (PCA) which is obtained from other samples to residualize the distribution of values from different populations and identify outliers. If there is an error and outliers are identified, the sample fails the filter. 

</details>
<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.filter_cols"> More on  <i> filter_cols() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.expr.SetExpression.html#hail.expr.SetExpression.difference"> More on  <i> difference() </i></a></li>

<li><a href=" https://hail.is/docs/0.2/hail.expr.CollectionExpression.html#hail.expr.CollectionExpression.length"> More on  <i> length() </i></a></li>
    </ul>
</details>

[Back to Index](#Index)

In [None]:
# put the gnomAD qc filters in a set 
all_sample_filters = set(mt['sample_filters']) 

# select out the filters that are removing whole populations despite them passing all other gnomAD filters
# if a filter name starts with 'fail_', add it to a new set after removing 'fail_' from the name  
bad_sample_filters = {re.sub('fail_', '', x) for x in all_sample_filters if x.startswith('fail_')} 

# filter out the samples that passed all gnomad QC filters OR only failed the filters that were removing population wholly
mt_filt = mt.filter_cols(mt['sample_filters']['qc_metrics_filters'].difference(bad_sample_filters).length() == 0)

# how many samples were removed by the initial QC?
print('Num of samples before initial QC = ' + str(mt.count()[1])) # 4151
print('Num of samples after initial QC = ' + str(mt_filt.count()[1])) # 4120
print('Samples removed = ' + str(mt.count()[1] - mt_filt.count()[1])) # 31

# 4. Plotting results of gnomAD sample filter investigation
[Back to Index](#Index)

In [None]:
# 
filepath = "gs://hgdp-1kg/hgdp_tgp/intermediate_files/failed_filters_population_level.csv"
filters = hl.import_table(filepath, delimiter=',')

In [None]:
#grab only "sample_filters.fail_n_snp_residual"
n_snp_resid = filters.annotate(population = filters['"population"'][1:hl.len(filters['"population"'])-1], \
                                   num_samples = hl.int(filters['"num_of_samples"']), \
                                   fail_n_snp_resid = hl.int(filters['"sample_filters.fail_n_snp_residual"']),\
                                 fail_gnomAD = hl.str(filters['"failed_gnomAD"']))

#manipulate all strings to remove the extraneous quotation marks
n_snp_resid = n_snp_resid.select("population", "num_samples", "fail_n_snp_resid", "fail_gnomAD")

# calculate the ratio between the number of samples that failed and the total number of samples in the population. 
n_snp_resid = n_snp_resid.annotate(fail_ratio = n_snp_resid.fail_n_snp_resid/n_snp_resid.num_samples)

In [None]:
n_snp_resid.show()

In [None]:
# generate scatter plots of ratios for each filter column across all populations colored by gnomAD failure 
plot_n_snp_resid = hl.ggplot.ggplot(n_snp_resid, hl.ggplot.aes(x=n_snp_resid.population, y=n_snp_resid.fail_ratio, \
                                                color=n_snp_resid.fail_gnomAD)) + \
    hl.ggplot.geom_point() +\
    hl.ggplot.ylab("Ratio of failed samples/total samples") + \
    hl.ggplot.ggtitle("Failure of gnomAD n_snp_resids filter by population")+\
    hl.ggplot.scale_x_discrete(breaks=list(range(78)))

plot_n_snp_resid.show()

# 5. Pre-QC Plots
When conducting quality control, it is often a good idea to create plots of things such as the number of SNPS and coverage, so that after removing samples or variants you get a visual representation of changes in the dataset and can potentially see if anything requires further investigation. 

The following plots show the dataset prior to running any sample QC filters.

[Back to Index](#Index)

In [None]:
# Dict that maps color for plotting to region name for both pre and post QC plots
newnames = {'AMR':"#E41A1C",'AFR':"#984EA3", 'OCE':"#999999", 'CSA':"#FF7F00", 
            'EAS':"#4DAF4A", 'EUR':"#377EB8", 'MID':"#A65628" }

In [None]:
# Using func to get pre_qc version of dataset
pre_qc = read_qc(default=True)
# As of hail v. 0.2.82, ggplot only takes in tables as input
# Making a table of samples for plotting
pre_qc_col = pre_qc.cols()
pre_qc_row = pre_qc.rows()

#### 5a. Number of SNPs - 

[Back to Index](#Index)

In [None]:
# Plotting histogram of number of SNPS for each individual within each global region
p = hl.ggplot.ggplot(pre_qc_col, hl.ggplot.aes(x = pre_qc_col.sample_qc.n_snp)) + \
    hl.ggplot.geom_histogram(hl.ggplot.aes(fill = pre_qc_col.hgdp_tgp_meta.Genetic.region), min_val = 5000000, 
                             max_val = 7500000, bins = 200, position="identity", alpha = .7) + \
    hl.ggplot.xlab("Number of SNPs")+ \
    hl.ggplot.ggtitle("Number of SNPs, Pre-QC")+ \
    hl.ggplot.coord_cartesian(ylim = (0,260))


# Update colors
p = p.to_plotly()

p.for_each_trace(
    lambda trace: trace.update(marker=dict(color = newnames[trace.name]))
)

# Show plot
p.show()

#### 5b. Mean Coverage - 

[Back to Index](#Index)

In [None]:
# Create a density plot of mean coverage per individual
p = hl.ggplot.ggplot(pre_qc_col, hl.ggplot.aes(x = pre_qc_col.bam_metrics.mean_coverage)) + \
    hl.ggplot.geom_density(hl.ggplot.aes(fill=pre_qc_col.project_meta.title),
                             alpha = .7) + \
    hl.ggplot.xlab("Coverage (x)")+ \
    hl.ggplot.ggtitle("Mean coverage, Pre-QC")


# Show plot
p.show()

#### 5c. Freemix - 

[Back to Index](#Index)

In [None]:
# Plotting freemix colored by population 
freemix_pre_qc_pop = hl.ggplot.ggplot(pre_qc_col, hl.ggplot.aes(x = pre_qc_col.bam_metrics.freemix)) +\
    hl.ggplot.geom_histogram(hl.ggplot.aes(fill=pre_qc_col.hgdp_tgp_meta.Genetic.region), bins = 140) + \
    hl.ggplot.scale_y_log10("Count (log scale)") +\
    hl.ggplot.xlab("Freemix") + \
    hl.ggplot.ggtitle("Bam metrics: Freemix, Pre-QC")+ \
    hl.ggplot.coord_cartesian(xlim = (0,.5))

# #update legends
# #update colors
freemix_pre_qc_pop = freemix_pre_qc_pop.to_plotly()
freemix_pre_qc_pop.for_each_trace(lambda trace: trace.update(marker=dict(color = newnames[trace.name])))

#show plot
freemix_pre_qc_pop.show()

In [None]:
# Plotting freemix colored by project
freemix_pre_qc_proj = hl.ggplot.ggplot(pre_qc_col, hl.ggplot.aes(x = pre_qc_col.bam_metrics.freemix)) +\
    hl.ggplot.geom_histogram(hl.ggplot.aes(fill=pre_qc_col.project_meta.title), position="identity", bins = 140,\
                            alpha = .5) + \
    hl.ggplot.scale_y_log10("Count (log scale)") +\
    hl.ggplot.xlab("Freemix") + \
    hl.ggplot.ggtitle("Bam metrics: Freemix, Pre-QC")+ \
    hl.ggplot.coord_cartesian(xlim = (0,.5))

#show plot
freemix_pre_qc_proj.show()

# 6. Post-QC Plots

The following plots are the same as those made above with pre-qc data except now the dataset has gone through:
- sample filtering
- variant filtering
- duplicate removal
- PCA outlier removal

[Back to Index](#Index)

In [None]:
# Reading in postQC matrix table
post_qc = read_qc(post_qc=True)

# As of hail v. 0.2.82, ggplot only takes in tables as input
# Making a table of samples for plotting
post_qc_col = post_qc.cols()
post_qc_row = post_qc.rows()

#### 6a. Number of SNPs - 

[Back to Index](#Index)

In [None]:
# Using ggplot, differentiate between populations
# Used to do fill by geographic region
n_snp_post_qc = hl.ggplot.ggplot(post_qc_col, hl.ggplot.aes(x = post_qc_col.sample_qc.n_snp)) + \
    hl.ggplot.geom_histogram(hl.ggplot.aes(fill=post_qc_col.hgdp_tgp_meta.Genetic.region), min_val = 5000000, 
                             max_val = 7500000, bins = 200, position="identity", alpha = .7) + \
    hl.ggplot.xlab("Number of SNPs")+ \
    hl.ggplot.ggtitle("Number of SNPs, Post-QC") + \
    hl.ggplot.coord_cartesian(ylim = (0,260)) 


# Update legends
n_snp_post_qc = n_snp_post_qc.to_plotly()

n_snp_post_qc.for_each_trace(
    lambda trace: trace.update(marker=dict(color = newnames[trace.name]))
)

#show plot
n_snp_post_qc.show()

#### 6b. Mean Coverage - 

[Back to Index](#Index)

In [None]:
# Plot histogram of mean coverage from bam_metrics
# Separate by project (HGDP or 1kGP)
cov_post_qc = hl.ggplot.ggplot(post_qc_col, hl.ggplot.aes(x = post_qc_col.bam_metrics.mean_coverage)) + \
    hl.ggplot.geom_density(hl.ggplot.aes(fill=post_qc_col.project_meta.title),
                             alpha = .7) + \
    hl.ggplot.xlab("Coverage (x)")+ \
    hl.ggplot.ggtitle("Mean coverage, Post-QC")

cov_post_qc.show()

#### 6c. Freemix - 

[Back to Index](#Index)

In [None]:
freemix_post_qc_pop = hl.ggplot.ggplot(post_qc_col, hl.ggplot.aes(x = post_qc_col.bam_metrics.freemix)) +\
    hl.ggplot.geom_histogram(hl.ggplot.aes(fill=post_qc_col.hgdp_tgp_meta.Genetic.region), bins = 70,\
                            alpha = 1) + \
    hl.ggplot.scale_y_log10("Count (log scale)") +\
    hl.ggplot.xlab("Freemix") + \
    hl.ggplot.ggtitle("Bam metrics: Freemix, Post-QC")+ \
    hl.ggplot.coord_cartesian(xlim = (0,.5))

#update legends
freemix_post_qc_pop = freemix_post_qc_pop.to_plotly()
freemix_post_qc_pop.for_each_trace(lambda trace: trace.update(marker=dict(color = newnames[trace.name])))

#show plot
freemix_post_qc_pop.show()

In [None]:
freemix_post_qc_proj = hl.ggplot.ggplot(post_qc_col, hl.ggplot.aes(x = post_qc_col.bam_metrics.freemix)) +\
    hl.ggplot.geom_histogram(hl.ggplot.aes(fill=post_qc_col.project_meta.title), position="identity", bins = 70,\
                            alpha = .5) + \
    hl.ggplot.scale_y_log10("Count (log scale)") +\
    hl.ggplot.xlab("Freemix") + \
    hl.ggplot.ggtitle("Bam metrics: Freemix, Post-QC")+ \
    hl.ggplot.coord_cartesian(xlim = (0,.5))

#show plot
freemix_post_qc_proj.show()

#### 6d. Site Frequency Spectrum -

[Back to Index](#Index)

In [None]:
# # Aggregating site frequency data for plotting
# # Writing out an intermediate file to cut down on plotting time
# # This section is commented out since users will only need to read in the new dataset
# sfs_data = ht_rows.aggregate(hl.agg.hist(post_qc.freq.AF[1], 0,1,250))
# with hl.hadoop_open('gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/sfs_pre_qc.txt', 'w') as f:
#     f.write(str(dict(sfs_data)))

In [None]:
# Load in data
sfs_post_qc = hl.hadoop_open('gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/sfs_post_qc.txt')
sfs_dict = eval(sfs_post_qc.read())
sfs_struct = hl.Struct(**sfs_dict)

# Plot site frequency spectrum histogram
sfs_p = hl.plot.histogram(sfs_struct, log = True, legend = "Frequency of major allele at site")
show(sfs_p)

In [None]:
# We can plot this in ggplot as well, and I have included the code below. 
# But, there currently does not exist a way to directly use pre-aggregated data (which takes ~30 mins to compile)
# Plotting this way takes the same amount of time and resources as running the cell above.
p = hl.ggplot.ggplot(pre_qc_row, hl.ggplot.aes(x = pre_qc_row.freq.AF[1])) + \
    hl.ggplot.geom_histogram(bins = 200, position="identity", alpha = .7) + \
    hl.ggplot.xlab("Allele frequency")+ \
    hl.ggplot.ggtitle("Site Frequency Spectrum") + \
    hl.ggplot.scale_y_log10("Number of loci (log scale)")
    
p.show()