Notebook 1: Merging and annotating tables and matrix tables, Sample QC, PCA outlier removal, Variant QC

To do:
1. add in plotting code generated by Ally & make sure it runs
2. add more detailed descriptions of why each step is run
3. remove certain table/mt write-outs
4. potentially (need to discuss with Mary first) write a function to read data
5. create more detailed index

## Index
- [Functions](#Functions)
- [Reading in datasets](#Reading-in-datasets)
- [Combining metadata by merging hail tables and matrix tables](#Combining-metadata)
- [Annotating metadata onto matrix table](#Annotating-merged-metadata-onto-matrix-table)
- [Sample QC filtering](#Sample-QC-filtering)
- [PCA outlier removal](#PCA-outlier-removal)
- [Variant QC filtering](#Variant-QC-filtering)
- [Exporting datasets post QC](#Exporting-final-dataset-post-QC)

# General Overview
The purpose of this script is to merge metadata components needed for the HGDP+1kGP dataset and then run  QC filters on that resulting dataset. The metadata included sample and variant information such as geographic region, and which samples/variants passed QC were initially located in different datasets. The QC filters were run using sample/variant flags from the metadata datasets. These flags were generated as a result of the dataset being run through the gnomAD QC pipeline.

**This script contains information on how to**: 
- annotate new fields onto a matrix table from another matrix table or hail table
- unflatten a hail matrix table
- harmonize datasets to prevent merge conflicts
- filter matrix tables using a field within the matrix table
- filter samples using a hardcoded list of samples to remove
- write out a matrix table in vcf format

**Datasets merged are**: 
- sample_meta: sample metadata table which contains harmonized metadata for the HGDP_1kGP dataset
- sample_qc_meta: gnomad v3.1 sample qc metadata from for the hgdp_1kg subset which contains flags to denote which samples failed gnomAD QC filters
- var_meta: hail matrix table with a field of flags (mt.filters) to denote which variants passed or failed gnomAD qc filters
- dense_mt: densified hgdp_1kg matrix table
    
**Author: Zan Koenig**

In [1]:
import hail as hl

In [2]:
hl.init()

Running on Apache Spark version 2.4.5
SparkUI available at http://znk-plink-m.c.diverse-pop-seq-ref.internal:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.64-1ef70187dc78
LOGGING: writing to /home/hail/hail-20211202-1706-0.2.64-1ef70187dc78.log


<a id='Functions'></a>

# Data reformatting and printing functions
For interactive analyses scripts such as this, defining functions at the top of the script allow for ease of use. There are other ways to organize the definition of functions in python and the best method depends on the intended usage of the script as well as the writers personal preference

In [3]:
# Takes a list of dicts and converts it to a struct format (works with nested structs too)
def dict_to_struct(d):
    fields = {}
    for k, v in d.items():
        if isinstance(v, dict):
            v = dict_to_struct(v)
        fields[k] = v
    return hl.struct(**fields)

# Formats the output of using hl.count in a more user-friendly format
def print_count(mt):
    '''
    Prints out total sample/variant count for a mt
    :param mt: hail matrix table
    :return: print statement with number of samples and variants
    '''
    # Since hl.count() is being used on a matrix table, the result has two numbers in output
    # When using hl.count() on a matrix table the first number is the number of rows, equivalent to the number of variants
    # The second number is the number of columns, equivalent to the number of samples
    n = mt.count()
    print('Number of Samples: {}\nNumber of Variants: {}'.format(n[1], n[0]))

<a id='Reading in datasets'></a>

# User specified output paths
Below are variables for which the user can specify their desired output paths ***May be replaced with a read dataset function to allow user specified inputs

In [None]:
# Here the user needs to specify the output path they want all the output datasets to be written to.
# There are default names which the data will be written out with
# users can alter the default names by changing the variables below.
output_path = input("Please input a path for datasets to be written out to: ")

In [None]:
# This will be a table of the merged metadata information needed to run QC filters.
# It is being written out before annotating onto the main matrix table since the steps taken during merging are computationally expensive.
# If you are going to run this tutorial multiple times, having this dataset output will save time.
metadata_table = 'hdgp_1kgp_tutorial_metadata'
# This is a pre-qc version of the hgdp+1kGP dataset. It contains all the metadata necessary to conduct qc but does not have any samples or variants removed yet
# Hail's hl.sample_qc() and hl.variant_qc() methods have been run on the dataset prior to outputting so the metrics in those fields are based off of the pre-qc sample/variant counts
pre_qc_dataset = 'hgdp_1kgp_tutorial_pre_qc'
# This is a version of the dataset which has had sample QC filters run on it. It is being written out because some filtering steps which take place prior take some time to run.
# Writing it out after those steps ensure that downstream the computational time does not take as long
post_sample_qc = 'hdgp_1kgp_tutorial_post_sample_qc'
# This is the final, post_qc version of the dataset. It will have all the sample and variant QC filters run on it as well as having PCA outliers removed
post_qc = 'hgdp_1kgp_tutorial_post_qc'

# Reading in datasets
Setting separate variables for paths before the datasets are read in makes it easier to update paths if datasets move in the future

In [4]:
# path for sample metadata file which contains metadata for the HDGP dataset
sample_meta_path = 'gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/gnomad_meta_v1.tsv'

# path for hail table which contains information on which samples passed gnomAD QC filters
sample_qc_meta_path = 'gs://hgdp_tgp/output/gnomad_v3.1_sample_qc_metadata_hgdp_tgp_subset.ht'

# path for hail table which contains info on which variants passed/failed gnomAD QC
var_metadata_path = 'gs://gcp-public-data--gnomad/release/3.1.1/ht/genomes/gnomad.genomes.v3.1.1.sites.ht'

# path for densfied pre-qc matrix table which was generated from the original pre-qc sparse version of the hgdp_1kgp dataset
dense_mt_path = 'gs://hgdp_tgp/output/tgp_hgdp.mt'

# Path for a txt file which contains the latest list of PCA outliers
pca_outlier_path = 'gs://hgdp-1kg/hgdp_tgp/pca_outliers_v2.txt'

In [5]:
# reading in Alicia's sample metadata file (Note: this file uses the 'v3.1::' prefix as done in gnomAD)
sample_meta = hl.import_table(sample_meta_path, impute=True)

# reading in Julia's sample metadata file
sample_qc_meta = hl.read_table(sample_qc_meta_path)

# reading in variant qc information
var_meta = hl.read_table(var_metadata_path)

# reading in densified pre-qc matrix table
dense_mt = hl.read_matrix_table(dense_mt_path)

# To read in the PCA outlier list, first need to read the file in as a list
# using hl.hadoop_open here which allows one to read in files into hail from Google cloud storage
with hl.utils.hadoop_open(pca_outlier_path) as file:
    outliers = [line.rstrip('\n') for line in file]

# Using hl.literal here to convert the list from a python object to a hail expression so that it can be used to filter out samples
outliers_list = hl.literal(outliers)

2021-11-02 17:02:28 Hail: INFO: Reading table to impute column types
2021-11-02 17:02:35 Hail: INFO: Loading 184 fields. Counts by type:
  str: 80
  bool: 44
  float64: 40
  int32: 20


# Combining metadata
The HGDP_1kPG dataset is dense_mt but in the state in which it was read in, it does not contain all the information about the dataset that we need for filtering purposes. This includes information from sample_meta which contains information on which geographic region and which population each sample is from. The sample_qc_meta dataset contains information on which samples and variants failed qc filters.
Before conducting QC, the different metadata datasets must be merged together. The first cell below is an example of having to alter the structure of a dataset before being able to merge with another.

In [6]:
# These bits below were written by Tim Poterba to help troubleshoot unflattening a ht with nested structure
# dict to hold struct names as well as nested field names
d = {}

# Getting just the row field names 
row = sample_meta.row_value

# returns a dict with the struct names as keys and their inner field names as values
for name in row:
    def recur(dict_ref, split_name):
        if len(split_name) == 1:
            dict_ref[split_name[0]] = row[name]
            return
        existing = dict_ref.get(split_name[0])
        if existing is not None:
            assert isinstance(existing, dict), existing  # fails on foo.bar and foo.bar.baz
            recur(existing, split_name[1:])
        else:
            existing = {}
            dict_ref[split_name[0]] = existing
            recur(existing, split_name[1:])
    recur(d, name.split('.'))


# using the dict created from flattened struct, creating new structs now unflattened
sample_meta = sample_meta.select(**dict_to_struct(d))
sample_meta = sample_meta.key_by('s')

In [7]:
# grabbing the columns needed from Alicia's metadata
new_meta = sample_meta.select(sample_meta.hgdp_tgp_meta, sample_meta.bergstrom)

# creating a table with Julia's metadata and Alicia's metadata
ht = sample_qc_meta.annotate(**new_meta[sample_qc_meta.s])

# stripping 'v3.1::' from the names to match with Konrad's MT
ht = ht.key_by(s=ht.s.replace("v3.1::", ""))

In [8]:
# When writing out any dataset, you want to make sure the path is as intended and the resulting name is descriptive
# hl.write() takes the entire output path as an argument as well as the name of the resulting table or matrix table
ht.write('gs://hgdp-1kg/hgdp_tgp/hgdp_tgp_sample_metadata.ht')

2021-06-22 18:31:47 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-22 18:31:55 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-22 18:32:25 Hail: INFO: wrote table with 4150 rows in 155 partitions to gs://african-seq-data/hgdp_tgp/hgdp_tgp_sample_metadata.ht
    Total size: 1.69 MiB
    * Rows: 1.68 MiB
    * Globals: 6.51 KiB
    * Smallest partition: 1 rows (758.00 B)
    * Largest partition:  173 rows (68.18 KiB)


# Annotating merged metadata onto matrix table
Now that the two metadata datasets are merged together and in the proper format, the next step is to annotate the dense matrix table with all of the samples and variants preQC with the metadata.

In [8]:
# EDIT
# reading in table annotated with Alicia and Julia's respective metadata
ht = hl.read_table('gs://hgdp-1kg/hgdp_tgp/hgdp_tgp_sample_metadata.ht')

In [10]:
# hl.count() returns the counts of samples and variants within a matrix table or table.
# In this case since it is a hail table, it only returns the count of the number of samples
# The number of samples is equal to the number of rows
ht.count()

4150

In [16]:
# hl.describe() gives you an overview of all the fields in a matrix table or table
ht.describe()

----------------------------------------
Global fields:
    'sex_imputation_ploidy_cutoffs': struct {
        x_ploidy_cutoffs: struct {
            upper_cutoff_X: float64, 
            lower_cutoff_XX: float64, 
            upper_cutoff_XX: float64, 
            lower_cutoff_XXX: float64
        }, 
        y_ploidy_cutoffs: struct {
            lower_cutoff_Y: float64, 
            upper_cutoff_Y: float64, 
            lower_cutoff_YY: float64
        }, 
        f_stat_cutoff: float64
    } 
    'population_inference_pca_metrics': struct {
        min_prob: float64, 
        include_unreleasable_samples: bool, 
        max_mislabeled_training_samples: int32, 
        known_pop_removal_iterations: int32, 
        n_pcs: int32
    } 
    'relatedness_inference_cutoffs': struct {
        min_individual_maf: float64, 
        min_emission_kinship: float64, 
        ibd0_0_max: float64, 
        second_degree_kin_cutoff: float64, 
        first_degree_kin_thresholds: tuple (
           

In [None]:
# Using hl.annotate_cols() method to annotate the metadata onto the matrix table
# Using hl.annotate_cols() in this way is essentially merging dense_mt with ht
# In order for this hl.annotate_cols() to work, both of the datasets to merge need to share the same key
# In this case that key is 's'
# When using hl.annotate_cols() the table is being indexed by the equivalent key in  the dense_mt
mt = dense_mt.annotate_cols(**ht[dense_mt.s])

In [12]:
print_count(mt)

(211358784, 4151)

In [13]:
# writing out a pre-qc version of the dataset for Mary's PCA analyses
mt.write("gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/pre_qc_final.mt")

2021-07-08 15:38:02 Hail: INFO: wrote matrix table with 211358784 rows and 4151 columns in 5000 partitions to gs://african-seq-data/hgdp_tgp/hgdp_tgp_dense_meta_preQC.mt
    Total size: 3.32 TiB
    * Rows/entries: 3.32 TiB
    * Columns: 1.71 MiB
    * Globals: 11.00 B
    * Smallest partition: 10589 rows (32.13 MiB)
    * Largest partition:  183321 rows (4.39 GiB)


# Pre-QC Plots
When conducting quality control on a dataset, making plots to visualize metrics which explain the data is useful to assess the effects of filters on the dataset

When running QC filters individually it can be useful to make plots before and after specific filters. In the case of this dataset, the results of the individual sample and variant filters have been merged into respective field which then allows us to remove all samples/variants that fail any respective QC steps all at once.

In [10]:
# Reading in the preQC dataset 
# This is a merged version of the metadata from different sources and the sample/variant dense dataset
mt = hl.read_matrix_table("gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/pre_qc_final.mt")

In [None]:
# Getting a preliminary count before filtering on the dataset
mt.count()

In [None]:
# As of hail v. 0.2.82, ggplot only takes in tables as input. as such, we will make a table from our pre_qc matrix table
ht = mt.cols()

# Create dictionary that maps color to region name
newnames = {"#E41A1C": 'AMR', "#984EA3": 'AFR', "#999999": 'OCE', "#FF7F00": 'CSA',
            "#4DAF4A": 'EAS', "#377EB8": 'EUR', "#A65628": 'MID'}


# Using ggplot, differentiate between populations
p = hl.ggplot.ggplot(ht, hl.ggplot.aes(x=ht.sample_qc.n_snp)) +
hl.ggplot.geom_histogram(hl.ggplot.aes(fill=ht.hgdp_tgp_meta.Continent.colors), min_val=5000000,
                         max_val=7500000, bins=200, position="identity", alpha=.7) +
hl.ggplot.xlab("Number of SNPs") +
hl.ggplot.ggtitle("Number of SNPs, Pre-QC") +
hl.ggplot.coord_cartesian(ylim=(0, 260))

# Update legends so that the geographic region name corresponds with the correct
p = p.to_plotly()
p.for_each_trace(lambda t: t.update(name=newnames[t.name]))

#show plot
p.show()

# Sample QC filtering
As previously mentioned, sample QC filtering for this dataset was conducted using metadata which was annotated onto the main matrix table. Sample QC was run using gnomAD's QC pipeline and the fields used to filter below contain information on whether samples passed or failed gnomAD QC. More details on the gnomAD sample qc steps can be found [here.](https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#sample-qc-hard-filtering)

For more information on how sample QC filters were developed see [nb3.](https://github.com/atgu/hgdp_tgp/blob/907619ac3fedf8c9239920c82a9842cf090fbc66/tutorials/nb3.ipynb)

In [11]:
# filtering samples to those who should pass QC
# this filters to only samples that passed gnomad sample QC hard filters
mt_filt = mt.filter_cols(~mt.sample_filters.hard_filtered)

# annotating partially filtered dataset with variant metadata
mt_filt = mt_filt.annotate_rows(**var_meta[mt_filt.locus, mt_filt.alleles])

In [12]:
# Checking the counts of samples/filters after filtering to those who passed sample QC
mt_filt.count()

(211358784, 4120)

In [12]:
mt_filt.write('gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/hgdp_tgp_dense_meta_filt.mt', overwrite=True)

2021-11-02 19:18:12 Hail: INFO: wrote matrix table with 211358784 rows and 4120 columns in 5000 partitions to gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/hgdp_tgp_dense_meta_filt.mt
    Total size: 3.82 TiB
    * Rows/entries: 3.82 TiB
    * Columns: 1.70 MiB
    * Globals: 11.00 B
    * Smallest partition: 10589 rows (38.69 MiB)
    * Largest partition:  183321 rows (4.82 GiB)


# PCA outlier removal
For information on how PCA outliers were found see [nb4.](https://github.com/atgu/hgdp_tgp/blob/907619ac3fedf8c9239920c82a9842cf090fbc66/tutorials/nb4.ipynb)

In [13]:
# Reading in the annotated & partially filtered dataset
mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/hgdp_tgp_dense_meta_filt.mt')

In [14]:
# Checking the sample and variant count before removing PCA outliers
mt.count()

(211358784, 4120)

In [None]:
# Using the list of PCA outliers, using the ~ operator which is a negation operator and obtains the compliment
# In this case the compliment is samples which are not contained in the pca outlier list
mt = mt.filter_cols(~outliers_list.contains(mt['s']))
# Removing any duplicates in the dataset using hl.distinct_by_col() which removes columns with a duplicate column key. It keeps one column for each unique key.
mt = mt.distinct_by_col()

In [16]:
# Getting a count of samples/variants after removing PCA outliers
mt.count()

(211358784, 4097)

# Variant QC filtering
Variant QC was run using annotated flags which denoted which variants passed/failed gnomAD's QC pipeline. More details on the variant QC steps conducted can be found on the gnomAD website [here.](https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#variant-qc)

In [17]:
# Subsetting the variants in the dataset to only PASS variants (those which passed variant QC)
# PASS variants are variants which have an entry in the filters field. This field contains an array which contains a bool if any variant qc filter was failed
# This is the last step in the QC process
mt = mt.filter_rows(hl.len(mt.filters) !=0  ,keep=False)

In [18]:
# Checking the final count of the dataset before writing out the dataset to different formats
mt.count()

(155648020, 4097)

# Exporting final dataset post QC
In order to write out matrix tables in hail, you use the mt.write() method. As a string inside that method you put the path where you want your matrix table to be written out to. Keep in mind a matrix table is a directory format and is a large size. Writing out this dataset will take some time until complete. On your Google cloud cluster, you can switch to worker nodes instead of secondary worker nodes in order to shorten the time it takes to write out the dataset.

In [19]:
# writing out the postQC dataset with PCA sample outliers removed and subset to PASS variants
mt.write('gs://hgdp-1kg/post_qc_final.mt', overwrite=True)

2021-11-02 21:19:22 Hail: INFO: wrote matrix table with 155648020 rows and 4097 columns in 5000 partitions to gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/new_hgdp_tgp_postQC.mt
    Total size: 3.09 TiB
    * Rows/entries: 3.09 TiB
    * Columns: 1.69 MiB
    * Globals: 11.00 B
    * Smallest partition: 0 rows (20.00 B)
    * Largest partition:  96270 rows (2.23 GiB)
