## Index
- [Functions](#Functions)
- [Reading in datasets](#Reading-in-datasets)
- [Combining metadata](#Combining-metadata)
- [Annotating metadata onto matrix table](#Annotating-merged-metadata-onto-matrix-table)
- [Sample QC filtering](#Sample-QC-filtering)
- [PCA outlier removal](#PCA-outlier-removal)
- [Variant QC filtering](#Variant-QC-filtering)
- [Exporting datasets post QC](#Exporting-final-dataset-post-QC)

### The purpose of this script is to merge metadata components needed for the HGDP+1kGP dataset and then run  QC filters on that resulting dataset. The QC filters were run using sample/variant information from the metadata datasets. 

**This script contains information on how to**: 
- annotate metadata onto a matrix table
- combine multiple tables and matrix tables
- harmonize datasets to prevent merge conflicts
- filter matrix tables using a field within the matrix table
- filter samples using a hardcoded list of samples to remove
- write out a matrix table in vcf format

**Datasets merged are**: 
    - gnomad sample metadata (sample_meta)
    - gnomad v3.1 sample qc metadata from Julia for the hgdp_1kg subset (jul_meta)
    - gnomad v3.1 variant qc metadata information (var_meta)
    - densified hgdp_1kg matrix table (dense_mt)   
    
**Author: Zan Koenig**

In [1]:
import hail as hl
import re

In [2]:
hl.init()

Running on Apache Spark version 2.4.5
SparkUI available at http://znk-plink-m.c.diverse-pop-seq-ref.internal:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.64-1ef70187dc78
LOGGING: writing to /home/hail/hail-20211202-1706-0.2.64-1ef70187dc78.log


# Functions
For interactive analyses scripts such as this, defining functions at the top of the script allow for ease of use. There are other ways to organize the definition of functions in python and the best method depends on the indended usage of the script as well as the writers personal preference

In [3]:
# Takes a list of dicts and converts it to a struct format (works with nested structs too)
def dict_to_struct(d):
    fields = {}
    for k, v in d.items():
        if isinstance(v, dict):
            v = dict_to_struct(v)
        fields[k] = v
    return hl.struct(**fields)

# Formats the output of using hl.count in a more user friendly format
def print_count(mt):
    '''
    Prints out total sample/variant count for an mt
    :param mt: hail matrix table
    :return: print statement with number of samples and variants
    '''
    n = mt.count()
    print('Number of Samples: {}\nNumber of Variants: {}'.format(n[1], n[0]))

# Reading in datasets
Setting separate variables for paths before the dataset are read in makes it easier to update paths if datasets move in the future

In [4]:
# path for Alicia's sample metadata file
sample_metadata_path = 'gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/gnomad_meta_v1.tsv'

# path for Julia's sample metadata file
jul_metadata_path = ('gs://hgdp_tgp/output/gnomad_v3.1_sample_qc_metadata_hgdp_tgp_subset.ht')

# path for variant qc info
var_metadata_path = 'gs://gcp-public-data--gnomad/release/3.1.1/ht/genomes/gnomad.genomes.v3.1.1.sites.ht'

# path for Konrad's densified matrix table
dense_mt_path = 'gs://hgdp_tgp/output/tgp_hgdp.mt'

In [5]:
# reading in Alicia's sample metadata file (Note: this file uses the 'v3.1::' prefix as done in gnomAD)
sample_meta = hl.import_table(sample_metadata_path, impute=True)

# reading in Julia's sample metadata file
jul_meta = hl.read_table(jul_metadata_path)

# reading in variant qc information
var_meta = hl.read_table(var_metadata_path)

# reading in densified matrix table
dense_mt = hl.read_matrix_table(dense_mt_path)

2021-11-02 17:02:28 Hail: INFO: Reading table to impute column types
2021-11-02 17:02:35 Hail: INFO: Loading 184 fields. Counts by type:
  str: 80
  bool: 44
  float64: 40
  int32: 20


# Combining metadata
Before conducting QC, the different metadata datasets must be merged together. The first cell below is an example of having to alter the structure of a dataset before being able to merge with another.

In [6]:
# These bits below were written by Tim Poterba to help troubleshoot unflattening a ht with nested structure
# dict to hold struct names as well as nested field names
d = {}

# Getting just the row field names 
row = sample_meta.row_value

# returns a dict with the struct names as keys and their inner field names as values
for name in row:
    def recur(dict_ref, split_name):
        if (len(split_name) == 1):
            dict_ref[split_name[0]] = row[name]
            return
        existing = dict_ref.get(split_name[0])
        if existing is not None:
            assert isinstance(existing, dict), existing  # fails on foo.bar and foo.bar.baz
            recur(existing, split_name[1:])
        else:
            existing = {}
            dict_ref[split_name[0]] = existing
            recur(existing, split_name[1:])
    recur(d, name.split('.'))


# using the dict created from flattened struct, creating new structs now unflattened
sample_meta = sample_meta.select(**dict_to_struct(d))
sample_meta = sample_meta.key_by('s')

In [7]:
# grabbing the columns needed from Alicia's metadata
new_meta = sample_meta.select(sample_meta.hgdp_tgp_meta, sample_meta.bergstrom)

# creating a table with Julia's metadata and Alicia's metadata
ht = jul_meta.annotate(**new_meta[jul_meta.s])

# stripping 'v3.1::' from the names to match with Konrad's MT
ht = ht.key_by(s=ht.s.replace("v3.1::", ""))

In [8]:
# When writing out any dataset, you want to make sure the path is as intended and the resulting name is descriptive
# hl.write() takes the entire output path as an argument as well as the name of the resulting table or matrix table
ht.write('gs://hgdp-1kg/hgdp_tgp/hgdp_tgp_sample_metadata.ht')

2021-06-22 18:31:47 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-22 18:31:55 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-22 18:32:25 Hail: INFO: wrote table with 4150 rows in 155 partitions to gs://african-seq-data/hgdp_tgp/hgdp_tgp_sample_metadata.ht
    Total size: 1.69 MiB
    * Rows: 1.68 MiB
    * Globals: 6.51 KiB
    * Smallest partition: 1 rows (758.00 B)
    * Largest partition:  173 rows (68.18 KiB)


# Annotating merged metadata onto matrix table
Now that the two metadata datasets are merged together and in the proper format, the next step is to annotate the dense matrix table with all of the samples and variants preQC with the metadata.

In [8]:
# reading in table annotated with Alicia and Julia's respective metadata
ht = hl.read_table('gs://hgdp-1kg/hgdp_tgp/hgdp_tgp_sample_metadata.ht')

In [10]:
# hl.count() returns the counts of samples and variants within a matrix table or table.
# In this case since it is a hail table, it only returns the count of the number of samples
# The number of samples is equal to the number of rows
ht.count()

4150

In [16]:
# hl.describe() gives you an overview of all of the fields in a matrix table or table
ht.describe()

----------------------------------------
Global fields:
    'sex_imputation_ploidy_cutoffs': struct {
        x_ploidy_cutoffs: struct {
            upper_cutoff_X: float64, 
            lower_cutoff_XX: float64, 
            upper_cutoff_XX: float64, 
            lower_cutoff_XXX: float64
        }, 
        y_ploidy_cutoffs: struct {
            lower_cutoff_Y: float64, 
            upper_cutoff_Y: float64, 
            lower_cutoff_YY: float64
        }, 
        f_stat_cutoff: float64
    } 
    'population_inference_pca_metrics': struct {
        min_prob: float64, 
        include_unreleasable_samples: bool, 
        max_mislabeled_training_samples: int32, 
        known_pop_removal_iterations: int32, 
        n_pcs: int32
    } 
    'relatedness_inference_cutoffs': struct {
        min_individual_maf: float64, 
        min_emission_kinship: float64, 
        ibd0_0_max: float64, 
        second_degree_kin_cutoff: float64, 
        first_degree_kin_thresholds: tuple (
           

In [None]:
# Using hl.annotate_cols() method to annotate the metadata onto the matrix table
# Using hl.annotate_cols() in this way is essentially merging dense_mt with ht
# In order for this hl.annotate_cols() to work, both of the datasets to merge need to share the same key
# In this case that key is 's'
# When using hl.annotate_cols() the table is being indexed by the equivalent key in  the dense_mt
mt = dense_mt.annotate_cols(**ht[dense_mt.s])

In [12]:
# Since hl.count() here is being used on a matrix table, the result has two numbers in output
# When using hl.count() on a matrix table the first number is the number of rows, equivalent to the number of variants
# The second number is the number of columns, equivalent to the number of samples
mt.count()

(211358784, 4151)

In [13]:
# writing out a pre-qc version of the dataset for Mary's PCA analyses
mt.write("gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/hgdp_tgp_dense_meta_preQC.mt")

2021-07-08 15:38:02 Hail: INFO: wrote matrix table with 211358784 rows and 4151 columns in 5000 partitions to gs://african-seq-data/hgdp_tgp/hgdp_tgp_dense_meta_preQC.mt
    Total size: 3.32 TiB
    * Rows/entries: 3.32 TiB
    * Columns: 1.71 MiB
    * Globals: 11.00 B
    * Smallest partition: 10589 rows (32.13 MiB)
    * Largest partition:  183321 rows (4.39 GiB)


# Sample QC filtering
As previously mentioned, sample QC filtering for this dataset was conducted using metadata which was annotated onto the main matrix table. Sample QC was run using gnomAD's QC pipeline and the fields used to filter below contain information on whether samples passed or failed gnomAD QC. 

For more information on how sample QC filters were developed see [INSERT ADDITIONAL TUTORIAL NAME HERE]

In [10]:
# Reading in the preQC dataset 
# This is a merged version of the metadata from different sources and the sample/variant dense dataset
mt = hl.read_matrix_table("gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/hgdp_tgp_dense_meta_preQC.mt")

In [None]:
# Getting a preliminary count before filtering on the dataset
mt.count()

In [11]:
# filtering samples to those who should pass QC
# this filters to only samples that passed gnomad sample QC hard filters
mt_filt = mt.filter_cols(~mt.sample_filters.hard_filtered)

# annotating partially filtered dataset with variant metadata
mt_filt = mt_filt.annotate_rows(**var_meta[mt_filt.locus, mt_filt.alleles])

In [12]:
# Checking the counts of samples/filters after filtering to those who passed sample QC
mt_filt.count()

(211358784, 4120)

In [12]:
mt_filt.write('gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/hgdp_tgp_dense_meta_filt.mt', overwrite=True)

2021-11-02 19:18:12 Hail: INFO: wrote matrix table with 211358784 rows and 4120 columns in 5000 partitions to gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/hgdp_tgp_dense_meta_filt.mt
    Total size: 3.82 TiB
    * Rows/entries: 3.82 TiB
    * Columns: 1.70 MiB
    * Globals: 11.00 B
    * Smallest partition: 10589 rows (38.69 MiB)
    * Largest partition:  183321 rows (4.82 GiB)


# PCA outlier removal
For information on how PCA outliers were found see [INSERT TUTORIAL NAME HERE]

In [13]:
# Reading in the annotated & partially filtered dataset
mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/hgdp_tgp_dense_meta_filt.mt')

In [14]:
# Checking the sample and variant count before removing PCA outliers
mt.count()

(211358784, 4120)

In [15]:
# This is a set of outliers found by Mary during PCA analyses as well as one duplicate sample
outlier_set = {"NA20314","NA20299","HG01880","HG01881","HGDP00130","HGDP00013",
               "HGDP00150","HGDP00029","HGDP01298","HGDP01303","LP6005443-DNA_B02",
               "HGDP01300","HG01628","HG01629","HG01630","HG01694","HG01696",
               "HGDP00621","HGDP01270","HGDP01271","NA20274","HGDP00057"}
# Creating a set of the samples to remove 
# Using hl.filter_cols() to remove the samples in the outlier set
set_to_remove = hl.literal(outlier_set)
mt = mt.filter_cols(~set_to_remove.contains(mt['s']))
mt = mt.distinct_by_col()

In [16]:
# Getting a count of samples/variants after removing PCA outliers
mt.count()

(211358784, 4097)

# Variant QC filtering

In [17]:
# Subsetting the variants in the dataset to only PASS variants (those which passed variant QC)
# PASS variants are variants which have an entry in the filters field. This field contains an array which contains a bool if any variant qc filter was failed
# This is the last step in the QC process
mt = mt.filter_rows(hl.len(mt.filters) !=0  ,keep=False)

In [18]:
# Checking the final count of the dataset before writing out the dataset to different formats
mt.count()

(155648020, 4097)

# Exporting final dataset post QC
### Before exporting, the dataset fields must be formatted so it can be written out as a vcf
This is another example of how datasets need to be altered in order to be written out in a specific format. In this case, the types of certain fields within the matrix table need to be changed before it can be written out into vcf format.  

In [None]:
# changing types to float64 so they can be written out to vcf
mt_vcf = mt.annotate_rows(info = mt.info.annotate(
    QUALapprox=hl.float64(mt.info.QUALapprox),
    SB= mt.info.SB.map(lambda x: hl.float64(x)),
    MQ=hl.float64(mt.info.MQ),
    MQRankSum=hl.float64(mt.info.MQRankSum),
    VarDP=hl.float64(mt.info.VarDP),
    AS_ReadPosRankSum=hl.float64(mt.info.AS_ReadPosRankSum), 
    AS_pab_max=hl.float64(mt.info.AS_pab_max), 
    AS_QD=hl.float64(mt.info.AS_QD), 
    AS_MQ=hl.float64(mt.info.AS_MQ), 
    QD=hl.float64(mt.info.QD), 
    AS_MQRankSum=hl.float64(mt.info.AS_MQRankSum), 
    FS=hl.float64(mt.info.FS), 
    AS_FS=hl.float64(mt.info.AS_FS), 
    ReadPosRankSum=hl.float64(mt.info.ReadPosRankSum), 
    AS_QUALapprox=hl.float64(mt.info.AS_QUALapprox), 
    AS_SB_TABLE=mt.info.AS_SB_TABLE.map(lambda x: hl.float64(x)), 
    AS_VarDP=hl.float64(mt.info.AS_VarDP), 
    AS_SOR=hl.float64(mt.info.AS_SOR), 
    SOR=hl.float64(mt.info.SOR), 
    singleton=hl.bool(mt.info.singleton), 
    transmitted_singleton=hl.bool(mt.info.transmitted_singleton), 
    omni=hl.bool(mt.info.omni), 
    mills=hl.bool(mt.info.omni), 
    monoallelic=hl.bool(mt.info.monoallelic), 
    AS_VQSLOD=hl.float64(mt.info.AS_VQSLOD), 
    InbreedingCoeff=hl.float64(mt.info.InbreedingCoeff)
))

In [None]:
# Dropping a field from the matrix table which is not needed in the final written out version
mt_vcf = mt_vcf.drop('gvcf_info')

In [None]:
# Writing out the dense, postQC dataset in vcf format
hl.export_vcf(mt_vcf, 'gs://african-seq-data/hgdp_tgp/hgdp_tgp_postqc.vcf.bgz', parallel='separate_header')

2021-08-06 13:56:41 Hail: WARN: export_vcf: ignored the following fields:
    'project_meta' (column)
    'subsets' (column)
    'bam_metrics' (column)
    'sex_imputation' (column)
    'sample_qc' (column)
    'population_inference' (column)
    'sample_filters' (column)
    'relatedness_inference' (column)
    'high_quality' (column)
    'release' (column)
    'hgdp_tgp_meta' (column)
    'bergstrom' (column)
    'a_index' (row)
    'was_split' (row)
    'freq' (row)
    'raw_qual_hists' (row)
    'popmax' (row)
    'qual_hists' (row)
    'faf' (row)
    'vep' (row)
    'vqsr' (row)
    'region_flag' (row)
    'allele_info' (row)
    'age_hist_het' (row)
    'age_hist_hom' (row)
    'cadd' (row)
    'revel' (row)
    'splice_ai' (row)
    'primate_ai' (row)
2021-08-06 13:56:44 Hail: WARN: export_vcf found row field rsid with type 'set<str>', but expected type str. Emitting missing ID.


In [19]:
# writing out the postQC dataset with PCA sample outliers removed and subset to PASS variants
mt.write('gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/new_hgdp_tgp_postQC.mt', overwrite=True)

2021-11-02 21:19:22 Hail: INFO: wrote matrix table with 155648020 rows and 4097 columns in 5000 partitions to gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/new_hgdp_tgp_postQC.mt
    Total size: 3.09 TiB
    * Rows/entries: 3.09 TiB
    * Columns: 1.69 MiB
    * Globals: 11.00 B
    * Smallest partition: 0 rows (20.00 B)
    * Largest partition:  96270 rows (2.23 GiB)
