# Summarizing Data Post QC

Author: Zan Koenig

## Index
1. [Setting Default Paths](#1.-Set-Default-Paths)
2. [Organizing the dataset](#2.-Setting-up-data)
3. [Annotating table with relatedness information](#3.-Annotating-table-with-relatedness-information)
4. [Calculating statistics per population](#4.-Calculating-statistics-per-population)
5. [Formatting table for exporting](#5.-Formatting-table-for-exporting)
6. [Exporting final table](#6.-Exporting-final-table)

# General Overview:

The purpose of this script is to format and write out a tsv which will be used to create plots and summaries of the post-QC dataset in R.

**This script contains information on how to:**
- Select specific columns from a matrix table
- Annotate filter flags onto a matrix table
- Join the columns of two matrix tables
- Join two tables
- Group a matrix table by region, population 
- Use `hl.agg.stats` to calculate statics for a metric within a population
- Count the number of samples where a filter flag equals True  



In [1]:
import hail as hl

# 1. Set Default Paths
These default paths can be edited by users as needed. It is recommended to run these tutorials without writing out datasets. 

By default all of the write sections are shown as markdown cells. If you would like to write out your own datasets, you can copy the code and paste it into a new code cell. 

[Back to Index](#Index)

In [57]:
# Input file 
post_qc_path = 'gs://hgdp-1kg/tutorial_datasets/metadata_and_qc/post_qc.mt'

# PCA outliers file 
outliers_path = 'gs://hgdp-1kg/tutorial_datasets/pca/pca_outliers.txt'

# Paths to related and unrelated Matrix Tables (without outliers) - written out in Notebook 2: PCA and Ancestry Analyses
unrelateds_path = 'gs://hgdp-1kg/tutorial_datasets/pca_results/unrelateds_without_outliers.mt'
relateds_path = 'gs://hgdp-1kg/tutorial_datasets/pca_results/relateds_without_outliers.mt'

# Path for final output table in tsv format
final_table_path = 'gs://hgdp-1kg/tutorial_datasets/metadata_and_qc/post_qc_summary.tsv'

# 2. Setting up data

Here we are walking through some steps to get the dataset ready for downstream analyses. We first create a table with only sample data, then select only the columns we need for the table we will write out. Next, we create a checkpoint of that table, to speed up downstream analysis steps.
 

<details><summary>For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<br>
<li><a href="https://hail.is/docs/0.2/utils/index.html#hail.utils.hadoop_open"> More on  <i> hadoop_open() </i></a></li>

<li><a href="https://hail.is/docs/0.2/methods/genetics.html#hail.methods.sample_qc"> More on  <i> sample_qc() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.cols"> More on  <i> cols() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.select"> More on  <i> select() </i></a></li>    
    
<li><a href="https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_table"> More on  <i> read_table() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.count"> More on  <i> count() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.checkpoint"> More on  <i> checkpoint() </i></a></li>
     
</details>

[Back to Index](#Index)

In [37]:
# Reading in a version of the dataset which has gnomAD's variant and sample QC filters applied to it
mt = hl.read_matrix_table(post_qc_path)

In [38]:
# Removing PCA outliers from the dataset
# To read in the PCA outlier list, first need to read the file in as a list
# Using hl.hadoop_open here which allows one to read in files into Hail from Google Cloud Storage
with hl.utils.hadoop_open(outliers_path) as file:
    outliers = [line.rstrip('\n') for line in file]

# Using hl.literal here to convert the list from a python object to a hail expression so that it can be used to filter out samples
outliers_list = hl.literal(outliers)

# Using the list of PCA outliers, using the ~ operator which is a negation operator and obtains the compliment
# In this case the compliment is samples which are not contained in the pca outlier list
mt_without_outliers = mt.filter_cols(~outliers_list.contains(mt['s']))

In [39]:
mt_without_outliers.count() 

(159795273, 4096)

In [40]:
# Grabbing only the columns from the Matrix Table (outputs table of just columns)
mt_col_table = mt_without_outliers.cols()

In [41]:
# Writing a col table with only the columns needed for table 1
mt_col_table = mt_col_table.select(mt_col_table.hgdp_tgp_meta.genetic_region,
                             mt_col_table.hgdp_tgp_meta.population,
                             mt_col_table.sample_qc.n_snp, 
                             mt_col_table.bam_metrics.mean_coverage)

In [42]:
# Validity check - there should be 4096 samples 
mt_col_table.count()

4096

In [1]:
## Writing out col_table as a checkpoint to make the downstream steps run faster
## This is done because running sample_qc is computationally expensive
#col_table.checkpoint(checkpoint_path, overwrite=True)

In [8]:
# This is a table of only the columns with only postQC information
col_table = hl.read_table(checkpoint_path)

In [10]:
# Since col_table is a table, count prints the number of rows which is equal to the number of samples
# There should be 4096 samples
mt_col_table.count()

4096

# 3. Annotating table with relatedness information

Relatedness information is added to the dataset so that we can filter out related individuals. 

<details><summary>For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    
<br>
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_cols"> More on  <i> annotate_cols() </i></a></li>

<li><a href="https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.counter"> More on  <i> counter() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.union_cols"> More on  <i> union_cols() </i></a></li>
    
</details>

[Back to Index](#Index)

In [43]:
# Need to get number of unrelateds annotated to the table

# Reading in the unrelated and related Matrix Tables which were written out in notebook 2: PCA and Ancestry Analyses
unrelateds = hl.read_matrix_table(unrelateds_path)
relateds = hl.read_matrix_table(relateds_path)

# Annotating both the unrelated and the related tables with a flag named unrelated 
# Set unrelated flag to True for those in the unrelated dataset, and False for those in the related dataset
unrelateds = unrelateds.annotate_cols(unrelated = True)
relateds = relateds.annotate_cols(unrelated = False)

# Using hl.cols() to obtain two tables with only the columns from the original Matrix Tables
unrelateds_cols = unrelateds.cols()
relateds_cols = relateds.cols()

In [44]:
# Validity check 
print(unrelateds_cols.count(), relateds_cols.count()) # 3378 unrelated and 718 related samples = 4096 total samples 

3378 718


In [45]:
# Annotating the unrelated/related mts with counts per population
unrelateds_count = unrelateds_cols.aggregate(hl.agg.counter(unrelateds_cols.hgdp_tgp_meta.population))
relateds_count = relateds_cols.aggregate(hl.agg.counter(relateds_cols.hgdp_tgp_meta.population))

# Validity check - print out the number of unrelated and related individuals per population 
print(f"Number of unrelated individuals per population: \
{unrelateds_count}\n\nNumber of related individuals per population: {relateds_count}")

Number of unrelated individuals per population: frozendict({'ACB': 94, 'ASW': 52, 'Adygei': 17, 'BEB': 99, 'Balochi': 23, 'BantuKenya': 10, 'BantuSouthAfrica': 8, 'Basque': 23, 'Bedouin': 42, 'BiakaPygmy': 22, 'Brahui': 23, 'Burusho': 24, 'CDX': 88, 'CEU': 120, 'CHB': 103, 'CHS': 103, 'CLM': 94, 'Cambodian': 9, 'Colombian': 3, 'Dai': 9, 'Daur': 9, 'Druze': 35, 'ESN': 103, 'FIN': 98, 'French': 27, 'GBR': 87, 'GIH': 100, 'GWD': 116, 'Han': 43, 'Hazara': 16, 'Hezhen': 8, 'IBS': 104, 'ITU': 102, 'Italian': 11, 'JPT': 102, 'Japanese': 29, 'KHV': 101, 'Kalash': 21, 'Karitiana': 10, 'LWK': 91, 'Lahu': 5, 'MSL': 83, 'MXL': 63, 'Makrani': 22, 'Mandenka': 20, 'Maya': 19, 'MbutiPygmy': 12, 'Melanesian': 11, 'Miao': 10, 'Mongola': 10, 'Mozabite': 25, 'Naxi': 8, 'Orcadian': 14, 'Oroqen': 8, 'PEL': 85, 'PJL': 97, 'PUR': 104, 'Palestinian': 38, 'Papuan': 17, 'Pathan': 24, 'Pima': 11, 'Russian': 25, 'STU': 98, 'San': 6, 'Sardinian': 27, 'She': 9, 'Sindhi': 22, 'Surui': 7, 'TSI': 103, 'Tu': 10, 'Tujia'

In [46]:
# Joining the columns of the unrelated and related datasets
mt_combined = unrelateds.union_cols(relateds)

# Validity check - count the number of unrelateds (True values) in the Matrix Table to make sure it is as expected
print(mt_combined.aggregate_cols(hl.agg.counter(mt_combined.unrelated))) # 3378 True and 718 False

# Creating a table with only the columns from the Matrix Table containing related information
# This is done since the final output will be a tsv and thus must be in table format
# Being a table of columns allows it to be annotated onto the existing mt_col_table as shown below
mt_combined_col_table = mt_combined.cols()

# Annotating the relatedness information onto the column table
mt_col_table = mt_col_table.annotate(unrelated = mt_combined_col_table[mt_col_table.s].unrelated)

frozendict({False: 718, True: 3378})


In [47]:
mt_col_table.show(5)

2022-11-18 21:04:21 Hail: INFO: Coerced sorted dataset
2022-11-18 21:04:21 Hail: INFO: Coerced sorted dataset


s,genetic_region,population,n_snp,mean_coverage,unrelated
str,str,str,int64,float64,bool
"""HG00096""","""EUR""","""GBR""",2566022,32.9,True
"""HG00097""","""EUR""","""GBR""",2569966,31.5,True
"""HG00099""","""EUR""","""GBR""",2567942,36.4,True
"""HG00100""","""EUR""","""GBR""",2576696,30.2,True
"""HG00101""","""EUR""","""GBR""",2565175,32.8,True


# 4. Calculating statistics per population
In this section, we will be using `hl.agg.stats()` which calculates the following metrics for a given expression:
- min
- max
- mean
- standard deviation
- number of non-missing records
- sum
 
Using `hl.group_by()` we calculate these statistics for each of the 78 populations in this dataset.
We also count the number of related samples within each populations by using `hl.agg.count_where()` and counting the number of times the field denoting if samples are related or not is True.

<br>
<details><summary>For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<br>
<li><a href="https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.stats"> More on  <i> stats() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.group_by"> More on  <i> group_by() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.count_where"> More on  <i> count_where() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.expr.TupleExpression.html#hail.expr.TupleExpression.show"> More on <i>show()</i></a></li>
     
</details>

[Back to Index](#Index)

In [49]:
# Calculating stats per population for each metric grouped by genetic region and population
table = n_snp = mt_col_table.group_by(
    mt_col_table.genetic_region, mt_col_table.population).aggregate(
    n_snp_stats = hl.agg.stats(mt_col_table.n_snp),
    cov_stats = hl.agg.stats(mt_col_table.mean_coverage),
    n_unrelated = hl.agg.count_where(mt_col_table.unrelated == True))

In [52]:
# Checking that each of the table fields contain what we'd expect
table.show()

2022-11-18 21:08:46 Hail: INFO: Coerced sorted dataset
2022-11-18 21:08:46 Hail: INFO: Coerced sorted dataset
2022-11-18 21:08:47 Hail: INFO: Ordering unsorted dataset with network shuffle


Unnamed: 0_level_0,Unnamed: 1_level_0,n_snp_stats,n_snp_stats,n_snp_stats,n_snp_stats,n_snp_stats,n_snp_stats,cov_stats,cov_stats,cov_stats,cov_stats,cov_stats,cov_stats,Unnamed: 14_level_0
genetic_region,population,mean,stdev,min,max,n,sum,mean,stdev,min,max,n,sum,n_unrelated
str,str,float64,float64,float64,float64,int64,float64,float64,float64,float64,float64,int64,float64,int64
"""AFR""","""ACB""",3140000.0,46300.0,2920000.0,3200000.0,114,358000000.0,31.7,2.48,28.4,42.0,114,3620.0,94
"""AFR""","""ASW""",3080000.0,62300.0,2880000.0,3180000.0,71,218000000.0,32.3,3.36,27.5,52.1,71,2290.0,52
"""AFR""","""BantuKenya""",3180000.0,20100.0,3140000.0,3210000.0,12,38200000.0,31.9,3.88,29.1,44.0,12,383.0,10
"""AFR""","""BantuSouthAfrica""",3270000.0,41500.0,3210000.0,3350000.0,8,26200000.0,38.9,10.3,30.5,64.5,8,311.0,8
"""AFR""","""BiakaPygmy""",3410000.0,9150.0,3390000.0,3430000.0,26,88600000.0,32.3,3.05,27.1,40.1,26,841.0,22
"""AFR""","""ESN""",3200000.0,8460.0,3180000.0,3230000.0,148,474000000.0,32.3,3.23,28.0,53.5,148,4780.0,103
"""AFR""","""GWD""",3190000.0,11900.0,3150000.0,3220000.0,176,561000000.0,32.5,2.71,28.2,43.4,176,5720.0,116
"""AFR""","""LWK""",3190000.0,9910.0,3150000.0,3220000.0,97,310000000.0,32.9,4.32,27.9,60.2,97,3190.0,91
"""AFR""","""MSL""",3230000.0,8520.0,3200000.0,3260000.0,98,317000000.0,31.8,2.69,27.5,45.6,98,3120.0,83
"""AFR""","""Mandenka""",3180000.0,9500.0,3160000.0,3210000.0,23,73200000.0,32.4,2.65,27.3,40.2,23,745.0,20


# 5. Formatting table for exporting
In this section, we format the table before exporting so it is in a usable format once written out. Specifically, we are flattening the table. This is done so that when the table is written out, the data is easier to work with. If the tables were written out without flattening them, the new annotated information would be in a nested structure which would make it difficult to work with outside hail. 

<br>
<details><summary>For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    
<br>
<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.flatten"> More on  <i> flatten() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.key_by"> More on <i>key_by()</i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.describe"> More on <i>describe()</i></a></li>
    
</details>

[Back to Index](#Index)

In [53]:
# Flattening out the structs created from annotating the tables
table = table.flatten()

# Changing the keys of the table so that it is keyed by genetic region and population
table = table.key_by(table.genetic_region, table.population)

In [54]:
# Checking format of the flattened table
table.show()

2022-11-18 21:09:47 Hail: INFO: Coerced sorted dataset
2022-11-18 21:09:47 Hail: INFO: Coerced sorted dataset
2022-11-18 21:09:48 Hail: INFO: Ordering unsorted dataset with network shuffle


genetic_region,population,n_snp_stats.mean,n_snp_stats.stdev,n_snp_stats.min,n_snp_stats.max,n_snp_stats.n,n_snp_stats.sum,cov_stats.mean,cov_stats.stdev,cov_stats.min,cov_stats.max,cov_stats.n,cov_stats.sum,n_unrelated
str,str,float64,float64,float64,float64,int64,float64,float64,float64,float64,float64,int64,float64,int64
"""AFR""","""ACB""",3140000.0,46300.0,2920000.0,3200000.0,114,358000000.0,31.7,2.48,28.4,42.0,114,3620.0,94
"""AFR""","""ASW""",3080000.0,62300.0,2880000.0,3180000.0,71,218000000.0,32.3,3.36,27.5,52.1,71,2290.0,52
"""AFR""","""BantuKenya""",3180000.0,20100.0,3140000.0,3210000.0,12,38200000.0,31.9,3.88,29.1,44.0,12,383.0,10
"""AFR""","""BantuSouthAfrica""",3270000.0,41500.0,3210000.0,3350000.0,8,26200000.0,38.9,10.3,30.5,64.5,8,311.0,8
"""AFR""","""BiakaPygmy""",3410000.0,9150.0,3390000.0,3430000.0,26,88600000.0,32.3,3.05,27.1,40.1,26,841.0,22
"""AFR""","""ESN""",3200000.0,8460.0,3180000.0,3230000.0,148,474000000.0,32.3,3.23,28.0,53.5,148,4780.0,103
"""AFR""","""GWD""",3190000.0,11900.0,3150000.0,3220000.0,176,561000000.0,32.5,2.71,28.2,43.4,176,5720.0,116
"""AFR""","""LWK""",3190000.0,9910.0,3150000.0,3220000.0,97,310000000.0,32.9,4.32,27.9,60.2,97,3190.0,91
"""AFR""","""MSL""",3230000.0,8520.0,3200000.0,3260000.0,98,317000000.0,31.8,2.69,27.5,45.6,98,3120.0,83
"""AFR""","""Mandenka""",3180000.0,9500.0,3160000.0,3210000.0,23,73200000.0,32.4,2.65,27.3,40.2,23,745.0,20


In [55]:
# Checking on the format of the table after flattening to make sure it is what we'd expect
table.describe()

----------------------------------------
Global fields:
    'global_annotation_descriptions': struct {
        gnomad_sex_imputation_ploidy_cutoffs: struct {
            Description: str
        }, 
        gnomad_population_inference_pca_metrics: struct {
            Description: str
        }, 
        sample_hard_filter_cutoffs: struct {
            Description: str
        }, 
        gnomad_sample_qc_metric_outlier_cutoffs: struct {
            Description: str
        }, 
        gnomad_age_distribution: struct {
            Description: str, 
            sub_globals: struct {
                bin_edges: struct {
                    Description: str
                }, 
                bin_freq: struct {
                    Description: str
                }, 
                n_smaller: struct {
                    Description: str
                }, 
                n_larger: struct {
                    Description: str
                }
            }
        }, 
        hgdp_tgp

In [56]:
# One last validity check before writing out the dataset to make sure you still have the number of rows you expect
# In this case, since the data is grouped by genetic region, population
# The number of rows should be equal to the number of populations (78)
table.count()

2022-11-18 21:10:47 Hail: INFO: Ordering unsorted dataset with network shuffle


78

# 6. Exporting final table
<br>
<details><summary>For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<br>
<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.export"> More on  <i> export() </i></a></li>

</details>

[Back to Index](#Index)

In [58]:
# Writing out the final table in tsv format 
table.export(final_table_path, header=True)

2022-11-18 21:14:43 Hail: INFO: Coerced sorted dataset
2022-11-18 21:14:43 Hail: INFO: Coerced sorted dataset
2022-11-18 21:14:44 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-11-18 21:14:44 Hail: INFO: Coerced sorted dataset
2022-11-18 21:14:45 Hail: INFO: merging 16 files totalling 10.1K...
2022-11-18 21:14:46 Hail: INFO: while writing:
    gs://hgdp-1kg/tutorial_datasets/metadata_and_qc/post_qc_summary.tsv
  merge time: 240.849ms


[Back to Index](#Index)