# Summarizing Data Post QC

## Index
1. [Setting Default Paths](#1.-Set-Default-Paths)
2. [Organizing the dataset](#2.-Setting-up-data)
3. [Annotating table with relatedness information](#3.-Annotating-table-with-relatedness-information)
4. [Calculating statistics per population](#4.-Calculating-statistics-per-population)
5. [Formatting table for exporting](#5.-Formatting-table-for-exporting)
6. [Exporting final table](#6.-Exporting-final-table)

# General Overview:

The purpose of this script is to format and write out a tsv which will be used to create plots and summaries of the post-QC dataset in R.

**This script contains information on how to:**
- select specific columns from a matrix table
- annotate filter flags onto a matrix table
- join the columns of two matrix tables
- join two tables
- group a matrix table by region, population 
- use hl.agg.stats to calculate statics for a metric within a population
- count the number of samples where a filter flag equals True  

Author: Zan Koenig

In [1]:
import hail as hl

# import the read_qc function
# tmp: this is commented out as the function will continue to change
#from read_qc_function import read_qc

Running on Apache Spark version 3.1.1
SparkUI available at http://znk-m.c.diverse-pop-seq-ref.internal:33345
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.77-684f32d73643
LOGGING: writing to /home/hail/hail-20211116-2136-0.2.77-684f32d73643.log


### tmp read_qc function
to be removed once tutorials & function are complete and we can troubleshoot importing

In [None]:
import hail as hl

def read_qc(
        raw: bool = False,
        post_qc:bool = False,
        sample_qc: bool = False,
        variant_qc: bool = False,
        outlier_removal: bool = False,
        ld_pruning: bool = False,
        rel_unrel: str = 'default',
        n_partitions: int = 0) -> hl.MatrixTable:
    """
    Wrapper function to get HGDP+1kGP data as Matrix Table at different stages of QC/filtering.
    By raw, returns pre QC MatrixTable with qc filters annotated but not filtered.

    :param bool raw: if True will return a preQC version of the dataset
    :param bool post_qc: if True will return a post QC matrix table that has gone through:
        - sample QC
        - variant QC
        - duplicate removal
        - outlier removal
    :param bool sample_qc: if True will return a post sample QC matrix table
    :param bool variant_qc: if True will return a post variant QC matrix table
    :param bool outlier_removal: if True will return a matrix table with PCA outliers removed
    :param bool ld_pruning: if True will return a matrix table that has gone through:
        - sample QC
        - variant QC
        - duplicate removal
        - LD pruning
        - additional variant filtering
    :param bool rel_unrel: default will return same mt as ld pruned above
        if 'all' will return the same matrix table as if ld_pruning is True
        if 'related_pre_outlier' will return a matrix table with only related samples pre pca outlier removal
        if 'unrelated_pre_outlier' will return a matrix table with only unrelated samples pre pca outlier removal
        if 'related_post_outlier' will return a matrix table with only related samples post pca outlier removal
        if 'unrelated_post_outlier' wil return a matrix table with only unrelated samples post pca outlier removal
    :param int n_partitions: if specified, will read in dataset with given number of partitions for the following arguments:
        - ld_pruning
        - rel_unrel
    """
    # Reading in all the tables and matrix tables needed to generate the pre_qc matrix table
    sample_meta = hl.import_table('gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/gnomad_meta_v1.tsv')
    sample_qc_meta = hl.read_table('gs://hgdp_tgp/output/gnomad_v3.1_sample_qc_metadata_hgdp_tgp_subset.ht')
    dense_mt = hl.read_matrix_table(
        'gs://gcp-public-data--gnomad/release/3.1.2/mt/genomes/gnomad.genomes.v3.1.2.hgdp_1kg_subset_dense.mt')
    
    dense_mt = dense_mt.naive_coalesce(5000)


    # Takes a list of dicts and converts it to a struct format (works with nested structs too)
    def dict_to_struct(d):
        fields = {}
        for k, v in d.items():
            if isinstance(v, dict):
                v = dict_to_struct(v)
            fields[k] = v
        return hl.struct(**fields)

    # un-flattening a hail table with nested structure
    # dict to hold struct names as well as nested field names
    d = {}

    # Getting the row field names
    row = sample_meta.row_value

    # returns a dict with the struct names as keys and their inner field names as values
    for name in row:
        def recur(dict_ref, split_name):
            if len(split_name) == 1:
                dict_ref[split_name[0]] = row[name]
                return
            existing = dict_ref.get(split_name[0])
            if existing is not None:
                assert isinstance(existing, dict), existing
                recur(existing, split_name[1:])
            else:
                existing = {}
                dict_ref[split_name[0]] = existing
                recur(existing, split_name[1:])
        recur(d, name.split('.'))

    # using the dict created from flattened struct, creating new structs now un-flattened
    sample_meta = sample_meta.select(**dict_to_struct(d))
    sample_meta = sample_meta.key_by('s')

    # grabbing the columns needed from HGDP metadata
    new_meta = sample_meta.select(sample_meta.hgdp_tgp_meta, sample_meta.bergstrom)

    # creating a table with gnomAD sample metadata and HGDP metadata
    ht = sample_qc_meta.annotate(**new_meta[sample_qc_meta.s])

    # stripping 'v3.1::' from the names to match with the densified MT
    ht = ht.key_by(s=ht.s.replace("v3.1::", ""))

    # Using hl.annotate_cols() method to annotate the gnomAD variant QC metadata onto the matrix table
    mt = dense_mt.annotate_cols(**ht[dense_mt.s])
    
    if raw:
        print("Returning default preQC matrix table")
        # returns preQC dataset
        return mt
    
    if post_qc:
        print("Returning post sample and variant QC matrix table with duplicates and PCA outliers removed")
        sample_qc = True
        variant_qc = True
        duplicate = True
        outlier_removal = True
    
    if sample_qc:
        print("Applying sample QC")
        # Apply sample QC filters to dataset
        # filtering samples to those who should pass gnomADs sample QC
        # this filters to only samples that passed gnomad sample QC hard filters
        mt = mt.filter_cols(~mt.sample_filters.hard_filtered)

    if variant_qc:
        print("Applying variant QC")
        # Apply variant QC filters to dataset
        # Subsetting the variants in the dataset to only PASS variants (those which passed gnomAD's variant QC)
        # PASS variants are variants which have an entry in the filters field.
        # This field contains an array which contains a bool if any variant qc filter was failed
        # This is the last step in the QC process
        mt = mt.filter_rows(hl.len(mt.filters) != 0, keep=False)

    if outlier_removal:
        print("Removing PCA outliers")
        # remove PCA outliers
        # reading in the PCA outlier list
        # To read in the PCA outlier list, first need to read the file in as a list
        # using hl.hadoop_open here which allows one to read in files into hail from Google cloud storage
        pca_outlier_path = 'gs://hgdp-1kg/hgdp_tgp/pca_outliers_v2.txt'
        with hl.utils.hadoop_open(pca_outlier_path) as file:
            outliers = [line.rstrip('\n') for line in file]

        # Using hl.literal here to convert the list from a python object to a hail expression so that it can be used
        # to filter out samples
        outliers_list = hl.literal(outliers)

        # Using the list of PCA outliers, using the ~ operator which is a negation operator and obtains the compliment
        # In this case the compliment is samples which are not contained in the pca outlier list
        mt = mt.filter_cols(~outliers_list.contains(mt['s']))

    if ld_pruning:
        print("Returning ld pruned post variant and sample QC matrix table pre PCA outlier removal ")
        # read in dataset which has additional variant filtering and ld pruning run
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        if n_partitions != 0:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/intermediate_files/filtered_n_pruned_output_updated.mt',
            _n_partitions = n_partitions)
        else:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/intermediate_files/filtered_n_pruned_output_updated.mt')

    if rel_unrel == "default":
        # do nothing
        # created a default value because there are multiple options for rel/unrel datasets
        mt = mt

    elif rel_unrel == 'related_pre_outlier':
        print("Returning post sample and variant QC matrix table " \
              "pre PCA outlier removal with only related individuals")
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        #   - LD pruning
        #   - pc_relate 
        #   - filter to only related individuals   
        if n_partitions != 0:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/rel_updated.mt',
            _n_partitions = n_partitions)
        else:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/rel_updated.mt')
        
    elif rel_unrel == 'unrelated_pre_outlier':
        print("Returning post QC matrix table with only unrelated individuals")
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        #   - LD pruning
        #   - pc_relate 
        #   - filter to only unrelated individuals
        if n_partitions != 0:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/unrel_updated.mt',
            _n_partitions = n_partitions)
        else:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/unrel_updated.mt')

    elif rel_unrel == 'related_post_outlier':
        print("Returning post sample and variant QC matrix table " \
              "pre PCA outlier removal with only related individuals")
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        #   - LD pruning
        #   - pc_relate 
        #   - filter to only related individuals
        #   - PCA outlier removal
        if n_partitions != 0:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/related.mt',
            _n_partitions = n_partitions)
        else:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/related.mt')

    elif rel_unrel == 'unrelated_post_outlier':
        print("Returning post sample and variant QC matrix table " \
              "pre PCA outlier removal with only related individuals")
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        #   - LD pruning
        #   - pc_relate 
        #   - filter to only unrelated individuals
        #   - PCA outlier removal
        if n_partitions != 0:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt',
            _n_partitions = n_partitions)
        else:
            mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt')
        
    # Calculating both variant and sample_qc metrics on the mt before returning
    # so the stats are up to date with the version being written out
    mt = hl.sample_qc(mt)
    mt = hl.variant_qc(mt)
    
    return mt

# 1. Set Default Paths
These default paths can be edited by users as needed. It is recommended to run these tutorials without writing out datasets. The read_qc() function is intended to take the place of needing to write out and read in datasets by the user. 

By default we have commented out all of the write steps of the tutorials, if you would like to write out your own datasets, uncomment those sections and replace the paths with your own. 

[Back to Index](#Index)

In [None]:
# Setting up a default output path for any datasets to be written out to

# Default output path for a checkpoint dataset
#takes some time to write out but speeds up downstream analyses
checkpoint_path = 'gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/table_1_checkpoint.ht'

# Default output path for final table dataset
table_path = 'gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/table_1.tsv'

# 2. Setting up data

Here we are walking through some steps to get the dataset ready for downstream analyses. First we create a table with only sample data, finally we select only the columns we need for the table we will write out. We then create a checkpoint of that table, to speed up downstream analyses steps. 

<details><summary>For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<br>
<li><a href="https://hail.is/docs/0.2/methods/genetics.html#hail.methods.sample_qc"> More on  <i> sample_qc() </i></a></li>   
    
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.cols"> More on  <i> cols() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.select"> More on  <i> select() </i></a></li>    
    
<li><a href="https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_table"> More on  <i> read_table() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.count"> More on  <i> count() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.checkpoint"> More on  <i> checkpoint() </i></a></li>
     
</details>

[Back to Index](#Index)

In [2]:
# Reading in the post QC version of the merged dataset (with metadata) using the read_qc function
mt = read_qc(post_qc=True)

In [4]:
# Grabbing only the columns from the matrix table (outputs table of just columns)
col_table = mt.cols()

2021-11-16 21:37:00 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'


In [5]:
# Writing out a col table with only the columns needed for table 1
col_table = col_table.select(col_table.hgdp_tgp_meta.Study.region,
                             col_table.hgdp_tgp_meta.Population,
                             col_table.sample_qc.n_snp,
                             col_table.sample_qc.n_singleton,
                             col_table.bam_metrics.mean_coverage)

In [None]:
# Checking the counts for the table, there should be 4097 samples 
col_table.count()

In [None]:
# # writing out col_table as a checkpoint to make the downstream steps run faster
# # this is done because running sample_qc is computationally expensive
# col_table.checkpoint(checkpoint_path, overwrite=True)

In [8]:
# this is a table of only the columns with only postQC information
col_table = hl.read_table(checkpoint_path)

In [9]:
# Since col_table is a table, count prints the number of rows which is equal to the number of samples
# There should be 4097 samples
col_table.count()

4097

# 3. Annotating table with relatedness information

Relatedness information is added to the dataset so that we can filter out related individuals. 

<details><summary>For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    
<br>
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_cols"> More on  <i> annotate_cols() </i></a></li>

<li><a href="https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.counter"> More on  <i> counter() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.union_cols"> More on  <i> union_cols() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.cols"> More on  <i> cols() </i></a></li>
    
</details>

[Back to Index](#Index)

In [10]:
# Need to get number of unrelateds annotated to the table
# Reading in the unrelated and related matrix tables
unrelated = read_qc(unrelated=True)
related = read_qc(related=True)

# Annotating both the unrelated and the related tables with a flag named unrelated 
# set unrelated flag to True for those in the unrelated dataset, and False for those in the related dataset
unrelated = unrelated.annotate_cols(unrelated = True)
related = related.annotate_cols(unrelated = False)

# using hl.cols() to obtain two tables with only the columns from the original matrix tables
unrelated_cols = unrelated.cols()
related_cols = related.cols()

In [11]:
# Annotating the related/unrelated mts with counts per population
related_count = related_cols.aggregate(hl.agg.counter(related_cols.hgdp_tgp_meta.Population))
unrelated_count = unrelated_cols.aggregate(hl.agg.counter(unrelated_cols.hgdp_tgp_meta.Population))

# Printing out the number of related and unrelated individuals per population as a validity check
print(f"Number of related individuals per population: \
{related_count}\n\nNumber of unrelated individuals per population: {unrelated_count}")

Number of related individuals per population: frozendict({'Naxi': 1, 'PJL': 48, 'Bedouin': 3, 'Pima': 3, 'IBS': 49, 'Mozabite': 1, 'Surui': 1, 'GWD': 60, 'PUR': 34, 'MbutiPygmy': 2, 'Hezhen': 1, 'She': 1, 'ITU': 4, 'BantuKenya': 2, 'Lahu': 3, 'Palestinian': 7, 'Maya': 3, 'STU': 16, 'ACB': 21, 'MXL': 34, 'GIH': 3, 'ASW': 19, 'Orcadian': 2, 'CEU': 55, 'Colombian': 4, 'Karitiana': 2, 'PEL': 37, 'Kalash': 2, 'CDX': 5, 'Makrani': 1, 'Druze': 7, 'CLM': 36, 'CHS': 60, 'BEB': 32, 'GBR': 2, 'Hazara': 4, 'BiakaPygmy': 4, 'YRI': 58, 'Melanesian': 2, 'KHV': 21, 'Japanese': 1, 'Mandenka': 3, 'MSL': 15, 'Sindhi': 1, 'LWK': 5, 'ESN': 45})

Number of unrelated individuals per population: frozendict({'Adygei': 17, 'Naxi': 8, 'PJL': 97, 'Tuscan': 8, 'Bedouin': 43, 'Pima': 11, 'BantuSouthAfrica': 8, 'IBS': 107, 'Italian': 11, 'Papuan': 17, 'Mozabite': 27, 'Surui': 7, 'Mongola': 10, 'Russian': 25, 'Basque': 23, 'FIN': 98, 'Sardinian': 27, 'GWD': 116, 'PUR': 104, 'MbutiPygmy': 12, 'Hezhen': 8, 'She': 9, 'I

In [12]:
# Joining the columns of the unrelated and related datasets
mt_rel = unrelated.union_cols(related)

# counting the number of unrelated in the matrix table to make sure it is as expected
mt_rel.aggregate_cols(hl.agg.counter(mt_rel.unrelated))

# creating a table with only the columns from the matrix table containing related information
# this is done since the final output will be a tsv and thus must be in table format
# Being a table of columns allows it to be annotated onto the existing col_table as shown below
rel_table = mt_rel.cols()

# annotating the relatedness information onto the column table
col_table = col_table.annotate(unrel = rel_table[col_table.s].unrelated)

# 4. Calculating statistics per population
In this section, we will be using hl.agg.stats() which calculates the following metrics for a given expression:
- min
- max
- mean
- standard deviation
- number of non-missing records
- sum
 
Using `hl.group_by()` we calculate these statistics for each of the 78 populations in this dataset.
We also use `hl.agg.count_where()` to count where the field denoting if samples are related or not is True which will give us the number of related samples within each population.

<br>
<details><summary>For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<br>
<li><a href="https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.stats"> More on  <i> stats() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.group_by"> More on  <i> group_by() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.count_where"> More on  <i> count_where() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.expr.TupleExpression.html#hail.expr.TupleExpression.show"> More on <i>show()</i></a></li>
     
</details>

[Back to Index](#Index)

In [13]:
# Calculating stats per population for each metric grouped by geographic region and population
table = n_snp = col_table.group_by(
    col_table.region, col_table.Population).aggregate(
    n_snp_stats = hl.agg.stats(col_table.n_snp),
    n_singleton_stats = hl.agg.stats(col_table.n_singleton),
    cov_stats = hl.agg.stats(col_table.mean_coverage),
    n_unrelated = hl.agg.count_where(col_table.unrel == True))

In [14]:
# checking that each of the table fields contain what we'd expect
table.show()

2021-11-16 22:11:09 Hail: INFO: Coerced sorted dataset
2021-11-16 22:11:11 Hail: INFO: Ordering unsorted dataset with network shuffle


Unnamed: 0_level_0,Unnamed: 1_level_0,n_snp_stats,n_snp_stats,n_snp_stats,n_snp_stats,n_snp_stats,n_snp_stats,n_singleton_stats,n_singleton_stats,n_singleton_stats,n_singleton_stats,n_singleton_stats,n_singleton_stats,cov_stats,cov_stats,cov_stats,cov_stats,cov_stats,cov_stats,Unnamed: 20_level_0
region,Population,mean,stdev,min,max,n,sum,mean,stdev,min,max,n,sum,mean,stdev,min,max,n,sum,n_unrelated
str,str,float64,float64,float64,float64,int64,float64,float64,float64,float64,float64,int64,float64,float64,float64,float64,float64,int64,float64,int64
"""AFR""","""ACB""",6040000.0,68100.0,5730000.0,6140000.0,114,688000000.0,15700.0,8380.0,528.0,27500.0,114,1790000.0,31.7,2.48,28.4,42.0,114,3620.0,94
"""AFR""","""ASW""",5940000.0,93800.0,5670000.0,6100000.0,71,422000000.0,14000.0,8230.0,532.0,27600.0,71,995000.0,32.3,3.36,27.5,52.1,71,2290.0,52
"""AFR""","""ESN""",6130000.0,14600.0,6090000.0,6180000.0,148,907000000.0,8560.0,5680.0,549.0,20000.0,148,1270000.0,32.3,3.23,28.0,53.5,148,4780.0,103
"""AFR""","""GWD""",6120000.0,19500.0,6060000.0,6170000.0,176,1080000000.0,10400.0,6800.0,573.0,26000.0,176,1830000.0,32.5,2.71,28.2,43.4,176,5720.0,116
"""AFR""","""LWK""",6110000.0,17200.0,6060000.0,6160000.0,97,593000000.0,25800.0,7780.0,9390.0,39800.0,97,2510000.0,32.9,4.32,27.9,60.2,97,3190.0,92
"""AFR""","""MSL""",6180000.0,15100.0,6130000.0,6220000.0,98,606000000.0,21300.0,8860.0,702.0,31100.0,98,2080000.0,31.8,2.69,27.5,45.6,98,3120.0,83
"""AFR""","""YRI""",6130000.0,15000.0,6080000.0,6160000.0,175,1070000000.0,8380.0,5640.0,518.0,23000.0,175,1470000.0,32.3,3.59,26.9,56.1,175,5650.0,117
"""AMR""","""CLM""",5290000.0,78700.0,5170000.0,5580000.0,130,687000000.0,8880.0,6560.0,284.0,26200.0,130,1150000.0,32.5,3.14,26.3,55.2,130,4230.0,94
"""AMR""","""MXL""",5270000.0,31400.0,5180000.0,5340000.0,97,511000000.0,9270.0,7250.0,295.0,28300.0,97,899000.0,31.3,2.11,28.3,42.6,97,3040.0,63
"""AMR""","""PEL""",5290000.0,50000.0,5230000.0,5600000.0,122,645000000.0,11700.0,8640.0,234.0,31500.0,122,1430000.0,31.8,1.94,28.2,38.7,122,3880.0,85


# 5. Formatting table for exporting
In this section we format the table before exporting so it is in a usable format once written out. Specifically we are flattening the table. This is done so that when the table is written out, the data is easier to work with. If the tables were written out without flattening the new annotated information would be in a nested structure which would make it difficult to work with outside hail. 

<br>
<details><summary>For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    
<br>
<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.flatten"> More on  <i> flatten() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.key_by"> More on <i>key_by()</i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.expr.TupleExpression.html#hail.expr.TupleExpression.show"> More on <i>show()</i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.describe"> More on <i>describe()</i></a></li>
    
<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.count"> More on <i>count()</i></a></li>
    
</details>

[Back to Index](#Index)

In [21]:
# Flattening out the structs created from annotating the tables
table = table.flatten()

# Changing the keys of the table so that it is keyed by global region and population
table = table.key_by(table.region, table.Population)

In [17]:
# checking format of the flattened table
table.show()

2021-06-29 16:31:23 Hail: INFO: Coerced sorted dataset
2021-06-29 16:31:24 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-29 16:31:25 Hail: INFO: Coerced sorted dataset
2021-06-29 16:31:25 Hail: INFO: Coerced sorted dataset
2021-06-29 16:31:26 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-29 16:31:27 Hail: INFO: Coerced sorted dataset
2021-06-29 16:31:28 Hail: INFO: Coerced sorted dataset
2021-06-29 16:31:29 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-06-29 16:31:29 Hail: INFO: Coerced sorted dataset


region,Population,n_unrelated,cov_stats.n,cov_stats.mean,cov_stats.stdev,n_snp.n_unrelated,n_snp.n_snp_stats.n,n_snp.n_snp_stats.mean,n_snp.n_snp_stats.stdev,n_singleton.n_unrelated,n_singleton.n_singleton_stats.n,n_singleton.n_singleton_stats.mean,n_singleton.n_singleton_stats.stdev
str,str,int64,int64,float64,float64,int64,int64,float64,float64,int64,int64,float64,float64
"""AFR""","""ACB""",90,114,31.7,2.48,90,114,6040000.0,68100.0,90,114,15700.0,8380.0
"""AFR""","""ASW""",49,72,32.3,3.34,49,72,5940000.0,106000.0,49,72,13900.0,8040.0
"""AFR""","""ESN""",100,148,32.3,3.23,100,148,6130000.0,14600.0,100,148,8560.0,5680.0
"""AFR""","""GWD""",112,176,32.5,2.71,112,176,6120000.0,19500.0,112,176,10400.0,6800.0
"""AFR""","""LWK""",92,97,32.9,4.32,92,97,6110000.0,17200.0,92,97,25800.0,7770.0
"""AFR""","""MSL""",80,98,31.8,2.69,80,98,6180000.0,15100.0,80,98,21300.0,8860.0
"""AFR""","""YRI""",114,175,32.3,3.59,114,175,6130000.0,15000.0,114,175,8380.0,5640.0
"""AMR""","""CLM""",91,130,32.5,3.14,91,130,5290000.0,78700.0,91,130,8880.0,6560.0
"""AMR""","""MXL""",59,97,31.3,2.11,59,97,5270000.0,31400.0,59,97,9260.0,7240.0
"""AMR""","""PEL""",81,122,31.8,1.94,81,122,5290000.0,50000.0,81,122,11700.0,8640.0


In [24]:
# Checking on the format of the table after flattening to make sure it is what we'd expect
table.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'region': str 
    'Population': str 
    'n_snp_stats.mean': float64 
    'n_snp_stats.stdev': float64 
    'n_snp_stats.min': float64 
    'n_snp_stats.max': float64 
    'n_snp_stats.n': int64 
    'n_snp_stats.sum': float64 
    'n_singleton_stats.mean': float64 
    'n_singleton_stats.stdev': float64 
    'n_singleton_stats.min': float64 
    'n_singleton_stats.max': float64 
    'n_singleton_stats.n': int64 
    'n_singleton_stats.sum': float64 
    'cov_stats.mean': float64 
    'cov_stats.stdev': float64 
    'cov_stats.min': float64 
    'cov_stats.max': float64 
    'cov_stats.n': int64 
    'cov_stats.sum': float64 
    'n_unrelated': int64 
----------------------------------------
Key: []
----------------------------------------


In [25]:
# one last validity check before writing out the dataset to make sure you still have the number of rows you expect
# in this case, since the data is grouped by global region, population
# the number of rows should be equal to the number of populations (78)
table.count()

2021-11-16 22:20:22 Hail: INFO: Ordering unsorted dataset with network shuffle


78

# 6. Exporting final table
<br>
<details><summary>For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<br>
<li><a href="https://hail.is/docs/0.2/hail.Table.html#hail.Table.export"> More on  <i> export() </i></a></li>

</details>

[Back to Index](#Index)

In [1]:
# # writing out the final table x tsv
# table.export(table_path, header=True)

NameError: name 'table' is not defined