## PCA and Ancestry Analyses

Author: Mary T. Yohannes and Ally Kim

**To run this tutorial, you need to have started your cluster with `--packages-gnomad`.**

*If you have not done this, you will need to shut down your current cluster and start a new one with the `--packages-gnomad` argument.* 

See the tutorials [README](https://github.com/atgu/hgdp_tgp/tree/master/tutorials#readme) for more information on how to start a cluster.

# Index
1. [Setting Default Paths](#1.-Set-Default-Paths)
2. [Variant Filtering and LD Pruning](#2.-Variant-Filtering-and-LD-Pruning)
3. [Run PC Relate](#3.-Run-PC-Relate)
4. [PCA](#4.-PCA)
    1. [Function to Run PCA on Unrelated Individuals](#4a.-Function-to-Run-PCA-on-Unrelated-Individuals)
    2. [Function to Project Related Individuals](#4b.-Function-to-Project-Related-Individuals)
    3. [Global PCA](#4c.-Global-PCA)
    4. [Subcontinental PCA](#4d.-Subcontinental-PCA)
    5. [PCA Plots](#4e.-PCA-Plots)
        1. [Global PCA Plots](#4e-1.-Global-PCA-Plots)
        2. [Subcontinental PCA Plots](#4e-2.-Subcontinental-PCA-Plots)
5. [Outlier Removal](#5.-Outlier-Removal)
6. [Rerun PCA (without outliers)](#6.-Rerun-PCA-(without-outliers))
    1. [Global PCA (without outliers)](#6a.-Global-PCA-(without-outliers))
    2. [Subcontinental PCA (without outliers)](#6b.-Subcontinental-PCA-(without-outliers))
    3. [PCA Plots (without outliers)](#6c.-PCA-Plots-(without-outliers))
        1. [Global PCA Plots (without outliers)](#6c-1.-Global-PCA-Plots-(without-outliers))
        2. [Subcontinental PCA Plots (without outliers)](#6c-2.-Subcontinental-PCA-Plots-(without-outliers))
7. [Writing out Matrix Table](#7.-Write-Out-Matrix-Table)

# General Overview 
The purpose of this notebook is to further filter the post-QC matrix table to prepare it for LD pruning, compute relatedness, and run Principal Component Analysis (PCA).

**This script contains information on how to:**
- Read in a matrix table and run Hail common variant statistics 
- Filter using allele frequency and call rate
- Run LD pruning 
- Run relatedness and separate related and unrelated individuals
- Calculate PC scores and project samples on to a PC space  
- Run global and subcontinental PCA and plot them 
- Remove PCA outliers (filter using sample IDs)
- Rerun global and subcontinental PCA
- Write out a matrix table 

In [5]:
import hail as hl

# Function from gnomAD for related sample projection 
from gnomad.sample_qc.ancestry import pc_project

# For plotting in Hail
from hail.ggplot import *
import plotly

from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()

# 1. Set Default Paths
These default paths can be edited by users as needed. It is recommended to run these tutorials without writing out datasets.

By default we have commented out all of the write steps of the tutorials, if you would like to write out your own datasets, uncomment those sections and replace the paths with your own. Don't forget to change the read-in paths as well. 

[Back to Index](#Index)

In [15]:
# input file 
post_qc_path = 'gs://hgdp-1kg/tutorial_datasets/metadata_and_qc/post_qc.mt'

# Path for gnomAD's HGDP+1kGP metadata for plotting 
metadata_path = 'gs://hgdp-1kg/tutorial_datasets/metadata_and_qc/gnomad_meta_v1.tsv'

# save the filtered and LD pruned mt as an intermediate file since LD pruning takes a while to rerun
ld_pruned_path = 'gs://hgdp-1kg/tutorial_datasets/pca_preprocessing/ld_pruned.mt'

# ht of related sample IDs for separating unrelated and related samples for PCA run 
related_sample_ids_path = 'gs://hgdp-1kg/tutorial_datasets/pca_preprocessing/related_sample_ids.ht'

# path for pre-outlier PCA results - global & subcontinental PCA 
pc_scores_with_outliers_path = 'gs://hgdp-1kg/tutorial_datasets/pca/pc_scores_with_outliers/'

# PCA outliers file 
outliers_path = 'gs://hgdp-1kg/tutorial_datasets/pca/pca_outliers.txt'

# path for post-outlier PCA results - global & subcontinental PCA 
pc_scores_without_outliers_path = 'gs://hgdp-1kg/tutorial_datasets/pca/pc_scores_without_outliers/'

# paths for unrelated and related datasets without outliers   
unrelateds_mt_without_outliers_path = 'gs://hgdp-1kg/tutorial_datasets/pca_results/unrelateds_without_outliers.mt'
relateds_mt_without_outliers_path = 'gs://hgdp-1kg/tutorial_datasets/pca_results/relateds_without_outliers.mt' 


# 2. Variant Filtering and LD Pruning
   
At this point, we have <code>159,795,273 SNVs</code>. We want fewer variants (~100-300k) for PCA for computational efficiency, so we apply filters on: allele frequency (<code>AF</code>) and missingness (<code>call rate</code>), then run LD pruning.  

Linkage disequilibrium (LD) is the correlation between nearby variants such that the alleles at neighboring polymorphisms (observed on the same chromosome) are associated within a population more often than if they were unlinked.    
    
For more information on LD pruning click <a href="https://www.nature.com/articles/nrg2361"> here</a>.


<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/methods/genetics.html#hail.methods.variant_qc"> More on  <i> variant_qc() </i></a></li>

<li><a href="https://hail.is/docs/0.2/methods/genetics.html#hail.methods.ld_prune"> More on  <i> ld_prune() </i></a></li>
    </ul>
</details>

[Back to Index](#Index)

In [3]:
# Read in the right input file 
mt = hl.read_matrix_table(post_qc_path)

Initializing Hail with default parameters...
Running on Apache Spark version 3.1.2
SparkUI available at http://mty-m.c.diverse-pop-seq-ref.internal:42443
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.83-b3151b4c4271
LOGGING: writing to /home/hail/hail-20221013-1222-0.2.83-b3151b4c4271.log


## 2a. Variant Filtering 

[Back to Index](#Index)

In [4]:
# Run Hail's common variant statistics (QC metrics) 
var_qc_mt = hl.variant_qc(mt) 


# Filter to variants with AF between 0.05 & 0.95, and call rate greater than 0.999    
filtered_mt = var_qc_mt.filter_rows(((var_qc_mt.variant_qc.AF[0] > 0.05) & (var_qc_mt.variant_qc.AF[1] > 0.05)) &
                                 ((var_qc_mt.variant_qc.AF[0] < 0.95) & (var_qc_mt.variant_qc.AF[1] < 0.95)) &
                                 (var_qc_mt.variant_qc.call_rate > 0.999))
# Took 9min to print 
print('Num of variants after filtering = ' + str(filtered_mt.count()[0])) 

Num of variants after filtering = 5194245


We started with <code>159,795,273</code> SNVs, then after filtering on allele frequency and call rate, we ended with <code>5,194,245</code> SNVs.

## 2b. LD Pruning

[Back to Index](#Index)

We have too many variants for PCA that are also non-independent. We address this by pruning SNVs based on LD.

In [5]:
# Remove correlated variants 
# Took 1hr & 15min to run 
pruned_mt = hl.ld_prune(filtered_mt.GT, r2=0.1, bp_window_size=500000) 

2022-10-13 12:33:04 Hail: INFO: ld_prune: running local pruning stage with max queue size of 62138 variants
2022-10-13 12:43:55 Hail: INFO: wrote table with 385170 rows in 50000 partitions to /tmp/d8EzVgMgCMYVOIVHLo3RFT
    Total size: 16.87 MiB
    * Rows: 16.87 MiB
    * Globals: 11.00 B
    * Smallest partition: 0 rows (21.00 B)
    * Largest partition:  48 rows (1.66 KiB)
2022-10-13 13:34:11 Hail: INFO: Wrote all 190 blocks of 385170 x 4120 matrix with block size 4096.
2022-10-13 13:45:45 Hail: INFO: wrote table with 725305 rows in 189 partitions to /tmp/ZQjOBeHXVqj9w4EJmAxEqs
    Total size: 14.75 MiB
    * Rows: 9.07 MiB
    * Globals: 5.67 MiB
    * Smallest partition: 1 rows (63.00 B)
    * Largest partition:  12950 rows (155.08 KiB)
2022-10-13 13:45:47 Hail: WARN: over 400,000 edges are in the graph; maximal_independent_set may run out of memory


In [6]:
filtered_pruned_mt = filtered_mt.filter_rows(hl.is_defined(pruned_mt[filtered_mt.row_key])) 

In [11]:
# Took ~13min to print 
print('Num of variants after LD pruning = ' + str(filtered_pruned_mt.count()[0])) 

Num of variants after LD pruning = 199974


Since the number of variants after this step is now in the ~100-300k range, we proceed to the PCA analysis without any more adjustments.

## 2c. Write out an intermediate file

The LD pruning step takes a non-negligible amount of time to run, so to ensure that the downstream analyses steps don't take a very long time, we write out an intermediate file. This write out step should take around 16 minutes to run.


If the user wishes to export their own intermediate file, they can do so by changing the intermediate file path. Once a file has been written out, the <code>overwrite</code> argument can be used to replace it with a new file or keep the original one.  

[Back to Index](#Index)

In [None]:
## Writing out an intermediate file to speed up subsequent analyses
## Took ~16min to run
#filtered_pruned_mt.write(ld_pruned_path, overwrite=False) 

In [3]:
# Read the intermediate file back in for subsequent analyses
filtered_pruned_mt = hl.read_matrix_table(ld_pruned_path) 

Initializing Hail with default parameters...
Running on Apache Spark version 3.1.3
SparkUI available at http://mty-m.c.diverse-pop-seq-ref.internal:34597
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.105-acd89e80c345
LOGGING: writing to /home/hail/hail-20221201-1403-0.2.105-acd89e80c345.log


# 3. Run PC Relate   

When doing Principal Component Analysis (PCA), we need to separate the related and unrelated samples before computing the PC scores and plotting them. This is because if we compute PCA with the related samples in the data set, the population structure and clustering will be affected by the relatedness that exists among those samples. Thus, we first have to identify the related individuals by computing relatedness estimates (kinship statistic in this case) using a variant of the PC-Relate method in Hail. We used a minimum minor allele frequency (MAF) filter of 0.05, excluded sample pairs with kinship less than 0.05, and used 20 principal components (PC) to control for population structure. After getting the sample ID pairs for the related samples, we then separate the filtered and pruned mt into relateds and unrelateds.

<br>
We computed the kinship statistic using (metrics for <code>pc_relate</code>):
    <ul>
        <li>a minimum minor allele frequency filter of 0.05</li>
        <li>excluding sample-pairs with kinship less than 0.05</li>
        <li>20 principal components to control for population structure</li>
    </ul>

<br>    
<details><summary>For more information on relatedness click <u><span style="color:blue">here</span></u>.</summary>
    <ul>
        <li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4716688/">Paper</a></li>
        <li><a href="https://hail.is/docs/0.2/methods/relatedness.html#relatedness">Hail documentation</a></li>
    </ul>
</details>

<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    <ul>
        <li><a href="https://hail.is/docs/0.2/methods/relatedness.html#hail.methods.pc_relate"> More on  <i> pc_relate() </i></a>
        </li>
        <li><a href="https://hail.is/docs/0.2/methods/misc.html#hail.methods.maximal_independent_set"> More on  <i> maximal_independent_set() </i></a>
        </li>
    </ul>
</details>

[Back to Index](#Index)

In [None]:
# Compute kinship statistic
relatedness_ht = hl.pc_relate(
    filtered_pruned_mt.GT, 
    min_individual_maf=0.05, 
    min_kinship=0.05, 
    statistics='kin', 
    k=20).key_by() 

Since running <code>hl.maximal_independent_set</code> took ~2hr and 22min, we decided to write out the result and read it back in. This allowed subsequent runs to get executed faster and saves time while running through the tutorial.

In [None]:
## Identify closely related individuals in pairs (list of sample IDs) 
#related_sample_ids = hl.maximal_independent_set(relatedness_ht.i, relatedness_ht.j, False) # 721 samples

## Write out the sample IDs 
#related_sample_ids.write(related_sample_ids_path, overwrite=False)

In [4]:
# Read the list of related-sample IDs back in
related_sample_ids = hl.read_table(related_sample_ids_path)

# Subset the filtered and pruned mt to unrelated samples 
# Sample IDs that are NOT present in the list of related individuals  
unrelateds_mt_preoutlier = filtered_pruned_mt.filter_cols(hl.is_defined(related_sample_ids[filtered_pruned_mt.col_key]), keep=False) 

# Do the same as above but this time subset to related samples 
# Sample IDs that are present in the list of related individuals    
relateds_mt_preoutlier = filtered_pruned_mt.filter_cols(hl.is_defined(related_sample_ids[filtered_pruned_mt.col_key]), keep=True) 

2022-11-08 18:43:23 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'


# 4. PCA

PCA is run on the unrelated samples first. Then, the related samples are projected onto the PC space of the unrelated samples to get their PC scores. This way the population structure and clustering is not affected by the relatedness among samples.  

[Back to Index](#Index)

## 4a. Function to Run PCA on Unrelated Individuals

[Back to Index](#Index)

In [3]:
def run_pca(mt: hl.MatrixTable):
    """
    Runs PCA on a dataset
    :param mt: dataset to run PCA on
    :return: loadings and pc scores of unrelated samples 
    """

    pca_evals, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, k=20, compute_loadings=True)
    pca_mt = mt.annotate_rows(pca_af=hl.agg.mean(mt.GT.n_alt_alleles()) / 2)
    pca_loadings = pca_loadings.annotate(pca_af=pca_mt.rows()[pca_loadings.key].pca_af)
    pca_scores = pca_scores.transmute(**{f'PC{i}': pca_scores.scores[i - 1] for i in range(1, 21)})
    return pca_loadings, pca_scores 

## 4b. Function to Project Related Individuals

**If running the cell below results in an error, double check that you used the  `--packages gnomad` argument when starting your cluster.**  
- See the tutorials [README](https://github.com/atgu/hgdp_tgp/tree/master/tutorials#readme) for more information on how to start a cluster.

[Back to Index](#Index)

In [4]:
def project_individuals(project_mt, pca_loadings, unrel_scores, out_path: str, reg_name:str, outlier_status:str):
    """
    Project samples into predefined PCA space
    :param project_mt: matrix table of related samples to project 
    :param pca_loadings: existing PCA space of unrelated samples 
    :param unrel_scores: unrelated samples' PC scores
    :param out_path: path for where to save PCA projection outputs
    :param reg_name: region name for saving output purposes
    :param outlier_status: is the dataset with or without outliers? 
    """
    ht_projections = pc_project(project_mt, pca_loadings)  
    ht_projections = ht_projections.transmute(**{f'PC{i}': ht_projections.scores[i - 1] for i in range(1, 21)}) 
    scores = unrel_scores.union(ht_projections) # combine the pc scores from both the unrelateds and relateds 
    scores.export(out_path + reg_name + '_scores_' + outlier_status + '.txt.bgz') # write output for plotting    

## 4c. Global PCA

We are doing this to see the population structure and clustering on a continental level and contextualize the data globally.    

[Back to Index](#Index)

In [7]:
# This block took 23min to run 

# Dictionaries to hold unrelateds' PCA loadings and scores
loadings_dict = {}
unrel_scores_dict = {}

# Run PCA on unrelated samples as whole
loadings_dict['GLOBAL'], unrel_scores_dict['GLOBAL'] = run_pca(unrelateds_mt_preoutlier)  


# Project related samples onto unrelated-samples' PC space 
project_individuals(relateds_mt_preoutlier, loadings_dict['GLOBAL'], unrel_scores_dict['GLOBAL'], pc_scores_with_outliers_path, 'GLOBAL', 'with_outliers')


2022-11-07 18:53:11 Hail: INFO: Coerced sorted dataset
2022-11-07 18:54:09 Hail: INFO: hwe_normalize: found 199974 variants after filtering out monomorphic sites.
2022-11-07 18:54:11 Hail: INFO: Coerced sorted dataset
2022-11-07 18:55:23 Hail: INFO: pca: running PCA with 20 components...
2022-11-07 19:13:26 Hail: INFO: Coerced sorted dataset
2022-11-07 19:13:58 Hail: INFO: Coerced sorted dataset
2022-11-07 19:13:59 Hail: INFO: Coerced sorted dataset
2022-11-07 19:16:18 Hail: INFO: Coerced sorted dataset
2022-11-07 19:16:19 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-11-07 19:16:19 Hail: INFO: Coerced sorted dataset
2022-11-07 19:16:20 Hail: INFO: merging 32 files totalling 332.7K...
2022-11-07 19:16:20 Hail: INFO: while writing:
    gs://hgdp-1kg/tutorial_datasets/pca/pc_scores_with_outliers/global_scores.txt.bgz
  merge time: 301.736ms


## 4d. Subcontinental PCA 

To see the population structure and clustering on a subcontinental level and contextualize data within continental regions. This also helped us identify outliers which were removed later on.     

When running the following section, the notebook might freeze/throw an error after running PCA for 3-4 regions. Thus, we run it in groups of 3-4 regions at a time. If you want to run subcontinental PCA, we recommend doing that.

<br>

<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 

<ul>
<li><a href="more info https://hail.is/docs/0.2/methods/genetics.html#hail.methods.hwe_normalized_pca"> More on <i> hwe_normalized_pca() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_rows"> More on <i> annotate_rows() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.annotate"> More on <i> annotate() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.transmute"> More on <i> transmute() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.export"> More on <i> export() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/experimental/index.html#hail.experimental.pc_project"> More on <i> pc_project() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.collect"> More on <i> collect() </i></a></li>
</ul>
    
</details>

[Back to Index](#Index)

In [7]:
# Run time breakdown for this cell is as follows:
# 1hr & 42 min for EAS, AMR, CSA
# 1hr & 23 min for EUR, AFR and OCE
# 34 min for MID

# dictionaries to hold unrelateds' PCA loadings and scores
loadings_dict = {}
unrel_scores_dict = {}
regions = unrelateds_mt_preoutlier['hgdp_tgp_meta']['genetic_region'].collect() 
regions = list(dict.fromkeys(regions)) # convert into a list
# There are 7 regions: EUR, AFR, AMR, EAS, CSA, OCE, and MID

# for each region, run PCA on the unrelated samples 
for i in regions:  
    if i is not None: # exclude a none value
        # filter the unrelateds per region
        subcont_unrelateds = unrelateds_mt_preoutlier.filter_cols(unrelateds_mt_preoutlier['hgdp_tgp_meta']['genetic_region'] == i) 

        # run PCA
        loadings_dict[i], unrel_scores_dict[i] = run_pca(subcont_unrelateds)

        # filter the related mt per region 
        subcont_relateds = relateds_mt_preoutlier.filter_cols(relateds_mt_preoutlier['hgdp_tgp_meta']['genetic_region'] == i)  

        # project related samples onto unrelated-samples' PC space 
        project_individuals(subcont_relateds, loadings_dict[i], unrel_scores_dict[i], pc_scores_with_outliers_path, i, 'with_outliers')


2022-11-08 18:43:37 Hail: INFO: Coerced sorted dataset
2022-11-08 18:44:39 Hail: INFO: hwe_normalize: found 198216 variants after filtering out monomorphic sites.
2022-11-08 18:44:41 Hail: INFO: Coerced sorted dataset
2022-11-08 18:46:01 Hail: INFO: pca: running PCA with 20 components...
2022-11-08 19:14:33 Hail: INFO: Coerced sorted dataset
2022-11-08 19:15:09 Hail: INFO: Coerced sorted dataset
2022-11-08 19:15:10 Hail: INFO: Coerced sorted dataset
2022-11-08 19:17:55 Hail: INFO: Coerced sorted dataset
2022-11-08 19:17:55 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-11-08 19:17:56 Hail: INFO: Coerced sorted dataset
2022-11-08 19:17:57 Hail: INFO: merging 32 files totalling 16.3K...
2022-11-08 19:17:57 Hail: INFO: while writing:
    gs://hgdp-1kg/tutorial_datasets/pca/pc_scores_with_outliers/MID_scores.txt.bgz
  merge time: 292.596ms


## 4e. PCA Plots

The following PCA plots are prior to the removal of any outlier.

[Back to Index](#Index)

In [63]:
# Read in gnomAD's HGDP+1kGP metadata for plotting 
metadata = hl.import_table(metadata_path, impute = True, key = 's')

# Dictionary mapping colors to region names 
cont_colors = {'AMR':"#E41A1C",
               'AFR':"#984EA3", 
               'OCE':"#999999",
               'CSA':"#FF7F00",
               'EAS':"#4DAF4A", 
               'EUR':"#377EB8", 
               'MID':"#A65628" }

2022-12-17 00:13:55.976 Hail: INFO: Reading table to impute column types
2022-12-17 00:13:57.828 Hail: INFO: Loading <StructExpression of type struct{s: str, `project_meta.sample_id`: str, `project_meta.research_project_key`: str, `project_meta.seq_project`: str, `project_meta.ccdg_alternate_sample_id`: str, `project_meta.ccdg_gender`: str, `project_meta.ccdg_center`: str, `project_meta.ccdg_study`: str, `project_meta.cram_path`: str, `project_meta.project_id`: str, `project_meta.v2_age`: str, `project_meta.v2_sex`: str, `project_meta.v2_hard_filters`: str, `project_meta.v2_perm_filters`: str, `project_meta.v2_pop_platform_filters`: str, `project_meta.v2_related`: str, `project_meta.v2_data_type`: str, `project_meta.v2_product`: str, `project_meta.v2_product_simplified`: str, `project_meta.v2_qc_platform`: str, `project_meta.v2_project_id`: str, `project_meta.v2_project_description`: str, `project_meta.v2_internal`: str, `project_meta.v2_investigator`: str, `project_meta.v2_known_pop`:

In [65]:
# Initalize dictionary to save final data files to
scores_with_outliers = {}

# Loop through each region to create a curated dataset for each
regions = ['GLOBAL', 'AFR', 'AMR', 'CSA', 'EAS', 'EUR', 'MID', 'OCE']

for region in regions:
    
    # Import PC score tables
    scores = hl.import_table(pc_scores_with_outliers_path + region + '_scores_with_outliers.txt.bgz', impute = True)
    
    # Add information from the metadata - genetic region and populations 
    scores = scores.annotate(
        global_pop = metadata[scores.s]['hgdp_tgp_meta.Genetic.region'], 
        subpop = metadata[scores.s]['hgdp_tgp_meta.Population'],
        global_color = metadata[scores.s]['hgdp_tgp_meta.Continent.colors'],
        subpop_color = metadata[scores.s]['hgdp_tgp_meta.Pop.colors'],
        subpop_shapes = metadata[scores.s]['hgdp_tgp_meta.Pop.shapes'],
        proj_title = metadata[scores.s]['hgdp_tgp_meta.Project'])

    # Save annotated table to dictionary 
    # For plotting, the score files can be accessed by indexing the dictionary using region names 
    scores_with_outliers[region] = scores

2022-12-17 00:14:54.602 Hail: INFO: Reading table to impute column types
2022-12-17 00:14:55.168 Hail: INFO: Finished type imputation
  Loading field 's' as type str (imputed)
  Loading field 'PC1' as type float64 (imputed)
  Loading field 'PC2' as type float64 (imputed)
  Loading field 'PC3' as type float64 (imputed)
  Loading field 'PC4' as type float64 (imputed)
  Loading field 'PC5' as type float64 (imputed)
  Loading field 'PC6' as type float64 (imputed)
  Loading field 'PC7' as type float64 (imputed)
  Loading field 'PC8' as type float64 (imputed)
  Loading field 'PC9' as type float64 (imputed)
  Loading field 'PC10' as type float64 (imputed)
  Loading field 'PC11' as type float64 (imputed)
  Loading field 'PC12' as type float64 (imputed)
  Loading field 'PC13' as type float64 (imputed)
  Loading field 'PC14' as type float64 (imputed)
  Loading field 'PC15' as type float64 (imputed)
  Loading field 'PC16' as type float64 (imputed)
  Loading field 'PC17' as type float64 (imputed)


### 4e-1. Global PCA Plots

[Back to Index](#Index)

In [67]:
# get annotated score table from dictionary 
global_with_outliers = scores_with_outliers['GLOBAL']

# CHMI_CHMI3_WGS2 is a sample added by gnomAD for QC purposes and thus doesn't have metadata information. 
# To avoid a "None" error, we have to remove it before plotting. 
# From the dataset itself, it is removed together with PCA outliers in section 5 below. 
global_with_outliers = global_with_outliers.filter(global_with_outliers.s == 'CHMI_CHMI3_WGS2', keep = False)

# Make plot
p = ggplot(global_with_outliers, aes(x = global_with_outliers.PC1, y = global_with_outliers.PC2))+ \
    geom_point(aes(color = global_with_outliers.global_pop,
                   shape = global_with_outliers.proj_title),
                   size = 3, alpha = .5) +\
    xlab("PC1") + \
    ylab("PC2") + \
    ggtitle("Global PCA With Outliers")+\
    labs(shape = 'Project', color = 'Population') +\
    scale_color_manual(values=cont_colors)

# Show plot
p.show()

2022-12-17 00:15:49.455 Hail: INFO: Coerced sorted dataset
2022-12-17 00:15:49.830 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:15:50.326 Hail: INFO: Coerced sorted dataset
2022-12-17 00:15:50.679 Hail: INFO: Ordering unsorted dataset with network shuffle


### 4e-2. Subcontinental PCA Plots

[Back to Index](#Index)

In [73]:
# Initialize dictionary to save each plot
plots_with_outliers = {}

for region in regions[1:]: # skip "GLOBAL" and only plot PCA for the 7 genetic regions 
    
    # Filter for a specific region
    subcont_with_outliers = scores_with_outliers[region]

    # Only plotting PC1 vs PC2 but you can change the PC values or make a for loop to plot the rest of the PCs
    p = ggplot(subcont_with_outliers, aes(x=subcont_with_outliers.PC1, y=subcont_with_outliers.PC2)) + \
        geom_point(aes(color = subcont_with_outliers.subpop, 
                       shape = subcont_with_outliers.proj_title),
                       size = 3, alpha = .3) +\
        xlab("PC1") + \
        ylab("PC2") + \
        ggtitle(region + " PCA With Outliers")+\
        labs(shape = 'Project', color = 'Population')

    # Add plot to dictionary with the region name as its key 
    plots_with_outliers[region] = p

In [81]:
# Show subcontinental PC1 vs PC2 plots one by one 
for region in regions[1:]:
    plots_with_outliers[region].show()

2022-12-17 00:28:09.291 Hail: INFO: Coerced sorted dataset
2022-12-17 00:28:09.624 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:28:10.157 Hail: INFO: Coerced sorted dataset
2022-12-17 00:28:10.486 Hail: INFO: Ordering unsorted dataset with network shuffle


2022-12-17 00:28:13.225 Hail: INFO: Coerced sorted dataset
2022-12-17 00:28:13.565 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:28:14.084 Hail: INFO: Coerced sorted dataset
2022-12-17 00:28:14.453 Hail: INFO: Ordering unsorted dataset with network shuffle


2022-12-17 00:28:17.187 Hail: INFO: Coerced sorted dataset
2022-12-17 00:28:17.531 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:28:18.034 Hail: INFO: Coerced sorted dataset
2022-12-17 00:28:18.363 Hail: INFO: Ordering unsorted dataset with network shuffle


2022-12-17 00:28:21.108 Hail: INFO: Coerced sorted dataset
2022-12-17 00:28:21.467 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:28:22.003 Hail: INFO: Coerced sorted dataset
2022-12-17 00:28:22.354 Hail: INFO: Ordering unsorted dataset with network shuffle


2022-12-17 00:28:25.181 Hail: INFO: Coerced sorted dataset
2022-12-17 00:28:25.523 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:28:26.024 Hail: INFO: Coerced sorted dataset
2022-12-17 00:28:26.358 Hail: INFO: Ordering unsorted dataset with network shuffle


2022-12-17 00:28:29.042 Hail: INFO: Coerced sorted dataset
2022-12-17 00:28:29.389 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:28:29.911 Hail: INFO: Coerced sorted dataset
2022-12-17 00:28:30.248 Hail: INFO: Ordering unsorted dataset with network shuffle


2022-12-17 00:28:32.933 Hail: INFO: Coerced sorted dataset
2022-12-17 00:28:33.268 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:28:33.742 Hail: INFO: Coerced sorted dataset
2022-12-17 00:28:34.067 Hail: INFO: Ordering unsorted dataset with network shuffle


# 5. Outlier Removal

After [plotting the PCs](https://github.com/atgu/hgdp_tgp/blob/master/plot_pca.Rmd) using R, 24 outliers were identified. 

| sample ID | Genetic region | Population |
| --- | --- | --- |
| HG01880 | AFR | ACB |
| HG01881 | AFR | ACB |
| NA20274 | AFR | ASW |
| NA20299 | AFR | ASW |
| NA20314 | AFR | ASW |
| HGDP00013 | CSA | Brahui |
| HGDP00029 | CSA | Brahui |
| HGDP00057 | CSA | Balochi |
| HGDP00130 | CSA | Makrani |
| HGDP00150 | CSA | Makrani |
| HGDP00175 (This sample was discovered in the second PCA rerun) | CSA | Sindhi |
| HGDP01298 | EAS | Uygur |
| HGDP01300 | EAS | Uygur |
| HGDP01303 | EAS | Uygur |
| LP6005443-DNA_B02 | EAS | Uygur |
| HG01628 | EUR | IBS | 
| HG01629 | EUR | IBS | 
| HG01630 | EUR | IBS | 
| HG01694 | EUR | IBS | 
| HG01696 | EUR | IBS |
| HGDP00621 | MID | Bedouin |
| HGDP01270 | MID | Mozabite |
| HGDP01271 | MID | Mozabite |
| CHMI_CHMI3_WGS2 | No metadata information available for this sample | 

<details>
<summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    
- <a href="more info https://hail.is/docs/0.2/utils/index.html#hail.utils.hadoop_open"> More on  <i> hl.utils.hadoop_open() </i></a>
    
- <a href="more info https://hail.is/docs/0.2/functions/core.html#hail.expr.functions.literal"> More on  <i> hl.literal() </i></a>
</details>

[Back to Index](#Index)

In [5]:
# Read in the filtered and pruned dataset if not already done so 
filtered_pruned_mt = hl.read_matrix_table(ld_pruned_path)

# Read in the PCA outliers file into a list
with hl.utils.hadoop_open(outliers_path) as file: 
    outliers = [line.rstrip('\n') for line in file]
    
# Capture and broadcast the list as an expression
outliers_list = hl.literal(outliers)

# Remove the 24 outliers from the pruned dataset 
mt_without_outliers = filtered_pruned_mt.filter_cols(~outliers_list.contains(filtered_pruned_mt['s']))

# Validity check 
print('Before outlier removal: ' + str(filtered_pruned_mt.count()[1]))
print('After outlier removal: ' + str(mt_without_outliers.count()[1])) 
num_outliers = filtered_pruned_mt.count()[1] - mt_without_outliers.count()[1]
print('Total samples removed: ' + str(num_outliers))

Initializing Hail with default parameters...
Running on Apache Spark version 3.1.3
SparkUI available at http://mty-m.c.diverse-pop-seq-ref.internal:44735
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.105-acd89e80c345
LOGGING: writing to /home/hail/hail-20221201-1844-0.2.105-acd89e80c345.log


Before outlier removal: 4120
After outlier removal: 4096
Total samples removed: 24


# 6. Rerun PCA (without outliers)

**Before running the sections below make sure you have run sections 4a (PCA) and 4b (Projection) above.**

Here we are using the dataset without outliers and new paths for the outputs.

[Back to Index](#Index)

In [6]:
# Read the list of related-sample IDs back in
related_sample_ids = hl.read_table(related_sample_ids_path)

# Divide the new dataset [one without the 24 outliers] to unrelated and related samples 
unrelateds_mt_postoutlier = mt_without_outliers.filter_cols(hl.is_defined(related_sample_ids[mt_without_outliers.col_key]), keep=False) 
relateds_mt_postoutlier = mt_without_outliers.filter_cols(hl.is_defined(related_sample_ids[mt_without_outliers.col_key]), keep=True)

# Validity check 
print(unrelateds_mt_postoutlier.count()[1], relateds_mt_postoutlier.count()[1])

2022-12-01 18:44:57.123 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
2022-12-01 18:45:03.836 Hail: INFO: Coerced sorted dataset
2022-12-01 18:45:07.670 Hail: INFO: Coerced sorted dataset


3378 718


## 6a. Global PCA (without outliers)

[Back to Index](#Index)

In [11]:
# This cell took 20 min to run

# Dictionaries to hold unrelateds' PCA loadings and scores
loadings_dict = {}
unrel_scores_dict = {}

# Run PCA on unrelated samples as a whole min 
loadings_dict['GLOBAL'], unrel_scores_dict['GLOBAL'] = run_pca(unrelateds_mt_postoutlier)  


# Project related samples onto unrelated-samples' PC space 
project_individuals(relateds_mt_postoutlier, loadings_dict['GLOBAL'], unrel_scores_dict['GLOBAL'], pc_scores_without_outliers_path, 'GLOBAL', 'without_outliers')


2022-12-01 14:07:37.841 Hail: INFO: Coerced sorted dataset
2022-12-01 14:08:40.390 Hail: INFO: hwe_normalize: found 199974 variants after filtering out monomorphic sites.
2022-12-01 14:08:42.405 Hail: INFO: Coerced sorted dataset
2022-12-01 14:09:25.796 Hail: INFO: pca: running PCA with 20 components...
2022-12-01 14:25:24.972 Hail: INFO: Coerced sorted dataset
2022-12-01 14:25:55.666 Hail: INFO: Coerced sorted dataset
2022-12-01 14:25:56.586 Hail: INFO: Coerced sorted dataset
2022-12-01 14:27:52.175 Hail: INFO: Coerced sorted dataset
2022-12-01 14:27:52.784 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-01 14:27:53.276 Hail: INFO: Coerced sorted dataset
2022-12-01 14:27:54.739 Hail: INFO: merging 33 files totalling 330.9K...
2022-12-01 14:27:55.033 Hail: INFO: while writing:
    gs://hgdp-1kg/tutorial_datasets/pca/pc_scores_without_outliers/GLOBAL_scores_without_outliers.txt.bgz
  merge time: 293.530ms


## 6b. Subcontinental PCA (without outliers)

When running the following section, the notebook might freeze after printing the log for <code>EUR</code>, <code>AFR</code> and <code>AMR</code>. If this happens, do not restart it. Let it run and follow the progress with the outputs being generated at the path indicated.  

When complete, check that there are 21 total output files (3 for each region) in your specified output path.

Once you have confirmed you have the desired outputs, do the following:
<ol type="1">
<li> Save close and halt the current notebook</li>
<li> Open a new session</li>
<li> Proceed to the next step (run <code>project_relateds</code> function first)</li>
</ol>

<br>
<details><summary>For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 

<ul>
<li><a href="more info https://hail.is/docs/0.2/methods/genetics.html#hail.methods.hwe_normalized_pca"> More on <i> hwe_normalized_pca() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_rows"> More on <i> annotate_rows() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.annotate"> More on <i> annotate() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.transmute"> More on <i> transmute() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.export"> More on <i> export() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/experimental/index.html#hail.experimental.pc_project"> More on <i> pc_project() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.collect"> More on <i> collect() </i></a></li>
    </ul>
    
</details>

[Back to Index](#Index)

In [7]:
# Run time breakdown for this cell is as follows:
# 1hr & 40min for EAS, AMR, CSA, OCE
# 1hr & 42min for EUR, AFR, MID

# Dictionaries to hold unrelateds' PCA loadings and scores  
loadings_dict = {}
unrel_scores_dict = {}
regions = mt_without_outliers['hgdp_tgp_meta']['genetic_region'].collect() 
regions = list(dict.fromkeys(regions)) # convert into a list
# There are 7 regions: EUR, AFR, AMR, EAS, CSA, OCE, and MID

# For each region, run PCA on the unrelated samples 
for i in regions:  
    if i is not None: # exclude a none value
        # Filter the unrelateds per region
        subcont_unrelateds = unrelateds_mt_postoutlier.filter_cols(unrelateds_mt_postoutlier['hgdp_tgp_meta']['genetic_region'] == i) 

        # Run PCA
        loadings_dict[i], unrel_scores_dict[i] = run_pca(subcont_unrelateds)

        # Filter the related mt per region 
        subcont_relateds = relateds_mt_postoutlier.filter_cols(relateds_mt_postoutlier['hgdp_tgp_meta']['genetic_region'] == i)  

        # Project related samples onto unrelated-samples' PC space 
        project_individuals(subcont_relateds, loadings_dict[i], unrel_scores_dict[i], pc_scores_without_outliers_path, i, 'without_outliers')


2022-12-01 18:45:49.861 Hail: INFO: Coerced sorted dataset
2022-12-01 18:46:43.062 Hail: INFO: hwe_normalize: found 197194 variants after filtering out monomorphic sites.
2022-12-01 18:46:45.326 Hail: INFO: Coerced sorted dataset
2022-12-01 18:47:23.280 Hail: INFO: pca: running PCA with 20 components...
2022-12-01 19:24:10.158 Hail: INFO: Coerced sorted dataset
2022-12-01 19:24:54.950 Hail: INFO: Coerced sorted dataset
2022-12-01 19:24:55.757 Hail: INFO: Coerced sorted dataset
2022-12-01 19:27:18.510 Hail: INFO: Coerced sorted dataset
2022-12-01 19:27:19.166 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-01 19:27:19.610 Hail: INFO: Coerced sorted dataset
2022-12-01 19:27:20.793 Hail: INFO: merging 33 files totalling 65.4K...
2022-12-01 19:27:21.034 Hail: INFO: while writing:
    gs://hgdp-1kg/tutorial_datasets/pca/pc_scores_without_outliers/EUR_scores_without_outliers.txt.bgz
  merge time: 240.442ms
2022-12-01 19:27:22.221 Hail: INFO: Coerced sorted dataset
2022-12-

## 6c. PCA Plots (without outliers)

The following PCA plots are after the removal of outliers.

[Back to Index](#Index)

In [82]:
# Read in gnomAD's HGDP+1kGP metadata for plotting 
metadata = hl.import_table(metadata_path, impute = True, key = 's')

# Dictionary mapping colors to region names 
cont_colors = {'AMR':"#E41A1C",
               'AFR':"#984EA3", 
               'OCE':"#999999",
               'CSA':"#FF7F00",
               'EAS':"#4DAF4A", 
               'EUR':"#377EB8", 
               'MID':"#A65628" }

2022-12-17 00:35:34.052 Hail: INFO: Reading table to impute column types
2022-12-17 00:35:35.830 Hail: INFO: Loading <StructExpression of type struct{s: str, `project_meta.sample_id`: str, `project_meta.research_project_key`: str, `project_meta.seq_project`: str, `project_meta.ccdg_alternate_sample_id`: str, `project_meta.ccdg_gender`: str, `project_meta.ccdg_center`: str, `project_meta.ccdg_study`: str, `project_meta.cram_path`: str, `project_meta.project_id`: str, `project_meta.v2_age`: str, `project_meta.v2_sex`: str, `project_meta.v2_hard_filters`: str, `project_meta.v2_perm_filters`: str, `project_meta.v2_pop_platform_filters`: str, `project_meta.v2_related`: str, `project_meta.v2_data_type`: str, `project_meta.v2_product`: str, `project_meta.v2_product_simplified`: str, `project_meta.v2_qc_platform`: str, `project_meta.v2_project_id`: str, `project_meta.v2_project_description`: str, `project_meta.v2_internal`: str, `project_meta.v2_investigator`: str, `project_meta.v2_known_pop`:

In [83]:
# Initalize dictionary to save final data files to
scores_without_outliers = {}

# Loop through each region to create a curated dataset for each
regions = ['GLOBAL', 'AFR', 'AMR', 'CSA', 'EAS', 'EUR', 'MID', 'OCE']

for region in regions:
    
    # Import PC score tables
    scores = hl.import_table(pc_scores_without_outliers_path + region + '_scores_without_outliers.txt.bgz', impute = True)
    
    # Add information from the metadata - genetic region and populations 
    scores = scores.annotate(
        global_pop = metadata[scores.s]['hgdp_tgp_meta.Genetic.region'], 
        subpop = metadata[scores.s]['hgdp_tgp_meta.Population'],
        global_color = metadata[scores.s]['hgdp_tgp_meta.Continent.colors'],
        subpop_color = metadata[scores.s]['hgdp_tgp_meta.Pop.colors'],
        subpop_shapes = metadata[scores.s]['hgdp_tgp_meta.Pop.shapes'],
        proj_title = metadata[scores.s]['hgdp_tgp_meta.Project'])

    # Save annotated table to dictionary 
    # For plotting, the score files can be accessed by indexing the dictionary using region names 
    scores_without_outliers[region] = scores

2022-12-17 00:36:15.415 Hail: INFO: Reading table to impute column types
2022-12-17 00:36:15.990 Hail: INFO: Finished type imputation
  Loading field 's' as type str (imputed)
  Loading field 'PC1' as type float64 (imputed)
  Loading field 'PC2' as type float64 (imputed)
  Loading field 'PC3' as type float64 (imputed)
  Loading field 'PC4' as type float64 (imputed)
  Loading field 'PC5' as type float64 (imputed)
  Loading field 'PC6' as type float64 (imputed)
  Loading field 'PC7' as type float64 (imputed)
  Loading field 'PC8' as type float64 (imputed)
  Loading field 'PC9' as type float64 (imputed)
  Loading field 'PC10' as type float64 (imputed)
  Loading field 'PC11' as type float64 (imputed)
  Loading field 'PC12' as type float64 (imputed)
  Loading field 'PC13' as type float64 (imputed)
  Loading field 'PC14' as type float64 (imputed)
  Loading field 'PC15' as type float64 (imputed)
  Loading field 'PC16' as type float64 (imputed)
  Loading field 'PC17' as type float64 (imputed)


### 6c-1. Global PCA Plots (without outliers)

[Back to Index](#Index)

In [86]:
# get annotated score table from dictionary 
global_without_outliers = scores_without_outliers['GLOBAL']

# CHMI_CHMI3_WGS2 is a sample added by gnomAD for QC purposes and thus doesn't have metadata information. 
# To avoid a "None" error, we have to remove it before plotting. 
# From the dataset itself, it is removed together with PCA outliers in section 5 below. 
global_without_outliers = global_without_outliers.filter(global_without_outliers.s == 'CHMI_CHMI3_WGS2', keep = False)

# Make plot
p = ggplot(global_without_outliers, aes(x = global_without_outliers.PC1, y = global_without_outliers.PC2))+ \
    geom_point(aes(color = global_without_outliers.global_pop,
                   shape = global_without_outliers.proj_title),
                   size = 3, alpha = .5) +\
    xlab("PC1") + \
    ylab("PC2") + \
    ggtitle("Global PCA Without Outliers")+\
    labs(shape = 'Project', color = 'Population') +\
    scale_color_manual(values=cont_colors)

# Show plot
p.show()

2022-12-17 00:44:40.961 Hail: INFO: Coerced sorted dataset
2022-12-17 00:44:41.303 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:44:41.792 Hail: INFO: Coerced sorted dataset
2022-12-17 00:44:42.124 Hail: INFO: Ordering unsorted dataset with network shuffle


### 6c-2. Subcontinental PCA Plots (without outliers)

[Back to Index](#Index)


In [85]:
# Initialize dictionary to save each plot
plots_without_outliers = {}

for region in regions[1:]: # skip "GLOBAL" and only plot PCA for the 7 genetic regions 
    
    # Filter for a specific region
    subcont_without_outliers = scores_without_outliers[region]

    # Only plotting PC1 vs PC2 but you can change the PC values or make a for loop to plot the rest of the PCs
    p = ggplot(subcont_without_outliers, aes(x=subcont_without_outliers.PC1, y=subcont_without_outliers.PC2)) + \
        geom_point(aes(color = subcont_without_outliers.subpop, 
                       shape = subcont_without_outliers.proj_title),
                       size = 3, alpha = .3) +\
        xlab("PC1") + \
        ylab("PC2") + \
        ggtitle(region + " PCA Without Outliers")+\
        labs(shape = 'Project', color = 'Population')

    # Add plot to dictionary with the region name as its key 
    plots_without_outliers[region] = p
    
# Show subcontinental PC1 vs PC2 plots one by one 
for region in regions[1:]:
    plots_without_outliers[region].show()

2022-12-17 00:43:36.282 Hail: INFO: Coerced sorted dataset
2022-12-17 00:43:36.646 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:43:37.182 Hail: INFO: Coerced sorted dataset
2022-12-17 00:43:37.543 Hail: INFO: Ordering unsorted dataset with network shuffle


2022-12-17 00:43:40.313 Hail: INFO: Coerced sorted dataset
2022-12-17 00:43:40.655 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:43:41.137 Hail: INFO: Coerced sorted dataset
2022-12-17 00:43:41.527 Hail: INFO: Ordering unsorted dataset with network shuffle


2022-12-17 00:43:44.337 Hail: INFO: Coerced sorted dataset
2022-12-17 00:43:44.674 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:43:45.180 Hail: INFO: Coerced sorted dataset
2022-12-17 00:43:45.514 Hail: INFO: Ordering unsorted dataset with network shuffle


2022-12-17 00:43:48.370 Hail: INFO: Coerced sorted dataset
2022-12-17 00:43:48.716 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:43:49.210 Hail: INFO: Coerced sorted dataset
2022-12-17 00:43:49.546 Hail: INFO: Ordering unsorted dataset with network shuffle


2022-12-17 00:43:52.367 Hail: INFO: Coerced sorted dataset
2022-12-17 00:43:52.729 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:43:53.252 Hail: INFO: Coerced sorted dataset
2022-12-17 00:43:53.594 Hail: INFO: Ordering unsorted dataset with network shuffle


2022-12-17 00:43:56.367 Hail: INFO: Coerced sorted dataset
2022-12-17 00:43:56.700 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:43:57.162 Hail: INFO: Coerced sorted dataset
2022-12-17 00:43:57.480 Hail: INFO: Ordering unsorted dataset with network shuffle


2022-12-17 00:44:00.247 Hail: INFO: Coerced sorted dataset
2022-12-17 00:44:00.625 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-17 00:44:01.147 Hail: INFO: Coerced sorted dataset
2022-12-17 00:44:01.519 Hail: INFO: Ordering unsorted dataset with network shuffle


# 7. Write Out Matrix Table 
[Back to Index](#Index)

In [8]:
# separately write out mts of unrelated and related samples without outliers - 10min to run 
#unrelated mt
unrelateds_mt_postoutlier.write(unrelateds_mt_without_outliers_path, overwrite=False)

#related mt
relateds_mt_postoutlier.write(relateds_mt_without_outliers_path, overwrite=False)


2022-11-18 19:56:07 Hail: INFO: Coerced sorted dataset
2022-11-18 20:01:34 Hail: INFO: wrote matrix table with 199974 rows and 3378 columns in 50000 partitions to gs://hgdp-1kg/tutorial_datasets/pca_results/unrelateds_without_outliers.mt
2022-11-18 20:01:37 Hail: INFO: Coerced sorted dataset
2022-11-18 20:06:09 Hail: INFO: wrote matrix table with 199974 rows and 718 columns in 50000 partitions to gs://hgdp-1kg/tutorial_datasets/pca_results/relateds_without_outliers.mt


### NOTE: The PCA plots shown above can also be easily plotted in R. Click [here](https://github.com/atgu/hgdp_tgp/blob/master/plot_pca.Rmd) for more information. 

[Back to Index](#Index)