## PCA and Ancestry Analyses

Author: Mary T. Yohannes

#### To run this tutorial you need to have started your cluster with --packages-gnomad.

If you have not done this, you will need to shut down your current cluster and start a new one with the --packages-gnomad argument.

See the tutorials README for more information on how to start a cluster.

# Index
1. [Setting Default Paths](#1.-Set-Default-Paths)
2. [Variant Filtering and LD Pruning](#2.-Variant-Filtering-and-LD-Pruning)
3. [Run PC Relate](#3.-Run-PC-Relate)
4. [PCA](#4.-PCA)
    1. [Function to Run PCA on Unrelated Individuals](#4a.-Function-to-Run-PCA-on-Unrelated-Individuals)
    2. [Function to Project Related Individuals](#4b.-Function-to-Project-Related-Individuals)
    3. [Global PCA](#4c.-Global-PCA)
    4. [Subcontinental PCA](#4d.-Subcontinental-PCA)
5. [Outlier Removal](#5.-Outlier-Removal)
6. [Rerun PCA](#6.-Rerun-PCA)
    1. [Global PCA](#6a.-Global-PCA)
    2. [Subcontinental PCA](#6b.-Subcontinental-PCA)
7. [Writing out Matrix Table](#7.-Write-Out-Matrix-Table)

# General Overview 
The purpose of this notebook is to further filter the post-QC matrix table to prepare it for LD pruning, compute relatedness, and run Principal Component Analysis (PCA).

**This script contains information on how to:**
- Read in a matrix table and run Hail common variant statistics 
- Filter using allele frequency and call rate
- Run LD pruning 
- Run relatedness and separate related and unrelated individuals
- Calculate PC scores and project samples on to a PC space  
- Run global and subcontinental PCA and plot them 
- Remove PCA outliers (filter using sample IDs)
- Rerun global and subcontinental PCA
- Write out a matrix table 

In [None]:
import hail as hl

# importing methods from gnomAD needed to project individuals
from gnomad.sample_qc.ancestry import *

# 1. Set Default Paths
These default paths can be edited by users as needed. 

By default we have commented out all of the write steps of the tutorials, if you would like to write out your own datasets, uncomment those sections and replace the paths with your own. Don't forget to change the read-in paths as well. 

[Back to Index](#Index)

In [None]:
# input file 
input_path = 'gs://hgdp-1kg/hgdp_tgp/intermediate_files/pre_running_varqc.mt'

# save the filtered and LD pruned mt as an intermediate file since LD pruning takes a while to rerun
intermediate_file_path = 'gs://hgdp-1kg/hgdp_tgp/intermediate_files/filtered_n_pruned_output_updated.mt'

# paths for unrelated and related samples (prior to outlier identification and removal) 
unrel_preoutlier_path = 'gs://hgdp-1kg/hgdp_tgp/unrel_updated.mt'
rel_preoutlier_path = 'gs://hgdp-1kg/hgdp_tgp/rel_updated.mt' 

# path for pre-outlier PCA results - global & subcontinental PCA 
pca_preoutlier_path = 'gs://hgdp-1kg/hgdp_tgp/pca_preoutlier/'

# outliers file 
outliers_path = 'gs://hgdp-1kg/hgdp_tgp/pca_outliers_v2.txt'

# path for post-outlier PCA results - global & subcontinental PCA 
pca_postoutlier_path = 'gs://hgdp-1kg/hgdp_tgp/pca_postoutlier/'

# final output paths for unrelated and related samples (post-outlier)
unrel_final_output = 'gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt'
rel_final_output = 'gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/related.mt'

# 2. Variant Filtering and LD Pruning
   
At this point, we have 155,648,020 SNVs. We want fewer variants (~100-300k) for PCA for computational efficiency, so we apply filters on: allele frequency (<code>AF</code>) and missingness (<code>call rate</code>), then run LD pruning.  

Linkage disequilibrium (LD) is the correlation between nearby variants such that the alleles at neighboring polymorphisms (observed on the same chromosome) are associated within a population more often than if they were unlinked.    
    
For more information on LD pruning click <a href=""> here </a>


<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/methods/genetics.html#hail.methods.variant_qc"> More on  <i> variant_qc() </i></a></li>

<li><a href="https://hail.is/docs/0.2/methods/genetics.html#hail.methods.ld_prune"> More on  <i> ld_prune() </i></a></li>
    </ul>
</details>

[Back to Index](#Index)

In [None]:
# read-in the right input file 
mt_filt = hl.read_matrix_table(input_path)

### 2a. Variant Filtering 

[Back to Index](#Index)

In [None]:
# run Hail's common variant statistics (QC metrics) 
mt_var = hl.variant_qc(mt_filt) 

# filter to variants with AF between 0.05 & 0.95, and call rate greater than 0.999    
mt_var_filt = mt_var.filter_rows((mt_var.variant_qc.AF[0] > 0.05) & 
                                 (mt_var.variant_qc.AF[0] < 0.95) & 
                                 (mt_var.variant_qc.call_rate > 0.999))

# this line take ~20min to print 
print('Num of variants after filtering = ' + str(mt_var_filt.count()[0]))

We started with 155,648,020 SNVs, then after filtering on allele frequency and call rate, we ended with 6,787,034 SNVs.

### 2b. LD Pruning

[Back to Index](#Index)

We have too many variants for PCA that are also non-independent. We address this by pruning SNVs based on LD.

In [None]:
# remove correlated variants 
pruned = hl.ld_prune(mt_var_filt.GT, r2=0.1, bp_window_size=500000) # ~113 min to run  
mt_var_pru_filt = mt_var_filt.filter_rows(hl.is_defined(pruned[mt_var_filt.row_key])) 
print('Num of variants after LD pruning = ' + str(mt_var_pru_filt.count()[0])) # 248,634 SNVs

Since the number of variants after this step is now in the ~100-300k range, we proceed to the PCA analysis without any more adjustments.

### 2c. Write out an intermediate file

The LD pruning step takes a non-negligible amount of time to run, so to ensure that the downstream analyses steps don't take a very long time, we write out an intermediate file. This write out step should take around 23 minutes to run.


If the user wishes to export their own intermediate file, they can do so by changing the intermediate file path. Once a file has been written out, the <code>overwrite</code> argument can be used to replace it with a new file or keep the original one.  

[Back to Index](#Index)

In [None]:
## writing out an intermediate file to speed up subsequent analyses; take ~23 min to run
# mt_var_pru_filt.write(intermediate_file_path, overwrite=False) 

# read the intermediate file back in for subsequent analyses
mt_var_pru_filt = hl.read_matrix_table(intermediate_file_path) 

# 3. Run PC Relate   

When doing Principal Component Analysis (PCA), we need to separate the related and unrelated samples before computing the PC scores and plotting them. This is because if we compute PCA with the related samples in the data set, the population structure and clustering will be affected by the relatedness that exists among those samples. Thus, we first have to identify the related individuals by computing relatedness estimates (kinship statistic in this case) using a variant of the PC-Relate method in Hail. We used a minimum minor allele frequency (MAF) filter of 0.05, excluded sample pairs with kinship less than 0.05, and used 20 principal components (PC) to control for population structure. After getting the sample ID pairs for the related samples, we then separate the filtered and pruned mt into relateds and unrelateds.

<br>
We computed the kinship statistic using (metrics for <code>pc_relate</code>):
    <ul>
        <li>a minimum minor allele frequency filter of 0.05</li>
        <li>excluding sample-pairs with kinship less than 0.05</li>
        <li>20 principal components to control for population structure</li>
    </ul>

<br>    
<details><summary>For more information on relatedness click <u><span style="color:blue">here</span></u>.</summary>
    <ul>
        <li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4716688/">Paper</a></li>
        <li><a href="https://hail.is/docs/0.2/methods/relatedness.html#relatedness">Hail documentation</a></li>
    </ul>
</details>

<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    <ul>
        <li><a href="https://hail.is/docs/0.2/methods/relatedness.html#hail.methods.pc_relate"> More on  <i> pc_relate() </i></a>
        </li>
        <li><a href="https://hail.is/docs/0.2/methods/misc.html#hail.methods.maximal_independent_set"> More on  <i> maximal_independent_set() </i></a>
        </li>
    </ul>
</details>

[Back to Index](#Index)

In [None]:
# compute kinship statistic
# takes ~4min to run
relatedness_ht = hl.pc_relate(
    mt_var_pru_filt.GT, 
    min_individual_maf=0.05, 
    min_kinship=0.05, 
    statistics='kin', 
    k=20).key_by() 

# identify closely related individuals in pairs (list of sample IDs) 
# takes ~2hr & 22min to run
related_samples_to_remove = hl.maximal_independent_set(relatedness_ht.i, relatedness_ht.j, False) 

# subset the filtered and pruned mt to samples that are NOT present in the list of related individuals  
mt_unrel = mt_var_pru_filt.filter_cols(hl.is_defined(related_samples_to_remove[mt_var_pru_filt.col_key]), keep=False) 

# do the same as above but this time subset to samples that are present in the related-individuals list   
mt_rel = mt_var_pru_filt.filter_cols(hl.is_defined(related_samples_to_remove[mt_var_pru_filt.col_key]), keep=True) 

In [None]:
# # write out the unrelated and related mts since they are used beyond this notebook in other analyses     
# # unrelated mt
# mt_unrel.write(unrel_preoutlier_path, overwrite=False) 

# # related mt 
# mt_rel.write(rel_preoutlier_path, overwrite=False)

In [None]:
# read the related and unrelated mts back in 
# unrelated mt
mt_unrel = hl.read_matrix_table(unrel_preoutlier_path)

# related mt 
mt_rel = hl.read_matrix_table(rel_preoutlier_path)

# 4. PCA

PCA is run on the unrelated samples first. Then, the related samples are projected onto the PC space of the unrelated samples to get their PC scores. This way the population structure and clustering is not affected by the relatedness among samples.  

[Back to Index](#Index)

### 4a. Function to Run PCA on Unrelated Individuals

[Back to Index](#Index)

In [None]:
def run_pca(mt: hl.MatrixTable, reg_name:str, out_path: str, overwrite: bool = False):
    """
    Runs PCA on a data set
    :param mt: data set to run PCA on
    :param reg_name: region name for saving output purposes
    :param out_path: path for where to save the outputs
    :return:
    """

    pca_evals, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, k=20, compute_loadings=True)
    pca_mt = mt.annotate_rows(pca_af=hl.agg.mean(mt.GT.n_alt_alleles()) / 2)
    pca_loadings = pca_loadings.annotate(pca_af=pca_mt.rows()[pca_loadings.key].pca_af)
    pca_scores = pca_scores.transmute(**{f'PC{i}': pca_scores.scores[i - 1] for i in range(1, 21)})
    
    pca_scores.export(out_path + reg_name + '_scores.txt.bgz')  # save individual-level-genetic-region PCs
    pca_loadings.write(out_path + reg_name + '_loadings.ht', overwrite)  # save PCA loadings

### 4b. Function to Project Related Individuals
If this function is not working, make sure you used the <code>--packages gnomad</code> argument when starting your cluster (as noted at the beginning of the notebook above).

[Back to Index](#Index)

In [None]:
from gnomad.sample_qc.ancestry import *

def project_individuals(pca_loadings, project_mt, reg_name:str, out_path: str, overwrite: bool = False):
    """
    Project samples into predefined PCA space
    :param pca_loadings: existing PCA space of unrelated samples 
    :param project_mt: matrix table of related samples to project  
    :param reg_name: region name for saving output purposes
    :param out_path: path for where to save PCA projection outputs
    :return:
    """
    ht_projections = pc_project(project_mt, pca_loadings)  
    ht_projections = ht_projections.transmute(**{f'PC{i}': ht_projections.scores[i - 1] for i in range(1, 21)}) 
    ht_projections.export(out_path + reg_name + '_projected_scores.txt.bgz') # save output   

### 4c. Global PCA

We are doing this to see the population structure and clustering on a continental level and contextualize the data globally.    

[Back to Index](#Index)

In [None]:
# run PCA on the unrelated samples
run_pca(mt_unrel, 'global', pca_preoutlier_path, False)  

# read in the PCA loadings of the unrelated samples
loadings = hl.read_table(pca_preoutlier_path+'global_loadings.ht') 

# project the related samples onto the unrelated-samples' PC space 
project_individuals(loadings, mt_rel, 'global', pca_preoutlier_path, False) 

### 4d. Subcontinental PCA 

To see the population structure and clustering on a subcontinental level and contextualize data within continental regions. This also helped us identify outliers which were removed later on.     

When running the following section, the notebook might freeze after printing the log for <code>EUR</code>, <code>AFR</code> and <code>AMR</code>. If this happens, do not restart it. Let it run and follow the progress with the outputs being generated at the path indicated.  

When complete, check that there are 21 total output files (3 for each region) in your specified output path.

Once you have confirmed you have the desired outputs, do the following:
<ol type="1">
<li> Save close and halt the current notebook</li>
<li> Open a new session</li>
<li> Proceed to the next step (run <code>project_relateds</code> function first)</li>
</ol>

<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 

<ul>
<li><a href="more info https://hail.is/docs/0.2/methods/genetics.html#hail.methods.hwe_normalized_pca"> More on <i> hwe_normalized_pca() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_rows"> More on <i> annotate_rows() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.annotate"> More on <i> annotate() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.transmute"> More on <i> transmute() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.export"> More on <i> export() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/experimental/index.html#hail.experimental.pc_project"> More on <i> pc_project() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.collect"> More on <i> collect() </i></a></li>
</ul>
    
</details>

[Back to Index](#Index)

In [None]:
# obtain a list of the continental regions in the data set (used the unrelated data set since it had more samples) 
regions = mt_unrel['hgdp_tgp_meta']['Genetic']['region'].collect()
regions = list(dict.fromkeys(regions)) # convert into a list
# There are 7 regions: EUR, AFR, AMR, EAS, CSA, OCE, and MID

# set argument values for PCA 
subcont_pca_prefix = pca_preoutlier_path+'subcont_pca/' # path for outputs 
overwrite = False

# for each region, run PCA on the unrelated samples (~27min to run)
for i in regions:  
    # filter the unrelateds per region
    subcont_unrel = mt_unrel.filter_cols(mt_unrel['hgdp_tgp_meta']['Genetic']['region'] == i) 
    
    # run PCA
    run_pca(subcont_unrel, i, subcont_pca_prefix, overwrite)

In [None]:
# for each region, project the related samples onto the unrelated-samples' PC space (~2min to run)
for i in regions:
    # read in the PCA loadings of the unrelated samples for each region 
    loadings = hl.read_table(subcont_pca_prefix + i + '_loadings.ht') 
    
    # filter the related mt per region 
    subcont_rel = mt_rel.filter_cols(mt_rel['hgdp_tgp_meta']['Genetic']['region'] == i)  
    
    # project 
    project_individuals(loadings, subcont_rel, i, subcont_pca_prefix, overwrite) 

# 5. Outlier Removal

After plotting the PCs using R (link_the_plotting_Rmarkdown_here), 22 outliers were identified. 

| sample ID | Genetic region | Population |
| --- | --- | --- |
| HG01880 | AFR | ACB |
| HG01881 | AFR | ACB |
| NA20274 | AFR | ASW |
| NA20299 | AFR | ASW |
| NA20314 | AFR | ASW |
| HGDP00013 | CSA | Brahui |
| HGDP00029 | CSA | Brahui |
| HGDP00057 | CSA | Balochi |
| HGDP00130 | CSA | Makrani |
| HGDP00150 | CSA | Makrani |
| HGDP01298 | EAS | Uygur |
| HGDP01303 | EAS | Uygur |
| HGDP01300 | EAS | Uygur |
| LP6005443-DNA_B02 | EAS | Uygur |
| HG01628 | EUR | IBS | 
| HG01629 | EUR | IBS | 
| HG01630 | EUR | IBS | 
| HG01694 | EUR | IBS | 
| HG01696 | EUR | IBS |
| HGDP00621 | MID | Bedouin |
| HGDP01270 | MID | Mozabite |
| HGDP01271 | MID | Mozabite |

<details>
<summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    
- <a href="more info https://hail.is/docs/0.2/utils/index.html#hail.utils.hadoop_open"> More on  <i> hl.utils.hadoop_open() </i></a>
    
- <a href="more info https://hail.is/docs/0.2/functions/core.html#hail.expr.functions.literal"> More on  <i> hl.literal() </i></a>
</details>

[Back to Index](#Index)

In [None]:
# read in the unrelated and related mts to remove outliers and rerun pca  
mt_unrel_unfiltered = hl.read_matrix_table(unrel_preoutlier_path) # unrelated mt
mt_rel_unfiltered = hl.read_matrix_table(rel_preoutlier_path) # related mt

# read the outliers file into a list
with hl.utils.hadoop_open(outliers_path) as file: 
    outliers = [line.rstrip('\n') for line in file]
    
# capture and broadcast the list as an expression
outliers_list = hl.literal(outliers)

# remove the 22 outliers from both mts
mt_unrel = mt_unrel_unfiltered.filter_cols(~outliers_list.contains(mt_unrel_unfiltered['s']))
mt_rel = mt_rel_unfiltered.filter_cols(~outliers_list.contains(mt_rel_unfiltered['s']))

# validity check 
print('Unrelated: Before outlier removal ' + str(mt_unrel_unfiltered.count()[1]) + 
      ' | After outlier removal ' + str(mt_unrel.count()[1]))

print('Related: Before outlier removal: ' + str(mt_rel_unfiltered.count()[1]) + 
      ' | After outlier removal ' + str(mt_rel.count()[1])) 

num_outliers = (mt_unrel_unfiltered.count()[1] - mt_unrel.count()[1]) + (mt_rel_unfiltered.count()[1] - mt_rel.count()[1])
print('Total samples removed = ' + str(num_outliers))

# 6. Rerun PCA

**Before running the sections below make sure you have run sections 4a (PCA) and 4b (projection) above.**

Here we are using the updated unrelated and related mts (outliers removed) and new paths for the outputs.

[Back to Index](#Index)

### 6a. Global PCA (without outliers)

[Back to Index](#Index)

In [None]:
# run PCA on the unrelated samples  
run_pca(mt_unrel, 'global', pca_postoutlier_path, False)

# read in the PCA loadings of the unrelated samples  
loadings = hl.read_table(pca_postoutlier_path+'global_loadings.ht') 

# project the related samples onto the unrelated-samples' PC space 
project_individuals(loadings, mt_rel, 'global', pca_postoutlier_path, False) 

### 6b. Subcontinental PCA (without outliers)

When running the following section, the notebook might freeze after printing the log for <code>EUR</code>, <code>AFR</code> and <code>AMR</code>. If this happens, do not restart it. Let it run and follow the progress with the outputs being generated at the path indicated.  

When complete, check that there are 21 total output files (3 for each region) in your specified output path.

Once you have confirmed you have the desired outputs, do the following:
<ol type="1">
<li> Save close and halt the current notebook</li>
<li> Open a new session</li>
<li> Proceed to the next step (run <code>project_relateds</code> function first)</li>
</ol>

<br>
<details><summary>For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 

<ul>
<li><a href="more info https://hail.is/docs/0.2/methods/genetics.html#hail.methods.hwe_normalized_pca"> More on <i> hwe_normalized_pca() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_rows"> More on <i> annotate_rows() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.annotate"> More on <i> annotate() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.transmute"> More on <i> transmute() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.export"> More on <i> export() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/experimental/index.html#hail.experimental.pc_project"> More on <i> pc_project() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.collect"> More on <i> collect() </i></a></li>
    </ul>
    
</details>

[Back to Index](#Index)

In [None]:
# set argument values for PCA 
subcont_pca_prefix = pca_postoutlier_path+'subcont_pca/' # path for outputs 
overwrite = False 

# for each region, run PCA on the unrelated samples (~26 min to run) 
# "regions" is a list containing the 7 continental regions in the data set from section 4d
for i in regions: 
    # filter the unrelateds per region
    subcont_unrel = mt_unrel.filter_cols(mt_unrel['hgdp_tgp_meta']['Genetic']['region'] == i)  
    
    # run PCA
    run_pca(subcont_unrel, i, subcont_pca_prefix, overwrite)

In [None]:
# for each region, project the related samples onto the unrelated-samples' PC space (~3min to run)
for i in regions:
    # read in the PCA loadings of the unrelated samples for each region
    loadings = hl.read_table(subcont_pca_prefix + i + '_loadings.ht') 
    
    # filter the relateds per region 
    subcont_rel = mt_rel.filter_cols(mt_rel['hgdp_tgp_meta']['Genetic']['region'] == i)  
    
    # project 
    project_individuals(loadings, subcont_rel, i, subcont_pca_prefix, overwrite) 

# 7. Write Out Matrix Table 
[Back to Index](#Index)

In [None]:
# # write out mts of unrelated and related samples separately (post-outlier removal) 
# #unrelated mt
# mt_unrel.write(unrel_final_output, overwrite=False)

# #related mt
# mt_rel.write(rel_final_output, overwrite=False)