## PCA and Ancestry Analyses

Author: Mary T. Yohannes and Ally Kim

**To run this tutorial, you need to have started your cluster with `--packages-gnomad`.**

*If you have not done this, you will need to shut down your current cluster and start a new one with the `--packages-gnomad` argument.* 

See the tutorials [README](https://github.com/atgu/hgdp_tgp/tree/master/tutorials#readme) for more information on how to start a cluster.

# Index
1. [Set Default Paths](#1.-Set-Default-Paths)
2. [Read in Pre-QC Dataset and Apply Quality Control Filters](#2.-Read-in-Pre-QC-Dataset-and-Apply-Quality-Control-Filters)
3. [Variant Filtering and LD Pruning](#3.-Variant-Filtering-and-LD-Pruning)
4. [Estimate Kinship using KING-robust](#4.-Estimate-Kinship-using-KING-robust)
5. [Functions for PCA Analyses](#5.-Functions-for-PCA-Analyses)
    1. [Run PCA on Unrelated Individuals](#5.a.-Run-PCA-on-Unrelated-Individuals)
    2. [Project Related Individuals](#5.b.-Project-Related-Individuals)
    3. [Plot Functions](#5.c.-Plot-Functions)
6. [Run PCA with Outliers](#6.-Run-PCA-with-Outliers)
    1. [Run Global PCA and Plot](#6.a.-Run-Global-PCA-and-Plot)
    2. [Run Subcontinental PCA and Plot](#6.b.-Run-Subcontinental-PCA-and-Plot)
7. [Outliers Removal](#7.-Outliers-Removal)
8. [Rerun PCA Without Outliers](#8.-Rerun-PCA-Without-Outliers)
    1. [Rerun Global PCA and Plot](#8.a.-Rerun-Global-PCA-and-Plot)
    2. [Rerun Subcontinental PCA and Plot](#8.b.-Rerun-Subcontinental-PCA-and-Plot)
9. [Writing out Matrix Tables](#9.-Write-Out-Matrix-Tables)

# General Overview 
The purpose of this notebook is to further filter the post-QC matrix table to prepare it for LD pruning, compute relatedness, and run Principal Component Analysis (PCA).

**This script contains information on how to:**
- Read in a matrix table (shortened as mt) and filter it using a field within the matrix table and a function imported from an external library
- Run Hail common variant statistics and filter using allele frequency & call rate
- Run LD pruning 
- Run relatedness and separate related and unrelated individuals
- Set up functions to make redundant calculations concise
- Calculate PC scores and project samples on to a PC space  
- Run global and subcontinental PCA and plot them 
- Remove PCA outliers (filter using sample IDs)
- Write out a matrix table 

In [None]:
import hail as hl

# Functions from gnomAD library to apply genotype filters and project related samples  
from gnomad.utils.filtering import filter_to_adj
from gnomad.sample_qc.ancestry import pc_project

# For plotting in Hail
from hail.ggplot import *
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()

In [None]:
# Initializing Hail 
hl.init()

In [None]:
# Allow output scrolling in Jupyter nb viewer for cells with long outputs 

from IPython.core.display import HTML
css = open('format.css').read()
HTML('<style>{}</style>'.format(css))

# 1. Set Default Paths

These default paths can be edited by users as needed. It is recommended to run these tutorials without writing out datasets.

**By default, all of the dataset write out sections are shown as markdown cells. If you would like to write out your own dataset, you can copy the code and paste it into a new code cell. Don't forget to change the paths in the following cell accordingly and edit the ```overwrite``` argument if you are writing out a dataset more than once.** 

[Back to Index](#Index)

In [2]:
# Path for HGDP+1kGP dataset prior to applying gnomAD QC filters
pre_qc_path = 'gs://gcp-public-data--gnomad/release/3.1.2/mt/genomes/gnomad.genomes.v3.1.2.hgdp_1kg_subset_dense.mt'

# Path for gnomAD's HGDP+1kGP metadata for plotting 
metadata_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/metadata_and_qc/gnomad_meta_updated.tsv'

# Save the filtered and LD pruned mt as an intermediate file since LD pruning takes a while to rerun
ld_pruned_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/pca_preprocessing/ld_pruned.mt'

# Hail table of related sample IDs for separating unrelateds and relateds for PCA run 
related_sample_ids_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/pca_preprocessing/related_sample_ids.ht'

# Path for with-outliers PCA results - global & subcontinental PCA 
pc_scores_with_outliers_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/pca/pc_scores_with_outliers/'

# PCA outliers file 
outliers_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/pca/pca_outliers.txt'

# Path for without-outliers PCA results - global & subcontinental PCA 
pc_scores_without_outliers_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/pca/pc_scores_without_outliers/'

# Paths for unrelated and related datasets without outliers   
unrelateds_mt_without_outliers_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/pca_results/unrelateds_without_outliers.mt'
relateds_mt_without_outliers_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg_v2/pca_results/relateds_without_outliers.mt' 

# 2. Read in Pre-QC Dataset and Apply Quality Control Filters

Since the post-QC mt was not written out, we run the same function as tutorial notebook 1 to apply the quality control filters to the pre-QC dataset.

**To avoid errors, make sure to run the next two cells before running any code that includes the post-QC dataset.**

<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_matrix_table"> More on  <i> read_matrix_table() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.count"> More on  <i> count() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/linalg/hail.linalg.BlockMatrix.html#hail.linalg.BlockMatrix.filter_cols"> More on  <i> filter_cols() </i></a></li>

<li><a href="https://hail.is/docs/0.2/linalg/hail.linalg.BlockMatrix.html#hail.linalg.BlockMatrix.filter_rows"> More on  <i> filter_rows() </i></a></li>
</ul>
</details>

[Back to Index](#Index)

In [4]:
# Set up function to:
# apply gnomAD's sample, variant and genotype QC filters
# remove two contaminated samples identified using CHARR - https://pubmed.ncbi.nlm.nih.gov/37425834/
# remove the gnomAD sample that's added for QC purposes
# only keep the variants which are found in the samples that are left 
# add gnomAD's HGDP+1kGP metadata with the updated population labels as a column field 

def run_qc(mt):
    
    ## Apply sample QC filters to dataset 
    # This filters to only samples that passed gnomAD's sample QC hard filters  
    mt = mt.filter_cols(~mt.gnomad_sample_filters.hard_filtered) # removed 31 samples
    
    ## Apply variant QC filters to dataset
    # This subsets to only PASS variants - those which passed gnomAD's variant QC
    # PASS variants have an entry in the filters field 
    mt = mt.filter_rows(hl.len(mt.filters) != 0, keep=False)
    
    # Remove the two contaminated samples identified by CHARR and 'CHMI_CHMI3_WGS2'
    contaminated_samples = {'HGDP01371', 'LP6005441-DNA_A09'}
    contaminated_samples_list = hl.literal(contaminated_samples)
    mt = mt.filter_cols(~contaminated_samples_list.contains(mt['s']))
    
    # CHMI_CHMI3_WGS2 is a sample added by gnomAD for QC purposes and has no metadata info 
    mt = mt.filter_cols(mt.s == 'CHMI_CHMI3_WGS2', keep = False)

    # Only keep the variants which are found in the samples that are left 
    mt = mt.filter_rows(hl.agg.any(mt.GT.is_non_ref()))
    
    # Read in and add the metadata with the updated population labels as a column field 
    metadata = hl.import_table(metadata_path, impute = True, key = 's') 
    mt = mt.annotate_cols(meta_updated = metadata[mt.s])
    
    ## Apply genotype QC filters to the dataset
    # This is done using a function imported from gnomAD and is the last step in the QC process
    mt = filter_to_adj(mt)

    return mt

In [8]:
# Read in the HGDP+1kGP pre-QC mt
pre_qc_mt = hl.read_matrix_table(pre_qc_path)

# Run QC 
post_qc_mt = run_qc(pre_qc_mt)

# Validity check: number of variants and samples after applying QC filters
# Took ~1hr to print 
print('Num of SNVs and samples after applying QC filters = ' + str(post_qc_mt.count()))

2023-12-17 19:18:46.415 Hail: INFO: Reading table to impute column types 1) / 1]
2023-12-17 19:18:51.744 Hail: INFO: Loading <StructExpression of type struct{s: str, `project_meta.sample_id`: str, `project_meta.research_project_key`: str, `project_meta.seq_project`: str, `project_meta.ccdg_alternate_sample_id`: str, `project_meta.ccdg_gender`: str, `project_meta.ccdg_center`: str, `project_meta.ccdg_study`: str, `project_meta.cram_path`: str, `project_meta.project_id`: str, `project_meta.v2_age`: str, `project_meta.v2_sex`: str, `project_meta.v2_hard_filters`: str, `project_meta.v2_perm_filters`: str, `project_meta.v2_pop_platform_filters`: str, `project_meta.v2_related`: str, `project_meta.v2_data_type`: str, `project_meta.v2_product`: str, `project_meta.v2_product_simplified`: str, `project_meta.v2_qc_platform`: str, `project_meta.v2_project_id`: str, `project_meta.v2_project_description`: str, `project_meta.v2_internal`: str, `project_meta.v2_investigator`: str, `project_meta.v2_kno

Num of SNVs and samples after applying QC filters = (159339147, 4117)


# 3. Variant Filtering and LD Pruning

At this point, we have <code>159,339,147 SNVs</code>. We want fewer variants (~100-300k) for PCA for computational efficiency, so we apply filters on: allele frequency (<code>AF</code>) and missingness (<code>call rate</code>), and then run LD pruning.  

Linkage disequilibrium (LD) is the correlation between nearby variants such that the alleles at neighboring polymorphisms (observed on the same chromosome) are associated within a population more often than if they were unlinked.    
    
For more information on LD pruning click <a href="https://www.nature.com/articles/nrg2361"> here</a>.


<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/methods/genetics.html#hail.methods.variant_qc"> More on  <i> variant_qc() </i></a></li>

<li><a href="https://hail.is/docs/0.2/methods/genetics.html#hail.methods.ld_prune"> More on  <i> ld_prune() </i></a></li>
    </ul>
</details>

[Back to Index](#Index)

## 3a. Variant Filtering 

[Back to Index](#Index)

In [9]:
# Run Hail's common variant statistics (QC metrics) 
var_qc_mt = hl.variant_qc(post_qc_mt) 

# Filter to variants with AF between 0.05 & 0.95, and call rate greater than 0.999    
filtered_mt = var_qc_mt.filter_rows(((var_qc_mt.variant_qc.AF[0] > 0.05) & (var_qc_mt.variant_qc.AF[1] > 0.05)) &
                                 ((var_qc_mt.variant_qc.AF[0] < 0.95) & (var_qc_mt.variant_qc.AF[1] < 0.95)) &
                                 (var_qc_mt.variant_qc.call_rate > 0.999))
# Took ~13min to print 
print('Num of variants after filtering = ' + str(filtered_mt.count()[0])) 



Num of variants after filtering = 5241811


After filtering on allele frequency and call rate, the number of SNVs decreased from <code>159,339,147</code> to <code>5,241,811</code>.

## 3b. LD Pruning

We have too many variants for PCA that are also non-independent. We address this by pruning SNVs based on LD.

[Back to Index](#Index)

In [None]:
# Remove correlated variants 
# Took ~2hrs to run 
pruned_mt = hl.ld_prune(filtered_mt.GT, r2=0.1, bp_window_size=500000) 

In [10]:
filtered_pruned_mt = filtered_mt.filter_rows(hl.is_defined(pruned_mt[filtered_mt.row_key])) 

In [16]:
# Took ~13min to print 
print('Num of variants after LD pruning = ' + str(filtered_pruned_mt.count()[0])) 

Num of variants after LD pruning = 200403


Since the number of variants is now in the ~100-300k range, we proceed to the PCA analysis without any more adjustments.

The LD pruning step takes a non-negligible amount of time to run, so to ensure that the downstream analyses steps don't take a long time, we write out an intermediate file. This write out step should take around 16 minutes to run.

If the user wishes to export their own intermediate file, they can do so by changing the intermediate file path. Once a file has been written out, the <code>overwrite</code> argument can be used to replace it with a new file or keep the original one.  

- Write out an intermediate file to speed up subsequent analyses (took ~16min to run)

```python3
filtered_pruned_mt.write(ld_pruned_path, overwrite=False) 
```

[Back to Index](#Index)

In [None]:
# Read the intermediate file back in for subsequent analyses
filtered_pruned_mt = hl.read_matrix_table(ld_pruned_path) 

# 4. Estimate Kinship using KING-robust

When doing Principal Component Analysis (PCA), we need to separate the related and unrelated samples before computing the PC scores and plotting them. This is because if we compute PCA with the related samples in the dataset, the population structure and clustering will be affected by the relatedness that exists among those samples. Thus, we first have to estimate kinship using KING-robust (<code>hl.king</code> in Hail) to identify unrelated and related sets before running PCA. 

<br>  
<details><summary>For more information on relatedness click <u><span style="color:blue">here</span></u>.</summary>
    <ul>
        <li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4716688/">Paper</a></li>
        <li><a href="https://hail.is/docs/0.2/methods/relatedness.html#relatedness">Hail documentation</a></li>
    </ul>
</details>

<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    <ul>
        <li><a href="https://hail.is/docs/0.2/methods/relatedness.html#hail.methods.king"> More on  <i> king() </i></a>
        </li>
        <li><a href="https://hail.is/docs/0.2/methods/misc.html#hail.methods.maximal_independent_set"> More on  <i> maximal_independent_set() </i></a>
        </li>
    </ul>
</details>

[Back to Index](#Index)

In [None]:
# Took ~2hrs to run
kinship = hl.king(filtered_pruned_mt.GT) 

In [None]:
# Reformat the kinship mt into a table with three columns 
# Exclude pairs of samples with kinship lower than 0.05 
relatedness_ht = kinship.filter_entries((kinship.s_1 != kinship.s) & (kinship.phi >= 0.05)).entries()

In [None]:
# Identify closely related individuals in pairs (list of sample IDs) using maximal independent set 
related_sample_ids = hl.maximal_independent_set(relatedness_ht.s_1, relatedness_ht.s, False) # 698 samples

- Write out sample IDs of related individuals  

```python3
related_sample_ids.write(related_sample_ids_path, overwrite=False)
```

In [4]:
# Read the list of related sample IDs back in
related_sample_ids = hl.read_table(related_sample_ids_path)

# Subset the filtered and pruned mt to unrelated samples only 
# Sample IDs that are NOT present in the list of related individuals  
unrelateds_mt_preoutlier = filtered_pruned_mt.filter_cols(hl.is_defined(related_sample_ids[filtered_pruned_mt.col_key]), keep=False) 

# Do the same as above but this time subset to related samples only 
# Sample IDs that are present in the list of related individuals    
relateds_mt_preoutlier = filtered_pruned_mt.filter_cols(hl.is_defined(related_sample_ids[filtered_pruned_mt.col_key]), keep=True) 

In [5]:
# Print SNV and sample counts for each mt 

print('Num of SNVs and samples for unrelateds = ' + str(unrelateds_mt_preoutlier.count())) # (200403, 3419) 
print('Num of SNVs and samples for relateds = ' + str(relateds_mt_preoutlier.count())) # (200403, 698)



Num of SNVs and samples for unrelateds = (200403, 3419)
Num of SNVs and samples for relateds = (200403, 698)


# 5. Functions for PCA Analyses

PCA is run on the unrelated samples first. Then, the related samples are projected onto the PC space of the unrelated samples to get their PC scores. This way the population structure and clustering is not affected by the relatedness among samples.  

[Back to Index](#Index)

## 5.a. Run PCA on Unrelated Individuals

[Back to Index](#Index)

In [12]:
def run_pca(mt: hl.MatrixTable):
    """
    Runs PCA on a dataset
    :param mt: dataset to run PCA on
    :return: loadings and pc scores of unrelated samples 
    """
    pca_evals, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, k=20, compute_loadings=True)
    pca_mt = mt.annotate_rows(pca_af=hl.agg.mean(mt.GT.n_alt_alleles()) / 2)
    pca_loadings = pca_loadings.annotate(pca_af=pca_mt.rows()[pca_loadings.key].pca_af)
    pca_scores = pca_scores.transmute(**{f'PC{i}': pca_scores.scores[i - 1] for i in range(1, 21)})
    
    return pca_loadings, pca_scores 

## 5.b. Project Related Individuals

**If running the cell below results in an error, double check that you used the  `--packages gnomad` argument when starting your cluster.**  
- See the tutorials [README](https://github.com/atgu/hgdp_tgp/tree/master/tutorials#readme) for more information on how to start a cluster.

[Back to Index](#Index)

In [13]:
def project_individuals(project_mt, pca_loadings, unrel_scores, out_path: str, reg_name:str, outlier_status:str):
    """
    Project samples into predefined PCA space
    :param project_mt: matrix table of related samples to project 
    :param pca_loadings: existing PCA space of unrelated samples 
    :param unrel_scores: unrelated samples' PC scores
    :param out_path: path for where to save PCA projection outputs
    :param reg_name: region name for saving output purposes
    :param outlier_status: is the dataset with or without outliers? 
    """
    ht_projections = pc_project(project_mt, pca_loadings)  
    ht_projections = ht_projections.transmute(**{f'PC{i}': ht_projections.scores[i - 1] for i in range(1, 21)}) 
    scores = unrel_scores.union(ht_projections) # combine the pc scores from both the unrelateds and relateds 
    scores.export(out_path + reg_name + '_scores_' + outlier_status + '.txt.bgz') # write output for plotting    

## 5.c. Plot Functions

[Back to Index](#Index)

### Global PCA

In [17]:
def plot_global_pca(scores_path, outlier_status):
    """
    :param scores_path: general path where PC score files are located 
    :param outlier_status: are outlier samples present in the scores file (use "with_outliers" OR "without_outliers")
    """
    # Dictionary mapping colors to region names 
    cont_colors = {'AMR':"#E41A1C",
               'AFR':"#984EA3", 
               'OCE':"#999999",
               'CSA':"#FF7F00",
               'EAS':"#4DAF4A", 
               'EUR':"#377EB8", 
               'MID':"#A65628" }
    
    # Import global PC scores table 
    global_scores = hl.import_table(scores_path + 'GLOBAL_scores_' + outlier_status + '.txt.bgz', impute = True)

    # Add information from the metadata for plotting purposes 
    global_scores = global_scores.annotate(
        global_pop = metadata[global_scores.s]['hgdp_tgp_meta.Genetic.region'], 
        subpop = metadata[global_scores.s]['population'],
        global_color = metadata[global_scores.s]['hgdp_tgp_meta.Continent.colors'],
        subpop_color = metadata[global_scores.s]['hgdp_tgp_meta.Pop.colors'],
        subpop_shapes = metadata[global_scores.s]['hgdp_tgp_meta.Pop.shapes'],
        proj_title = metadata[global_scores.s]['hgdp_tgp_meta.Project'])

    # Make plot
    # Only plotting PC1 vs PC2 here but you can change the PC values OR make a for loop to plot the rest of the PCs
    p = ggplot(global_scores, aes(x = global_scores.PC1, y = global_scores.PC2))+ \
        geom_point(aes(color = global_scores.global_pop,
                       shape = global_scores.proj_title),
                       size = 3, alpha = .5) +\
        xlab("PC1") + \
        ylab("PC2") + \
        ggtitle("Global PCA " + outlier_status.replace('_', ' '))+\
        labs(shape = 'Project', color = 'Population') +\
        scale_color_manual(values=cont_colors)

    return p 

### Subcontinental PCA

[Back to Index](#Index)

In [18]:
def plot_subcont_pca(scores_path, outlier_status):
    """
    :param scores_path: general path where PC score files are located  
    :param outlier_status: are outlier samples present in the scores file (use "with_outliers" OR "without_outliers")   
    """
    # Initialize a dictionary to save the subcontinental PCA plots by their respective regions 
    pca_plots = {}

    # Loop through each subcontinental region 
    regions = ['AFR', 'AMR', 'CSA', 'EAS', 'EUR', 'MID', 'OCE']
 
    for region in regions:
        # Import PC scores table 
        subcont_scores = hl.import_table(scores_path + region + '_scores_' + outlier_status + '.txt.bgz', impute = True)

        # Add information from the metadata for plotting purposes 
        subcont_scores = subcont_scores.annotate(
            global_pop = metadata[subcont_scores.s]['hgdp_tgp_meta.Genetic.region'], 
            subpop = metadata[subcont_scores.s]['population'],
            global_color = metadata[subcont_scores.s]['hgdp_tgp_meta.Continent.colors'],
            subpop_color = metadata[subcont_scores.s]['hgdp_tgp_meta.Pop.colors'],
            subpop_shapes = metadata[subcont_scores.s]['hgdp_tgp_meta.Pop.shapes'],
            proj_title = metadata[subcont_scores.s]['hgdp_tgp_meta.Project'])

        # Make plot 
        # Only plotting PC1 vs PC2 here but you can change the PC values OR make a for loop to plot the rest of the PCs
        p = ggplot(subcont_scores, aes(x=subcont_scores.PC1, y=subcont_scores.PC2)) + \
            geom_point(aes(color = subcont_scores.subpop, 
                           shape = subcont_scores.proj_title),
                           size = 3, alpha = .3) +\
            xlab("PC1") + \
            ylab("PC2") + \
            ggtitle(region + " PCA " + outlier_status.replace('_', ' '))+\
            labs(shape = 'Project', color = 'Population')

        # Add plot to dictionary with the region name as its key 
        pca_plots[region] = p

    return pca_plots

# 6. Run PCA with Outliers

In this section, we calculate PCA globally and subcontinentally, and plot the results using the functions written in section 5 above **so make sure all functions are run beforehand**. The following PCA plots are prior to the removal of any outliers.

[Back to Index](#Index)

In [None]:
# Read in gnomAD's HGDP+1kGP metadata for plot annotation
metadata = hl.import_table(metadata_path, impute = True, key = 's')

## 6.a. Run Global PCA and Plot

We are doing this to see the population structure and clustering on a continental level and contextualize the data globally.    

[Back to Index](#Index)

### 6.a.1. Calculate PC scores 

In [None]:
# This block took ~1hr to run 

# Dictionaries to hold unrelateds' PCA loadings and scores
loadings_dict = {}
unrel_scores_dict = {}

# Run PCA on unrelated samples as a whole
loadings_dict['GLOBAL'], unrel_scores_dict['GLOBAL'] = run_pca(unrelateds_mt_preoutlier)  

# Project related samples onto unrelated-samples' PC space 
project_individuals(relateds_mt_preoutlier, loadings_dict['GLOBAL'], unrel_scores_dict['GLOBAL'], pc_scores_with_outliers_path, 'GLOBAL', 'with_outliers')

### 6.a.2. Plot 

In [19]:
# Plot PCA 
global_with_outliers = plot_global_pca(pc_scores_with_outliers_path, "with_outliers") 

# Show PC1 Vs PC2
global_with_outliers.show()

2024-07-09 19:50:52.326 Hail: INFO: Reading table to impute column types
2024-07-09 19:50:53.156 Hail: INFO: Finished type imputation
  Loading field 's' as type str (imputed)
  Loading field 'PC1' as type float64 (imputed)
  Loading field 'PC2' as type float64 (imputed)
  Loading field 'PC3' as type float64 (imputed)
  Loading field 'PC4' as type float64 (imputed)
  Loading field 'PC5' as type float64 (imputed)
  Loading field 'PC6' as type float64 (imputed)
  Loading field 'PC7' as type float64 (imputed)
  Loading field 'PC8' as type float64 (imputed)
  Loading field 'PC9' as type float64 (imputed)
  Loading field 'PC10' as type float64 (imputed)
  Loading field 'PC11' as type float64 (imputed)
  Loading field 'PC12' as type float64 (imputed)
  Loading field 'PC13' as type float64 (imputed)
  Loading field 'PC14' as type float64 (imputed)
  Loading field 'PC15' as type float64 (imputed)
  Loading field 'PC16' as type float64 (imputed)
  Loading field 'PC17' as type float64 (imputed)


## 6.b. Run Subcontinental PCA and Plot

We are doing this to see the population structure and clustering on a subcontinental level, and contextualize data within continental regions. This helped us identify outliers which are removed later on.     

**When running the following code cell, the notebook might freeze/throw an error after running PCA for 3-4 regions. Thus, we run it in groups of 3-4 regions at a time. If you want to run subcontinental PCA, we recommend doing that.**

<br>

<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 

<ul>
<li><a href="more info https://hail.is/docs/0.2/methods/genetics.html#hail.methods.hwe_normalized_pca"> More on <i> hwe_normalized_pca() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_rows"> More on <i> annotate_rows() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.annotate"> More on <i> annotate() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.transmute"> More on <i> transmute() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.export"> More on <i> export() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/experimental/index.html#hail.experimental.pc_project"> More on <i> pc_project() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.collect"> More on <i> collect() </i></a></li>
</ul>
    
</details>

[Back to Index](#Index)

### 6.b.1. Calculate PC scores  

In [None]:
# Run time breakdown for this cell is as follows:
# 1hr & 42 min for EAS, AMR, CSA
# 1hr & 23 min for EUR, AFR and OCE
# 34 min for MID

# Dictionaries to hold unrelateds' PCA loadings and scores
loadings_dict = {}
unrel_scores_dict = {}
regions = unrelateds_mt_preoutlier['hgdp_tgp_meta']['genetic_region'].collect() 
regions = list(dict.fromkeys(regions)) # convert into a list
# There are 7 regions: EUR, AFR, AMR, EAS, CSA, OCE, and MID

# For each region, run PCA on the unrelated samples 
for i in regions:  
    if i is not None: # exclude a none value
        # Filter the unrelateds per region
        subcont_unrelateds = unrelateds_mt_preoutlier.filter_cols(unrelateds_mt_preoutlier['hgdp_tgp_meta']['genetic_region'] == i) 

        # Run PCA
        loadings_dict[i], unrel_scores_dict[i] = run_pca(subcont_unrelateds)

        # Filter the related mt per region 
        subcont_relateds = relateds_mt_preoutlier.filter_cols(relateds_mt_preoutlier['hgdp_tgp_meta']['genetic_region'] == i)  

        # Project related samples onto unrelated-samples' PC space 
        project_individuals(subcont_relateds, loadings_dict[i], unrel_scores_dict[i], pc_scores_with_outliers_path, i, 'with_outliers')


### 6.b.2. Plot

In [20]:
# Plot PCA
subcont_with_outliers = plot_subcont_pca(pc_scores_with_outliers_path, "with_outliers") 

# Show subcontinental PC1 Vs PC2 plots one by one 
for region in ['AFR', 'AMR', 'CSA', 'EAS', 'EUR', 'MID', 'OCE']:
    subcont_with_outliers[region].show()
    
# If you are only interested in one subcontinental region, you can do the following. Using AFR as an example:
subcont_with_outliers["AFR"].show()

2024-07-09 19:53:00.286 Hail: INFO: Reading table to impute column types
2024-07-09 19:53:00.976 Hail: INFO: Finished type imputation
  Loading field 's' as type str (imputed)
  Loading field 'PC1' as type float64 (imputed)
  Loading field 'PC2' as type float64 (imputed)
  Loading field 'PC3' as type float64 (imputed)
  Loading field 'PC4' as type float64 (imputed)
  Loading field 'PC5' as type float64 (imputed)
  Loading field 'PC6' as type float64 (imputed)
  Loading field 'PC7' as type float64 (imputed)
  Loading field 'PC8' as type float64 (imputed)
  Loading field 'PC9' as type float64 (imputed)
  Loading field 'PC10' as type float64 (imputed)
  Loading field 'PC11' as type float64 (imputed)
  Loading field 'PC12' as type float64 (imputed)
  Loading field 'PC13' as type float64 (imputed)
  Loading field 'PC14' as type float64 (imputed)
  Loading field 'PC15' as type float64 (imputed)
  Loading field 'PC16' as type float64 (imputed)
  Loading field 'PC17' as type float64 (imputed)


2024-07-09 19:53:23.730 Hail: INFO: Ordering unsorted dataset with network shuffle


2024-07-09 19:53:27.216 Hail: INFO: Coerced sorted dataset
2024-07-09 19:53:27.715 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:53:28.596 Hail: INFO: Coerced sorted dataset
2024-07-09 19:53:29.074 Hail: INFO: Ordering unsorted dataset with network shuffle


2024-07-09 19:53:32.544 Hail: INFO: Coerced sorted dataset
2024-07-09 19:53:33.060 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:53:33.879 Hail: INFO: Coerced sorted dataset
2024-07-09 19:53:34.350 Hail: INFO: Ordering unsorted dataset with network shuffle


2024-07-09 19:53:37.814 Hail: INFO: Coerced sorted dataset
2024-07-09 19:53:38.318 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:53:39.131 Hail: INFO: Coerced sorted dataset
2024-07-09 19:53:39.646 Hail: INFO: Ordering unsorted dataset with network shuffle


2024-07-09 19:53:42.974 Hail: INFO: Coerced sorted dataset
2024-07-09 19:53:43.487 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:53:44.361 Hail: INFO: Coerced sorted dataset
2024-07-09 19:53:44.832 Hail: INFO: Ordering unsorted dataset with network shuffle


2024-07-09 19:53:48.079 Hail: INFO: Coerced sorted dataset
2024-07-09 19:53:48.563 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:53:49.352 Hail: INFO: Coerced sorted dataset
2024-07-09 19:53:49.827 Hail: INFO: Ordering unsorted dataset with network shuffle


2024-07-09 19:53:53.062 Hail: INFO: Coerced sorted dataset
2024-07-09 19:53:53.508 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:53:54.300 Hail: INFO: Coerced sorted dataset
2024-07-09 19:53:54.843 Hail: INFO: Ordering unsorted dataset with network shuffle


2024-07-09 19:53:57.927 Hail: INFO: Coerced sorted dataset
2024-07-09 19:53:58.408 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:53:59.215 Hail: INFO: Coerced sorted dataset
2024-07-09 19:53:59.656 Hail: INFO: Ordering unsorted dataset with network shuffle


# 7. Outliers Removal

After [plotting the PCs](https://github.com/atgu/hgdp_tgp/blob/master/plot_pca.Rmd) using R, 23 outliers were identified. 

[Back to Index](#Index)

| sample ID | Genetic region | Population |
| --- | --- | --- |
| HG01880 | AFR | ACB |
| HG01881 | AFR | ACB |
| NA20274 | AFR | ASW |
| NA20299 | AFR | ASW |
| NA20314 | AFR | ASW |
| HGDP00013 | CSA | Brahui |
| HGDP00029 | CSA | Brahui |
| HGDP00057 | CSA | Balochi |
| HGDP00130 | CSA | Makrani |
| HGDP00150 | CSA | Makrani |
| HGDP00175 | CSA | Sindhi |
| HGDP01298 | EAS | Uygur |
| HGDP01300 | EAS | Uygur |
| HGDP01303 | EAS | Uygur |
| LP6005443-DNA_B02 | EAS | Uygur |
| HG01628 | EUR | IBS | 
| HG01629 | EUR | IBS | 
| HG01630 | EUR | IBS | 
| HG01694 | EUR | IBS | 
| HG01696 | EUR | IBS |
| HGDP00621 | MID | Bedouin |
| HGDP01270 | MID | Mozabite |
| HGDP01271 | MID | Mozabite |

<details>
<summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    
- <a href="more info https://hail.is/docs/0.2/utils/index.html#hail.utils.hadoop_open"> More on  <i> hl.utils.hadoop_open() </i></a>
    
- <a href="more info https://hail.is/docs/0.2/functions/core.html#hail.expr.functions.literal"> More on  <i> hl.literal() </i></a>
</details>

[Back to Index](#Index)

In [21]:
# Read in the filtered and pruned dataset if not already done so 
filtered_pruned_mt = hl.read_matrix_table(ld_pruned_path)

# Read in the PCA outliers file into a list
with hl.utils.hadoop_open(outliers_path) as file: 
    outliers = [line.rstrip('\n') for line in file]
    
# Capture and broadcast the list as an expression
outliers_list = hl.literal(outliers)

# Remove the 23 outliers from the pruned dataset 
mt_without_outliers = filtered_pruned_mt.filter_cols(~outliers_list.contains(filtered_pruned_mt['s']))

# Validity check 
print('Before outlier removal: ' + str(filtered_pruned_mt.count()[1]))
print('After outlier removal: ' + str(mt_without_outliers.count()[1])) 
num_outliers = filtered_pruned_mt.count()[1] - mt_without_outliers.count()[1]
print('Total samples removed: ' + str(num_outliers))

Before outlier removal: 4117
After outlier removal: 4094
Total samples removed: 23


# 8. Rerun PCA Without Outliers

**Before running this section, make sure to run all functions in section 5 above.**

Here we are using the dataset without outliers and set new paths for the outputs.

[Back to Index](#Index)

In [22]:
# Read the list of related-sample IDs back in
related_sample_ids = hl.read_table(related_sample_ids_path)

# Divide the new dataset [one without the 23 outliers] to unrelated and related samples 
unrelateds_without_outliers = mt_without_outliers.filter_cols(hl.is_defined(related_sample_ids[mt_without_outliers.col_key]), keep=False) 
relateds_without_outliers = mt_without_outliers.filter_cols(hl.is_defined(related_sample_ids[mt_without_outliers.col_key]), keep=True)

# Validity check 
print(unrelateds_without_outliers.count()[1], relateds_without_outliers.count()[1])

3400 694


## 8.a. Rerun Global PCA and Plot

[Back to Index](#Index)

### 8.a.1. Calculate PC scores  

In [None]:
# This cell took 20 min to run

# Dictionaries to hold unrelateds' PCA loadings and scores
loadings_dict = {}
unrel_scores_dict = {}

# Run PCA on unrelated samples as a whole  
loadings_dict['GLOBAL'], unrel_scores_dict['GLOBAL'] = run_pca(unrelateds_without_outliers)  

# Project related samples onto unrelated-samples' PC space 
project_individuals(relateds_without_outliers, loadings_dict['GLOBAL'], unrel_scores_dict['GLOBAL'], pc_scores_without_outliers_path, 'GLOBAL', 'without_outliers')

### 8.a.2. PCA

In [24]:
# Plot PCA 
global_without_outliers = plot_global_pca(pc_scores_without_outliers_path, "without_outliers") 

# Show PC1 Vs PC2
global_without_outliers.show()

2024-07-09 19:56:17.764 Hail: INFO: Reading table to impute column types
2024-07-09 19:56:18.496 Hail: INFO: Finished type imputation
  Loading field 's' as type str (imputed)
  Loading field 'PC1' as type float64 (imputed)
  Loading field 'PC2' as type float64 (imputed)
  Loading field 'PC3' as type float64 (imputed)
  Loading field 'PC4' as type float64 (imputed)
  Loading field 'PC5' as type float64 (imputed)
  Loading field 'PC6' as type float64 (imputed)
  Loading field 'PC7' as type float64 (imputed)
  Loading field 'PC8' as type float64 (imputed)
  Loading field 'PC9' as type float64 (imputed)
  Loading field 'PC10' as type float64 (imputed)
  Loading field 'PC11' as type float64 (imputed)
  Loading field 'PC12' as type float64 (imputed)
  Loading field 'PC13' as type float64 (imputed)
  Loading field 'PC14' as type float64 (imputed)
  Loading field 'PC15' as type float64 (imputed)
  Loading field 'PC16' as type float64 (imputed)
  Loading field 'PC17' as type float64 (imputed)


## 8.b. Rerun Subcontinental PCA and Plot

**When running the following code cell, the notebook might freeze/throw an error after running PCA for 3-4 regions. Thus, we run it in groups of 3-4 regions at a time. If you want to run subcontinental PCA, we recommend doing that.**



<br>
<details><summary>For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 

<ul>
<li><a href="more info https://hail.is/docs/0.2/methods/genetics.html#hail.methods.hwe_normalized_pca"> More on <i> hwe_normalized_pca() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_rows"> More on <i> annotate_rows() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.annotate"> More on <i> annotate() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.transmute"> More on <i> transmute() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.export"> More on <i> export() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/experimental/index.html#hail.experimental.pc_project"> More on <i> pc_project() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.collect"> More on <i> collect() </i></a></li>
    </ul>
    
</details>

[Back to Index](#Index)

### 8.b.1. Calculate PC scores 

In [None]:
# Run time breakdown for this cell is as follows:
# 1hr & 40min for EAS, AMR, CSA, OCE
# 1hr & 42min for EUR, AFR, MID

# Dictionaries to hold unrelateds' PCA loadings and scores  
loadings_dict = {}
unrel_scores_dict = {}
regions = mt_without_outliers['hgdp_tgp_meta']['genetic_region'].collect() 
regions = list(dict.fromkeys(regions)) # convert into a list
# There are 7 regions: EUR, AFR, AMR, EAS, CSA, OCE, and MID

# For each region, run PCA on the unrelated samples 
for i in regions:  
    if i is not None: # exclude a none value
        # Filter the unrelateds per region
        subcont_unrelateds = unrelateds_without_outliers.filter_cols(unrelateds_without_outliers['hgdp_tgp_meta']['genetic_region'] == i) 

        # Run PCA
        loadings_dict[i], unrel_scores_dict[i] = run_pca(subcont_unrelateds)

        # Filter the related mt per region 
        subcont_relateds = relateds_without_outliers.filter_cols(relateds_without_outliers['hgdp_tgp_meta']['genetic_region'] == i)  

        # Project related samples onto unrelated-samples' PC space 
        project_individuals(subcont_relateds, loadings_dict[i], unrel_scores_dict[i], pc_scores_without_outliers_path, i, 'without_outliers')


### 8.b.2. Plot 

In [25]:
# Plot PCA
subcont_without_outliers = plot_subcont_pca(pc_scores_without_outliers_path, "without_outliers") 

# Show subcontinental PC1 Vs PC2 plots one by one 
for region in ['AFR', 'AMR', 'CSA', 'EAS', 'EUR', 'MID', 'OCE']:
    subcont_without_outliers[region].show()
    
# If you are only interested in one subcontinental region, you can do the following. Using AFR as an example:
subcont_without_outliers["AFR"].show()

2024-07-09 19:57:20.867 Hail: INFO: Reading table to impute column types
2024-07-09 19:57:21.444 Hail: INFO: Finished type imputation
  Loading field 's' as type str (imputed)
  Loading field 'PC1' as type float64 (imputed)
  Loading field 'PC2' as type float64 (imputed)
  Loading field 'PC3' as type float64 (imputed)
  Loading field 'PC4' as type float64 (imputed)
  Loading field 'PC5' as type float64 (imputed)
  Loading field 'PC6' as type float64 (imputed)
  Loading field 'PC7' as type float64 (imputed)
  Loading field 'PC8' as type float64 (imputed)
  Loading field 'PC9' as type float64 (imputed)
  Loading field 'PC10' as type float64 (imputed)
  Loading field 'PC11' as type float64 (imputed)
  Loading field 'PC12' as type float64 (imputed)
  Loading field 'PC13' as type float64 (imputed)
  Loading field 'PC14' as type float64 (imputed)
  Loading field 'PC15' as type float64 (imputed)
  Loading field 'PC16' as type float64 (imputed)
  Loading field 'PC17' as type float64 (imputed)


2024-07-09 19:57:42.536 Hail: INFO: Ordering unsorted dataset with network shuffle


2024-07-09 19:57:45.570 Hail: INFO: Coerced sorted dataset
2024-07-09 19:57:46.033 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:57:46.763 Hail: INFO: Coerced sorted dataset
2024-07-09 19:57:47.221 Hail: INFO: Ordering unsorted dataset with network shuffle


2024-07-09 19:57:50.272 Hail: INFO: Coerced sorted dataset
2024-07-09 19:57:50.743 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:57:51.463 Hail: INFO: Coerced sorted dataset
2024-07-09 19:57:51.916 Hail: INFO: Ordering unsorted dataset with network shuffle


2024-07-09 19:57:55.026 Hail: INFO: Coerced sorted dataset
2024-07-09 19:57:55.472 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:57:56.272 Hail: INFO: Coerced sorted dataset
2024-07-09 19:57:56.775 Hail: INFO: Ordering unsorted dataset with network shuffle


2024-07-09 19:57:59.853 Hail: INFO: Coerced sorted dataset
2024-07-09 19:58:00.364 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:58:01.149 Hail: INFO: Coerced sorted dataset
2024-07-09 19:58:01.583 Hail: INFO: Ordering unsorted dataset with network shuffle


2024-07-09 19:58:04.759 Hail: INFO: Coerced sorted dataset
2024-07-09 19:58:05.200 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:58:05.958 Hail: INFO: Coerced sorted dataset
2024-07-09 19:58:06.426 Hail: INFO: Ordering unsorted dataset with network shuffle


2024-07-09 19:58:09.408 Hail: INFO: Coerced sorted dataset
2024-07-09 19:58:09.901 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:58:10.636 Hail: INFO: Coerced sorted dataset
2024-07-09 19:58:11.109 Hail: INFO: Ordering unsorted dataset with network shuffle


2024-07-09 19:58:14.139 Hail: INFO: Coerced sorted dataset
2024-07-09 19:58:14.603 Hail: INFO: Ordering unsorted dataset with network shuffle
2024-07-09 19:58:15.370 Hail: INFO: Coerced sorted dataset
2024-07-09 19:58:15.845 Hail: INFO: Ordering unsorted dataset with network shuffle


# 9. Write-Out MatrixTables 
[Back to Index](#Index)

- Separately write out mts of unrelated and related samples without outliers - 10min to run 

``` python3
# Unrelated mt
unrelateds_without_outliers.write(unrelateds_mt_without_outliers_path, overwrite=False)

# Related mt
relateds_without_outliers.write(relateds_mt_without_outliers_path, overwrite=False)
```

### NOTE: The PCA plots shown above can also be easily plotted in R with better resolution. Click [here](https://github.com/atgu/hgdp_tgp/blob/master/figure_generation/plot_pca.Rmd) for more information. 

In [6]:
# Sample and variant count after removing PCA outliers from the post-QC mt (before variant filtering and LD pruning)

# Read in the PCA outliers file into a list
with hl.utils.hadoop_open(outliers_path) as file: 
    outliers = [line.rstrip('\n') for line in file]
    
# Capture and broadcast the list as an expression
outliers_list = hl.literal(outliers)

# Remove the 23 outliers from the post-QC mt dataset 
post_qc_mt_without_outliers = post_qc_mt.filter_cols(~outliers_list.contains(post_qc_mt['s']))

# Only keep the variants which are found in the samples that are left 
p_o_mt = post_qc_mt_without_outliers.filter_rows(hl.agg.any(post_qc_mt_without_outliers.GT.is_non_ref()))

print(p_o_mt.count())

Exception in thread "Thread-39" java.lang.NullPointerException97 + 176) / 50000]
	at sparkmonitor.listener.JupyterSparkMonitorListener$TaskUpdaterThread.$anonfun$run$1(CustomListener.scala:116)
	at scala.collection.TraversableLike$grouper$1$.apply(TraversableLike.scala:465)
	at scala.collection.TraversableLike$grouper$1$.apply(TraversableLike.scala:455)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.groupBy(TraversableLike.scala:524)
	at scala.collection.TraversableLike.groupBy$(TraversableLike.scala:454)
	at scala.collection.AbstractTraversable.groupBy(Traversable.scala:108)
	at sparkmonitor.listener.JupyterSparkMonitorListener$TaskUpdaterThread.run(CustomListener.scala:116)
	at java.base/java.lang.Thread.run(Thread.java:829)

(153894851, 4094)




(153894851, 4094)


[Back to Index](#Index)