Notebook 4: filtering to common independent SNPS, relatedness, PCA, joint calling with new dataset, applying RF on new dataset. Analyses that need to be run:
1. Work with Lindo:
    - Joint calling with GGV, sample QC 
    - Use gnomAD RF (subset to variants in RF model) - doesn’t need VQSR 
    - Intersect HGDP+1kG+GGV, build a RF with 1kG+ HGDP, apply it to a new dataset (GGV) - doesn’t need VQSR 
2. PCA plots - Ally 
    - already implemented in R, just need to plot it in Hail

## Index
- [General Overview](#General-Overview)
- [Variant Filter and LD Pruning](#Variant-Filter-and-LD-Pruning)
- [Run PC Relate](#Run-PC-Relate)
- [PCA](#PCA)
    - [Function to Run PCA on Unrelated Individuals](#Function-to-Run-PCA-on-Unrelated-Individuals)
    - [Function to Project Related Individuals](#Function-to-Project-Related-Individuals)
    - [Global PCA](#Global-PCA)
    - [Subcontinental PCA](#Subcontinental-PCA)
- [Outlier Removal](#Outlier-Removal)
- [Rerun PCA](#Rerun-PCA)
    - [Global PCA](#Global-PCA)
    - [Subcontinental PCA](#Subcontinental-PCA)
- [Write Out Matrix Table](#Write-Out-Matrix-Table)

# General Overview 
The purpose of this notebook is to further filter the matrix table obtained from notebook 3, run relatedness and PCA, joint call with new data set, and apply RF. It contains steps on how to:

- Read in the a matrix table and run Hail common variant statistics  
- Filter using allele frequency and call rate
- Run LD pruning 
- Run relatedness and separate related and unrelated individuals
- Calculate PC scores and project samples on to a PC space  
- Run global and Subcontinental PCA and plot them 
- Remove PCA outliers (filter using sample IDs)
- Joint call with a new data set
- Build/apply RF
- Write out a matrix table 

Author: Mary T. Yohannes

In [None]:
# import hail
import hail as hl

# Variant Filter and LD Pruning

In [None]:
# read-in the right intermediate file 
mt_filt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/intermediate_files/pre_running_varqc.mt')

In [None]:
# run common variant statistics (quality control metrics) - more info https://hail.is/docs/0.2/methods/genetics.html#hail.methods.variant_qc  
mt_var = hl.variant_qc(mt_filt) 

In [None]:
# trying to get down to ~100-300k SNPs - might need to change values later accordingly  
# AF: allele freq and call_rate: fraction of calls neither missing nor filtered
# mt.variant_qc.AF[0] is referring to the first element of the list under that column field  
mt_var_filt = mt_var.filter_rows((mt_var.variant_qc.AF[0] > 0.05) & (mt_var.variant_qc.AF[0] < 0.95) & (mt_var.variant_qc.call_rate > 0.999))

In [None]:
# ~20min to run 
mt_var_filt.count() # started with 155648020 snps and ended up with 6787034 snps 

In [None]:
# LD pruning (~113 min to run) 
pruned = hl.ld_prune(mt_var_filt.GT, r2=0.1, bp_window_size=500000) 

In [None]:
# subset data even further   
mt_var_pru_filt = mt_var_filt.filter_rows(hl.is_defined(pruned[mt_var_filt.row_key])) 

In [None]:
# write out the output as a temp file - make sure to save the file on this step b/c the pruning step takes a while to run
# saving took ~23 min 
mt_var_pru_filt.write('gs://hgdp-1kg/hgdp_tgp/intermediate_files/filtered_n_pruned_output_updated.mt', overwrite=False)

In [None]:
# after saving the pruned file to the cloud, reading it back in for the next steps 
# Duplicate file - gs://hgdp-1kg/filtered_n_pruned_output_updated.mt
mt_var_pru_filt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/intermediate_files/filtered_n_pruned_output_updated.mt') 

In [None]:
# how many snps are left after filtering and prunning? 
mt_var_pru_filt.count() # 248,634 snps 
# between ~100-300k so we proceed without any value adjustments  

# Run PC Relate   

In [None]:
# compute relatedness estimates between individuals using a variant of the PC-Relate method (https://hail.is/docs/0.2/methods/relatedness.html#hail.methods.pc_relate)
# only compute the kinship statistic using:
# a minimum minor allele frequency filter of 0.05, 
# excluding sample-pairs with kinship less than 0.05, and 
# 20 principal components to control for population structure 
# a hail table is produced (~4min to run) 
relatedness_ht = hl.pc_relate(mt_var_pru_filt.GT, min_individual_maf=0.05, min_kinship=0.05, statistics='kin', k=20).key_by()

In [None]:
# identify related individuals in pairs to remove - returns a list of sample IDs (~2hr & 22 min to run) - previous one took ~13min
related_samples_to_remove = hl.maximal_independent_set(relatedness_ht.i, relatedness_ht.j, False)

In [None]:
# using sample IDs (col_key of the matrixTable), pick out the samples that are not found in 'related_samples_to_remove' (had 'False' values for the comparison)  
# subset the mt to those only 
mt_unrel = mt_var_pru_filt.filter_cols(hl.is_defined(related_samples_to_remove[mt_var_pru_filt.col_key]), keep=False) 

# do the same as above but this time for the samples with 'True' values (found in 'related_samples_to_remove')  
mt_rel = mt_var_pru_filt.filter_cols(hl.is_defined(related_samples_to_remove[mt_var_pru_filt.col_key]), keep=True) 

In [None]:
# write out mts of unrelated and related samples on to the cloud 

# unrelated mt
mt_unrel.write('gs://hgdp-1kg/hgdp_tgp/unrel_updated.mt', overwrite=False) 

# related mt 
mt_rel.write('gs://hgdp-1kg/hgdp_tgp/rel_updated.mt', overwrite=False) 

In [None]:
# read saved mts back in 

# unrelated mt
mt_unrel = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/unrel_updated.mt') 

# related mt 
mt_rel = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/rel_updated.mt') 

# PCA

### Function to Run PCA on Unrelated Individuals

In [None]:
def run_pca(mt: hl.MatrixTable, reg_name:str, out_prefix: str, overwrite: bool = False):
    """
    Runs PCA on a dataset
    :param mt: dataset to run PCA on
    :param reg_name: region name for saving output purposes
    :param out_prefix: path for where to save the outputs
    :return:
    """

    pca_evals, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, k=20, compute_loadings=True)
    pca_mt = mt.annotate_rows(pca_af=hl.agg.mean(mt.GT.n_alt_alleles()) / 2)
    pca_loadings = pca_loadings.annotate(pca_af=pca_mt.rows()[pca_loadings.key].pca_af)
    pca_scores = pca_scores.transmute(**{f'PC{i}': pca_scores.scores[i - 1] for i in range(1, 21)})
    
    pca_scores.export(out_prefix + reg_name + '_scores.txt.bgz')  # save individual-level genetic region PCs
    pca_loadings.write(out_prefix + reg_name + '_loadings.ht', overwrite)  # save PCA loadings

### Function to Project Related Individuals

In [None]:
#if running on GCS, need to add "--packages gnomad" when starting a cluster in order for the import to work  
from gnomad.sample_qc.ancestry import *

def project_individuals(pca_loadings, project_mt, reg_name:str, out_prefix: str, overwrite: bool = False):
    """
    Project samples into predefined PCA space
    :param pca_loadings: existing PCA space - unrelated samples 
    :param project_mt: matrixTable of data to project - related samples 
    :param reg_name: region name for saving output purposes
    :param project_prefix: path for where to save PCA projection outputs
    :return:
    """
    ht_projections = pc_project(project_mt, pca_loadings)  
    ht_projections = ht_projections.transmute(**{f'PC{i}': ht_projections.scores[i - 1] for i in range(1, 21)}) 
    ht_projections.export(out_prefix + reg_name + '_projected_scores.txt.bgz') # save output 
    #return ht_projections # return to user  

## Global PCA

In [None]:
# run 'run_pca' function for global pca   
run_pca(mt_unrel, 'global', 'gs://hgdp-1kg/hgdp_tgp/pca_preoutlier/', False)

In [None]:
# run 'project_relateds' function for global pca 
loadings = hl.read_table('gs://hgdp-1kg/hgdp_tgp/pca_preoutlier/global_loadings.ht') # read in the PCA loadings that were obtained from 'run_pca' function 
project_individuals(loadings, mt_rel, 'global', 'gs://hgdp-1kg/hgdp_tgp/pca_preoutlier/', False) 

## Subcontinental PCA 

In [None]:
# obtain a list of the genetic regions in the dataset - used the unrelated dataset since it had more samples 
regions = mt_unrel['hgdp_tgp_meta']['Genetic']['region'].collect()
regions = list(dict.fromkeys(regions)) # 7 regions - ['EUR', 'AFR', 'AMR', 'EAS', 'CSA', 'OCE', 'MID']

In [None]:
# set argument values 
subcont_pca_prefix = 'gs://hgdp-1kg/hgdp_tgp/pca_preoutlier/subcont_pca/subcont_pca_' # path for outputs 
overwrite = False 

In [None]:
# run 'run_pca' function for each region - nb freezes after printing the log for AMR  
# don't restart it - just let it run and you can follow the progress through the SparkUI
# even after all the outputs are produced and the run is complete, the code chunk will seem as if it's still running (* in the left square bracket)
# can check if the run is complete by either checking the output files in the Google cloud bucket or using the SparkUI 
# after checking the desired outputs are generated and the run is done, exit the current nb, open a new session, and proceed to the next step
# ~27min to run 
for i in regions:
    subcont_unrel = mt_unrel.filter_cols(mt_unrel['hgdp_tgp_meta']['Genetic']['region'] == i)  # filter the unrelateds per region
    run_pca(subcont_unrel, i, subcont_pca_prefix, overwrite)

In [None]:
# run 'project_relateds' function for each region (~2min to run)
for i in regions:
    loadings = hl.read_table(subcont_pca_prefix + i + '_loadings.ht') # for each region, read in the PCA loadings that were obtained from 'run_pca' function 
    subcont_rel = mt_rel.filter_cols(mt_rel['hgdp_tgp_meta']['Genetic']['region'] == i)  # filter the relateds per region 
    project_individuals(loadings, subcont_rel, i, subcont_pca_prefix, overwrite) 

# Outlier Removal
#### After plotting the PCs, 22 outliers that need to be removed were identified (the table below will be completed for the final report)


| s | Genetic region | Population | Note |
| --- | --- | --- | -- |
| NA20314 | AFR | ASW | Clusters with AMR in global PCA | 
| NA20299 | - | - | - |
| NA20274 | - | - | - |
| HG01880 | - | - | - |
| HG01881 | - | - | - |
| HG01628 | - | - | - |
| HG01629 | - | - | - |
| HG01630 | - | - | - |
| HG01694 | - | - | - |
| HG01696 | - | - | - |
| HGDP00013 | - | - | - |
| HGDP00150 | - | - | - |
| HGDP00029 | - | - | - |
| HGDP01298 | - | - | - |
| HGDP00130 | CSA | Makrani | Closer to AFR than most CSA |
| HGDP01303 | - | - | - |
| HGDP01300 | - | - | - |
| HGDP00621 | MID | Bedouin | Closer to AFR than most MID |
| HGDP01270 | MID | Mozabite | Closer to AFR than most MID |
| HGDP01271 | MID | Mozabite | Closer to AFR than most MID |
| HGDP00057 | - | - | - | 
| LP6005443-DNA_B02 | - | - | - |


















	


In [None]:
# read back in the unrelated and related mts to remove outliers and run pca 
# bucket was moved to another project so different paths are used from where these mts were previously saved 
mt_unrel_unfiltered = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/unrel_updated.mt') # unrelated mt
mt_rel_unfiltered = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/rel_updated.mt') # related mt

In [None]:
# read the outliers file into a list
with hl.utils.hadoop_open('gs://hgdp-1kg/hgdp_tgp/pca_outliers_v2.txt') as file: 
    outliers = [line.rstrip('\n') for line in file]
    
# capture and broadcast the list as an expression
outliers_list = hl.literal(outliers)

In [None]:
# remove 22 outliers 
mt_unrel = mt_unrel_unfiltered.filter_cols(~outliers_list.contains(mt_unrel_unfiltered['s']))
mt_rel = mt_rel_unfiltered.filter_cols(~outliers_list.contains(mt_rel_unfiltered['s']))

In [None]:
# sanity check 
print('Unrelated: Before outlier removal ' + str(mt_unrel_unfiltered.count()[1]) + ' | After outlier removal ' + str(mt_unrel.count()[1]))
print('Related: Before outlier removal: ' + str(mt_rel_unfiltered.count()[1]) + ' | After outlier removal ' + str(mt_rel.count()[1]))

num_outliers = (mt_unrel_unfiltered.count()[1] - mt_unrel.count()[1]) + (mt_rel_unfiltered.count()[1] - mt_rel.count()[1])
print('Total samples removed = ' + str(num_outliers))

# Rerun PCA
### - The following steps are similar to the ones prior to removing the outliers except now we are using the updated unrelated & related dataset and a new GCS bucket path to save the outputs 

## Global PCA

In [None]:
# run 'run_pca' function for global pca - make sure the code block for the function (located above) is run prior to running this    
run_pca(mt_unrel, 'global', 'gs://hgdp-1kg/hgdp_tgp/pca_postoutlier/', False)

In [None]:
# run 'project_relateds' function for global pca - make sure the code block for the function (located above) is run prior to running this    
loadings = hl.read_table('gs://hgdp-1kg/hgdp_tgp/pca_postoutlier/global_loadings.ht') # read in the PCA loadings that were obtained from 'run_pca' function 
project_individuals(loadings, mt_rel, 'global', 'gs://hgdp-1kg/hgdp_tgp/pca_postoutlier/', False) 

## Subcontinental PCA

In [None]:
# obtain a list of the genetic regions in the dataset - used the unrelated dataset since it had more samples  
regions = mt_unrel['hgdp_tgp_meta']['Genetic']['region'].collect()
regions = list(dict.fromkeys(regions)) # 7 regions - ['EUR', 'AFR', 'AMR', 'EAS', 'CSA', 'OCE', 'MID']

In [None]:
# set argument values 
subcont_pca_prefix = 'gs://hgdp-1kg/hgdp_tgp/pca_postoutlier/subcont_pca/subcont_pca_' # path for outputs 
overwrite = False 

In [None]:
# run 'run_pca' function (located above) for each region 
# notebook became slow and got stuck - don't restart it, just let it run and you can follow the progress through the SparkUI
# after checking the desired outputs are generated (GCS bucket) and the run is done (SparkUI), exit the current nb, open a new session, and proceed to the next step
# took roughly 25-27 min  
for i in regions:
    subcont_unrel = mt_unrel.filter_cols(mt_unrel['hgdp_tgp_meta']['Genetic']['region'] == i)  # filter the unrelateds per region
    run_pca(subcont_unrel, i, subcont_pca_prefix, overwrite)

In [None]:
# run 'project_relateds' function (located above) for each region - took ~3min 
for i in regions:
    loadings = hl.read_table(subcont_pca_prefix + i + '_loadings.ht') # for each region, read in the PCA loadings that were obtained from 'run_pca' function 
    subcont_rel = mt_rel.filter_cols(mt_rel['hgdp_tgp_meta']['Genetic']['region'] == i)  # filter the relateds per region 
    project_individuals(loadings, subcont_rel, i, subcont_pca_prefix, overwrite) 

# Write Out Matrix Table 

In [None]:
# write out mts of unrelated and related samples separately (post-outlier removal) 

mt_unrel.write('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt', overwrite=False) #unrelated mt
mt_rel.write('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/related.mt', overwrite=False) #related mt