Notebook 4: filtering to common independent SNPS, relatedness, PCA, joint calling with new dataset, applying RF on new dataset. Analyses that need to be run:
1. Work with Lindo: - *PENDING*
    - Joint calling with GGV, sample QC 
    - Use gnomAD RF (subset to variants in RF model) - doesn’t need VQSR 
    - Intersect HGDP+1kG+GGV, build a RF with 1kG+ HGDP, apply it to a new dataset (GGV) - doesn’t need VQSR 
2. PCA plots - Ally - *PENDING*
    - already implemented in R, just need to plot it in Hail
    
----------------------------------------
Further edits needed in this nb: 
- Update the global and subcontinental paths and add them to section 1
- Add the path to the PCA plotting Rmarkdown (in section 5) once available  
- Complete the table in section 5
- Add Ally's code for plots 

## Index
- [General Overview](#1.-General-Overview)
- [Variant Filtering and LD Pruning](#2.-Variant-Filtering-and-LD-Pruning)
- [Run PC-Relate](#3.-Run-PC-Relate)
- [PCA](#4.-PCA)
    - [Function to Run PCA on Unrelated Individuals](#4a.-Function-to-Run-PCA-on-Unrelated-Individuals)
    - [Function to Project Related Individuals](#4b.-Function-to-Project-Related-Individuals)
    - [Global PCA](#4c.-Global-PCA)
    - [Subcontinental PCA](#4d.-Subcontinental-PCA)
- [Outlier Removal](#5.-Outlier-Removal)
- [Rerun PCA](#6.-Rerun-PCA)
    - [Global PCA](#6a.-Global-PCA)
    - [Subcontinental PCA](#6b.-Subcontinental-PCA)
- [Write Out Matrix Table](#7.-Write-Out-Matrix-Table)

# 1. General Overview 
The purpose of this notebook is to further filter the matrix table obtained from notebook 3, run relatedness and Principal Component Analysis (PCA), joint call with new data set, and apply RF. It contains steps on how to:

- Read in the a matrix table and run Hail common variant statistics  
- Filter using allele frequency and call rate
- Run LD pruning 
- Run relatedness and separate related and unrelated individuals
- Calculate PC scores and project samples on to a PC space  
- Run global and Subcontinental PCA and plot them 
- Remove PCA outliers (filter using sample IDs)
- Joint call with a new data set
- Build/apply RF
- Write out a matrix table 

Author: Mary T. Yohannes

1a. Import needed libraries and packages 

In [None]:
# import hail
import hail as hl

1b. Input and output path variables to be edited by users as needed 

In [None]:
# input 
input_path = 'gs://hgdp-1kg/hgdp_tgp/intermediate_files/pre_running_varqc.mt'

# intermediate file
intermediate_path = 'gs://hgdp-1kg/hgdp_tgp/intermediate_files/filtered_n_pruned_output_updated.mt'

# pre-outlier paths for unrelated and related samples 
unrel_path = 'gs://hgdp-1kg/hgdp_tgp/unrel_updated.mt'
rel_path = 'gs://hgdp-1kg/hgdp_tgp/rel_updated.mt' 

# pre-outlier file path is missing - global & subcont pca results [here]

# outliers file 
outliers_path = 'gs://hgdp-1kg/hgdp_tgp/pca_outliers_v2.txt'

# post-outlier file path is missing - global & subcont pca results[here]

# final output paths for unrelated and related samples (post-outlier)
unrel_output = 'gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt'
rel_output = 'gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/related.mt'

# 2. Variant Filtering and LD Pruning

<details>
<summary> - Why are we doing this? [click here] </summary>
    
At this point, we have 155,648,020 SNPs and since we need fewer number of variants (~100-300k) for PCA, we filter on:
- AF - allele frequency 
- call rate - fraction of calls neither missing nor filtered

and then run LD pruning.     
</details>

In [None]:
# read-in the right intermediate file 
mt_filt = hl.read_matrix_table(input_path)

2a. Variant Filtering 

In [None]:
# run Hail's common variant statistics (QC metrics) 
mt_var = hl.variant_qc(mt_filt) 

# filter to variants with AF between 0.05 & 0.95, and call rate greater than 0.999    
mt_var_filt = mt_var.filter_rows((mt_var.variant_qc.AF[0] > 0.05) & (mt_var.variant_qc.AF[0] < 0.95) & (mt_var.variant_qc.call_rate > 0.999))
print('Num of variants after filtering = ' + str(mt_var_filt.count()[0])) # 6787034 snps; this line take ~20min to run 

2b. LD Pruning

In [None]:
# remove correlated variants 
pruned = hl.ld_prune(mt_var_filt.GT, r2=0.1, bp_window_size=500000) # ~113 min to run  
mt_var_pru_filt = mt_var_filt.filter_rows(hl.is_defined(pruned[mt_var_filt.row_key])) 
print('Num of variants after LD pruning = ' + str(mt_var_pru_filt.count()[0])) # 248634 snps

- Since the number of variants is now in the ~100-300k range, we proceed to the PCA analysis without any further adjustments.  

2c. Write out an intermediate file

In [None]:
# the pruning step took a bit of time to run so we have to write out the filtered and pruned mt as an intermediate file
mt_var_pru_filt.write(intermediate_path, overwrite=False) # ~23 min to run

# read the intermediate file back in for subsequent analyses
mt_var_pru_filt = hl.read_matrix_table(intermediate_path) 

<details>
<summary> - More information on Hail methods and expressions [click here] </summary>

- <a href="more info https://hail.is/docs/0.2/methods/genetics.html#hail.methods.variant_qc"> More on  <i> variant_qc() </i></a>

- <a href="more info https://hail.is/docs/0.2/methods/genetics.html#hail.methods.ld_prune"> More on  <i> ld_prune() </i></a>

# 3. Run PC-Relate   

<details>
<summary> - Why are we doing this? [click here] </summary>
    
When doing Principal Component Analysis (PCA), we need to separate the related and unrelated samples before computing the PC scores and ploting them. This is because if we compute PCA with the related samples in the data set, the population structure and clustering will be affected by the relatedness that exists among those samples. Thus, we first have to identify the related individuals by computing relatedness estimates (kinship statistic in this case) using a variant of the PC-Relate method in Hail. We used a minimum minor allele frequency (MAF) filter of 0.05, excluded sample pairs with kinship less than 0.05, and used 20 principal components (PC) to control for population structure. After getting the sample ID pairs for the related samples, we then separate the filtered and pruned mt into relateds and unrelateds.     
</details>

In [None]:
# compute the kinship statistic
relatedness_ht = hl.pc_relate(mt_var_pru_filt.GT, min_individual_maf=0.05, min_kinship=0.05, statistics='kin', k=20).key_by() # ~4min to run

# identify closely related individuals in pairs (list of sample IDs) 
related_samples_to_remove = hl.maximal_independent_set(relatedness_ht.i, relatedness_ht.j, False) # ~2hr & 22min to run

# subset the filtered and pruned mt to samples that are NOT present in the list of related individuals  
mt_unrel = mt_var_pru_filt.filter_cols(hl.is_defined(related_samples_to_remove[mt_var_pru_filt.col_key]), keep=False) 

# do the same as above but this time subset to samples that are present in the related-individuals list   
mt_rel = mt_var_pru_filt.filter_cols(hl.is_defined(related_samples_to_remove[mt_var_pru_filt.col_key]), keep=True) 

In [None]:
# write out the unrelated and related mts since they are used beyond this notebook in other analyses     
mt_unrel.write(unrel_path, overwrite=False) # unrelated mt
mt_rel.write(rel_path, overwrite=False) # related mt 

# read the saved mts back in
mt_unrel = hl.read_matrix_table(unrel_path) # unrelated mt
mt_rel = hl.read_matrix_table(rel_path) # related mt 

<details>
<summary> - More information on Hail methods and expressions [click here] </summary>

- <a href="more info https://hail.is/docs/0.2/methods/relatedness.html#hail.methods.pc_relate"> More on  <i> pc_relate() </i></a>
    
- <a href="more info https://hail.is/docs/0.2/methods/misc.html?highlight=maximal_independent_set#hail.methods.maximal_independent_set"> More on  <i> maximal_independent_set() </i></a>

# 4. PCA

<details>
<summary> - What are we doing here? [click here] </summary>
    
PCA is ran on the unrelated samples first. Then, the related samples are projected onto the PC space of the unrelated samples to get their PC scores. This way the population structure and clustering is not affected by the relatedness among samples.      
</details>

### 4a. Function to Run PCA on Unrelated Individuals

In [None]:
def run_pca(mt: hl.MatrixTable, reg_name:str, out_prefix: str, overwrite: bool = False):
    """
    Runs PCA on a data set
    :param mt: data set to run PCA on
    :param reg_name: region name for saving output purposes
    :param out_prefix: path for where to save the outputs
    :return:
    """

    pca_evals, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, k=20, compute_loadings=True)
    pca_mt = mt.annotate_rows(pca_af=hl.agg.mean(mt.GT.n_alt_alleles()) / 2)
    pca_loadings = pca_loadings.annotate(pca_af=pca_mt.rows()[pca_loadings.key].pca_af)
    pca_scores = pca_scores.transmute(**{f'PC{i}': pca_scores.scores[i - 1] for i in range(1, 21)})
    
    pca_scores.export(out_prefix + reg_name + '_scores.txt.bgz')  # save individual-level-genetic-region PCs
    pca_loadings.write(out_prefix + reg_name + '_loadings.ht', overwrite)  # save PCA loadings

### 4b. Function to Project Related Individuals

<details>
<summary> - Something to note: [click here] </summary>
    
If this is being run on Google Cloud, add "--packages gnomad" when starting a cluster so that the library import works without an issue.      
</details>

In [None]:
from gnomad.sample_qc.ancestry import *

def project_individuals(pca_loadings, project_mt, reg_name:str, out_prefix: str, overwrite: bool = False):
    """
    Project samples into predefined PCA space
    :param pca_loadings: existing PCA space of unrelated samples 
    :param project_mt: matrix table of related samples to project  
    :param reg_name: region name for saving output purposes
    :param project_prefix: path for where to save PCA projection outputs
    :return:
    """
    ht_projections = pc_project(project_mt, pca_loadings)  
    ht_projections = ht_projections.transmute(**{f'PC{i}': ht_projections.scores[i - 1] for i in range(1, 21)}) 
    ht_projections.export(out_prefix + reg_name + '_projected_scores.txt.bgz') # save output   

### 4c. Global PCA

<details>
<summary> - Why are we doing this? [click here] </summary>
    
To see the population structure and clustering on a continental level and contextualize the data globally.      
</details>

In [None]:
# run PCA on the unrelated samples
run_pca(mt_unrel, 'global', 'gs://hgdp-1kg/hgdp_tgp/pca_preoutlier/', False)  

# read in the PCA loadings of the unrelated samples
loadings = hl.read_table('gs://hgdp-1kg/hgdp_tgp/pca_preoutlier/global_loadings.ht') 

# project the related samples onto the unrelated-samples' PC space 
project_individuals(loadings, mt_rel, 'global', 'gs://hgdp-1kg/hgdp_tgp/pca_preoutlier/', False) 

### 4d. Subcontinental PCA 

<details>
<summary> - Why are we doing this? [click here] </summary>
    
To see the population structure and clustering on a subcontinental level and contextualize data within continental regions. This also helped us identify outliers which were removed later on. 
</details>

<details>
<summary> - Something to note: [click here] </summary>
 
When running the next code chunk, the notebook might freeze after printing the log for EUR, AFR and AMR. If this happens, don't restart it. Just let it run and follow the progress with the outputs being generated. Even after all the outputs have been generated (3 for each region so 21 in total), the code chunk will seem as if it's still running. So after checking that the desired outputs are there, just exit the current notebook, open a new session, and proceed to the next step. 
</details>

In [None]:
# obtain a list of the continental regions in the data set (used the unrelated data set since it had more samples) 
regions = mt_unrel['hgdp_tgp_meta']['Genetic']['region'].collect()
regions = list(dict.fromkeys(regions)) # convert into a list
# There are 7 regions: EUR, AFR, AMR, EAS, CSA, OCE, and MID

# set argument values for PCA 
subcont_pca_prefix = 'gs://hgdp-1kg/hgdp_tgp/pca_preoutlier/subcont_pca/subcont_pca_' # path for outputs 
overwrite = False

# for each region, run PCA on the unrelated samples (~27min to run)
for i in regions:  
    subcont_unrel = mt_unrel.filter_cols(mt_unrel['hgdp_tgp_meta']['Genetic']['region'] == i)  # filter the unrelateds per region
    run_pca(subcont_unrel, i, subcont_pca_prefix, overwrite)

In [None]:
# for each region, project the related samples onto the unrelated-samples' PC space (~2min to run)
for i in regions:
    loadings = hl.read_table(subcont_pca_prefix + i + '_loadings.ht') # read in the PCA loadings of the unrelated samples for each region 
    subcont_rel = mt_rel.filter_cols(mt_rel['hgdp_tgp_meta']['Genetic']['region'] == i)  # filter the related mt per region 
    project_individuals(loadings, subcont_rel, i, subcont_pca_prefix, overwrite) # project 

<details>
<summary> - More information on Hail methods and expressions [click here] </summary>

- <a href="more info https://hail.is/docs/0.2/methods/genetics.html#hail.methods.hwe_normalized_pca"> More on  <i> hwe_normalized_pca() </i></a>
    
- <a href="more info https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_rows"> More on  <i> annotate_rows() </i></a>
    
- <a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.annotate"> More on  <i> annotate() </i></a>
    
- <a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.transmute"> More on  <i> transmute() </i></a>
    
- <a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.export"> More on  <i>  export() </i></a>
    
- <a href="more info https://hail.is/docs/0.2/experimental/index.html#hail.experimental.pc_project"> More on  <i> pc_project() </i></a>
    
- <a href="more info https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.collect"> More on  <i> collect() </i></a>

# 5. Outlier Removal

After plotting the PCs using R (link_the_plotting_Rmarkdown_here), 22 outliers were identified (complete_the_table)

| s | Genetic region | Population | Note |
| --- | --- | --- | -- |
| NA20314 | AFR | ASW | Clusters with AMR in global PCA | 
| NA20299 | - | - | - |
| NA20274 | - | - | - |
| HG01880 | - | - | - |
| HG01881 | - | - | - |
| HG01628 | - | - | - |
| HG01629 | - | - | - |
| HG01630 | - | - | - |
| HG01694 | - | - | - |
| HG01696 | - | - | - |
| HGDP00013 | - | - | - |
| HGDP00150 | - | - | - |
| HGDP00029 | - | - | - |
| HGDP01298 | - | - | - |
| HGDP00130 | CSA | Makrani | Closer to AFR than most CSA |
| HGDP01303 | - | - | - |
| HGDP01300 | - | - | - |
| HGDP00621 | MID | Bedouin | Closer to AFR than most MID |
| HGDP01270 | MID | Mozabite | Closer to AFR than most MID |
| HGDP01271 | MID | Mozabite | Closer to AFR than most MID |
| HGDP00057 | - | - | - | 
| LP6005443-DNA_B02 | - | - | - |


















	


In [None]:
# read in the unrelated and related mts to remove outliers and rerun pca  
mt_unrel_unfiltered = hl.read_matrix_table(unrel_path) # unrelated mt
mt_rel_unfiltered = hl.read_matrix_table(rel_path) # related mt

# read the outliers file into a list
with hl.utils.hadoop_open(outliers_path) as file: 
    outliers = [line.rstrip('\n') for line in file]
    
# capture and broadcast the list as an expression
outliers_list = hl.literal(outliers)

# remove the 22 outliers from both mts
mt_unrel = mt_unrel_unfiltered.filter_cols(~outliers_list.contains(mt_unrel_unfiltered['s']))
mt_rel = mt_rel_unfiltered.filter_cols(~outliers_list.contains(mt_rel_unfiltered['s']))

# sanity check 
print('Unrelated: Before outlier removal ' + str(mt_unrel_unfiltered.count()[1]) + ' | After outlier removal ' + str(mt_unrel.count()[1]))
print('Related: Before outlier removal: ' + str(mt_rel_unfiltered.count()[1]) + ' | After outlier removal ' + str(mt_rel.count()[1]))
num_outliers = (mt_unrel_unfiltered.count()[1] - mt_unrel.count()[1]) + (mt_rel_unfiltered.count()[1] - mt_rel.count()[1])
print('Total samples removed = ' + str(num_outliers))

<details>
<summary> - More information on Hail methods and expressions [click here] </summary>

- <a href="more info https://hail.is/docs/0.2/utils/index.html#hail.utils.hadoop_open"> More on  <i> hl.utils.hadoop_open() </i></a>
    
- <a href="more info https://hail.is/docs/0.2/functions/core.html#hail.expr.functions.literal"> More on  <i> hl.literal() </i></a>

# 6. Rerun PCA

<details>
<summary> - What's different here? [click here] </summary> 

We are using:
- the updated unrelated and related mts (outliers removed)
- new paths for the outputs     
</details>

<details>
<summary> - Something to note: [click here] </summary>
 
Make sure the code blocks for the PCA (4a) and the projection (4b) functions in section 4 above are run prior to running the following.     
</details>

### 6a. Global PCA (without outliers)

In [None]:
# run PCA on the unrelated samples  
run_pca(mt_unrel, 'global', 'gs://hgdp-1kg/hgdp_tgp/pca_postoutlier/', False)

# read in the PCA loadings of the unrelated samples  
loadings = hl.read_table('gs://hgdp-1kg/hgdp_tgp/pca_postoutlier/global_loadings.ht') 

# project the related samples onto the unrelated-samples' PC space 
project_individuals(loadings, mt_rel, 'global', 'gs://hgdp-1kg/hgdp_tgp/pca_postoutlier/', False) 

### 6b. Subcontinental PCA (without outliers)

In [None]:
# set argument values for PCA 
subcont_pca_prefix = 'gs://hgdp-1kg/hgdp_tgp/pca_postoutlier/subcont_pca/subcont_pca_' # path for outputs 
overwrite = False 

# if not done so already, read "Something to note" in section 4d above before running the following code 
# for each region, run PCA on the unrelated samples (~26 min to run) 
for i in regions: # "regions" is a list containing the 7 continental regions found in the data set. It comes from the code chunk 4d above.    
    subcont_unrel = mt_unrel.filter_cols(mt_unrel['hgdp_tgp_meta']['Genetic']['region'] == i)  # filter the unrelateds per region
    run_pca(subcont_unrel, i, subcont_pca_prefix, overwrite)

In [None]:
# for each region, project the related samples onto the unrelated-samples' PC space (~3min to run)
for i in regions:
    loadings = hl.read_table(subcont_pca_prefix + i + '_loadings.ht') # read in the PCA loadings of the unrelated samples for each region
    subcont_rel = mt_rel.filter_cols(mt_rel['hgdp_tgp_meta']['Genetic']['region'] == i)  # filter the relateds per region 
    project_individuals(loadings, subcont_rel, i, subcont_pca_prefix, overwrite) # project 

# 7. Write Out Matrix Table 

In [None]:
# write out the updated unrelated and related mts separately (post-outlier removal) 
mt_unrel.write('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt', overwrite=False) # unrelated mt 
mt_rel.write('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/related.mt', overwrite=False) # related mt 