Notebook 4: filtering to common independent SNPS, relatedness, PCA, joint calling with new dataset, applying RF on new dataset. Analyses that need to be run:
1. Work with Lindo: - *PENDING*
    - Joint calling with GGV, sample QC 
    - Use gnomAD RF (subset to variants in RF model) - doesn’t need VQSR 
    - Intersect HGDP+1kG+GGV, build a RF with 1kG+ HGDP, apply it to a new dataset (GGV) - doesn’t need VQSR 
2. PCA plots - Ally - *PENDING*
    - already implemented in R, just need to plot it in Hail
    
----------------------------------------
Further edits needed in this nb: 
- Add all paths below in part 1
- Add desc above each code block if needed 
- Clean up comments inside the code block 
- Add Hail links below each code block
- Add Ally's code for plots 

# Index
1. [Setting Default Output Paths](#1.-Set-Default-Output-Paths)
2. [Variant Filtering and LD Pruning](#2.-Variant-Filtering-and-LD-Pruning)
3. [Run PC Relate](#3.-Run-PC-Relate)
4. [PCA](#4.-PCA)
    1. [Function to Run PCA on Unrelated Individuals](#4a.-Function-to-Run-PCA-on-Unrelated-Individuals)
    2. [Function to Project Related Individuals](#4b.-Function-to-Project-Related-Individuals)
    3. [Global PCA](#4c.-Global-PCA)
    4. [Subcontinental PCA](#4d.-Subcontinental-PCA)
5. [Outlier Removal](#5.-Outlier-Removal)
6. [Rerun PCA](#6.-Rerun-PCA)
    1. [Global PCA](#6a.-Global-PCA)
    2. [Subcontinental PCA](#6b.-Subcontinental-PCA)
7. [Writing out Matrix Table](#7.-Write-Out-Matrix-Table)

# General Overview 
The purpose of this notebook is to further filter the matrix table obtained from notebook 3, run relatedness and Principal Component Analysis (PCA), joint call with new data set, and apply RF. 

**This script contains information on how to:**
- Read in the a matrix table and run Hail common variant statistics  
- Filter using allele frequency and call rate
- Run LD pruning 
- Run relatedness and separate related and unrelated individuals
- Calculate PC scores and project samples on to a PC space  
- Run global and Subcontinental PCA and plot them 
- Remove PCA outliers (filter using sample IDs)
- Joint call with a new data set
- Build/apply RF
- Write out a matrix table 

Author: Mary T. Yohannes

In [None]:
# import hail
import hail as hl

# import the read_qc function
from read_qc_function import read_qc

# importing methods from gnomAD needed to project individuals
from gnomad.sample_qc.ancestry import *

## Set Requester Pays Bucket
Running through these tutorials, users must specify which project is to be billed. To change which project is billed, set the `GCP_PROJECT_NAME` variable to your own project.

In [None]:
# setting requester pays bucket to use throughout tutorial
GCP_PROJECT_NAME = "diverse-pop-seq-ref"
hl.init(spark_conf={
    'spark.hadoop.fs.gs.requester.pays.mode': 'CUSTOM',
    'spark.hadoop.fs.gs.requester.pays.buckets': 'hgdp_tgp,gcp-public-data--gnomad',
    'spark.hadoop.fs.gs.requester.pays.project.id': 'diverse-pop-seq-ref'
})

# 1. Set Default Output Paths
These default paths can be edited by users as needed. It is recommended to run these tutorials without writing out datasets. The read_qc() function is intended to take the place of needing to write out and read in datasets by the user. 

By default we have commented out all of the write steps of the tutorials, if you would like to write out your own datasets, uncomment those sections and replace the paths with your own. 

In [None]:
# input 
input_path = 'gs://hgdp-1kg/hgdp_tgp/intermediate_files/pre_running_varqc.mt'

# temporary file
intermediate_path = 'gs://hgdp-1kg/hgdp_tgp/intermediate_files/filtered_n_pruned_output_updated.mt'

# pre-outlier paths for unrelated and related samples 
unrel_path = 'gs://hgdp-1kg/hgdp_tgp/unrel_updated.mt'
rel_path = 'gs://hgdp-1kg/hgdp_tgp/rel_updated.mt' 

# pre-outlier file path is missing - global & subcont pca results [here]

# outliers file 
outliers_path = 'gs://hgdp-1kg/hgdp_tgp/pca_outliers_v2.txt'

# post-outlier file path is missing - global & subcont pca results[here]

# final output paths for unrelated and related samples (post-outlier)
unrel_output = 'gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt'
rel_output = 'gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/related.mt'

# 2. Variant Filtering and LD Pruning

<details><summary> <b>Why are we doing this?</b>
    <br>[click <span style="color:green">here</span> to expand] </summary>
    
At this point, we have 155,648,020 SNPs and since we need fewer number of variants (~100-300k) for PCA, we filter on:
- AF - allele frequency 
- call rate - fraction of calls neither missing nor filtered

and then run LD pruning.     
    
Linkage disequilibrium (LD) is the correlation between nearby variants such that the alleles at neighboring polymorphisms (observed on the same chromosome) are associated within a population more often than if they were unlinked.
    
For more information on LD pruning click <a href=""> here </a>
</details>

<br>
<details><summary> <b>More information on Hail methods and expressions</b> 
    <br>[click <span style="color:green">here</span> to expand] </summary>
<ul>
<li><a href="https://hail.is/docs/0.2/methods/genetics.html#hail.methods.variant_qc"> More on  <i> variant_qc() </i></a></li>

<li><a href="https://hail.is/docs/0.2/methods/genetics.html#hail.methods.ld_prune"> More on  <i> ld_prune() </i></a></li>
    </ul>
</details>

[Back to Index](#Index)

In [None]:
# read-in the right intermediate file 
mt_filt = hl.read_matrix_table(input_path)

#### 2a. Variant Filtering 

In [None]:
# run Hail's common variant statistics (QC metrics) 
mt_var = hl.variant_qc(mt_filt) 

# filter to variants with AF between 0.05 & 0.95, and call rate greater than 0.999    
mt_var_filt = mt_var.filter_rows((mt_var.variant_qc.AF[0] > 0.05) & (mt_var.variant_qc.AF[0] < 0.95) & (mt_var.variant_qc.call_rate > 0.999))
print('Num of variants after filtering = ' + str(mt_var_filt.count()[0])) # 6787034 snps; this line take ~20min to run 

#### 2b. LD Pruning
Since the number of variants is now in the ~100-300k range, we proceed to the PCA analysis without any more adjustments.  

In [None]:
# remove correlated variants 
pruned = hl.ld_prune(mt_var_filt.GT, r2=0.1, bp_window_size=500000) # ~113 min to run  
mt_var_pru_filt = mt_var_filt.filter_rows(hl.is_defined(pruned[mt_var_filt.row_key])) 
print('Num of variants after LD pruning = ' + str(mt_var_pru_filt.count()[0])) # 248634 snps

#### 2c. Write out an intermediate file
The LD pruning step takes a non negligble time to run so to ensure that the downstream analyses steps don't take a very long time we write out an intermediate file. This write out step should take around 23 minutes to run. 

Due to the use of the read_qc function however, you do not need to run through the write out step. Instead, the function will automatically read in the version of the dataset we wrote out when creating these tutorials. 

If the user wishes to export their own intermediate file, they can do so by changing the intermediate path and then replacing the read_qc() function call with `hl.read_matrix_table(intermediate_path)`

In [None]:
# writing out an intermediate file to speed up subsequent analyses
mt_var_pru_filt.write(intermediate_path, overwrite=False)

# read the intermediate file back in for subsequent analyses
mt_var_pru_filt = read_qc(ld_prune=True)

# 3. Run PC Relate   
<br>
<details><summary> <b>Why are we doing this?</b>
    <br>[click <span style="color:green">here</span> to expand] </summary>
<br>
In many genomic studies relatedness filtering often occurs to prevent non genuine association. Here we conduct relatedness using the pc_relate method which is a PCA based relatedness method.
    
For more information on relatedness click <a href="https://hail.is/docs/0.2/methods/relatedness.html#relatedness"> here</a>
    
For more information on the pc_relate method click <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4716688/">here</a>
    
</details>

<br>
<details><summary> <b>What metrics are we using for pc_relate?</b> 
    <br>[click <span style="color:green">here</span> to expand] </summary>
<br>
We computed the kinship statistic using:
<ul>
<li>a minimum minor allele frequency filter of 0.05</li>
<li>excluding sample-pairs with kinship less than 0.05</li>
<li>20 principal components to control for population structure</li>
</ul>
    
For more information on the pc_relate method click <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4716688/">here</a>
    
</details>

<br>
<details><summary> <b>More information on Hail methods and expressions</b> 
    <br>[click <span style="color:green">here</span> to expand] </summary>
<ul>
<li><a href="https://hail.is/docs/0.2/methods/relatedness.html#hail.methods.pc_relate"> More on  <i> pc_relate() </i></a></li>

<li><a href="https://hail.is/docs/0.2/methods/misc.html#hail.methods.maximal_independent_set"> More on  <i> maximal_independent_set() </i></a></li>
    </ul>
</details>

[Back to Index](Index)

In [None]:
# compute relatedness estimates between individuals using a variant of the PC-Relate method
# takes ~4min to run
relatedness_ht = hl.pc_relate(mt_var_pru_filt.GT, min_individual_maf=0.05, min_kinship=0.05, statistics='kin', k=20).key_by()

In [None]:
# identify related individuals in pairs to remove - returns a list of sample IDs 
# takes ~2hr & 22 min to run - previous one took ~13min
related_samples_to_remove = hl.maximal_independent_set(relatedness_ht.i, relatedness_ht.j, False)

In [None]:
# using sample IDs (col_key of the matrixTable) 
# pick out the samples that are not found in 'related_samples_to_remove' (had 'False' values for the comparison)  
# subset the mt to those only 
mt_unrel = mt_var_pru_filt.filter_cols(hl.is_defined(related_samples_to_remove[mt_var_pru_filt.col_key]), keep=False) 

# do the same as above but this time for the samples with 'True' values (found in 'related_samples_to_remove')  
mt_rel = mt_var_pru_filt.filter_cols(hl.is_defined(related_samples_to_remove[mt_var_pru_filt.col_key]), keep=True) 

In [None]:
# # write out mts of unrelated and related samples on to the cloud 

# # unrelated mt
# mt_unrel.write(unrel_path, overwrite=False) 

# # related mt 
# mt_rel.write(rel_path, overwrite=False) 

In [None]:
# read saved mts back in 

# unrelated mt
mt_unrel = read_qc(unrelated=True)

# related mt 
mt_rel = read_qc(related=True) 

# 4. PCA
<br>
<details><summary> <b>Why are we doing this?</b> 
    <br>[click <span style="color:green">here</span> to expand] </summary>
    [INSERT INFO ON WHY WE ARE CONDUCTING PCA AS WELL AS BACKGROUND ON PCA]
    
</details>

[Back to Index](#Index)

### 4a. Function to Run PCA on Unrelated Individuals

[Back to Index](#Index)

In [None]:
def run_pca(mt: hl.MatrixTable, reg_name:str, out_prefix: str, overwrite: bool = False):
    """
    Runs PCA on a dataset
    :param mt: dataset to run PCA on
    :param reg_name: region name for saving output purposes
    :param out_prefix: path for where to save the outputs
    :return:
    """

    pca_evals, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, k=20, compute_loadings=True)
    pca_mt = mt.annotate_rows(pca_af=hl.agg.mean(mt.GT.n_alt_alleles()) / 2)
    pca_loadings = pca_loadings.annotate(pca_af=pca_mt.rows()[pca_loadings.key].pca_af)
    pca_scores = pca_scores.transmute(**{f'PC{i}': pca_scores.scores[i - 1] for i in range(1, 21)})
    
    pca_scores.export(out_prefix + reg_name + '_scores.txt.bgz')  # save individual-level genetic region PCs
    pca_loadings.write(out_prefix + reg_name + '_loadings.ht', overwrite)  # save PCA loadings

### 4b. Function to Project Related Individuals
[Back to Index](#Index)

In [1]:
def project_individuals(pca_loadings, project_mt, reg_name:str, out_prefix: str, overwrite: bool = False):
    """
    Project samples into predefined PCA space
    :param pca_loadings: existing PCA space - unrelated samples 
    :param project_mt: matrixTable of data to project - related samples 
    :param reg_name: region name for saving output purposes
    :param project_prefix: path for where to save PCA projection outputs
    :return:
    """
    ht_projections = pc_project(project_mt, pca_loadings)  
    ht_projections = ht_projections.transmute(**{f'PC{i}': ht_projections.scores[i - 1] for i in range(1, 21)}) 
    ht_projections.export(out_prefix + reg_name + '_projected_scores.txt.bgz') # save output 
    #return ht_projections # return to user  

ModuleNotFoundError: No module named 'gnomad'

### 4c. Global PCA

[Back to Index](#Index)

In [None]:
# run 'run_pca' function for global pca   
run_pca(mt_unrel, 'global', 'gs://hgdp-1kg/hgdp_tgp/pca_preoutlier/', False)

In [None]:
# run 'project_relateds' function for global pca 
loadings = hl.read_table('gs://hgdp-1kg/hgdp_tgp/pca_preoutlier/global_loadings.ht') # read in the PCA loadings that were obtained from 'run_pca' function 
project_individuals(loadings, mt_rel, 'global', 'gs://hgdp-1kg/hgdp_tgp/pca_preoutlier/', False) 

### 4d. Subcontinental PCA 
[Back to Index](#Index)

In [None]:
# obtain a list of the genetic regions in the dataset - used the unrelated dataset since it had more samples 
regions = mt_unrel['hgdp_tgp_meta']['Genetic']['region'].collect()
regions = list(dict.fromkeys(regions)) # 7 regions - ['EUR', 'AFR', 'AMR', 'EAS', 'CSA', 'OCE', 'MID']

In [None]:
# set argument values 
subcont_pca_prefix = 'gs://hgdp-1kg/hgdp_tgp/pca_preoutlier/subcont_pca/subcont_pca_' # path for outputs 
overwrite = False 

In [None]:
# run 'run_pca' function for each region - nb freezes after printing the log for AMR  
# don't restart it - just let it run and you can follow the progress through the SparkUI
# even after all the outputs are produced and the run is complete, the code chunk will seem as if it's still running (* in the left square bracket)
# can check if the run is complete by either checking the output files in the Google cloud bucket or using the SparkUI 
# after checking the desired outputs are generated and the run is done, exit the current nb, open a new session, and proceed to the next step
# ~27min to run 
for i in regions:
    subcont_unrel = mt_unrel.filter_cols(mt_unrel['hgdp_tgp_meta']['Genetic']['region'] == i)  # filter the unrelateds per region
    run_pca(subcont_unrel, i, subcont_pca_prefix, overwrite)

In [None]:
# run 'project_relateds' function for each region (~2min to run)
for i in regions:
    loadings = hl.read_table(subcont_pca_prefix + i + '_loadings.ht') # for each region, read in the PCA loadings that were obtained from 'run_pca' function 
    subcont_rel = mt_rel.filter_cols(mt_rel['hgdp_tgp_meta']['Genetic']['region'] == i)  # filter the relateds per region 
    project_individuals(loadings, subcont_rel, i, subcont_pca_prefix, overwrite) 

# 5. Outlier Removal
#### After plotting the PCs, 22 outliers that need to be removed were identified (the table below will be completed for the final report)


| s | Genetic region | Population | Note |
| --- | --- | --- | -- |
| NA20314 | AFR | ASW | Clusters with AMR in global PCA | 
| NA20299 | - | - | - |
| NA20274 | - | - | - |
| HG01880 | - | - | - |
| HG01881 | - | - | - |
| HG01628 | - | - | - |
| HG01629 | - | - | - |
| HG01630 | - | - | - |
| HG01694 | - | - | - |
| HG01696 | - | - | - |
| HGDP00013 | - | - | - |
| HGDP00150 | - | - | - |
| HGDP00029 | - | - | - |
| HGDP01298 | - | - | - |
| HGDP00130 | CSA | Makrani | Closer to AFR than most CSA |
| HGDP01303 | - | - | - |
| HGDP01300 | - | - | - |
| HGDP00621 | MID | Bedouin | Closer to AFR than most MID |
| HGDP01270 | MID | Mozabite | Closer to AFR than most MID |
| HGDP01271 | MID | Mozabite | Closer to AFR than most MID |
| HGDP00057 | - | - | - | 
| LP6005443-DNA_B02 | - | - | - |
	
[Back to Index](Index)

In [None]:
# read back in the unrelated and related mts to remove outliers and run pca 
# bucket was moved to another project so different paths are used from where these mts were previously saved 
mt_unrel_unfiltered = hl.read_matrix_table(unrel_path) # unrelated mt
mt_rel_unfiltered = hl.read_matrix_table(rel_path) # related mt

In [None]:
# read the outliers file into a list
with hl.utils.hadoop_open(outliers_path) as file: 
    outliers = [line.rstrip('\n') for line in file]
    
# capture and broadcast the list as an expression
outliers_list = hl.literal(outliers)

In [None]:
# remove 22 outliers 
mt_unrel = mt_unrel_unfiltered.filter_cols(~outliers_list.contains(mt_unrel_unfiltered['s']))
mt_rel = mt_rel_unfiltered.filter_cols(~outliers_list.contains(mt_rel_unfiltered['s']))

In [None]:
# sanity check 
print('Unrelated: Before outlier removal ' + str(mt_unrel_unfiltered.count()[1]) + ' | After outlier removal ' + str(mt_unrel.count()[1]))
print('Related: Before outlier removal: ' + str(mt_rel_unfiltered.count()[1]) + ' | After outlier removal ' + str(mt_rel.count()[1]))

num_outliers = (mt_unrel_unfiltered.count()[1] - mt_unrel.count()[1]) + (mt_rel_unfiltered.count()[1] - mt_rel.count()[1])
print('Total samples removed = ' + str(num_outliers))

# 6. Rerun PCA

The following steps are similar to the ones we ran prior to removing the outliers except now we are using the updated unrelated & related dataset and a new google cloud bucket path to save the outputs 

[Back to Index](Index)

### 6a. Global PCA
[Back to Index](#Index)

In [None]:
# run 'run_pca' function for global pca - make sure the code block for the function (located above) is run prior to running this    
run_pca(mt_unrel, 'global', 'gs://hgdp-1kg/hgdp_tgp/pca_postoutlier/', False)

In [None]:
# run 'project_relateds' function for global pca - make sure the code block for the function (located above) is run prior to running this    
loadings = hl.read_table('gs://hgdp-1kg/hgdp_tgp/pca_postoutlier/global_loadings.ht') # read in the PCA loadings that were obtained from 'run_pca' function 
project_individuals(loadings, mt_rel, 'global', 'gs://hgdp-1kg/hgdp_tgp/pca_postoutlier/', False) 

### 6b. Subcontinental PCA

For the following section, the notwbook may become slow, specifially when running the [run_pca function](#4.-PCA) for each region. If it looks stuck, do not restart it. Let it run and follow the progress through the SparkUI. 

When complete, run through the following checks:
- desired outputs are generated
    - verify by checking the specified output google cloud bucket
- the run is done as shown by the SparkUI

Do the following:
1. Save close and halt the current notebook
2. Open a new session
3. Proceed to the next step (run project_relateds function)

[Back to Index](#Index)

In [None]:
# obtain a list of the genetic regions in the dataset - used the unrelated dataset since it had more samples  
regions = mt_unrel['hgdp_tgp_meta']['Genetic']['region'].collect()
regions = list(dict.fromkeys(regions)) # 7 regions - ['EUR', 'AFR', 'AMR', 'EAS', 'CSA', 'OCE', 'MID']

In [None]:
# set argument values 
subcont_pca_prefix = 'gs://hgdp-1kg/hgdp_tgp/pca_postoutlier/subcont_pca/subcont_pca_' # path for outputs 
overwrite = False 

In [None]:
# run 'run_pca' function for each region 
# takes ~25-27 min  
for i in regions:
    subcont_unrel = mt_unrel.filter_cols(mt_unrel['hgdp_tgp_meta']['Genetic']['region'] == i)  # filter the unrelateds per region
    run_pca(subcont_unrel, i, subcont_pca_prefix, overwrite)

In [None]:
# run 'project_relateds' function (located above) for each region - took ~3min 
for i in regions:
    # for each region, read in the PCA loadings that were obtained from 'run_pca' function 
    loadings = hl.read_table(subcont_pca_prefix + i + '_loadings.ht') 
    # filter the relateds per region 
    subcont_rel = mt_rel.filter_cols(mt_rel['hgdp_tgp_meta']['Genetic']['region'] == i)  
    project_individuals(loadings, subcont_rel, i, subcont_pca_prefix, overwrite) 

# 7. Write Out Matrix Table 
[Back to Index](#Index)

In [None]:
# # write out mts of unrelated and related samples separately (post-outlier removal) 
# #unrelated mt
# mt_unrel.write('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt',
#                overwrite=False)
# #related mt
# mt_rel.write('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/related.mt',
#              overwrite=False) 