PCA plots - Ally - *PENDING*
    - already implemented in R, just need to plot it in Hail
    
----------------------------------------
Further edits needed in this nb: 
- Add the path to the PCA plotting Rmarkdown (in section 5) once available  
- Complete the table in section 5 - need Alicia's help with that 
- Add Ally's code for plots 

# Index
1. [Setting Default Output Paths](#1.-Set-Default-Output-Paths)
2. [Variant Filtering and LD Pruning](#2.-Variant-Filtering-and-LD-Pruning)
3. [Run PC Relate](#3.-Run-PC-Relate)
4. [PCA](#4.-PCA)
    1. [Function to Run PCA on Unrelated Individuals](#4a.-Function-to-Run-PCA-on-Unrelated-Individuals)
    2. [Function to Project Related Individuals](#4b.-Function-to-Project-Related-Individuals)
    3. [Global PCA](#4c.-Global-PCA)
    4. [Subcontinental PCA](#4d.-Subcontinental-PCA)
5. [Outlier Removal](#5.-Outlier-Removal)
6. [Rerun PCA](#6.-Rerun-PCA)
    1. [Global PCA](#6a.-Global-PCA)
    2. [Subcontinental PCA](#6b.-Subcontinental-PCA)
7. [Writing out Matrix Table](#7.-Write-Out-Matrix-Table)

# General Overview 
The purpose of this notebook is to further filter the postQC matrix table to prepare it for LD pruning, compute relatedness and run Principal Component Analysis (PCA).

**This script contains information on how to:**
- Read in the a matrix table and run Hail common variant statistics  
- Filter using allele frequency and call rate
- Run LD pruning 
- Run relatedness and separate related and unrelated individuals
- Calculate PC scores and project samples on to a PC space  
- Run global and Subcontinental PCA and plot them 
- Remove PCA outliers (filter using sample IDs)
- rerun global and subcontinental PCA
- Write out a matrix table 

Author: Mary T. Yohannes

In [None]:
# import hail
import hail as hl

# import the read_qc function
# tmp: this is commented out as the function will continue to change
#from read_qc_function import read_qc

# importing methods from gnomAD needed to project individuals
from gnomad.sample_qc.ancestry import *

## Set Requester Pays Bucket
Running through these tutorials, users must specify which project is to be billed. To change which project is billed, set the `GCP_PROJECT_NAME` variable to your own project.

In [None]:
# setting requester pays bucket to use throughout tutorial
GCP_PROJECT_NAME = "diverse-pop-seq-ref" # change this to your project name
hl.init(spark_conf={
    'spark.hadoop.fs.gs.requester.pays.mode': 'CUSTOM',
    'spark.hadoop.fs.gs.requester.pays.buckets': 'hgdp_tgp,gcp-public-data--gnomad',
    'spark.hadoop.fs.gs.requester.pays.project.id': GCP_PROJECT_NAME
})

### tmp read_qc function
to be removed once tutorials & function are complete and we can troubleshoot importing

In [None]:
def read_qc(
        default: bool = False,
        post_qc:bool = False,
        sample_qc: bool = False,
        variant_qc: bool = False,
        duplicate: bool = False,
        outlier_removal: bool = False,
        ld_pruning: bool = False,
        rel_unrel: str = 'default') -> hl.MatrixTable:
    """
    Wrapper function to get HGDP+1kGP data as Matrix Table at different stages of QC/filtering.
    By default, returns pre QC MatrixTable with qc filters annotated but not filtered.

    :param bool default: if True will preQC version of the dataset
    :param bool post_qc: if True will return a post QC matrix table that has gone through:
        - sample QC
        - variant QC
        - duplicate removal
        - outlier removal
    :param bool sample_qc: if True will return a post sample QC matrix table
    :param bool variant_qc: if True will return a post variant QC matrix table
    :param bool duplicate: if True will return a matrix table with duplicate samples removed
    :param bool outlier_removal: if True will return a matrix table with PCA outliers and duplicate samples removed
    :param bool ld_pruning: if True will return a matrix table that has gone through:
        - sample QC
        - variant QC
        - duplicate removal
        - LD pruning
        - additional variant filtering
    :param bool rel_unrel: default will return same mt as ld pruned above
        if 'all' will return the same matrix table as if ld_pruning is True
        if 'related_pre_outlier' will return a matrix table with only related samples pre pca outlier removal
        if 'unrelated_pre_outlier' will return a matrix table with only unrelated samples pre pca outlier removal
        if 'related_post_outlier' will return a matrix table with only related samples post pca outlier removal
        if 'unrelated_post_outlier' wil return a matrix table with only unrelated samples post pca outlier removal
    """
    # Reading in all the tables and matrix tables needed to generate the pre_qc matrix table
    sample_meta = hl.import_table('gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/gnomad_meta_v1.tsv')
    sample_qc_meta = hl.read_table('gs://hgdp_tgp/output/gnomad_v3.1_sample_qc_metadata_hgdp_tgp_subset.ht')
    dense_mt = hl.read_matrix_table(
        'gs://gcp-public-data--gnomad/release/3.1.2/mt/genomes/gnomad.genomes.v3.1.2.hgdp_1kg_subset_dense.mt')
    
    dense_mt = dense_mt.naive_coalesce(5000)


    # Takes a list of dicts and converts it to a struct format (works with nested structs too)
    def dict_to_struct(d):
        fields = {}
        for k, v in d.items():
            if isinstance(v, dict):
                v = dict_to_struct(v)
            fields[k] = v
        return hl.struct(**fields)

    # un-flattening a hail table with nested structure
    # dict to hold struct names as well as nested field names
    d = {}

    # Getting the row field names
    row = sample_meta.row_value

    # returns a dict with the struct names as keys and their inner field names as values
    for name in row:
        def recur(dict_ref, split_name):
            if len(split_name) == 1:
                dict_ref[split_name[0]] = row[name]
                return
            existing = dict_ref.get(split_name[0])
            if existing is not None:
                assert isinstance(existing, dict), existing
                recur(existing, split_name[1:])
            else:
                existing = {}
                dict_ref[split_name[0]] = existing
                recur(existing, split_name[1:])
        recur(d, name.split('.'))

    # using the dict created from flattened struct, creating new structs now un-flattened
    sample_meta = sample_meta.select(**dict_to_struct(d))
    sample_meta = sample_meta.key_by('s')

    # grabbing the columns needed from HGDP metadata
    new_meta = sample_meta.select(sample_meta.hgdp_tgp_meta, sample_meta.bergstrom)

    # creating a table with gnomAD sample metadata and HGDP metadata
    ht = sample_qc_meta.annotate(**new_meta[sample_qc_meta.s])

    # stripping 'v3.1::' from the names to match with the densified MT
    ht = ht.key_by(s=ht.s.replace("v3.1::", ""))

    # Using hl.annotate_cols() method to annotate the gnomAD variant QC metadata onto the matrix table
    mt = dense_mt.annotate_cols(**ht[dense_mt.s])
    

    print(f"sample_qc: {sample_qc}\nvariant_qc: {variant_qc}\nduplicate: {duplicate}" \
          f"\noutlier_removal: { outlier_removal}\nld_pruning: {ld_pruning}\nrel_unrel: {rel_unrel}")
    
    if default:
        print("Returning default preQC matrix table")
        # returns preQC dataset
        return mt
    
    if post_qc:
        print("Returning post sample and variant QC matrix table with duplicates and PCA outliers removed")
        sample_qc = True
        variant_qc = True
        duplicate = True
        outlier_removal = True
    
    if sample_qc:
        print("Running sample QC")
        # run data through sample QC
        # filtering samples to those who should pass gnomADs sample QC
        # this filters to only samples that passed gnomad sample QC hard filters
        mt = mt.filter_cols(~mt.sample_filters.hard_filtered)

        # annotating partially filtered dataset with variant metadata
        mt = mt.annotate_rows(**var_meta[mt.locus, mt.alleles])

    if variant_qc:
        print("Running variant QC")
        # run data through variant QC
        # Subsetting the variants in the dataset to only PASS variants (those which passed gnomAD's variant QC)
        # PASS variants are variants which have an entry in the filters field.
        # This field contains an array which contains a bool if any variant qc filter was failed
        # This is the last step in the QC process
        mt = mt.filter_rows(hl.len(mt.filters) != 0, keep=False)

    if duplicate:
        print("Removing any duplicate samples")
        # Removing any duplicates in the dataset using hl.distinct_by_col() which removes
        # columns with a duplicate column key. It keeps one column for each unique key.
        # after updating to the new dense_mt, this step is no longer necessary to run
        mt = mt.distinct_by_col()

    if outlier_removal:
        print("Removing PCA outliers")
        # remove PCA outliers and duplicates
        # reading in the PCA outlier list
        # To read in the PCA outlier list, first need to read the file in as a list
        # using hl.hadoop_open here which allows one to read in files into hail from Google cloud storage
        pca_outlier_path = 'gs://hgdp-1kg/hgdp_tgp/pca_outliers_v2.txt'
        with hl.utils.hadoop_open(pca_outlier_path) as file:
            outliers = [line.rstrip('\n') for line in file]

        # Using hl.literal here to convert the list from a python object to a hail expression so that it can be used
        # to filter out samples
        outliers_list = hl.literal(outliers)

        # Using the list of PCA outliers, using the ~ operator which is a negation operator and obtains the compliment
        # In this case the compliment is samples which are not contained in the pca outlier list
        mt = mt.filter_cols(~outliers_list.contains(mt['s']))

    if ld_pruning:
        print("Returning ld pruned post variant and sample QC matrix table pre PCA outlier removal ")
        # read in dataset which has additional variant filtering and ld pruning run
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/intermediate_files/filtered_n_pruned_output_updated.mt')

    if rel_unrel == "default":
        # do nothing
        # created a default value because there are multiple options for rel/unrel datasets
        mt = mt

    elif rel_unrel == 'related_pre_outlier':
        print("Returning post sample and variant QC matrix table " \
              "pre PCA outlier removal with only related individuals")
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        #   - LD pruning
        #   - pc_relate - filter to only related individuals
        mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/rel_updated.mt')

        
    elif rel_unrel == 'unrelated_pre_outlier':
        print("Returning post QC matrix table with only unrelated individuals")
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        #   - LD pruning
        #   - pc_relate - filter to only unrelated individuals
        mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/unrel_updated.mt')


    elif rel_unrel == 'related_post_outlier':
        print("Returning post sample and variant QC matrix table " \
              "pre PCA outlier removal with only related individuals")
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        #   - LD pruning
        #   - pc_relate - filter to only related individuals
        #   - PCA outlier removal
        mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/related.mt')


    elif rel_unrel == 'unrelated_pst_outlier':
        print("Returning post sample and variant QC matrix table " \
              "pre PCA outlier removal with only related individuals")
        # data has gone through:
        #   - sample QC
        #   - variant QC
        #   - duplicate removal
        #   - LD pruning
        #   - pc_relate - filter to only unrelated individuals
        #   - PCA outlier removal
        mt = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt')
        
    # Calculating both variant and sample_qc metrics on the mt before returning
    # so the stats are up to date with the version being written out
    mt = hl.sample_qc(mt)
    mt = hl.variant_qc(mt)
    
    return mt

# 1. Set Default Output Paths
These default paths can be edited by users as needed. It is recommended to run these tutorials without writing out datasets. The read_qc() function is intended to take the place of needing to write out and read in datasets by the user. 

By default we have commented out all of the write steps of the tutorials, if you would like to write out your own datasets, uncomment those sections and replace the paths with your own. 

In [None]:
# input file 
input_path = 'gs://hgdp-1kg/hgdp_tgp/intermediate_files/pre_running_varqc.mt'

# save the filtered and LD pruned mt as an intermediate file since LD pruning takes a while to rerun
intermediate_file_path = 'gs://hgdp-1kg/hgdp_tgp/intermediate_files/filtered_n_pruned_output_updated.mt'

# paths for unrelated and related samples (prior to outlier identification and removal) 
unrel_preoutlier_path = 'gs://hgdp-1kg/hgdp_tgp/unrel_updated.mt'
rel_preoutlier_path = 'gs://hgdp-1kg/hgdp_tgp/rel_updated.mt' 

# path for pre-outlier PCA results - global & subcontinental PCA 
pca_preoutlier_path = 'gs://hgdp-1kg/hgdp_tgp/pca_preoutlier/'

# outliers file 
outliers_path = 'gs://hgdp-1kg/hgdp_tgp/pca_outliers_v2.txt'

# path for post-outlier PCA results - global & subcontinental PCA 
pca_postoutlier_path = 'gs://hgdp-1kg/hgdp_tgp/pca_postoutlier/'

# final output paths for unrelated and related samples (post-outlier)
unrel_final_output = 'gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt'
rel_final_output = 'gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/related.mt'

# 2. Variant Filtering and LD Pruning
<br>
<details><summary> Click <u><span style="color:blue">here</span></u> to learn why we are doing this. </summary>
    
> At this point, we have 155,648,020 SNPs and since we need fewer number of variants (~100-300k) for PCA, we filter on:
> - AF - allele frequency 
> - call rate - fraction of calls neither missing nor filtered
>
> and then run LD pruning.     
>    
> Linkage disequilibrium (LD) is the correlation between nearby variants such that the alleles at neighboring polymorphisms (observed on the same chromosome) are associated within a population more often than if they were unlinked.
<br>    
For more information on LD pruning click <a href=""> here </a>
</details>

<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/methods/genetics.html#hail.methods.variant_qc"> More on  <i> variant_qc() </i></a></li>

<li><a href="https://hail.is/docs/0.2/methods/genetics.html#hail.methods.ld_prune"> More on  <i> ld_prune() </i></a></li>
    </ul>
</details>

[Back to Index](#Index)

In [None]:
# read-in the input file using the read_qc function
mt_filt = read_qc(post_qc=True)

#### 2a. Variant Filtering 

In [None]:
# run Hail's common variant statistics (QC metrics) 
mt_var = hl.variant_qc(mt_filt) 

# filter to variants with AF between 0.05 & 0.95, and call rate greater than 0.999    
mt_var_filt = mt_var.filter_rows((mt_var.variant_qc.AF[0] > 0.05) & 
                                 (mt_var.variant_qc.AF[0] < 0.95) & (mt_var.variant_qc.call_rate > 0.999))

# Should print 6787034 snps; this line take ~20min to run 
print('Num of variants after filtering = ' + str(mt_var_filt.count()[0]))

#### 2b. LD Pruning
Since the number of variants after this step is now in the ~100-300k range, we proceed to the PCA analysis without any more adjustments.  

In [None]:
# remove correlated variants 
pruned = hl.ld_prune(mt_var_filt.GT, r2=0.1, bp_window_size=500000) # ~113 min to run  
mt_var_pru_filt = mt_var_filt.filter_rows(hl.is_defined(pruned[mt_var_filt.row_key])) 
print('Num of variants after LD pruning = ' + str(mt_var_pru_filt.count()[0])) # 248634 snps

#### 2c. Write out an intermediate file
The LD pruning step takes a non negligble time to run so to ensure that the downstream analyses steps don't take a very long time we write out an intermediate file. This write out step should take around 23 minutes to run. 

Due to the use of the read_qc function however, you do not need to run through the write out step. Instead, the function will automatically read in the version of the dataset we wrote out when creating these tutorials. 

If the user wishes to export their own intermediate file, they can do so by changing the intermediate path and then replacing the read_qc() function call with `hl.read_matrix_table(intermediate_path)`

In [None]:
# # this step will take ~23 min
# # writing out an intermediate file to speed up subsequent analyses
# mt_var_pru_filt.write(intermediate_file_path, overwrite=False)

# read the intermediate file back in for subsequent analyses
mt_var_pru_filt = read_qc(ld_prune=True)

# 3. Run PC Relate   
<br>
<details><summary>Click <u><span style="color:blue">here</span></u> to learn why we are doing this. </summary>
<br>
When doing Principal Component Analysis (PCA), we need to separate the related and unrelated samples before computing the PC scores and ploting them. This is because if we compute PCA with the related samples in the data set, the population structure and clustering will be affected by the relatedness that exists among those samples. Thus, we first have to identify the related individuals by computing relatedness estimates (kinship statistic in this case) using a variant of the PC-Relate method in Hail. We used a minimum minor allele frequency (MAF) filter of 0.05, excluded sample pairs with kinship less than 0.05, and used 20 principal components (PC) to control for population structure. After getting the sample ID pairs for the related samples, we then separate the filtered and pruned mt into relateds and unrelateds.
   
    
For more information on relatedness click <a href="https://hail.is/docs/0.2/methods/relatedness.html#relatedness"> here</a>
    
</details>

<br>
<details><summary> Click <u><span style="color:blue">here</span></u> to learn what metrics we used for pc_relate. 
 </summary>
    
<br>
We computed the kinship statistic using:
<ul>
<li>a minimum minor allele frequency filter of 0.05</li>
<li>excluding sample-pairs with kinship less than 0.05</li>
<li>20 principal components to control for population structure</li>
</ul>
    
For more information on the pc_relate method click <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4716688/">here</a>
    
</details>

<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    
<ul>
<li><a href="https://hail.is/docs/0.2/methods/relatedness.html#hail.methods.pc_relate"> More on  <i> pc_relate() </i></a></li>

<li><a href="https://hail.is/docs/0.2/methods/misc.html#hail.methods.maximal_independent_set"> More on  <i> maximal_independent_set() </i></a></li>
</ul>
    
</details>

[Back to Index](#Index)

In [None]:
# compute kinship statistic
# takes ~4min to run
relatedness_ht = hl.pc_relate(
    mt_var_pru_filt.GT, min_individual_maf=0.05, min_kinship=0.05, statistics='kin', k=20).key_by() 

# identify closely related individuals in pairs (list of sample IDs) 
# takes ~2hr & 22min to run
related_samples_to_remove = hl.maximal_independent_set(relatedness_ht.i, relatedness_ht.j, False) 

# subset the filtered and pruned mt to samples that are NOT present in the list of related individuals  
mt_unrel = mt_var_pru_filt.filter_cols(hl.is_defined(related_samples_to_remove[mt_var_pru_filt.col_key]), keep=False) 

# do the same as above but this time subset to samples that are present in the related-individuals list   
mt_rel = mt_var_pru_filt.filter_cols(hl.is_defined(related_samples_to_remove[mt_var_pru_filt.col_key]), keep=True) 

In [None]:
# # write out the unrelated and related mts since they are used beyond this notebook in other analyses     
# # unrelated mt
# mt_unrel.write(unrel_path, overwrite=False) 

# # related mt 
# mt_rel.write(rel_path, overwrite=False)

In [None]:
# read the related and unrelated mts back in using read_qc
# unrelated mt
mt_unrel = read_qc(rel_unrel='unrelated_pre_outlier') 

# related mt 
mt_rel = read_qc(rel_unrel='related_pre_outlier')

# 4. PCA
<br>
<details><summary>Click <u><span style="color:blue">here</span></u> to learn why we are doing this. </summary>
<br>
PCA is ran on the unrelated samples first. Then, the related samples are projected onto the PC space of the unrelated samples to get their PC scores. This way the population structure and clustering is not affected by the relatedness among samples.  
    
</details>

[Back to Index](#Index)

### 4a. Function to Run PCA on Unrelated Individuals

[Back to Index](#Index)

In [None]:
def run_pca(mt: hl.MatrixTable, reg_name:str, out_path: str, overwrite: bool = False):
    """
    Runs PCA on a data set
    :param mt: data set to run PCA on
    :param reg_name: region name for saving output purposes
    :param out_path: path for where to save the outputs
    :return:
    """

    pca_evals, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, k=20, compute_loadings=True)
    pca_mt = mt.annotate_rows(pca_af=hl.agg.mean(mt.GT.n_alt_alleles()) / 2)
    pca_loadings = pca_loadings.annotate(pca_af=pca_mt.rows()[pca_loadings.key].pca_af)
    pca_scores = pca_scores.transmute(**{f'PC{i}': pca_scores.scores[i - 1] for i in range(1, 21)})
    
    pca_scores.export(out_path + reg_name + '_scores.txt.bgz')  # save individual-level-genetic-region PCs
    pca_loadings.write(out_path + reg_name + '_loadings.ht', overwrite)  # save PCA loadings

### 4b. Function to Project Related Individuals
<br>
<details><summary> For troubleshooting information click <u><span style="color:blue">here</span></u>. </summary>

> If this function is not working, make sure you used the  <code>--packages gnomad</code> argument when starting your cluster
    
</details>

[Back to Index](#Index)

In [None]:
from gnomad.sample_qc.ancestry import *

def project_individuals(pca_loadings, project_mt, reg_name:str, out_path: str, overwrite: bool = False):
    """
    Project samples into predefined PCA space
    :param pca_loadings: existing PCA space of unrelated samples 
    :param project_mt: matrix table of related samples to project  
    :param reg_name: region name for saving output purposes
    :param out_path: path for where to save PCA projection outputs
    :return:
    """
    ht_projections = pc_project(project_mt, pca_loadings)  
    ht_projections = ht_projections.transmute(**{f'PC{i}': ht_projections.scores[i - 1] for i in range(1, 21)}) 
    ht_projections.export(out_path + reg_name + '_projected_scores.txt.bgz') # save output   

ModuleNotFoundError: No module named 'gnomad'

### 4c. Global PCA

<br>
<details><summary> Click <u><span style="color:blue">here</span></u> to learn why are we doing this.</summary>
<br>
    
> To see the population structure and clustering on a continental level and contextualize the data globally.    
    
</details>

[Back to Index](#Index)

In [None]:
# run PCA on the unrelated samples
run_pca(mt_unrel, 'global', pca_preoutlier_path, False)  

# read in the PCA loadings of the unrelated samples
loadings = hl.read_table(pca_preoutlier_path+'global_loadings.ht') 

# project the related samples onto the unrelated-samples' PC space 
project_individuals(loadings, mt_rel, 'global', pca_preoutlier_path, False) 

### 4d. Subcontinental PCA 
<br>

When running the following section, the notebook might freeze after printing the log for EUR, AFR and AMR. If this happens, do not restart it. Let it run and follow the progress with the outputs being generated.  

When complete, check for the following in your specified output path:
- 21 total output files (3 for each region)

Once you have confirmed you have the desired output do the following:
1. Save close and halt the current notebook
2. Open a new session
3. Proceed to the next step (run project_relateds function)


<br>
<details><summary> Click <u><span style="color:blue">here</span></u> to learn why are we doing this. </summary>
<br>
    
> To see the population structure and clustering on a subcontinental level and contextualize data within continental regions. This also helped us identify outliers which were removed later on.     

</details>
<br>

<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 

<ul>
<li><a href="more info https://hail.is/docs/0.2/methods/genetics.html#hail.methods.hwe_normalized_pca"> More on <i> hwe_normalized_pca() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_rows"> More on <i> annotate_rows() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.annotate"> More on <i> annotate() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.transmute"> More on <i> transmute() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.export"> More on <i> export() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/experimental/index.html#hail.experimental.pc_project"> More on <i> pc_project() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.collect"> More on <i> collect() </i></a></li>
</ul>
    
</details>

[Back to Index](#Index)

In [None]:
# obtain a list of the continental regions in the data set (used the unrelated data set since it had more samples) 
regions = mt_unrel['hgdp_tgp_meta']['Genetic']['region'].collect()
regions = list(dict.fromkeys(regions)) # convert into a list
# There are 7 regions: EUR, AFR, AMR, EAS, CSA, OCE, and MID

# set argument values for PCA 
subcont_pca_prefix = pca_preoutlier_path+'subcont_pca/' # path for outputs 
overwrite = False

# for each region, run PCA on the unrelated samples (~27min to run)
for i in regions:  
    # filter the unrelateds per region
    subcont_unrel = mt_unrel.filter_cols(mt_unrel['hgdp_tgp_meta']['Genetic']['region'] == i)  
    run_pca(subcont_unrel, i, subcont_pca_prefix, overwrite)

In [None]:
# for each region, project the related samples onto the unrelated-samples' PC space (~2min to run)
for i in regions:
    # read in the PCA loadings of the unrelated samples for each region 
    loadings = hl.read_table(subcont_pca_prefix + i + '_loadings.ht') 
    
    # filter the related mt per region 
    subcont_rel = mt_rel.filter_cols(mt_rel['hgdp_tgp_meta']['Genetic']['region'] == i)  
    
    # project 
    project_individuals(loadings, subcont_rel, i, subcont_pca_prefix, overwrite) 

# 5. Outlier Removal

After plotting the PCs using R (link_the_plotting_Rmarkdown_here), 22 outliers were identified (complete_the_table)

| s | Genetic region | Population | Note |
| --- | --- | --- | -- |
| HG01880 | AFR | ACB | - |
| HG01881 | AFR | ACB | - |
| NA20274 | AFR | ASW | - |
| NA20299 | AFR | ASW | - |
| NA20314 | AFR | ASW | Clusters with AMR in global PCA | 
| HGDP00013 | CSA | Brahui | - |
| HGDP00029 | CSA | Brahui | - |
| HGDP00057 | CSA | Balochi | - | 
| HGDP00130 | CSA | Makrani | Closer to AFR than most CSA |
| HGDP00150 | CSA | Makrani | - |
| HGDP01298 | EAS | Uygur | - |
| HGDP01303 | EAS | Uygur | - |
| HGDP01300 | EAS | Uygur | - |
| LP6005443-DNA_B02 | EAS | Uygur | - |
| HG01628 | EUR | IBS | - |
| HG01629 | EUR | IBS | - |
| HG01630 | EUR | IBS | - |
| HG01694 | EUR | IBS | - |
| HG01696 | EUR | IBS | - |
| HGDP00621 | MID | Bedouin | Closer to AFR than most MID |
| HGDP01270 | MID | Mozabite | Closer to AFR than most MID |
| HGDP01271 | MID | Mozabite | Closer to AFR than most MID |

In [None]:
# read in the unrelated and related mts to remove outliers and rerun pca  
mt_unrel_unfiltered = read_qc(rel_unrel='unrelated_pre_outlier') # unrelated mt
mt_rel_unfiltered = read_qc(rel_unrel='related_pre_outlier') # related mt

# read the outliers file into a list
with hl.utils.hadoop_open(outliers_path) as file: 
    outliers = [line.rstrip('\n') for line in file]
    
# capture and broadcast the list as an expression
outliers_list = hl.literal(outliers)

# remove the 22 outliers from both mts
mt_unrel = mt_unrel_unfiltered.filter_cols(~outliers_list.contains(mt_unrel_unfiltered['s']))
mt_rel = mt_rel_unfiltered.filter_cols(~outliers_list.contains(mt_rel_unfiltered['s']))

# sanity check 
print('Unrelated: Before outlier removal ' + 
      str(mt_unrel_unfiltered.count()[1]) + ' | After outlier removal ' + 
      str(mt_unrel.count()[1]))

print('Related: Before outlier removal: ' + 
      str(mt_rel_unfiltered.count()[1]) + ' | After outlier removal ' + 
      str(mt_rel.count()[1])) num_outliers = (mt_unrel_unfiltered.count()[1] - 
                                              mt_unrel.count()[1]) + (mt_rel_unfiltered.count()[1] - 
                                                                      mt_rel.count()[1])
print('Total samples removed = ' + str(num_outliers))

# 6. Rerun PCA

**Before running the sections below make sure you have run sections 4a (PCA) and 4b (projection) above.**

<br>
<details><summary> To learn what is different from the prior PCA run click <u><span style="color:blue">here</span></u>.</summary>
<ul>
<li>updated unrelated and related mts (outliers removed)</li>
<li>new paths for the outputs</li>  
    </ul>
</details>

[Back to Index](#Index)

### 6a. Global PCA (without outliers)

In [None]:
# run PCA on the unrelated samples  
run_pca(mt_unrel, 'global', pca_postoutlier_path, False)

# read in the PCA loadings of the unrelated samples  
loadings = hl.read_table(pca_postoutlier_path+'global_loadings.ht') 

# project the related samples onto the unrelated-samples' PC space 
project_individuals(loadings, mt_rel, 'global', pca_postoutlier_path, False) 

### 6b. Subcontinental PCA (without outliers)

> When running the following section, the notebook might freeze after printing the log for EUR, AFR and AMR. If this happens, do not restart it. Let it run and follow the progress with the outputs being generated.  
>
> When complete, check for the following in your specified output path:
> - 21 total output files (3 for each region)
>
> Once you have confirmed you have the desired output do the following:
> 1. Save close and halt the current notebook
> 2. Open a new session
> 3. Proceed to the next step (run project_relateds function)
>
<br>
<details><summary>For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 

<ul>
<li><a href="more info https://hail.is/docs/0.2/methods/genetics.html#hail.methods.hwe_normalized_pca"> More on <i> hwe_normalized_pca() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_rows"> More on <i> annotate_rows() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.annotate"> More on <i> annotate() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.transmute"> More on <i> transmute() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.Table.html#hail.Table.export"> More on <i> export() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/experimental/index.html#hail.experimental.pc_project"> More on <i> pc_project() </i></a></li>
    
<li><a href="more info https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.collect"> More on <i> collect() </i></a></li>
    </ul>
    
</details>

[Back to Index](#Index)

In [None]:
# set argument values for PCA 
subcont_pca_prefix = pca_postoutlier_path+'subcont_pca/' # path for outputs 
overwrite = False 


# for each region, run PCA on the unrelated samples (~26 min to run) 
# "regions" is a list containing the 7 continental regions in the data set from section 4d
for i in regions: 
    # filter the unrelateds per region
    subcont_unrel = mt_unrel.filter_cols(mt_unrel['hgdp_tgp_meta']['Genetic']['region'] == i)  
    run_pca(subcont_unrel, i, subcont_pca_prefix, overwrite)

In [None]:
# for each region, project the related samples onto the unrelated-samples' PC space (~3min to run)
for i in regions:
    # read in the PCA loadings of the unrelated samples for each region
    loadings = hl.read_table(subcont_pca_prefix + i + '_loadings.ht') 
    
    # filter the relateds per region 
    subcont_rel = mt_rel.filter_cols(mt_rel['hgdp_tgp_meta']['Genetic']['region'] == i)  
    
    # project 
    project_individuals(loadings, subcont_rel, i, subcont_pca_prefix, overwrite) 

# 7. Write Out Matrix Table 
[Back to Index](#Index)

In [None]:
# # write out mts of unrelated and related samples separately (post-outlier removal) 
# #unrelated mt
# mt_unrel.write('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt',
#                overwrite=False)
# #related mt
# mt_rel.write('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/related.mt',
#              overwrite=False)