# Assigning Ancestry Labels Using a Random Forest Model
Author: Lethukuthula Nkambule

**To run this tutorial, we suggest you start your cluster with the following commmand.** *If you have not done so, shut down your current cluster and start a new session as follows.* 

```python3
hailctl dataproc start qc-notebook5 --project [YOUR_PROJECT_NAME] --num-secondary-workers 50 --region=us-central1 --zone=us-central1-b --packages git+https://github.com/broadinstitute/gnomad_methods.git --master-machine-type n1-highmem-8 --worker-machine-type n1-highmem-8 --big-executors
```

See the tutorials [README](https://github.com/atgu/hgdp_tgp/tree/master/tutorials#readme) for more information on how to start a cluster.

## Index
1. [Set Default Paths](#1.-Set-Default-Paths)
2. [Read in Pre-QC Dataset and Apply Quality Control Filters](#2.-Read-in-Pre-QC-Dataset-and-Apply-Quality-Control-Filters)
3. [Intersecting Two Datasets](#3.-Intersecting-Two-Datasets)
4. [Applying gnomAD RF Model to HGDP+1kGP-GGV Intersect](#4.-Applying-gnomAD-RF-Model-to-Intersected-Dataset)
    1. [Plotting PCA After Applying gnomAD RF to HGDP+1kGP-GGV Intersect](#4.a.-Plotting-PCA-After-Applying-gnomAD-RF-to-Intersected-Dataset) 
5. [Building an RF Model Using HGDP+1kGP and Applying to a New Dataset](#5.-Building-a-Random-Forest-Model-from-HGDP+1kGP-and-Applying-to-a-New-Dataset)
    1. [Plotting PCA After Applying HGDP+1kGP RF to GGV](#5.a.-Plotting-PCA-After-Building-RF-Model-from-HGDP+1kGP-Dataset-and-Applying-It-to-GGV)

# General Overview:

The purpose of this notebook is to show how to use the HGDP+1kGP resource with an external dataset to do ancestry analyses. Specifically, we show how to apply a machine learning method called a random forest (RF) classifier trained on the population metadata labels from gnomAD to an external dataset to learn population labels in the new dataset. The gnomAD random forest has already been [generated and released previously](https://gnomad.broadinstitute.org/news/2021-09-using-the-gnomad-ancestry-principal-components-analysis-loadings-and-random-forest-classifier-on-your-dataset/).
We also show how to build a random forest classifier for HGDP+1kGP so you can build a new one for your dataset with an arbitrary set of SNVs. 

**This notebook contains information on how to:**
- Intersect two datasets
- Apply a random forest model 
- Build a random forest model
- Plot PCA after applying a RF model to a dataset 


**Abbreviations**<br>
HGDP: [Human Genome Diversity Project](https://www.internationalgenome.org/data-portal/data-collection/hgdp)<br>
1kGP: [1000 Genomes Project](https://www.internationalgenome.org/1000-genomes-summary)<br>
GGV: [Gambian Genome Variation Project](https://www.internationalgenome.org/gambian-genome-variation-project/)<br>

In [1]:
from typing import Tuple

import hail as hl
import pickle
import pandas as pd

# Functions from gnomAD library to apply genotype filters   
from gnomad.utils.filtering import filter_to_adj

from bokeh.io import show, output_notebook, output_file
from bokeh.layouts import column, row
from bokeh.plotting import figure
from bokeh.models.widgets import Panel, Tabs
from bokeh.models import ColumnDataSource, Legend, TableColumn, DataTable
from bokeh.palettes import Category10
from bokeh.transform import factor_cmap
from gnomad.sample_qc.ancestry import assign_population_pcs, pc_project
from sklearn.ensemble import RandomForestClassifier
output_notebook()

In [2]:
hl.init()

Running on Apache Spark version 3.1.3
SparkUI available at http://mty-m.c.diverse-pop-seq-ref.internal:46527
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.109-b71b065e4bb6
LOGGING: writing to /home/hail/hail-20230317-1629-0.2.109-b71b065e4bb6.log


# 1. Set Default Paths

These default paths can be edited by users as needed. It is recommended to run these tutorials without writing out datasets.

**By default, all of the dataset write out sections are shown as markdown cells. If you would like to write out your own dataset, you can copy the code and paste it into a new code cell. Don't forget to change the paths in the following cell accordingly.**  

[Back to Index](#Index)

In [3]:
# First input file path - HGDP+1kGP dataset prior to applying gnomAD QC filters
pre_qc_path = 'gs://gcp-public-data--gnomad/release/3.1.2/mt/genomes/gnomad.genomes.v3.1.2.hgdp_1kg_subset_dense.mt'

# Second input file path - GGV dataset 
ggv_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/data_intersection/gambian_genomes_merged_gvcfs.mt'

# Path for gnomAD loadings Hail Table
gnomad_loadings_path = 'gs://gcp-public-data--gnomad/release/3.1/pca/gnomad.v3.1.pca_loadings.ht'

# Path for gnomAD's RF model
gnomad_rf_path = 'gs://gcp-public-data--gnomad/release/3.1/pca/gnomad.v3.1.RF_fit.pkl'

# Path for HGDP+1kGP metadata obtained from gnomAD  
gnomad_metadata_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/metadata_and_qc/gnomad_meta_v1.tsv'

# File path for unrelated individuals without outliers - mt written out at the end of nb4 
unrelateds_without_outliers_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/pca_results/unrelateds_without_outliers.mt'

# Path for the intersected dataset 
data_intersect_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/data_intersection/hgdp_tgp_ggv_intersect.mt'

# Path for file containing SuperPopulation labels 
# Created using this script - https://github.com/atgu/GWASpy/blob/main/gwaspy/pca/filter_ref_data.py 
# Further filtered to only include unrelated samples and no outliers 
ref_info_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/data_intersection/hgdp_1kg_sample_info.unrelateds.pca_outliers_removed.with_project.tsv'

# 2. Read in Pre-QC Dataset and Apply Quality Control Filters

Since the post-QC mt was not written out, we run the same function as the previous tutorial notebooks to apply the quality control filters to the pre-QC dataset.

**To avoid errors, make sure to run the next two cells before running any code that includes the post-QC dataset.**

<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_matrix_table"> More on  <i> read_matrix_table() </i></a></li>
        
<li><a href="https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.describe"> More on  <i> describe() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.count"> More on  <i> count() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/linalg/hail.linalg.BlockMatrix.html#hail.linalg.BlockMatrix.filter_cols"> More on  <i> filter_cols() </i></a></li>

<li><a href="https://hail.is/docs/0.2/linalg/hail.linalg.BlockMatrix.html#hail.linalg.BlockMatrix.filter_rows"> More on  <i> filter_rows() </i></a></li>
</ul>
</details>

[Back to Index](#Index)

In [17]:
# Set up function to apply gnomAD's sample, variant and genotype QC filters

def run_qc(mt):
    
    ## Apply sample QC filters to dataset 
    # This filters to only samples that passed gnomAD's sample QC hard filters  
    mt = mt.filter_cols(~mt.gnomad_sample_filters.hard_filtered) # removed 31 samples
    
    ## Apply variant QC filters to dataset
    # This subsets to only PASS variants - those which passed gnomAD's variant QC
    # PASS variants have an entry in the filters field 
    mt = mt.filter_rows(hl.len(mt.filters) != 0, keep=False)

    ## Apply genotype QC filters to the dataset
    # This is done using a function imported from gnomAD and is the last step in the QC process
    mt = filter_to_adj(mt)

    return mt

In [18]:
# Read in the HGDP+1kGP pre-QC mt
pre_qc_mt = hl.read_matrix_table(pre_qc_path)

# Run QC 
post_qc_mt = run_qc(pre_qc_mt)

# Repartition post-QC mt to 500 partitions
mt_hgdp_tgp = post_qc_mt.repartition(500)

print(f'Number of variants in HGDP+1kGP before intersecting: {mt_hgdp_tgp.count_rows()}') # 159795273

Number of variants in HGDP+1kGP before intersecting: 159795273


# 3. Intersecting Two Datasets 

The first step in building the random forest model is to intersect the HGDP+1kGP dataset with the Gambian Genome Variation Project dataset.  
<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    
<ul>
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.key_rows_by"> More on  <i> key_rows_by() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.collect"> More on  <i> collect() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.union_rows"> More on  <i> union_cols() </i></a></li>
</ul>
    
</details>

[Back to Index](#Index)

### Read GGV data

In [19]:
# Read-in GGV dataset - a sparse mt from combining GVCFs
mt_ggv = hl.read_matrix_table(ggv_path, _n_partitions=500)

# Hail still keeps the non-variant sites (contain only REF allele) so we have to filter to variant-sites only
mt_ggv = mt_ggv.filter_rows(hl.len(mt_ggv.alleles) > 1)

# The GGV dataset has multiallelic variants that need to be split
mt_ggv = hl.experimental.sparse_split_multi(mt_ggv) # split multiallelic sites

print(f'Number of variant and bi-allelic sites only in GGV before intersecting: {mt_ggv.count_rows()}') 

Number of variant and bi-allelic sites only in GGV before intersecting: 70471702


### Select only fields that will be used downstream
#### -  In order to intersect two datasets, three requirements must be met:

1. The row keys must match

2. The column key schemas and column schemas must match

3. The entry schemas must match

[Back to Index](#Index)

In [20]:
mt_hgdp_tgp_clean = mt_hgdp_tgp.select_cols() # select s (sampleID) field
mt_hgdp_tgp_clean = mt_hgdp_tgp_clean.select_rows(mt_hgdp_tgp_clean.rsid) # select rsid field
mt_hgdp_tgp_clean = mt_hgdp_tgp_clean.select_entries(mt_hgdp_tgp_clean.GT) # select GT field

# Collect sample ID list to be used later to check how they were classified by the RF model
hgdp_tgp_samples = mt_hgdp_tgp_clean.s.collect()

In [21]:
mt_ggv_clean = mt_ggv.select_cols()
mt_ggv_clean = mt_ggv_clean.select_rows(mt_ggv_clean.rsid)
mt_ggv_clean = mt_ggv_clean.select_entries(mt_ggv_clean.GT)

# Collect GGV samples to list so we can later use this to check how they were classified by the RF model
ggv_samples = mt_ggv_clean.s.collect()

### Intersect the two datasets

In [9]:
hgdp_tgp_ggv_intersect = mt_hgdp_tgp_clean.union_cols(mt_ggv_clean)

- Checkpoint the intersected dataset so that the following commands don't take a long time to run (took 10min to checkpoint) 

```python3
hgdp_tgp_ggv_intersect = hgdp_tgp_ggv_intersect.checkpoint(data_intersect_path)
```

[Back to Index](#Index)

In [4]:
# Read file back in 
hgdp_tgp_ggv_intersect = hl.read_matrix_table(data_intersect_path)

print(f'Number of variants after intersecting HGDP+1kGP with GGV: {hgdp_tgp_ggv_intersect.count_rows()}') 

Number of variants after intersecting HGDP+1kGP with GGV: 33882563


# 4. Applying gnomAD RF Model to Intersected Dataset
<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    
<ul>
<li><a href="https://hail.is/docs/0.2/experimental/index.html#hail.experimental.pc_project"> More on  <i> pc_project() </i></a></li>

<li><a href="https://hail.is/docs/0.2/utils/index.html#hail.utils.hadoop_open"> More on  <i> hadoop_open() </i></a></li>

</ul>
    
</details>

[Back to Index](#Index)

In [10]:
# gnomAD loadings Hail Table
loadings_ht = hl.read_table(gnomad_loadings_path)

gnomad_loadings_count = loadings_ht.count()
print(f'Number of variants in gnomAD loadings: {gnomad_loadings_count}') 

Number of variants in gnomAD loadings: 76399


In [11]:
# Get the number of variants found in gnomAD loadings and hgdp_tgp_ggv_intersect
# Scores usually shrink towards zero for missingness > 5% and more samples will get classified as OTH

hgdp_tgp_ggv_intersect = hgdp_tgp_ggv_intersect.annotate_rows(
        pca_loadings=loadings_ht[hgdp_tgp_ggv_intersect.row_key]['loadings'],
        pca_af=loadings_ht[hgdp_tgp_ggv_intersect.row_key]['pca_af'],
    )

gnomad_loadings_data_interset_count = hgdp_tgp_ggv_intersect.aggregate_rows(hl.agg.count_where(
    hl.is_defined(hgdp_tgp_ggv_intersect.pca_loadings) & hl.is_defined(hgdp_tgp_ggv_intersect.pca_af)))

In [12]:
print(f'Number of variants common between HGDP+1kGP+GGV & gnomAD RF: {gnomad_loadings_data_interset_count}')
missingness = round((1 - (gnomad_loadings_data_interset_count/gnomad_loadings_count)) * 100, 2)
print(f'Level of missingness: {missingness}%') 

Number of variants common between HGDP+1kGP+GGV & gnomAD RF: 40005
Level of missingness: 47.64%


In [12]:
# Project HGDP+1kGP+GGV genotypes onto gnomAD loadings
ht = hl.experimental.pc_project(
    hgdp_tgp_ggv_intersect.GT,
    loadings_ht.loadings,
    loadings_ht.pca_af,
)

2023-03-17 16:31:57.044 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'


In [13]:
# Load gnomAD RF model
with hl.hadoop_open(gnomad_rf_path, 'rb') as f:
    fit = pickle.load(f)

In [14]:
# Reduce the scores to only those used in the RF model, this was 6 for v2 and 16 for v3.1 
num_pcs = fit.n_features_
ht = ht.annotate(scores=ht.scores[:num_pcs])

# Infer population labels in HGDP+1kGP+GGV using gnomAD RF model
ht, rf_model = assign_population_pcs(
    ht,
    pc_cols=[(i + 1) for i in range(num_pcs)],
    fit=fit,
) 

2023-03-17 16:33:22.188 Hail: INFO: Coerced sorted dataset
INFO (gnomad.sample_qc.ancestry 268): Found the following sample count after population assignment: oth: 3001, amr: 365, afr: 1085, sas: 58, nfe: 5


In [15]:
# PC scores are in one column saved as an array, split this into columns for each PC
gnomad_rf_output = ht.transmute(**{f'PC{i}': ht.pca_scores[i - 1] for i in range(1, num_pcs+1)})

# Convert Hail Table to Pandas DataFrame
gnomad_rf_output = gnomad_rf_output.to_pandas() 

2023-03-17 16:33:34.013 Hail: INFO: Coerced sorted dataset


## 4.a. Plotting PCA After Applying gnomAD RF to Intersected Dataset

[Back to Index](#Index)

In [22]:
# Create dictionary to use to store colors for each population label
color_map = {}

# Get a list of population labels (inferred) in the data
rf_labels_inferred = list(gnomad_rf_output['pop'].unique())

# Update the dictionary with unique colors for each population
for i in range(len(rf_labels_inferred)):
    color_map[rf_labels_inferred[i]] = Category10[len(rf_labels_inferred)][i]

tabs1 = []

# Split the dataframe into two dataframes: HGDP+1kGP and GGV
ref_samples_df1 = gnomad_rf_output[gnomad_rf_output['s'].isin(hgdp_tgp_samples)]
ggv_samples_df1 = gnomad_rf_output[gnomad_rf_output['s'].isin(ggv_samples)]

def plot_pca(
        ref_df: pd.DataFrame = None,
        data_df: pd.DataFrame = None,
        x_pc: str = None,
        y_pc: str = None
) -> Tuple[figure, figure, figure]:
    """
    This function is for plotting PCA scores

    :param pd.DataFrame ref_df: DataFrame with reference PCA scores to be plotted
    :param pd.DataFrame data_df: DataFrame with data PCA scores to be plotted
    :param str x_pc: x-axis (bottom) PC scores
    :param str y_pc: y-axis (left) PC scores
    
    :rtype: figure, figure, figure
    """
    pref = figure(width=600, height=500, background_fill_color='#fafafa', title = 'HGDP+1kGP')
    pref.add_layout(Legend(), 'right')
    pref.xaxis.axis_label = x_pc
    pref.yaxis.axis_label = y_pc
    
    pdata = figure(width=600, height=500, background_fill_color='#fafafa', title = 'GGV')
    pdata.add_layout(Legend(), 'right')
    pdata.xaxis.axis_label = x_pc
    pdata.yaxis.axis_label = y_pc
    
    pcomb = figure(width=600, height=500, background_fill_color='#fafafa', title = 'HGDP+1kGP+GGV')
    pcomb.add_layout(Legend(), 'right')
    pcomb.xaxis.axis_label = x_pc
    pcomb.yaxis.axis_label = y_pc
    pcomb.circle(ref_df[x_pc].tolist(), ref_df[y_pc].tolist(), size=3, color='grey',
                 alpha=0.3, legend_label='HGDP+1kGP')

    for pop, col in color_map.items():
        # reference
        if pop in list(ref_df['pop'].unique()):
            pref.circle(ref_df[(ref_df['pop'] == pop)][x_pc].tolist(), ref_df[(ref_df['pop'] == pop)][y_pc].tolist(),
                        size=3, color=col, alpha=0.8, legend_label=pop)
        
        # data
        if pop in list(data_df['pop'].unique()):
            pdata.circle(data_df[(data_df['pop'] == pop)][x_pc].tolist(), data_df[(data_df['pop'] == pop)][y_pc].tolist(),
                         size=3, color=col, alpha=0.8, legend_label=pop)
        
        # ref+data combined
        if pop in list(data_df['pop'].unique()):
            pcomb.circle(data_df[(data_df['pop'] == pop)][x_pc].tolist(), data_df[(data_df['pop'] == pop)][y_pc].tolist(),
                         size=3, color=col, alpha=0.8, legend_label=pop)
        
    return pref, pdata, pcomb


for i in range(1, num_pcs, 2):
    xpc = f'PC{i}'
    ypc = f'PC{i + 1}'
    
    p1, p2, p3 = plot_pca(ref_df=ref_samples_df1, data_df=ggv_samples_df1, x_pc=xpc, y_pc=ypc)
        
    tab = Panel(child=column(row(p1, p2), row(p3)), title=f'{xpc}v{ypc}')

    tabs1.append(tab)

In [23]:
show(Tabs(tabs=tabs1))

In [22]:
# Read-in truth population for HGDP+1kGP sample - from gnomAD metadata
truth_pop_labels = pd.read_csv(gnomad_metadata_path, sep='\t', low_memory=False)
truth_pop_labels = truth_pop_labels[['project_meta.sample_id', 'hgdp_tgp_meta.Project', 'hgdp_tgp_meta.Genetic.region']]
truth_pop_labels.columns = ['Sample', 'Project', 'SuperPop']

# Add population labels to the dataframe with inferred (using RF model) population labels
merged = pd.merge(left=gnomad_rf_output[['s', 'pop']], right=truth_pop_labels,
                  left_on='s', right_on='Sample', how='left')

# All GGV samples are AFR
merged['Project'].fillna('GGV', inplace=True)
merged.loc[(merged.Project == 'GGV'),'SuperPop'] = 'AFR'

In [23]:
# Convert this to a Hail Table so we can easily get counts (not straight forward to do in Pandas)
# First make sure we have every column type as string
merged = merged.astype({'s': str, 'pop': str, 'Sample': str, 'SuperPop': str, 'Project': str})

# Convert DataFrame to a Hail Table
t = hl.Table.from_pandas(merged) 

In [27]:
# Print counts of how many samples were classified: (1) correctly; (2) incorrectly; or (3) as OTH
(t.group_by(t.Project).aggregate(n=hl.agg.count(),
                                 match=hl.agg.count_where(t.pop.upper() == t.SuperPop),
                                 mismatch=hl.agg.count_where((t.pop.upper() != t.SuperPop) & (t.pop.upper() != 'OTH')),
                                 oth=hl.agg.count_where(t.pop.upper() == 'OTH'))).to_pandas()

2022-11-22 16:09:05 Hail: INFO: Coerced sorted dataset
2022-11-22 16:09:05 Hail: INFO: Coerced dataset with out-of-order partitions.


Unnamed: 0,Project,n,match,mismatch,oth
0,1000 Genomes,3176,899,62,2215
1,GGV,395,390,0,5
2,HGDP,943,150,12,781


### Note about the plots and table above:
Because the gnomAD random forest is trained on `76,399` SNVs and our dataset only has `40,005` of these, we are missing almost half (`47.64%`) of the training data. As a result, most of the samples are assigned `oth` or misclassified.

# 5. Building a Random Forest Model from HGDP+1kGP and Applying to a New Dataset 

In the following steps we are building a random forest (RF) model with unrelated individuals from the HGDP+1kGP dataset. This was done using global region labels. 
We then apply the model to the Gambian Genome Variation Project (GGV) dataset. 
    
[Back to Index](#Index)

In [26]:
def intersect_ref(
    ref_mt: hl.MatrixTable = None, 
    data_mt: hl.MatrixTable = None
) -> Tuple[hl.MatrixTable, hl.MatrixTable]:
    """
    This function is for intersecting reference data with input data

    :param hl.MatrixTable ref_mt: reference data to be intersected with input data
    :param hl.MatrixTable data_mt: input data to be intersected with reference data
    
    :rtype: hl.MatrixTable, hl.MatrixTable
    """
    
    data_in_ref = data_mt.filter_rows(hl.is_defined(ref_mt.rows()[data_mt.row_key]))
    print('sites common between the data and ref, inds in data: {}'.format(data_in_ref.count()))

    ref_in_data = ref_mt.filter_rows(hl.is_defined(data_mt.rows()[ref_mt.row_key]))
    print('sites commond between the ref and data, inds in ref: {}'.format(ref_in_data.count()))
    
    return ref_in_data, data_in_ref


def run_ref_pca(
    mt: hl.MatrixTable = None,
    npcs: int = 20
) -> Tuple[hl.Table, hl.Table]:
    """
    This function is for running PCA

    :param hl.MatrixTable mt: data to be used to run PCA
    :param int npcs: number of principal components to use in running PCA
    
    :rtype: hl.Table, hl.Table
    """
    pca_evals, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, k=npcs, compute_loadings=True)
    pca_mt = mt.annotate_rows(pca_af=hl.agg.mean(mt.GT.n_alt_alleles()) / 2)
    pca_loadings = pca_loadings.annotate(pca_af=pca_mt.rows()[pca_loadings.key].pca_af)

    # individual-level PCs
    pca_scores = pca_scores.transmute(**{f'PC{i}': pca_scores.scores[i - 1] for i in range(1, npcs+1)})
    
    return pca_loadings, pca_scores


def merge_data_with_ref(
    ref_scores: hl.Table = None,
    ref_info: str = ref_info_path,
    data_scores: hl.Table = None
) -> pd.DataFrame:
    """
    This function is for merging the reference scores DataFrame with the data scores DataFrame

    :param hl.Table ref_scores: Table with reference scores
    :param str ref_info: path to file containing SuperPopulation labels
    :param hl.Table data_scores: Table with data scores
    
    :rtype: pd.DataFrame
    """
    print('Merging data with ref')
    ref_info = hl.import_table(ref_info,
                           impute=True, key='Sample')
    ref_merge = ref_scores.annotate(SuperPop = ref_info[ref_scores.s].SuperPop)

    print('merging data and ref data')
    data_ref = ref_merge.union(data_scores, unify=True)
    print('Done merging data with ref')

    return data_ref

In [None]:
# Use pruned post-QC mt with unrelated individuals (without outliers) to speed up things
mt_unrel = hl.read_matrix_table(unrelateds_without_outliers_path, _n_partitions=500)

In [None]:
# Filter the HGDP+1kGP and GGV datasets to variants ONLY common between the two
# Took ~1hr to run
hgdp_tgp_in_ggv_mt, ggv_in_hgdp_tgp_mt = intersect_ref(ref_mt=mt_unrel, data_mt=mt_ggv)

In [31]:
# Compute loadings and scores for the HGDP+1kGP data 
# Took ~2hrs & 48min to run
ref_pca_loadings, ref_pca_scores = run_ref_pca(mt=hgdp_tgp_in_ggv_mt, npcs=20)

2022-11-22 18:04:52 Hail: INFO: hwe_normalize: found 191661 variants after filtering out monomorphic sites.
2022-11-22 18:44:30 Hail: INFO: pca: running PCA with 20 components...
2022-11-22 20:09:12 Hail: INFO: Coerced sorted dataset


In [32]:
# Project GGV's genotypes onto HGDP+1kGP PCs we computed above
data_projections_ht = pc_project(mt=ggv_in_hgdp_tgp_mt, loadings_ht=ref_pca_loadings,
                                 loading_location='loadings', af_location='pca_af')

# Instead of having all PCs in one column as an array, create a column for each PC
data_scores = data_projections_ht.transmute(**{f'PC{i}': data_projections_ht.scores[i - 1] for i in range(1, 20+1)})

In [33]:
# Took ~1hr to run
data_ref = merge_data_with_ref(ref_scores=ref_pca_scores, data_scores=data_scores)

data_ref_df = data_ref.to_pandas()

Merging data with ref


2022-11-22 20:13:37 Hail: INFO: Reading table to impute column types
2022-11-22 20:13:39 Hail: INFO: Finished type imputation
  Loading field 'Sample' as type str (imputed)
  Loading field 'SuperPop' as type str (imputed)
  Loading field 'Project' as type str (imputed)


merging data and ref data
Done merging data with ref


2022-11-22 21:14:18 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-11-22 21:14:19 Hail: INFO: Coerced sorted dataset
2022-11-22 21:14:19 Hail: INFO: Coerced sorted dataset


In [34]:
ht, rf_model = assign_population_pcs(
    data_ref_df,
    pc_cols=['PC{}'.format(i + 1) for i in range(20)],
    known_col="SuperPop",
) 

INFO (gnomad.sample_qc.ancestry 268): Found the following sample count after population assignment: EUR: 658, EAS: 717, AMR: 386, oth: 380, CSA: 666, AFR: 797, OCE: 28, MID: 140


Random forest feature importances are as follows: [0.19036443 0.17229619 0.16182895 0.13284561 0.09963888 0.05840517
 0.04459707 0.03327567 0.01894406 0.0139005  0.01497705 0.00370062
 0.01696946 0.00231301 0.01343959 0.00710826 0.00343708 0.0073683
 0.00384564 0.00074445]
Estimated error rate for RF model is 0.0014880952380952328


## 5.a. Plotting PCA After Building RF Model from HGDP+1kGP Dataset and Applying It to GGV

[Back to Index](#Index)

In [None]:
# Create dictionary to use to store colors for each population label
color_map = {}

# Get a list of population labels in the data
rf_pop_labels = data_ref_df['pop'].unique().tolist()

# Update the dictionary with unique colors for each population
for i in range(len(rf_pop_labels)):
    color_map[rf_pop_labels[i]] = Category10[len(rf_pop_labels)][i]
    
    
tabs2 = []

ref_samples_df2 = data_ref_df[data_ref_df['s'].isin(hgdp_tgp_samples)]
ggv_samples_df2 = data_ref_df[data_ref_df['s'].isin(ggv_samples)]

def plot_pca(
        ref_df: pd.DataFrame = None,
        data_df: pd.DataFrame = None,
        x_pc: str = None,
        y_pc: str = None
) -> Tuple[figure, figure]:
    """
    This function is for plotting PCA scores

    :param pd.DataFrame ref_df: DataFrame with reference PCA scores to be plotted
    :param pd.DataFrame data_df: DataFrame with data PCA scores to be plotted
    :param str x_pc: x-axis (bottom) PC scores
    :param str y_pc: y-axis (left) PC scores
    
    :rtype: figure, figure
    """
    pcomb1 = figure(width=600, height=500, background_fill_color='#fafafa', title = 'HGDP+1kGP and GGV')
    pcomb1.add_layout(Legend(), 'right')
    pcomb1.xaxis.axis_label = x_pc
    pcomb1.yaxis.axis_label = y_pc
    pcomb1.circle(data_df[x_pc].tolist(), data_df[y_pc].tolist(), size=3, color='grey',
                 alpha=0.3, legend_label='GGV')
    
    pcomb2 = figure(width=600, height=500, background_fill_color='#fafafa', title = 'GGV and HGDP+1kGP')
    pcomb2.add_layout(Legend(), 'right')
    pcomb2.xaxis.axis_label = x_pc
    pcomb2.yaxis.axis_label = y_pc
    pcomb2.circle(ref_df[x_pc].tolist(), ref_df[y_pc].tolist(), size=3, color='grey',
                 alpha=0.3, legend_label='HGDP+1kGP')

    for pop, col in color_map.items():
        # HGDP+1kGP colored and GGV grey
        if pop in ref_df['pop'].unique().tolist():
            pcomb1.circle(ref_df[(ref_df['pop'] == pop)][x_pc].tolist(), ref_df[(ref_df['pop'] == pop)][y_pc].tolist(),
                        size=3, color=col, alpha=0.8, legend_label=pop)
        
        # HGDP+1kGP grey and GGV colored
        if pop in data_df['pop'].unique().tolist():
            pcomb2.circle(data_df[(data_df['pop'] == pop)][x_pc].tolist(), data_df[(data_df['pop'] == pop)][y_pc].tolist(),
                         size=3, color=col, alpha=0.8, legend_label=pop)
        
    return pcomb1, pcomb2


for i in range(1, 20, 2):
    xpc = f'PC{i}'
    ypc = f'PC{i + 1}'
    
    p1, p2 = plot_pca(ref_df=ref_samples_df2, data_df=ggv_samples_df2, x_pc=xpc, y_pc=ypc)
        
    tab = Panel(child=column(row(p1, p2)), title=f'{xpc}v{ypc}')

    tabs2.append(tab)

In [36]:
show(Tabs(tabs=tabs2))

In [37]:
# Get counts by POP
ggv_samples_df2['pop'].value_counts()

AFR    394
Name: pop, dtype: int64

### Note about the plots:

We can see that all the GGV samples are getting classified as correclty `AFR`. So building our own model using the HGDP+1kGP data instead of using the gnomAD RF model did a better job at inferring ancestry labels in this case.

[Back to Index](#Index)