## Index
1. [Setting Default Paths](#1.Setting-Default-Paths)
2. [Intersecting Two Datasets](#1.-Intersecting-HGDP+1kG-unrelateds-with-GGV)
3. [Applying gnomAD RF Model to HGDP+1kGP-GGV Intersect](#2.-Applying-gnomAD-RF-model-to-HGDP+1kGP+GGV-intersect)
    1. [Plotting PCA After Applying gnomAD RF to HGDP+1kGP-GGV Intersect](#-2a.-Plotting-PCA-after-applying-gnomAD-RF-to-HGDP+1kGP-Intersect)
4. [Building an RF Model Using HGDP+1kGP and Applying to a New Dataset](#3.-Building-a-random-forest-model-from-HGDP+1kGP-and-applying-to-a-new-dataset)
    1. [Plotting PCA After Applying HGDP+1kGP RF to GGV](#3a.-Plotting-PCA-after-building-RF-model-from-HGDP+1kGP-dataset-and-applying-it-to-GGV)

# General Overview:

The purpose of this notebook is to show how to use the HGDP+1kGP resource with an external dataset to do ancestry analyses. Specifically, we show how to apply a machine learning method called a random forest (RF) classifier trained on the population metadata labels from gnomAD to an external dataset to learn population labels in the new dataset. The gnomAD random forest has already been [generated and released previously](https://gnomad.broadinstitute.org/news/2021-09-using-the-gnomad-ancestry-principal-components-analysis-loadings-and-random-forest-classifier-on-your-dataset/).
We also show how to build a random forest classifier for HGDP+1kGP so you can build a new one for your dataset with an arbitrary set of SNVs. 

**This notebook contains information on how to:**
- Intersect two datasets
- Apply a random forest model 
- Build a random forest model
- Plot PCA after applying a RF model to a dataset 


**Abbreviations**<br>
HGDP: [Human Genome Diversity Project](https://www.internationalgenome.org/data-portal/data-collection/hgdp)<br>
1kGP: [1000 Genomes Project](https://www.internationalgenome.org/1000-genomes-summary)<br>
GGV: [Gambian Genome Variation Project](https://www.internationalgenome.org/gambian-genome-variation-project/)<br>

Author: Lethukuthula Nkambule

In [1]:
from typing import Tuple

import hail as hl
import pickle
import pandas as pd

from bokeh.io import show, output_notebook, output_file
from bokeh.layouts import column, row
from bokeh.plotting import figure
from bokeh.models.widgets import Panel, Tabs
from bokeh.models import ColumnDataSource, Legend, TableColumn, DataTable
from bokeh.palettes import Category10
from bokeh.transform import factor_cmap
from gnomad.sample_qc.ancestry import assign_population_pcs, pc_project
from sklearn.ensemble import RandomForestClassifier

In [2]:
output_notebook()

In [3]:
hl.init()

Running on Apache Spark version 3.1.3
SparkUI available at http://qc-notebook4-m.c.diverse-pop-seq-ref.internal:34227
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.97-937922d7f46c
LOGGING: writing to /home/hail/hail-20220928-1952-0.2.97-937922d7f46c.log


# 1. Set Default Paths
These default paths can be edited by users as needed. It is recommended to run these tutorials without writing out datasets. 

By default all of the write sections are shown as markdown cells. If you would like to write out your own datasets, you can copy the code and paste it into a new code cell. 

[Back to Index](#Index)

# 2. Intersecting HGDP+1kGP unrelateds with GGV
The first step in building the random forest model is to intersect the HGDP+1kGP dataset with the Gambian Genome Variation Project dataset.  
<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    
<ul>
<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.key_rows_by"> More on  <i> key_rows_by() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.collect"> More on  <i> collect() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.union_rows"> More on  <i> union_cols() </i></a></li>
</ul>
    
</details>

[Back to Index](#Index)

### 2a. Read HGDP+1kGP data

In [4]:
# use large HGDP+1kGP
mt_hgdp_tgp = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/intermediate_files/pre_running_varqc.mt',
                               _n_partitions=500)
#mt_unrel = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt')
print(f'Number of variants in HGDP+1kGP before intersecting: {mt_hgdp_tgp.count_rows()}')

Number of variants in HGDP+1KG before intersecting: 155648020


### 2b. Read GGV data

In [5]:
# GGV dataset is a sparse MT from combining GVCFs.
mt_ggv = hl.read_matrix_table('gs://gnomaf/gambian-genomes/COMBINED_GVCFS/gambian_genomes_merged_gvcfs.mt',
                             _n_partitions=500)

# Hail still keeps the non-variant sites (contain only REF allele). So we have to filter to variant-sites only
mt_ggv = mt_ggv.filter_rows(hl.len(mt_ggv.alleles) > 1)

# The GGV dataset has multiallelic variants that need to be split
mt_ggv = hl.experimental.sparse_split_multi(mt_ggv) # split multiallelic sites

print(f'Number of variant and bi-allelic sites only in GGV before intersecting: {mt_ggv.count_rows()}')

Number of variant and bi-allelic sites only in GGV before intersecting: 70471702


### 2c. Select only fields that will be used downstream
### In order to intersect two datasets, three requirements must be met:

1. The row keys must match.

2. The column key schemas and column schemas must match.

3. The entry schemas must match.

In [6]:
mt_hgdp_tgp_clean = mt_hgdp_tgp.select_cols() # select s (sampleID) field
mt_hgdp_tgp_clean = mt_hgdp_tgp_clean.select_rows(mt_hgdp_tgp_clean.rsid) # select rsid field
mt_hgdp_tgp_clean = mt_hgdp_tgp_clean.select_entries(mt_hgdp_tgp_clean.GT) # select GT field

# Collect sample ID list to be used later to check how they were classified by the RF model
hgdp_tgp_samples = mt_hgdp_tgp_clean.s.collect()

In [7]:
mt_ggv_clean = mt_ggv.select_cols()
mt_ggv_clean = mt_ggv_clean.select_rows(mt_ggv_clean.rsid)
mt_ggv_clean = mt_ggv_clean.select_entries(mt_ggv_clean.GT)

# collect GGV samples to list so we can later use this to check how they were classified by the RF model
ggv_samples = mt_ggv_clean.s.collect()

### 2d. Intersect the two datasets

In [8]:
hgdp_tgp_ggv_intersect = mt_hgdp_tgp_clean.union_cols(mt_ggv_clean)

In [9]:
# This step takes a while, so I've already checkpointed to save time
# hgdp_tgp_ggv_intersect = hgdp_tgp_ggv_intersect.checkpoint('gs://hgdp-1kg/tutorial_datasets/data_intersection/hgdp_tgp_ggv_intersect.mt')

In [10]:
hgdp_tgp_ggv_intersect = hl.read_matrix_table('gs://hgdp-1kg/tutorial_datasets/data_intersection/hgdp_tgp_ggv_intersect.mt')


In [11]:
print(f'Number of variants after intersecting HGDP+1kGP with GGV: {hgdp_tgp_ggv_intersect.count_rows()}')

Number of variants after intersecting HGDP+1KG with GGV: 33908009


# 3. Applying gnomAD RF model to HGDP+1kGP+GGV intersect
<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
    
<ul>
<li><a href="https://hail.is/docs/0.2/experimental/index.html#hail.experimental.pc_project"> More on  <i> pc_project() </i></a></li>

<li><a href="https://hail.is/docs/0.2/utils/index.html#hail.utils.hadoop_open"> More on  <i> hadoop_open() </i></a></li>

</ul>
    
</details>

[Back to Index](#Index)

In [12]:
# gnomAD loadings Hail Table
loadings_ht = hl.read_table('gs://gcp-public-data--gnomad/release/3.1/pca/gnomad.v3.1.pca_loadings.ht')

gnomad_loadings_count = loadings_ht.count()
print(f'Number of variants in gnomAD loadings: {gnomad_loadings_count}')

Number of variants in gnomAD loadings: 76399


In [13]:
# Get the number of variants found in gnomAD loadings and hgdp_tgp_ggv_intersect
# Scores usually shrink towards zero for missingness > 5% and more samples will get classified as OTH

hgdp_tgp_ggv_intersect = hgdp_tgp_ggv_intersect.annotate_rows(
        pca_loadings=loadings_ht[hgdp_tgp_ggv_intersect.row_key]['loadings'],
        pca_af=loadings_ht[hgdp_tgp_ggv_intersect.row_key]['pca_af'],
    )

gnomad_loadings_data_interset_count = hgdp_tgp_ggv_intersect.aggregate_rows(hl.agg.count_where(
    hl.is_defined(hgdp_tgp_ggv_intersect.pca_loadings) & hl.is_defined(hgdp_tgp_ggv_intersect.pca_af)))

In [14]:
print(f'Number of variants common between HGDP+1kGP+GGV & gnomAD RF: {gnomad_loadings_data_interset_count}')
missingness = round((1-(gnomad_loadings_data_interset_count/gnomad_loadings_count))*100, 2)
print(f'Level of missingness: {missingness}%')

Number of variants common between HGDP+1KG+GGV & gnomAD RF: 40005
Level of missingness: 47.64%


In [15]:
# Project HGDP+1kGP+GGV genotypes onto gnomAD loadings
ht = hl.experimental.pc_project(
    hgdp_tgp_ggv_intersect.GT,
    loadings_ht.loadings,
    loadings_ht.pca_af,
)

2022-09-28 20:03:41 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'


In [16]:
# Load gnomAD RF model
with hl.hadoop_open('gs://gcp-public-data--gnomad/release/3.1/pca/gnomad.v3.1.RF_fit.pkl', 'rb') as f:
    fit = pickle.load(f)



In [17]:
# Reduce the scores to only those used in the RF model, this was 6 for v2 and 16 for v3.1
num_pcs = fit.n_features_
ht = ht.annotate(scores=ht.scores[:num_pcs])

# Infer population labels in HGDP+1kGP+GGV using gnomAD RF model
ht, rf_model = assign_population_pcs(
    ht,
    pc_cols=[(i + 1) for i in range(num_pcs)],
    fit=fit,
)

2022-09-28 20:04:28 Hail: INFO: Coerced sorted dataset
INFO (gnomad.sample_qc.ancestry 230): Found the following sample count after population assignment: oth: 3008, amr: 365, afr: 1074, sas: 61, nfe: 5


In [18]:
# PC scores are in one column saved as an array, split this into columns for each PC
gnomad_rf_output = ht.transmute(**{f'PC{i}': ht.pca_scores[i - 1] for i in range(1, num_pcs+1)})
gnomad_rf_output = gnomad_rf_output.to_pandas() # convert Hail Table to Pandas DataFrame

2022-09-28 20:04:45 Hail: INFO: Coerced sorted dataset


## 3a. Plotting PCA after applying gnomAD RF to HGDP+1kGP+GGV Intersect

[Back to Index](#Index)

In [19]:
# Create dictionary to use to store colors for each population label
color_map = {}

# get a list of population labels (inferred) in the data
rf_labels_inferred = gnomad_rf_output['pop'].unique().tolist()

# update the dictionary with unique colors for each population
for i in range(len(rf_labels_inferred)):
    color_map[rf_labels_inferred[i]] = Category10[len(rf_labels_inferred)][i]

tabs1 = []

# split the dataframe into two dataframes: HGDP+1kGP and GGV
ref_samples_df1 = gnomad_rf_output[gnomad_rf_output['s'].isin(hgdp_tgp_samples)]
ggv_samples_df1 = gnomad_rf_output[gnomad_rf_output['s'].isin(ggv_samples)]

def plot_pca(
        ref_df: pd.DataFrame = None,
        data_df: pd.DataFrame = None,
        x_pc: str = None,
        y_pc: str = None
) -> Tuple[figure, figure, figure]:
    """
    This function is for plotting PCA scores

    :param pd.DataFrame ref_df: DataFrame with reference PCA scores to be plotted
    :param pd.DataFrame data_df: DataFrame with data PCA scores to be plotted
    :param str x_pc: x-axis (bottom) PC scores
    :param str y_pc: y-axis (left) PC scores
    
    :rtype: figure, figure, figure
    """
    pref = figure(width=600, height=500, background_fill_color='#fafafa', title = 'HGDP+1kGP')
    pref.add_layout(Legend(), 'right')
    pref.xaxis.axis_label = x_pc
    pref.yaxis.axis_label = y_pc
    
    pdata = figure(width=600, height=500, background_fill_color='#fafafa', title = 'GGV')
    pdata.add_layout(Legend(), 'right')
    pdata.xaxis.axis_label = x_pc
    pdata.yaxis.axis_label = y_pc
    
    pcomb = figure(width=600, height=500, background_fill_color='#fafafa', title = 'HGDP+1kGP+GGV')
    pcomb.add_layout(Legend(), 'right')
    pcomb.xaxis.axis_label = x_pc
    pcomb.yaxis.axis_label = y_pc
    pcomb.circle(ref_df[x_pc].tolist(), ref_df[y_pc].tolist(), size=3, color='grey',
                 alpha=0.3, legend_label='HGDP+1kGP')

    for pop, col in color_map.items():
        # reference
        if pop in ref_df['pop'].unique().tolist():
            pref.circle(ref_df[(ref_df['pop'] == pop)][x_pc].tolist(), ref_df[(ref_df['pop'] == pop)][y_pc].tolist(),
                        size=3, color=col, alpha=0.8, legend_label=pop)
        
        # data
        if pop in data_df['pop'].unique().tolist():
            pdata.circle(data_df[(data_df['pop'] == pop)][x_pc].tolist(), data_df[(data_df['pop'] == pop)][y_pc].tolist(),
                         size=3, color=col, alpha=0.8, legend_label=pop)
        
        # ref+data combined
        if pop in data_df['pop'].unique().tolist():
            pcomb.circle(data_df[(data_df['pop'] == pop)][x_pc].tolist(), data_df[(data_df['pop'] == pop)][y_pc].tolist(),
                         size=3, color=col, alpha=0.8, legend_label=pop)
        
    return pref, pdata, pcomb


for i in range(1, num_pcs, 2):
    xpc = f'PC{i}'
    ypc = f'PC{i + 1}'
    
    p1, p2, p3 = plot_pca(ref_df=ref_samples_df1, data_df=ggv_samples_df1, x_pc=xpc, y_pc=ypc)
        
    tab = Panel(child=column(row(p1, p2), row(p3)), title=f'{xpc}v{ypc}')

    tabs1.append(tab)

In [20]:
show(Tabs(tabs=tabs1))


In [21]:
# Read in truth population for HGDP+1kGP sample
# FR: col 138
# FU: 141
truth_pop_labels = pd.read_csv('gs://hgdp-1kg/hgdp_tgp/qc_and_figure_generation/gnomad_meta_v1.tsv',
                              sep='\t', low_memory=False)
truth_pop_labels = truth_pop_labels[['project_meta.sample_id', 'hgdp_tgp_meta.Project', 'hgdp_tgp_meta.Genetic.region']]
truth_pop_labels.columns = ['Sample', 'Project', 'SuperPop']

# Add population labels to the dataframe with inferred (using RF model) population labels
merged = pd.merge(left=gnomad_rf_output[['s', 'pop']], right=truth_pop_labels,
                  left_on='s', right_on='Sample', how='left')

# All GGV samples are AFR
merged['Project'].fillna('GGV', inplace = True)
merged.loc[(merged.Project == 'GGV'),'SuperPop'] = 'AFR'

In [22]:
# Convert this to a Hail Table so we can easily get counts (not straight forward to do in Pandas)
# first make sure we have every column type as string
merged = merged.astype({'s': str, 'pop': str, 'Sample': str, 'SuperPop': str, 'Project': str})

# convert DataFrame to a Hail Table
t = hl.Table.from_pandas(merged) 

In [23]:
# print counts of how many samples were classified: (1) correctly; (2) incorrectly; or (3) as OTH
(t.group_by(t.Project).aggregate(n=hl.agg.count(),
                                 match=hl.agg.count_where(t.pop.upper()==t.SuperPop),
                                 mismatch=hl.agg.count_where((t.pop.upper()!=t.SuperPop) & (t.pop.upper()!='OTH')),
                                 oth=hl.agg.count_where(t.pop.upper()=='OTH'))).to_pandas()

2022-09-28 20:05:23 Hail: INFO: Ordering unsorted dataset with network shuffle


Unnamed: 0,Project,n,match,mismatch,oth
0,1000 Genomes,3176,890,64,2222
1,GGV,394,390,0,4
2,HGDP,943,148,13,782


### Note about the plots and table above:
Because the gnomAD random forest is trained on 76,399 SNVs and our dataset only has 40005 of these, we are missing almost half (47.64%) of the training data. As a result, most of the samples are assigned “oth” or misclassified

# 4. Building a random forest model from HGDP+1kGP and applying to a new dataset
In the following steps we are building a random forest (RF) model with unrelated individuals from the HGDP+1kGP dataset. This was done using global region labels. 
We then apply the model to the Gambian Genome Variation Project (GGV) dataset. 

[INSERT LINK] For more information on Random Forest models click [here]().

[INSERT LINK] For more information on the GGV dataset click [here]().
    

[Back to Index](#Index)

In [24]:
def intersect_ref(
    ref_mt: hl.MatrixTable = None, 
    data_mt: hl.MatrixTable = None
) -> Tuple[hl.MatrixTable, hl.MatrixTable]:
    """
    This function is for intersecting reference data with input data

    :param hl.MatrixTable ref_mt: reference data to be intersected with input data
    :param hl.MatrixTable data_mt: input data to be intersected with reference data
    
    :rtype: hl.MatrixTable, hl.MatrixTable
    """
    
    data_in_ref = data_mt.filter_rows(hl.is_defined(ref_mt.rows()[data_mt.row_key]))
    print('sites common between the data and ref, inds in data: {}'.format(data_in_ref.count()))

    ref_in_data = ref_mt.filter_rows(hl.is_defined(data_mt.rows()[ref_mt.row_key]))
    print('sites commond between the ref and data, inds in ref: {}'.format(ref_in_data.count()))
    
    return ref_in_data, data_in_ref


def run_ref_pca(
    mt: hl.MatrixTable = None,
    npcs: int = 20
) -> Tuple[hl.Table, hl.Table]:
    """
    This function is for running PCA

    :param hl.MatrixTable mt: data to be used to run PCA
    :param int npcs: number of principal components to use in running PCA
    
    :rtype: hl.Table, hl.Table
    """
    pca_evals, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, k=npcs, compute_loadings=True)
    pca_mt = mt.annotate_rows(pca_af=hl.agg.mean(mt.GT.n_alt_alleles()) / 2)
    pca_loadings = pca_loadings.annotate(pca_af=pca_mt.rows()[pca_loadings.key].pca_af)

    # individual-level PCs
    pca_scores = pca_scores.transmute(**{f'PC{i}': pca_scores.scores[i - 1] for i in range(1, npcs+1)})
    
    return pca_loadings, pca_scores


def merge_data_with_ref(
    ref_scores: hl.Table = None,
    ref_info: str = 'gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/hgdp_1kg_sample_info.unrelateds.pca_outliers_removed.with_project.tsv',
    data_scores: hl.Table = None
) -> pd.DataFrame:
    """
    This function is for merging the reference scores DataFrame with the data scores DataFrame

    :param hl.Table ref_scores: Table with reference scores
    :param str ref_info: path to file containing SuperPopulation labels
    :param hl.Table data_scores: Table with data scores
    
    :rtype: pd.DataFrame
    """
    print('Merging data with ref')
    ref_info = hl.import_table(ref_info,
                           impute=True, key='Sample')
    ref_merge = ref_scores.annotate(SuperPop = ref_info[ref_scores.s].SuperPop)

    print('merging data and ref data')
    data_ref = ref_merge.union(data_scores, unify=True)
    print('Done merging data with ref')

    return data_ref


In [25]:
# use pruned postQC MT with unrelated individuals to speed up things
mt_unrel = hl.read_matrix_table('gs://hgdp-1kg/hgdp_tgp/datasets_for_others/lindo/ds_without_outliers/unrelated.mt',
                               _n_partitions=500)

In [26]:
# filter the HGDP+1kGP and GGV datasets to variants ONLY common between the two
hgdp_tgp_in_ggv_mt, ggv_in_hgdp_tgp_mt = intersect_ref(ref_mt=mt_unrel, data_mt=mt_ggv)

sites common between the data and ref, inds in data: (237801, 394)
sites commond between the ref and data, inds in ref: (237801, 3380)


In [27]:
# compute loadings and scores for the HGDP+1kGP data
ref_pca_loadings, ref_pca_scores = run_ref_pca(mt=hgdp_tgp_in_ggv_mt, npcs=20)

2022-09-28 20:26:59 Hail: INFO: hwe_normalize: found 237801 variants after filtering out monomorphic sites.
2022-09-28 20:34:15 Hail: INFO: pca: running PCA with 20 components...
2022-09-28 20:48:04 Hail: INFO: Coerced sorted dataset


In [28]:
# Project GGV's genotypes onto HGDP+1kGP PCs we computed above
data_projections_ht = pc_project(mt=ggv_in_hgdp_tgp_mt, loadings_ht=ref_pca_loadings,
                                 loading_location='loadings', af_location='pca_af')

# Instead of having all PCs in one column as an array, create a column for each PC
data_scores = data_projections_ht.transmute(**{f'PC{i}': data_projections_ht.scores[i - 1] for i in range(1, 20+1)})

In [29]:
data_ref = merge_data_with_ref(ref_scores=ref_pca_scores, data_scores=data_scores)

data_ref_df = data_ref.to_pandas()

Merging data with ref


2022-09-28 20:49:30 Hail: INFO: Reading table to impute column types
2022-09-28 20:49:31 Hail: INFO: Finished type imputation
  Loading field 'Sample' as type str (imputed)
  Loading field 'SuperPop' as type str (imputed)
  Loading field 'Project' as type str (imputed)


merging data and ref data
Done merging data with ref


2022-09-28 21:04:30 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-09-28 21:04:30 Hail: INFO: Coerced sorted dataset
2022-09-28 21:04:30 Hail: INFO: Coerced sorted dataset


In [30]:
ht, rf_model = assign_population_pcs(
    data_ref_df,
    pc_cols=['PC{}'.format(i + 1) for i in range(20)],
    known_col="SuperPop",
)

INFO (gnomad.sample_qc.ancestry 230): Found the following sample count after population assignment: EUR: 663, oth: 324, EAS: 717, AMR: 385, CSA: 666, AFR: 852, OCE: 28, MID: 139


Random forest feature importances are as follows: [0.1910129  0.18270991 0.14926709 0.13097643 0.09713216 0.05345533
 0.0556701  0.03116968 0.02431842 0.01288155 0.01411129 0.00376679
 0.01476367 0.00142115 0.01438296 0.00527158 0.00414086 0.00751913
 0.00418878 0.00184024]
Estimated error rate for RF model is 0.004437869822485174


## 4a. Plotting PCA after building RF model from HGDP+1kGP dataset and applying it to GGV

[Back to Index](#Index)

In [31]:
# Create dictionary to use to store colors for each population label
color_map = {}

# get a list of population labels in the data
rf_pop_labels = data_ref_df['pop'].unique().tolist()

# update the dictionary with unique colors for each population
for i in range(len(rf_pop_labels)):
    color_map[rf_pop_labels[i]] = Category10[len(rf_pop_labels)][i]
    
    
tabs2 = []

ref_samples_df2 = data_ref_df[data_ref_df['s'].isin(hgdp_tgp_samples)]
ggv_samples_df2 = data_ref_df[data_ref_df['s'].isin(ggv_samples)]

def plot_pca(
        ref_df: pd.DataFrame = None,
        data_df: pd.DataFrame = None,
        x_pc: str = None,
        y_pc: str = None
) -> Tuple[figure, figure]:
    """
    This function is for plotting PCA scores

    :param pd.DataFrame ref_df: DataFrame with reference PCA scores to be plotted
    :param pd.DataFrame data_df: DataFrame with data PCA scores to be plotted
    :param str x_pc: x-axis (bottom) PC scores
    :param str y_pc: y-axis (left) PC scores
    
    :rtype: figure, figure
    """
    pcomb1 = figure(width=600, height=500, background_fill_color='#fafafa', title = 'HGDP+1kGP and GGV')
    pcomb1.add_layout(Legend(), 'right')
    pcomb1.xaxis.axis_label = x_pc
    pcomb1.yaxis.axis_label = y_pc
    pcomb1.circle(data_df[x_pc].tolist(), data_df[y_pc].tolist(), size=3, color='grey',
                 alpha=0.3, legend_label='GGV')
    
    pcomb2 = figure(width=600, height=500, background_fill_color='#fafafa', title = 'GGV and HGDP+1kGP')
    pcomb2.add_layout(Legend(), 'right')
    pcomb2.xaxis.axis_label = x_pc
    pcomb2.yaxis.axis_label = y_pc
    pcomb2.circle(ref_df[x_pc].tolist(), ref_df[y_pc].tolist(), size=3, color='grey',
                 alpha=0.3, legend_label='HGDP+1kGP')

    for pop, col in color_map.items():
        # HGDP+1kGP colored and GGV grey
        if pop in ref_df['pop'].unique().tolist():
            pcomb1.circle(ref_df[(ref_df['pop'] == pop)][x_pc].tolist(), ref_df[(ref_df['pop'] == pop)][y_pc].tolist(),
                        size=3, color=col, alpha=0.8, legend_label=pop)
        
        # HGDP+1kGP grey and GGV colored
        if pop in data_df['pop'].unique().tolist():
            pcomb2.circle(data_df[(data_df['pop'] == pop)][x_pc].tolist(), data_df[(data_df['pop'] == pop)][y_pc].tolist(),
                         size=3, color=col, alpha=0.8, legend_label=pop)
        
    return pcomb1, pcomb2


for i in range(1, 20, 2):
    xpc = f'PC{i}'
    ypc = f'PC{i + 1}'
    
    p1, p2 = plot_pca(ref_df=ref_samples_df2, data_df=ggv_samples_df2, x_pc=xpc, y_pc=ypc)
        
    tab = Panel(child=column(row(p1, p2)), title=f'{xpc}v{ypc}')

    tabs2.append(tab)

In [32]:
show(Tabs(tabs=tabs2))


In [33]:
# Get counts by POP
ggv_samples_df2['pop'].value_counts()

AFR    394
Name: pop, dtype: int64

### Note about the plots:
We can see that all the GGV samples are getting classified as correclty AFR. So building our own model using the HGDP+1kGP data instead of using the gnomAD RF model does a better job at inferring ancestry labels in this case.