# Example of using the gnomAD genetic ancestry principal components analysis loadings and random forest classifier

Please read our [blog post](https://gnomad.broadinstitute.org/news/2021-09-using-the-gnomad-ancestry-principal-components-analysis-loadings-and-random-forest-classifier-on-your-dataset/) about important caveats to consider when using gnomAD ancestry principal components analysis loadings and random forest classifier models on your own dataset.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#The-following-packages-are-required-for-this-example" data-toc-modified-id="The-following-packages-are-required-for-this-example-1">The following packages are required for this example</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2">Imports</a></span></li><li><span><a href="#Data-Loading" data-toc-modified-id="Data-Loading-3">Data Loading</a></span><ul class="toc-item"><li><span><a href="#Define-file-paths" data-toc-modified-id="Define-file-paths-3.1">Define file paths</a></span><ul class="toc-item"><li><span><a href="#v3-example-paths" data-toc-modified-id="v3-example-paths-3.1.1">v3 example paths</a></span></li><li><span><a href="#v2-example-paths" data-toc-modified-id="v2-example-paths-3.1.2">v2 example paths</a></span></li></ul></li><li><span><a href="#Define-the-number-of-PCs-used-for-v2-and-v3-genetic-ancestry-group-classification" data-toc-modified-id="Define-the-number-of-PCs-used-for-v2-and-v3-genetic-ancestry-group-classification-3.2">Define the number of PCs used for v2 and v3 genetic ancestry group classification</a></span></li><li><span><a href="#Define-the-RF-minimum-probability-used-for-v2-and-v3-genetic-ancestry-group-classification" data-toc-modified-id="Define-the-RF-minimum-probability-used-for-v2-and-v3-genetic-ancestry-group-classification-3.3">Define the RF minimum probability used for v2 and v3 genetic ancestry group classification</a></span></li><li><span><a href="#Load-ONNX-models" data-toc-modified-id="Load-ONNX-models-3.4">Load ONNX models</a></span></li><li><span><a href="#Load-gnomAD-v3.1-loadings-Hail-Table-and-the-VariantDataset-to-apply-projection-and-genetic-ancestry-group-assignment-to" data-toc-modified-id="Load-gnomAD-v3.1-loadings-Hail-Table-and-the-VariantDataset-to-apply-projection-and-genetic-ancestry-group-assignment-to-3.5">Load gnomAD v3.1 loadings Hail Table and the VariantDataset to apply projection and genetic ancestry group assignment to</a></span></li><li><span><a href="#Load-gnomAD-v2.1-precomputed-v2-PCA-scores" data-toc-modified-id="Load-gnomAD-v2.1-precomputed-v2-PCA-scores-3.6">Load gnomAD v2.1 precomputed v2 PCA scores</a></span></li></ul></li><li><span><a href="#Perform-PC-projection-using-the-v3.1-PCA-loadings" data-toc-modified-id="Perform-PC-projection-using-the-v3.1-PCA-loadings-4">Perform PC projection using the v3.1 PCA loadings</a></span><ul class="toc-item"><li><span><a href="#Create-dense-MatrixTable-of-only-the-variants-found-in-the-loadings-Table" data-toc-modified-id="Create-dense-MatrixTable-of-only-the-variants-found-in-the-loadings-Table-4.1">Create dense MatrixTable of only the variants found in the loadings Table</a></span></li><li><span><a href="#We-recommend-filtering-to-entries-meeting-GQ,-DP-and-het-AB-'adj'-thresholds" data-toc-modified-id="We-recommend-filtering-to-entries-meeting-GQ,-DP-and-het-AB-'adj'-thresholds-4.2">We recommend filtering to entries meeting GQ, DP and het AB 'adj' thresholds</a></span></li><li><span><a href="#Checkpoint-dense-MT-for-PC-projection" data-toc-modified-id="Checkpoint-dense-MT-for-PC-projection-4.3">Checkpoint dense MT for PC projection</a></span></li><li><span><a href="#Project-test-dataset-genotypes-onto-gnomAD-v3.1-loadings-and-checkpoint-the-scores" data-toc-modified-id="Project-test-dataset-genotypes-onto-gnomAD-v3.1-loadings-and-checkpoint-the-scores-4.4">Project test dataset genotypes onto gnomAD v3.1 loadings and checkpoint the scores</a></span></li></ul></li><li><span><a href="#Assign-genetic-ancestry-group-using-ONNX-RF-model" data-toc-modified-id="Assign-genetic-ancestry-group-using-ONNX-RF-model-5">Assign genetic ancestry group using ONNX RF model</a></span><ul class="toc-item"><li><span><a href="#v3.1-RF-model" data-toc-modified-id="v3.1-RF-model-5.1">v3.1 RF model</a></span></li><li><span><a href="#v2.1-RF-model" data-toc-modified-id="v2.1-RF-model-5.2">v2.1 RF model</a></span></li></ul></li></ul></div>

## The following packages are required for this example

In [1]:
!/opt/conda/default/bin/pip install onnxruntime onnx

[0m

## Imports

In [2]:
import onnx
import hail as hl
from gnomad.sample_qc.ancestry import apply_onnx_classification_model, assign_population_pcs
from gnomad.utils.filtering import filter_to_adj

from gnomad_qc.v2.resources.basics import get_gnomad_meta
from gnomad_qc.v4.resources.basics import get_checkpoint_path



  tys = obj.typeStr or ''
  if getattr(obj, 'isHomogeneous', False):
  return getattr(obj, attribute)


## Data Loading

In [3]:
read_if_exists = False

### Define file paths

#### v3 example paths

In [4]:
# v3.1 PCA loadings.
gnomad_v3_loadings = "gs://gcp-public-data--gnomad/release/3.1/pca/gnomad.v3.1.pca_loadings.ht"

# v3.1 ONNX RF model.
gnomad_v3_onnx_rf = "gs://gcp-public-data--gnomad/release/3.1/pca/gnomad.v3.1.RF_fit.onnx"

# Test dataset to apply projection and genetic ancestry group assignment to.
# This will be the path to your dataset VDS.
vds_to_project = "gs://gnomad/v4.0/raw/exomes/testing/gnomad_v4.0_test.vds"

# v3.1 output paths.
test_mt_output_path = get_checkpoint_path("example_gnomad_v3.1_ancestry_rf", mt=True)
test_scores_output_path = get_checkpoint_path("example_gnomad_v3.1_ancestry_rf.scores")
gnomad_v3_assignment_path = get_checkpoint_path("example_gnomad_v3.1_ancestry_rf.assignment")

#### v2 example paths

The v2 example will use our precomputed v2 PCA scores. Using the loadings will be the same process as shown for v3.1, it just needs to be used on a dataset aligned to GRCh37 instead of GRCh38.


In [5]:
# v2.1 ONNX RF model.
gnomad_v2_onnx_rf = "gs://gcp-public-data--gnomad/release/2.1/pca/gnomad.r2.1.RF_fit.onnx"

# v2.1 output path.
gnomad_v2_assignment_path = get_checkpoint_path("example_gnomad_v2.1_ancestry_rf.assignment")

### Define the number of PCs used for v2 and v3 genetic ancestry group classification

In [6]:
v3_num_pcs = 16
v2_num_pcs = 6

### Define the RF minimum probability used for v2 and v3 genetic ancestry group classification

In [7]:
v3_min_prob = 0.75
v2_min_prob = 0.9

### Load ONNX models

In [8]:
with hl.hadoop_open(gnomad_v2_onnx_rf, "rb") as f:
    v2_onx_fit = onnx.load(f)

with hl.hadoop_open(gnomad_v3_onnx_rf, "rb") as f:
    v3_onx_fit = onnx.load(f)

Initializing Hail with default parameters...
Running on Apache Spark version 3.1.3
SparkUI available at http://jg2-m.c.broad-mpg-gnomad.internal:36357
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.109-b71b065e4bb6
LOGGING: writing to /home/hail/hail-20230710-1923-0.2.109-b71b065e4bb6.log


### Load gnomAD v3.1 loadings Hail Table and the VariantDataset to apply projection and genetic ancestry group assignment to

In [9]:
vds = hl.vds.read_vds(vds_to_project)
v3_loading_ht = hl.read_table(gnomad_v3_loadings)

### Load gnomAD v2.1 precomputed v2 PCA scores

As mentioned above, the v2 example will use our precomputed v2 PCA scores.

In [10]:
v2_meta_ht = get_gnomad_meta("exomes", full_meta=True)
v2_pcs_ht = v2_meta_ht.select(
    scores=hl.array([v2_meta_ht[f"PC{pc+1}"] for pc in range(v2_num_pcs)])
).select_globals()
v2_pcs_ht = v2_pcs_ht.filter(hl.is_defined(v2_pcs_ht.scores[0]))

## Perform PC projection using the v3.1 PCA loadings

### Create dense MatrixTable of only the variants found in the loadings Table

In [11]:
# Reduce variant data to only needed annotations to reduce annotations being 
# split and densified.
# This includes annotations needed for our standard genotype filter ('adj').
vds = hl.vds.VariantDataset(
    vds.reference_data, 
    vds.variant_data.select_entries("LA", "LGT", "GQ", "DP", "LAD")
)

# Split multiallelics.
vds = hl.vds.split_multi(vds, filter_changed_loci=True)

# Filter to variants in the loadings Table.
vds = hl.vds.filter_variants(vds, v3_loading_ht)

# Densify VDS.
mt = hl.vds.to_dense_mt(vds)

### We recommend filtering to entries meeting GQ, DP and het AB 'adj' thresholds

In [12]:
mt = filter_to_adj(mt)

### Checkpoint dense MT for PC projection

In [13]:
mt = mt.checkpoint(
    test_mt_output_path, 
    overwrite=not read_if_exists, 
    _read_if_exists=read_if_exists
)

2023-07-10 19:29:31.600 Hail: INFO: wrote matrix table with 69018 rows and 649 columns in 3600 partitions to gs://gnomad-tmp/gnomad.exomes.v4.0.qc_data/example_gnomad_v3.1_ancestry_rf.mt


### Project test dataset genotypes onto gnomAD v3.1 loadings and checkpoint the scores

In [14]:
# Project new genotypes onto loadings.
v3_pcs_ht = hl.experimental.pc_project(
    mt.GT, v3_loading_ht.loadings, v3_loading_ht.pca_af,
)

# Checkpoint PC projection results.
v3_pcs_ht = v3_pcs_ht.checkpoint(
    test_scores_output_path, 
    overwrite=not read_if_exists, 
    _read_if_exists=read_if_exists
)

2023-07-10 19:29:33.405 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
2023-07-10 19:29:48.687 Hail: INFO: Coerced sorted dataset
2023-07-10 19:29:51.320 Hail: INFO: wrote table with 649 rows in 16 partitions to gs://gnomad-tmp/gnomad.exomes.v4.0.qc_data/example_gnomad_v3.1_ancestry_rf.scores.ht


## Assign genetic ancestry group using ONNX RF model

### v3.1 RF model

In [15]:
ht, model = assign_population_pcs(
    v3_pcs_ht,
    pc_cols=v3_pcs_ht.scores[:v3_num_pcs],
    fit=v3_onx_fit,
    min_prob=v3_min_prob,
    apply_model_func = apply_onnx_classification_model,
)
ht = ht.checkpoint(
    gnomad_v3_assignment_path, 
    overwrite=not read_if_exists, 
    _read_if_exists=read_if_exists
)

ht.show()
ht.aggregate(hl.agg.counter(ht.pop))

INFO (gnomad.sample_qc.ancestry 369): Found the following sample count after population assignment: nfe: 378, oth: 32, afr: 28, amr: 60, eas: 42, sas: 49, asj: 25, fin: 35
2023-07-10 19:29:55.424 Hail: INFO: Coerced sorted dataset
2023-07-10 19:29:58.024 Hail: INFO: wrote table with 649 rows in 16 partitions to gs://gnomad-tmp/gnomad.exomes.v4.0.qc_data/example_gnomad_v3.1_ancestry_rf.assignment.ht


s,pca_scores,pop,prob_afr,prob_ami,prob_amr,prob_asj,prob_eas,prob_fin,prob_mid,prob_nfe,prob_oth,prob_sas
str,array<float64>,str,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
"""ALSGEN-S1246_2013-0124""","[9.34e-02,-2.83e-02,4.36e-03,-2.02e-02,-1.16e-02,2.11e-03,2.98e-04,-3.83e-03,2.37e-03,2.95e-04,-5.25e-03,2.54e-03,-3.75e-03,2.57e-03,-1.46e-03,-3.87e-03]","""nfe""",0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.97,0.0,0.0
"""ALSGEN-S1246_2013-81""","[9.73e-02,-3.32e-02,5.29e-03,-1.79e-02,-1.59e-02,1.48e-02,-1.40e-03,-1.59e-03,1.62e-03,-4.94e-03,1.14e-02,-1.21e-04,1.48e-04,-5.34e-03,2.48e-03,-1.41e-03]","""nfe""",0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.99,0.0,0.0
"""ALSGEN-S1246_2057-50""","[8.28e-02,-2.78e-02,7.51e-03,-1.64e-02,-1.23e-02,6.49e-03,-5.25e-03,-4.37e-03,2.19e-03,-1.92e-03,3.36e-03,-1.19e-03,3.38e-03,-2.32e-03,-3.73e-03,-2.49e-04]","""nfe""",0.08,0.0,0.0,0.0,0.0,0.0,0.0,0.91,0.0,0.01
"""ALSGEN-S1246_SLA2010-171""","[9.84e-02,-3.32e-02,-1.09e-03,-2.36e-02,-3.35e-03,-1.12e-02,-7.33e-03,1.16e-03,-3.01e-03,1.47e-02,-1.53e-02,1.95e-03,-9.43e-03,7.76e-03,-6.42e-05,1.65e-03]","""nfe""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.99,0.0,0.01
"""ALSGEN-S1246_SLA2010-368""","[9.31e-02,-3.20e-02,-7.34e-04,-3.51e-02,5.79e-03,-2.60e-02,-7.88e-03,-4.09e-03,-1.12e-03,8.30e-03,-1.82e-02,-3.86e-03,-1.25e-02,1.30e-02,8.98e-05,3.68e-03]","""nfe""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
"""ALSGEN-S1246_SLA2011-019""","[9.34e-02,-2.86e-02,-6.85e-03,-3.90e-02,1.63e-02,-3.72e-02,-2.91e-03,-6.41e-03,-5.20e-03,8.16e-03,-2.53e-02,-3.63e-03,-2.02e-02,2.23e-02,-3.74e-03,4.86e-03]","""nfe""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
"""ALSGEN-S1246_SLA2011-093""","[9.46e-02,-3.17e-02,-7.01e-03,-3.43e-02,5.28e-03,-3.19e-02,-4.50e-03,-2.45e-03,-3.98e-03,1.07e-02,-3.08e-02,-3.32e-03,-2.10e-02,1.87e-02,-2.91e-03,2.25e-03]","""nfe""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
"""ALSGEN-S1246_SLA2011-331""","[9.50e-02,-3.48e-02,-9.17e-03,-4.26e-02,-2.78e-03,-4.84e-02,-2.04e-02,-1.42e-02,-2.77e-03,9.42e-03,-3.62e-02,9.75e-04,-1.37e-02,1.34e-02,-1.03e-04,3.19e-03]","""nfe""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
"""ALSGEN-S1246_VABT120020""","[8.74e-02,-2.85e-02,8.87e-03,-8.42e-03,-5.52e-03,1.07e-02,3.90e-02,2.25e-02,-5.78e-03,6.61e-03,-3.89e-03,-7.40e-04,-6.29e-04,6.06e-04,-4.05e-03,-6.14e-04]","""nfe""",0.07,0.0,0.07,0.0,0.01,0.0,0.0,0.82,0.0,0.03
"""Aspera-Columbia_alsag17p3""","[9.15e-02,-3.15e-02,1.92e-03,-2.80e-02,-1.16e-02,5.49e-03,-4.57e-03,-7.87e-03,2.89e-03,-3.70e-03,-3.05e-03,-5.60e-04,4.44e-03,-4.43e-03,-4.08e-03,-6.72e-04]","""nfe""",0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.98,0.0,0.0


{'afr': 28,
 'amr': 60,
 'asj': 25,
 'eas': 42,
 'fin': 35,
 'nfe': 378,
 'oth': 32,
 'sas': 49}

### v2.1 RF model

In [16]:
ht, model = assign_population_pcs(
    v2_pcs_ht,
    pc_cols=v2_pcs_ht.scores,
    fit=v2_onx_fit,
    min_prob=v2_min_prob,
    apply_model_func=apply_onnx_classification_model,
)
ht = ht.checkpoint(
    gnomad_v2_assignment_path, 
    overwrite=not read_if_exists, 
    _read_if_exists=read_if_exists
)

ht.show()
ht.aggregate(hl.agg.counter(ht.pop))

INFO (gnomad.sample_qc.ancestry 369): Found the following sample count after population assignment: fin: 14181, nfe: 74477, oth: 4631, amr: 21237, eas: 11842, sas: 17305, afr: 10425, asj: 5968
2023-07-10 19:32:20.206 Hail: INFO: Coerced sorted dataset
2023-07-10 19:32:24.510 Hail: INFO: wrote table with 160066 rows in 16 partitions to gs://gnomad-tmp/gnomad.exomes.v4.0.qc_data/example_gnomad_v2.1_ancestry_rf.assignment.ht


s,pca_scores,pop,prob_afr,prob_amr,prob_asj,prob_eas,prob_fin,prob_nfe,prob_sas
str,array<float64>,str,float64,float64,float64,float64,float64,float64,float64
"""00-0062""","[6.59e-02,-4.08e-02,-1.91e-02,8.17e-02,9.76e-02,-2.08e-02]","""fin""",0.0,0.0,0.0,0.0,1.0,0.0,0.0
"""00-0183""","[6.45e-02,-4.07e-02,-1.73e-02,7.24e-02,8.67e-02,-2.25e-02]","""fin""",0.0,0.0,0.0,0.0,1.0,0.0,0.0
"""00-0184""","[6.64e-02,-3.70e-02,-1.70e-02,9.22e-02,1.08e-01,-3.11e-02]","""fin""",0.0,0.0,0.0,0.0,1.0,0.0,0.0
"""00-0186""","[6.24e-02,-3.81e-02,-1.64e-02,6.25e-02,7.07e-02,-1.32e-02]","""fin""",0.0,0.0,0.0,0.0,1.0,0.0,0.0
"""00-0188""","[6.69e-02,-4.19e-02,-1.86e-02,9.69e-02,1.16e-01,-3.18e-02]","""fin""",0.0,0.0,0.0,0.0,1.0,0.0,0.0
"""00-0191""","[6.49e-02,-4.36e-02,-1.65e-02,7.96e-02,8.97e-02,-2.36e-02]","""fin""",0.0,0.0,0.0,0.0,1.0,0.0,0.0
"""00-0193""","[6.61e-02,-3.88e-02,-2.05e-02,8.33e-02,9.81e-02,-2.57e-02]","""fin""",0.0,0.0,0.0,0.0,1.0,0.0,0.0
"""00-0194""","[6.83e-02,-4.24e-02,-2.08e-02,1.00e-01,1.18e-01,-3.71e-02]","""fin""",0.0,0.0,0.0,0.0,1.0,0.0,0.0
"""00-0199""","[6.73e-02,-3.96e-02,-2.21e-02,9.67e-02,1.15e-01,-3.88e-02]","""fin""",0.0,0.0,0.0,0.0,1.0,0.0,0.0
"""00-0204""","[6.51e-02,-4.41e-02,-1.55e-02,5.75e-02,6.46e-02,-1.20e-02]","""fin""",0.0,0.0,0.0,0.0,1.0,0.0,0.0


{'afr': 10425,
 'amr': 21237,
 'asj': 5968,
 'eas': 11842,
 'fin': 14181,
 'nfe': 74477,
 'oth': 4631,
 'sas': 17305}