Skip to content

chienyuchen/TWB-PRS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TWB-PRS

In this repository, we release fourteen polygenic risk score (PRS) models using Taiwan Biobank data, including five binary phenotypes and nine quantitative phenotypes. Each model, derived from the GWAS-PRS analysis, can be used to estimate the genetic propensity for a specific trait at the individual level. Three well-known PRS algorithms are performed to build prediction models, including clumping and thresholding, Lassosum, and LDpred2.

Taiwan Biobank application: TWBR10411-03 (Title: Constructing Pharmacogenomics Testing Platform and Discovering Causal Genes of Abnormal Drug Responses; PI: Dr. Pei-Lung Chen and co-PI: Dr. Chien-Yu Chen)

Note that the effect size of SNPs in the PRS model is only used for the accumulation of risk scores. The cause of diseases can not be explained via SNPs with high effect sizes because the relation between the effect size and the pathogenicity is uncertain.

Dataset

Both Taiwan Biobank 1.0 (TWB1.0) and 2.0 (TWB2.0) are included in this study.

Dataset Genome Reference Sample Size SNP Number
TWB1.0 hg19 27,500 653,288
TWB2.0 hg38 68,978 748,344

Traits

PRS analysis is applied on fourteen traits selected from Taiwan Biobank. Quantitative traits are derived from the measurements directly, while binary traits are labelled using both measurements and self reports.

The distribution plots of quantitative traits are located at figures/distribution. Note that only values within ±5 standard deviation are plotted.

Quantitative traits:

  • Height
    • SEX is a very strong factor, so it serves as a covariate during GWAS.
    • Two PRS models are built for male and female, respectively.
  • Body mass index (BMI)
  • Blood pressure: systolic blood pressure (SBP), diastolic blood pressure (DBP)
  • Blood lipids: triacelglycerol (TG), low-density lipoprotein (LDL), high-density lipoprotein (HDL)
  • Glycated hemoglobin (HbA1c)
  • Uric acid

Binary traits:

  • Individuals meeting the measurement criteria or with self report are denoted as positive.
  • For analysis, measurement criteria are referred to the diagnostic criteria, though an exact diagnosis should be made by a clinical doctor.
  • The criterium of hypertriglyceridemia is derived from the indication for drugs.
  • Criteria:
Trait Measurement Self report
Obesity BMI ≥ 30 V
Hypertension (HTN) SBP ≥ 140 or DBP ≥ 90 V
Hypertriglyceridemia (HTG) TG ≥ 200 and HDL ≥ 40 X
Type 2 diabetes (T2D) fasting glucose ≥ 126 or HbA1c ≥ 6.5 V
Gout X V
  • Sample size:
Trait TWB1 Control TWB1 Case TWB1 Unknown TWB2 Control TWB2 Case TWB2 Unknown
OBESITY 25,257 2,230 13 63,932 5,042 4
HTN 20,379 7,116 5 51,991 16,972 15
HTG 25,793 1,706 1 65,919 3,056 3
T2D 24,506 2,993 1 61,804 7,171 3
GOUT 25,887 1,612 1 66,523 2,452 3

Pipeline

The adopted PRS pipeline follows a comprehensive tutorial by Choi and colleagues.

Choi, S. W., Mak, T. S.-H., & O’Reilly, P. F. (2020). Tutorial: A guide to performing polygenic risk score analyses. Nature Protocols, 15(9), 2759–2772. https://doi.org/10.1038/s41596-020-0353-1

There are six major steps of the GWAS-PRS analysis pipeline.

  1. Train-Test Split: Split the dataset into training and testing sets. The training set is used for training the PRS model while the testing set is used for validating the PRS model.
  2. Quality Control: Apply quality control on the training set.
  3. Base-Target Split: Split the training set into base and target sets. The base set is used for GWAS analysis while the target set is used for adjusting the effect size of SNPs.
  4. GWAS Analysis: Perform genome-wide association studies on the base set to get the P-value and effect size of each SNP.
  5. PRS Model: Adjust the effect size of SNPs from GWAS using the target set. For example, select representative SNPs based on the linkage disequilibrium.
  6. Validation: Evaluate the performance of PRS models on the testing set.

Performance

The Manhattan plot of each trait derived from GWAS is located at figures/manhattan.

Area under the receiver operating characteristic curve (AUROC) and Spearman's correlation are used to evaluate the performance of a binary trait and a quantitative trait respectively.

Performance of the binary trait

Performance of the quantitative trait

The quantile plot shows the risk stratification. For each model, samples in the test set are divided into 10 quantiles of increasing PRS. Then, in each quantile, the odds ratio is calculated for binary traits, while the mean of values is calculated for quantitative traits. A great difference between the first and the last group represents a good risk stratification. (Quantile plots are located at figures/quantile)

Quantile plot of the hypertriglyceridemia (binary trait)

Quantile plot of the LDL (quantitative trait)

Prediction

PRS prediction of the fourteen traits is available here. Genotype data from the same platform (TWB1.0 or TWB2.0) and population (Taiwanese) are recommended.

Package Requirement

  • plink1.9, plink2, and python3
  • python packages: numpy and pandas

or you can run it in docker

docker build . -t c4lab/twb-prs
docker run -it --rm c4lab/twb-prs -h

Usage

  • models/beta.tar.gz has to be pulled with git lfs and extracted before predicting
$ bash src/predictor.sh -h

Predict the polygenic risk score for individuals
options:
-i, input bfile prefix
-b, beta file (beta.csv)
-r, reference rank file (rank.csv)
-m, type of the trait; 'clf' (binary) or 'reg' (quantitative)
-d, output directory
-o, output basename

Example

$ bash src/predictor.sh \
    -i example/exp \
    -b models/beta/TWB2_HTG.beta.tsv \
    -r models/rank/TWB2_HTG.rank.csv \
    -m clf \
    -d example/output \
    -o exp

Data Source

Taiwan Biobank application: TWBR10411-03 (Title: Constructing Pharmacogenomics Testing Platform and Discovering Causal Genes of Abnormal Drug Responses; PI: Dr. Pei-Lung Chen and co-PI: Dr. Chien-Yu Chen)

The beta file is currently unavailable because of the data availability policy of Taiwan Biobank

Copyright and Usage

Copyright 2022 NTU c4Lab and Taiwan AI Labs. All rights Reserved. The released models can only be used for research or education and are not suitable for medical purposes.

About

PRS models using Taiwan Biobank data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •