CT-SLEB is a method designed to generate multi-ancestry PRSs that incorporate existing large GWAS from EUR populations and smaller GWAS from non-EUR populations. The method has three key steps: 1. Clumping and Thresholding for selecting SNPs to be included in a PRS for the target population; 2. Empirical-Bayes method for estimating the coefficients of the SNPs; 3. Super-learning model to combine a series of PRSs generated under different SNP selection thresholds. The method requires three independent datasets: (1) GWAS summary statistics from training datasets across EUR and non-EUR populations; (2) a tuning dataset for the target population to find optimal model parameters; and (3) a validation dataset for the target population to report the final prediction performance.
The 'CTSLEB' vignette will provide a good start point for using CTSLEB package.
To install CTSLEB, it's easiest to use the 'devtools' package.
install.packages("devtools")
library(devtools)
install_github("andrewhaoyu/CTSLEB")
The example dataset for the vignette can be downloaded through this link.
For data analyses, the pipelines require reference samples from different populations for the clumping step. We use data from 1000 Genomes Project (Phase 3) in PLINK format. Other reference data can also be used as long as it's in PLINK format. The 1000 Genomes Project data in PLINK format can be downloaded through the following links:
Simulation can be challenging and time-consuming to generate large-scale genotype data with realistic LD for diverse ancestries. We have generated independent 600,000 subjects for five ancestries: African, American, European, East Asian, and South Asian using the LD estimated from the 1000 Genomes Project (Phase 3). Each of the five ancestries contains 120,000 subjects. We release this data through Harvard DataVerse to help researchers develop and test methods in multi-ancestry GWAS settings. The data can be downloaded through the following link. All the simulated genotype data, phenotype data, and summary statistics used in the manuscript are available. The ReadMe file in the directory gives a detailed explanation of the data. Due to the file size limit on Harvard DataVerse, we restrict the shared data on SNPs on HapMap3 (HM3) or Multi-Ethnic Genotyping Arrays (MEGA) chip arrays, or both. If you need the data for all ~19 million SNPs, please contact Haoyu Zhang (andrew.haoyu@gmail.com).
The analyses also need PLINK 1.9 for clumping purpose and PLINK 2.0 for calculating PRSs. Guidance can be found in the vignette.
Please direct any problems or questions to Haoyu Zhang haoyu.zhang2@nih.gov.
Zhang, H., Zhan, J., Jin, J., Zhang, J., Lu, W., Zhao, R., ... & Chatterjee, N. (2023). A new method for multiancestry polygenic prediction improves performance across diverse populations. Nature genetics, 55(10), 1757-1768.