PRS-CSx is a Python based command line tool that integrates GWAS summary statistics and external LD reference panels from multiple populations to improve cross-population polygenic prediction. Posterior SNP effect sizes are inferred under coupled continuous shrinkage (CS) priors across populations. PRS-CSx is an extension of the Bayesian polygenic prediction method PRS-CS (https://github.com/getian107/PRScs), described in the article:
T Ge, CY Chen, Y Ni, YCA Feng, JW Smoller. Polygenic Prediction via Bayesian Regression and Continuous Shrinkage Priors. Nature Communications, 10:1776, 2019.
The development and evaluation of PRS-CSx are described in:
Y Ruan, YF Lin, YCA Feng, CY Chen, M Lam, Z Guo, Stanley Global Asia Initiatives, L He, A Sawa, AR Martin, S Qin, H Huang, T Ge. Improving polygenic prediction in ancestrally diverse populations. Nature Genetics, 54:573-580, 2022.
An application of the "meta" and "auto" version of PRS-CSx is described in:
T Ge et al. Development and validation of a trans-ancestry polygenic risk score for type 2 diabetes in diverse populations. Genome Medicine, 14:70, 2022.
🔴 Aug 10, 2023: Added BETA/OR + SE as a new input format (see the format of GWAS summary statistics below), which is now the recommended input data. When using BETA/OR + P as the input, p-values smaller than 1e-323 are truncated, which may reduce prediction accuracy for traits that have highly significant loci.
July 29, 2021: Changed default MCMC parameters.
Jun 4, 2021: Expanded reference panels to five populations.
May 26, 2021: Added suggestions for limiting the number of threads in scipy when running PRS-CS (see Computational Efficiency section below).
Apr 6, 2021: Added projection of the LD matrix to its nearest non-negative definite matrix.
Mar 4, 2021: LD reference panels constructed using the UK Biobank data are now available.
Jan 4, 2021: Improved the accuracy and robustness of random sampling from the generalized inverse Gaussian distribution. Prediction accuracy will probably slightly improve over the initial release.
Dec 20, 2020: Repository made public.
Clone this repository using the following git command:
git clone https://github.com/getian107/PRScsx.git
Alternatively, download the source files from the github website (
Download the LD reference panels and extract files:
LD reference panels constructed using the 1000 Genomes Project phase 3 samples:
AFR reference (~4.44G);
tar -zxvf ldblk_1kg_afr.tar.gz
AMR reference (~3.84G);
tar -zxvf ldblk_1kg_amr.tar.gz
EAS reference (~4.33G);
tar -zxvf ldblk_1kg_eas.tar.gz
EUR reference (~4.56G);
tar -zxvf ldblk_1kg_eur.tar.gz
SAS reference (~5.60G);
tar -zxvf ldblk_1kg_sas.tar.gz
LD reference panels constructed using the UK Biobank data (Notes):
AFR reference (~4.93G);
tar -zxvf ldblk_ukbb_afr.tar.gz
AMR reference (~4.10G);
tar -zxvf ldblk_ukbb_amr.tar.gz
EAS reference (~5.80G);
tar -zxvf ldblk_ukbb_eas.tar.gz
EUR reference (~6.25G);
tar -zxvf ldblk_ukbb_eur.tar.gz
SAS reference (~7.37G);
tar -zxvf ldblk_ukbb_sas.tar.gz
Note that these files are identical to the reference panels used in PRS-CS.
Therefore, there is no need to download again if you are already using PRS-CS.
For regions that don't have access to Dropbox, reference panels can be downloaded from the alternative download site.
Download the SNP information file and put it in the same folder containing the reference panels:
1000 Genomes reference: SNP info (~106M)
UK Biobank reference: SNP info (~108M)
Once Python and its dependencies have been installed, running
will print a list of command-line options.
python PRScsx.py --ref_dir=PATH_TO_REFERENCE --bim_prefix=VALIDATION_BIM_PREFIX --sst_file=SUM_STATS_FILE --n_gwas=GWAS_SAMPLE_SIZE --pop=POPULATION --out_dir=OUTPUT_DIR --out_name=OUTPUT_FILE_PREFIX [--a=PARAM_A --b=PARAM_B --phi=PARAM_PHI --n_iter=MCMC_ITERATIONS --n_burnin=MCMC_BURNIN --thin=MCMC_THINNING_FACTOR --chrom=CHROM --meta=META_FLAG --seed=SEED]
PATH_TO_REFERENCE (required): Full path to the directory that contains the SNP information file and LD reference panels. If the 1000 Genomes reference is used, the folder would contain the SNP information file
snpinfo_mult_1kg_hm3and one or more of the LD reference files:
ldblk_1kg_sas; if the UK Biobank reference is used, the folder would contain the SNP information file
snpinfo_mult_ukbb_hm3and one or more of the LD reference files:
VALIDATION_BIM_PREFIX (required): Full path and the prefix of the bim file for the target (validation/testing) dataset. This file is used to provide a list of SNPs that are available in the target dataset.
SUM_STATS_FILE (required): Full path and the file name of the GWAS summary statistics. Multiple GWAS summary statistics files are allowed and should be separated by comma. The summary statistics file must include either BETA/OR + SE or BETA/OR + P. When using BETA/OR + SE as the input, the file must have the following format (including the header line):
SNP A1 A2 BETA SE rs4970383 C A -0.0064 0.0090 rs4475691 C T -0.0145 0.0094 rs13302982 A G -0.0232 0.0199 ...
SNP A1 A2 OR SE rs4970383 A C 0.9825 0.0314 rs4475691 T C 0.9436 0.0319 rs13302982 A G 1.1337 0.0543 ...
where SNP is the rs ID, A1 is the effect allele, A2 is the alternative allele, BETA/OR is the effect/odds ratio of the A1 allele, SE is the standard error of the effect. Note that when OR is used, SE corresponds to the standard error of logOR.
When using BETA/OR + P as the input, the file must have the following format (including the header line):
SNP A1 A2 BETA P rs4970383 C A -0.0064 0.4778 rs4475691 C T -0.0145 0.1245 rs13302982 A G -0.0232 0.2429 ...
SNP A1 A2 OR P rs4970383 A C 0.9825 0.5737 rs4475691 T C 0.9436 0.0691 rs13302982 A G 1.1337 0.0209 ...
where SNP is the rs ID, A1 is the effect allele, A2 is the alternative allele, BETA/OR is the effect/odds ratio of the A1 allele, P is the p-value of the effect. Here, a standardized effect size is calculated using the p-value while BETA/OR is only used to determine the direction of an association. Therefore if z-scores or even +1/-1 indicating effect directions are presented in the BETA column, the algorithm should still work properly.
GWAS_SAMPLE_SIZE (required): Sample sizes of the GWAS, in the same order of the GWAS summary statistics files, separated by comma.
POPULATION (required): Population of the GWAS sample, in the same order of the GWAS summary statistics files, separated by comma. For both the 1000 Genomes reference and the UK Biobank reference, AFR, AMR, EAS, EUR and SAS are allowed.
OUTPUT_DIR (required): Output directory of the posterior effect size estimates.
OUTPUT_FILE_PREFIX (required): Output filename prefix of the posterior effect size estimates.
PARAM_A (optional): Parameter a in the gamma-gamma prior. Default is 1.
PARAM_B (optional): Parameter b in the gamma-gamma prior. Default is 0.5.
PARAM_PHI (optional): Global shrinkage parameter phi. If phi is not specified, it will be learnt from the data using a fully Bayesian approach. This usually works well for polygenic traits with very large GWAS sample sizes (hundreds of thousands of subjects). For GWAS with limited sample sizes (including most of the current disease GWAS), fixing phi to 1e-2 (for highly polygenic traits) or 1e-4 (for less polygenic traits), or doing a small-scale grid search (e.g., phi=1e-6, 1e-4, 1e-2, 1) to find the optimal phi value in the validation dataset often improves perdictive performance.
MCMC_ITERATIONS (optional): Total number of MCMC iterations. Default is 1,000 * the number of discovery populations.
MCMC_BURNIN (optional): Number of burnin iterations. Default is 500 * the number of discovery populations. Both
--n_burninneed to be specified to overwrite their default values.
MCMC_THINNING_FACTOR (optional): Thinning factor of the Markov chain. Default is 5.
CHROM (optional): The chromosome on which the model is fitted, separated by comma, e.g., --chrom=1,3,5. Parallel computation for the 22 autosomes is recommended. Default is iterating through 22 autosomes (can be time-consuming).
META_FLAG (optional): If True, return combined SNP effect sizes across populations using an inverse-variance-weighted meta-analysis of the population-specific posterior effect size estimates. Default is False.
SEED (optional): Non-negative integer which seeds the random number generator.
For each input GWAS, PRS-CSx writes posterior SNP effect size estimates for each chromosome to the user-specified directory. The output file contains chromosome, rs ID, base position, A1, A2 and posterior effect size estimate for each SNP. If
--meta=True, meta-analyzed posterior effect sizes will also be written to the output directory. An individual-level polygenic score can be produced by concatenating output files from all chromosomes and then using
--score command (https://www.cog-genomics.org/plink/1.9/score). If polygenic scores are generated by chromosome, use the 'sum' modifier so that they can be combined into a genome-wide score.
We recommend calculating one polygenic score for each discovery population using population-specific posterior SNP effect size estimates and learn a linear combination of the polygenic scores that most accurately predicts the trait in the validation dataset. The predictive performance of the method can be assessed in an independent dataset, using the optimal global shrinkage parameter phi (if grid search is used) and weights for the linear combination learnt in the validation dataset. Separate evaluation of each population-specific polygenic score is usually NOT the intended use of PRS-CSx. We recommend standardizing the polygenic scores (i.e., converting the scores to zero mean and unit variance) in both validation and testing datasets before learning/applying the linear combination.
PRS-CSx relies on scipy packages, which automatically use all available cores on a compute node. This can be problematic when running PRS-CSx on a compute cluster; PRS-CSx jobs may interfere with other jobs running on the same node, reducing computational efficiency. To resolve this issue, including the following code in the script to specify the number of threads in scipy:
export MKL_NUM_THREADS=$N_THREADS export NUMEXPR_NUM_THREADS=$N_THREADS export OMP_NUM_THREADS=$N_THREADS
For example, to use a single thread for the computation, set
The test data contains EUR and EAS GWAS summary statistics and a bim file for 1,000 SNPs on chromosome 22. An example to use the test data:
python PRScsx.py --ref_dir=path_to_ref --bim_prefix=path_to_bim/test --sst_file=path_to_sumstats/EUR_sumstats.txt,path_to_sumstats/EAS_sumstats.txt --n_gwas=200000,100000 --pop=EUR,EAS --chrom=22 --phi=1e-2 --out_dir=path_to_output --out_name=test
The test data analysis would be finished in approximately 1 min when using 8Gb of RAM.
Please direct any problems or questions to Tian Ge (email@example.com).