Skip to content
Guo-Bo Chen edited this page Feb 12, 2020 · 4 revisions

PPSR


Pseudo profile score regression for meta-analysis

A working example can be found at the bottom of the page.


Pseudo profile score regression uses pseudo profile scores to detect individuals who overlap between cohorts included in different cohorts, which contribute summary statistics for such as meta-analysis. This is computed from summary statistics data where access to genotype data is not possible.

The algorithm has three steps.

In Step 1: a meta-analysis analyst determines the number of profile scores and generate a score file to each cohort.

In Step 2: a GWAS analyst generates pseudo profile scores, which will be sent back to the meta-analysis analyst.

In Step 3: the meta-analyst will run PPSR to pinpoint overlapping samples.


Step 1: determine the number of profile scores The first step is to determine the number of pseudo profile scores that are required to detect overlapping individuals between cohorts without access their genotypes. Assuming there are n1 and n2 individuals in cohort 1 and cohort 2, respectively, it will be n=n1*n2 number of comparisons required to detect overlapping individuals between two cohorts.

When using linear regression method, this can be achieved as follows:

gear mwpower --reg 0.95 --alpha 0.01 --beta 0.05 --test n --out mw

This step calculates the number of pseudo profile scores controlling for the experiment-wide type I error rate at 0.01 and type II error rate of 0.05 (power=1-type II error rate) given the cutoff for regression coefficient of 0.95.

Setting --reg to 0.45 (or 0.4) could be interpreted as aiming to detect first-degree relatives and setting it to 0.95 aims to detect duplicated individuals across cohorts.

For example, if there are 1000 individuals in cohort 1 and cohort 2, respectively, then n=1,000,000 and the required number of scores will be K=41.57. We take K=42 into the steps below.

The required number of the PPS will be saved in mw.encode, and will be required in the next two steps.

In addition, it will give an estimate of the number of SNPs that is suggested to generate PPS. In our experience, the best number of SNPs is 5~10 times K.

Step 2: generate PPS given consensus SNPs At this stage, mw.encode generated in the last step should be used, and the reference allele file should be provided by the meta-analyst.

gear mwscore --bfile set1 --encode mw.encode --refallele refA.txt --out set1
gear mwscore --bfile set2 --encode mw.encode --refallele refA.txt --out set2

The reference allele file reads as below

rs1001 A 0.4
rs2003 G 0.35
...

The first column is the SNP names, and the second column is the reference alleles, and the third column is the reference allele frequencies. The reference allele frequency can be calculated from a reference cohort, such as HapMap. If the third column is absent, the allele frequency will be calculated from each GWAS cohort.

After this step, set1.profile and set2.profile will be generated.

Notes:

  1. It is important for the cohorts in comparison to use the same encode, which is provided by the meta-analysis analist. For example, here it is mw.encode, as generated in Step 1.
  2. It is better to eliminate ambiguous loci which have A/T pairs or G/C pairs.
  3. However, gear will automatically take care of the strand issues. For example, for a locus, if it is coded as A/G in set 1 but T/C in set 2, GEAR will automatically detect it and flip the reference allele.

Step 3: detect overlapping individuals

gear mw --set1 set1.profile --set2 set2.profile --encode mw.encode --out overlap

Parameters specified in mw.encode will be used to detect the overlapping individuals, if any identified, they will be written to overlap.mw

In addition, the user can also reset the parameters

It will use 0.9 rather than 0.95, as set in the first step, as the cutoff for the regression test.

gear mw --set1 set1.profile --set2 set2.profile --encode mw.encode --out overlap --reg 0.9

NOTE: when --verbose option is specified, all pairwise regressions coefficient will be printed out regardless of the regression coefficients.

gear mw --set1 set1.profile --set2 set2.profile --encode mw.encode --out overlap --verbose

Working example (Download data&script)

It has two cohorts (set1.* and set2.*), each of which has 110 individuals and has 10000 snp markers. Between the two files, 10 individuals are identical.

The script (ppsr_demo.txt) reads as below. As 110*110=12100, so "--test" is specified to 12100.

gear mwpower --reg 0.95 --alpha 0.01 --beta 0.05 --test 12100 --out mw
gear mwscore --bfile set1 --encode mw.encode --refallele refFreq.txt --out set1
gear mwscore --bfile set2 --encode mw.encode --refallele refFreq.txt --out set2
gear mw --set1 set1.profile --set2 set2.profile --encode mw.encode --out overlap

The final result is saved in overlap.mw, which includes the (10) pairs of possible overlapping samples. The detailed annotation of the overlap.mw is as below. Of note, for S.E, as the regression coefficient is 1, there is no residual left and s.e. becomes zero.

Family ID1 (set1) Individual ID1 (set1) Family ID2 (set2) Individual ID2 (set 2) Regression Coefficient S.E. for RegCo K
Fsample_0_3 1 sample_0_3 1 1.0 0.0 32
Fsample_1_3 1 sample_1_3 1 1.0 0.0 32
Fsample_2_3 1 sample_2_3 1 1.0 0.0 32
Fsample_3_3 1 sample_3_3 1 1.0 0.0 32
Fsample_4_3 1 sample_4_3 1 1.0 0.0 32
Fsample_5_3 1 sample_5_3 1 1.0 0.0 32
Fsample_6_3 1 sample_6_3 1 1.0 0.0 32
Fsample_7_3 1 sample_7_3 1 1.0 0.0 32
Fsample_8_3 1 sample_8_3 1 1.0 0.0 32
Fsample_9_3 1 sample_9_3 1 1.0 0.0 32

Or, if you are only interested in seeing all comparisons, the last commandline can be modified to

gear mwpower --reg 0.95 --alpha 0.01 --beta 0.05 --test 12100 --out mw
gear mwscore --bfile set1 --encode mw.encode --refallele refFreq.txt --out set1
gear mwscore --bfile set2 --encode mw.encode --refallele refFreq.txt --out set2
gear mw --set1 set1.profile --set2 set2.profile --encode mw.encode --out overlap --verbose

You will see all possible 12100 comparisons, the format of which is as tabulated above.

Return to GEAR Home

Clone this wiki locally