-
Notifications
You must be signed in to change notification settings - Fork 1
Home
One essential question in gene-based analysis (e.g. pathway- and network-based analysis) is how to combine SNP-level P-values into a single representative gene-level P-value. A popular approach is to take the best SNP P-value, among all SNPs that are annotated to a gene, as the gene-level P-value. This approach is sensitive in capturing high association signals, however, is also biased to genes annotated with more SNPs. Other approach that aim to correct for bias of gene size have been proposed, such as ARTP, Fisher's combination method etc. Nonetheless, most of these approaches rely on the assumption of independence between SNP-level p-values, which is inappropriate for gene-based analysis because of the well-known LD pattern between SNPs. Phenotype permutation is known as the gold standard for preservation of the LD pattern, thus is promising for correcting bias of gene size [ref]. Yet, for reason of computational intensity and the required access to raw genotype data, it was rarely applied to compute gene-level P-values. An alternative permutation strategy that accounts for the LD pattern while does not require genotype data is the Circular Genomic Permutation (CGP) strategy (https://www.ncbi.nlm.nih.gov/pubmed/22973544). Briefly, CGP considers the genome as a circle, starting from chromosome 1 to chromosome X and restarting at chromosome 1. SNP-level P-values of a GWAS are ordered on the circle according to the position of the SNPs. A CGP sample can be generated by rotating the P-values for a random position and reassigning them to each SNP. By permuting SNP-level statistics in a circular manner, CGP to keep similar patterns of correlation in the permuted data as in the original data.
Here, we took advantage of the CGP strategy and proposed a novel method named fastCGP to compute gene-level P-values from SNP-level P-values. fastCGP first annotates SNPs to genes if a SNP is located within the boundary of a gene (gene boundaries can be defined by users, for example 20Kb in both upstream and downstream of the gene coding regions). Then for each gene, the gene-level P-value is taken as the best SNP-level P-value. This P-value is further corrected for bias of gene size using the concept of permutation test, in which the final gene-level P-value is defined as P = (k+1)/(K+1), where K is the total number of CGP samples, and k is the number of extreme samples. Specially, instead of generating some given number of CGP samples, fastCGP takes all non-repeating CGP samples into account to obtain the best obtainable P-value in this permutation framework. Under this specification, both k and K can be obtained analytically without generating any CGP sample (details are omitted here).
The implementation of this analytical approach brings several advantages. First, the method is exact, hence does not suffer from randomness compared to many simulation-based methods where random numbers are used, such as VEGAS, ARTP, Pascal [refs]. Second, the computation is efficient. In our study of 2,370,689 SNPs and 24,120 genes, it takes around 0.5 hour on a standard PC (Intel Core i7 3.40GHz CPU, 8GB RAM), which is much faster than many simulation-based competitors [refs]. Third, the resultant P-values are of high precision. According to the formulation of fastCGP, the precision of each gene-level P-value is 1/L , where L is the total number of SNPs analyzed in a GWAS. The larger the amount of SNPs is analyzed, the higher is the precision. In our case of analyzing 2,370,689 SNPs, the precision reaches to 4.2 x 10-7.