/
DQ.R
397 lines (363 loc) · 15.8 KB
/
DQ.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
#' @title Improved q-values for discrete uniform and homogeneous tests
#' @aliases DQ
#' @description Performs the five versions of the q-value method considered in Cousido-Rocha et al. (2019). The q-value method is based on the false discovery rate (FDR), and the versions differ in the estimator of the proportion of true null hypotheses, \eqn{\pi_0}{π_0}, which is plugged in the FDR estimator. Specifically, we consider as possible estimators for \eqn{\pi_0}{π_0}: two usual estimators for continuous and possibly heterogeneous P-values; an estimator for discrete P-values defined in two steps: firstly the P-values are randomized, and then the usual \eqn{\pi_0}{π_0} estimator for continuous P-values is applied; and the estimators recently proposed for discrete P-values by Liang (2016) and Chen et al. (2014).
#'
#'
#' @param pv A vector of P-values.
#' @param ss Support of the discrete distribution of the P-values. Only required for “Liang”, “Chen” and “Rand” methods which are specifically proposed for discrete P-values. If the P-values are continuous the methods “ST” and “SS” do not need this argument, hence “ss=NULL” by default.
#' @param ss_inf Logical. Default is FALSE. A variable indicating whether the support of the discrete distribution of the P-values is finite or infinite. See details.
#' @param method The q-value method. By default the “Chen” method is computed.
#' @param lambda The value of the tuning parameter to estimate \eqn{\pi_0}{π_0} when the method is “ST”. See details.
#'
#' @details The function implements the five different versions of the q-value method in Cousido-Rocha et al. (2019).
#' Three versions are novel adaptions for the case of homogeneous discrete uniform P-values, whereas the other two
#' are classical versions of the q-value method for P-values which follow a continuous uniform distribution under the global null hypothesis.
#' The classical versions are the q-value method based on the \eqn{\pi_0}{π_0} estimator proposed in Storey (2002) with tunning parameter \eqn{\lambda}{lambda} = 0.5,
#' and the q-value procedure which uses the \eqn{\pi_0}{π_0} estimator in Storey and Tibshirani (2003) who proposed an automatic method to estimate \eqn{\pi_0}{π_0}.
#' We refer to these methods as “SS” and “ST”, respectively. On the other hand the three adaptations of the q-value method
#' for homogeneous discrete uniform P-values are: “Liang” which considers the \eqn{\pi_0}{π_0} estimator proposed in
#' Liang (2016); “Chen” which uses a simplification for homogeneous discrete P-values of the algorithm for
#' the estimation of \eqn{\pi_0}{π_0} proposed in Chen et al. (2014); and “Rand” which employs the standard q-value procedure but applied to
#' randomized P-values. For details of the different q-value versions, in
#' particular for the novel adaptations for homogeneous discrete uniform P-values, see Cousido-Rocha et al. (2019).
#'
#' As we mentioned above the novel adaptations of the q-value method are developed for homogeneous discrete uniform P-values.
#' Specifically, suppose that we test a large number of null hypothesis, p, and that the P-values \eqn{\{pv_1, \dots, pv_p\}}{{pv_1, ... , pv_p}}
#' are observations of the random variables \eqn{PV_i, i = 1, \dots , p.}{PV_i, i = 1, ... , p.}
#' Homogeneous means that all the P-values share an identical support \eqn{S}{S} with \eqn{0<t_1 < \dots <t_s<t_{s+1} \equiv 1}{0 < t_1 < ··· < t_s < t_{s+1} ≡ 1}.
#' On the other hand, making an abuse of language, we say that the P-values follow a discrete uniform distribution if it holds
#' \eqn{Pr(PV_i \leq t) = t}{Pr(PV_i ≤ t) = t} for \eqn{t \in S, i = 1, \dots, p}{t in S, i = 1, ..., p}.
#' The classical discrete uniform distribution is a particular case.
#'
#' The argument \code{“lambda”} must be a sequence of values in \eqn{[0, 1)}{[0, 1)}, for details of this parameter see Storey (2002) or Storey and Tibshirani (2003). The latter paper recommends the default value “lambda=seq(0.05,0.95,0.05)”.
#'
#' The support of the discrete distribution of the P-values can be finite or infinte. Hence the parameter “ss inf” must be “FALSE” if the support is finite and “TRUE” if the support is infinite. See examples where a poisson setting is considered.
#'
#' Finally, it is relevant to mention that Cousido-Rocha et al. (2019) verified (via simulations) that the results of the different q-values methods for dependent P-values are very similar to the ones corresponding to the independent setting.
#'
#'
#' @return A list containing the following components:
#' \item{pi0}{An estimate of the proportion of null P-values.}
#' \item{q.values}{A vector of the estimated q-values (the main quantities of interest).}
#'
#'
#'
#' @author
#' \itemize{
#' \item{Marta Cousido-Rocha}
#' \item{José Carlos Soage González}
#' \item{Jacobo de Uña-Álvarez}
#' \item{Sebastian Döhler}
#' }
#'
#'
#' @references
#'
#' \itemize{
#' \item{Chen, X., R. W. Doerge, and J. F. Heyse (2014). Methodology Multiple testing with discrete data: proportion of true null hypotheses and two adaptive FDR procedures. arXiv:1410.4274v2.}
#'
#' \item{Cousido-Rocha, M., J. de Uña-Álvarez, and S. Döhler (2019). Multiple testing methods for homogeneous discrete uniform P-values. Preprint.}
#'
#' \item{Gibbons, J. D. and S. Chakraborti (1992). Nonparametric Statistical Inference. Third Edition. Marcel Dekker, Inc, New York.}
#'
#' \item{Liang, K. (2016). False discovery rate estimation for large-scale homogeneous discrete p-values. Biometrics 72, 639-648.}
#'
#' \item{Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (3), 479-498.}
#'
#' \item{Storey, J. and R. Tibshirani (2003). Statistical significance for genomewide studies. Proceedings of National Academy of Science 100, 9440-9445.}
#' }
#'
#'
#'
#' @examples
#'
#' # We consider a simple simulated data set to illustrate the use of the DQ function.
#' # We have simulated the following situation.
#' # We have two groups, for example, 5 patients with tumor 1 and 5 patients
#' # with tumor 2. For each patient 100 variables are measured, for example,
#' # gene expression levels. Besides, the distributions of 30 of the variables
#' # are different in the two groups, and the differences are in location.
#'
#'
#' # We consider a collection of densities {f1=N(0,1), f2=N(0,1/4), f3=N(2,1), f4=N(2,1/4)}. In
#' # the first group (tumor 1) the sample of each variable (gene) comes from one of
#' # the four densities with the same probability 1/4. On the other hand, in the second
#' # group the sample of each variable comes from the same density as in the first
#' # group except for 30 randomly selected variables for which the density changes
#' # producing location differences. Specifically, if the variable follows
#' # f1 in the first group, its density, in the second group, is f3 producing a
#' # change on its location parameter.The situation for the other cases is as follows:
#' # the density f2 (f3 or f4) in group 1 leads to density f4 (f1 or f2, respectively)
#' # in the second one.
#'
#' set.seed(123)
#' p <- 100
#' n = m = 5
#' inds <- sample(1:4, p, replace = TRUE)
#' X <- matrix(rep(0, n * p), ncol = n)
#'
#' for (j in 1:p){
#' if (inds[j] == 1){
#' X[j, ] <- rnorm(n)}
#' if (inds[j] == 2){
#' X[j, ] <- rnorm(n, sd = sqrt(1/4))
#' }
#' if (inds[j] == 3){
#' X[j, ]<-rnorm(n, mean = 2)
#' }
#' if (inds[j] == 4){
#' X[j, ]<-rnorm(n, mean = 2, sd = sqrt(1/4))
#' }
#' }
#'
#' rho <- 0.3
#'
#' ind <- sample(1:p, rho * p)
#' li <- length(ind)
#' indsy <- inds
#'
#' for (l in 1:li){
#' if (indsy[ind[l]] == 1){indsy[ind[l]] = 3} else{
#' if (indsy[ind[l]] == 2){indsy[ind[l]] = 4} else {
#' if (indsy[ind[l]] == 3){indsy[ind[l]] = 1}
#' else{indsy[ind[l]] = 2}}}}
#'
#' Y <- matrix(rep(0, m * p), ncol = m)
#'
#' for (j in 1:p){
#' if (indsy[j] == 1){
#' Y[j, ] <- rnorm(m)}
#' if (indsy[j] == 2){
#' Y[j, ] <- rnorm(m, sd = sqrt(1/4))
#' }
#' if (indsy[j] == 3){
#' Y[j, ] <- rnorm(m, mean = 2)
#' }
#' if (indsy[j] == 4){
#' Y[j, ]<- rnorm(m, mean = 2, sd = sqrt(1/4))
#' }
#' }
#'
#'
#' # We can see which are the variables with different distributions in the two data sets.
#'
#' dif <- which(inds != indsy)
#'
#' # Cross table for (X,Y) density indexes:
#'
#' table(inds,indsy)
#'
#' # Our interest is to identify which variables have a different distribution in the two groups.
#' # Hence, since the differences between the distributions are in location, we applied
#' # Wilcoxon-Mann-Whitney test to verify for each variable the equality of distribution
#' # in the two groups.
#'
#' \donttest{
#'
#' library(exactRankTests)
#' library(coin)
#'
#' # We compute the P-values
#'
#' p <- nrow(X)
#' pv <- 1:p
#' for (i in 1:p){
#' pv[i] <- wilcox.exact(X[i, ], Y[i, ])$p.value
#' }
#'
#' # When the sample size is small, in this case n=m=5, the distribution of
#' # the Wilcoxon's statistic is calibrated using an exact permutation test. Hence,
#' # the P-values are homogeneous discrete uniform distributed with support points
#' # of such distribution:
#'
#' ss <- c(1, 2, 4, 7, 12, 19, 28, 39, 53,69, 87, 106, 126) / 126
#'
#' # When the number of P-values is large enough "ss" is equal to:
#' sort(unique(pv))
#'
#' # For details about the Wilcoxon-Mann-Whitney test and its exact distribution
#' # see Section 9.2 of Gibbons and Chakraborti (1992).
#'
#' # We apply Chen method:
#'
#' R <- DQ(pv, ss = ss, method = "Chen")
#'
#' # The estimate of the proportion of null P-values:
#'
#' R$pi0
#'
#' # Summary of the vector of the estimated q-values:
#'
#' summary(R$q.values)
#'
#' # How many null hypotheses are rejected?
#' alpha <- 0.05
#' sum(R$q.values < alpha)
#'
#' # Which variables correspond to such null hypotheses?
#'
#' which(R$q.values < alpha)
#'
#' # Classification table (Decision at nominal level alpha vs. reality):
#'
#' table(R$q.values < alpha,inds != indsy)
#'
#' # The conclusion from the previous table is that Chen method reports
#' # 21 true positives and 9 false negatives.
#'
#'
#' # We can also apply Liang and SS methods as follows.
#' RLiang <- DQ(pv, ss = ss, method = "Liang")
#' RSS <- DQ(pv, ss = ss, method = "SS")
#'
#' # The next graphic help us to see that Liang method (for discrete P-values)
#' # is more powerful than SS method (only suitable for continuous P-values).
#'
#' plot(RSS$q.values,RLiang$q.values)
#' abline(a = 0, b = 1, col = 2, lty = 2)
#'
#' # We consider a poisson setting to show a case where the support of the discrete
#' # distribution of the P-values is infinte.
#' # We generate 100 values of a poisson distribution with event rate 10.
#' # Then we compute the probability that each of the values come from a
#' # a poisson distribution with event rate 10. This set of probabilities
#' # are considered as our set of P-values.
#'
#' p <- 100
#' N <- rpois(p, lambda = 10)
#' pv <- 1:p
#' for(i in 1:p){
#' pv[i] <- ppois(N[i], lambda = 10)
#' }
#'
#' # It is well know that the support of a poisson distribution is infinite
#' # and equal to the natural numbers. Hence to know the support of the P-values
#' # defined above, we compute for 1,..., 50 their corresponding P-value.
#' # We only considered 50 values because for large values than 50 the P-value is 1 again.
#'
#' nn_0 <- 50
#' ss <- 1:(nn_0 + 1)
#' for (i in 1:(nn_0 + 1)){
#' ss[i] <- ppois(i - 1, lambda = 10)
#' }
#'
#' # We eliminate repeated values.
#' ss <- unique(ss)
#' # For Chen method the relevant support points are only the values below tau[100] = 0.5.
#' # We define the support ss as such values. Then, we can apply Chen method. Of
#' # course s_inf = TRUE.
#'
#' indicator <- which(ss <= 0.5)
#' ssi <- ss[indicator]
#' R <- DQ(pv, ss = ssi, ss_inf = "TRUE", method = "Chen")
#' # For Liang method the relevant support values are also the values below 0.5, hence
#' # ss defined above is suitable.
#' R <- DQ(pv, ss = ssi, ss_inf = "TRUE", method = "Liang")
#'
#' # For Rand method the relevant support values are those ones with match with
#' # the P-values, and also the largest support points smaller than each one of such P-values.
#' # Hence, ss only includes the relevant points, as we can see below.
#' pv_unique <- unique(pv)
#' p_u <- length(pv_unique)
#' ind <- 1:p_u
#' for(i in 1:p_u){
#' ind[i] <- which(ss == pv_unique[i])
#' }
#' p_u == length(ind)
#' ind_minus <- ind - 1
#' ind_final <- unique(sort(c(ind, ind_minus)))
#' ss <- ss[ind_final]
#' # Now, we can apply Rand method.
#' R <- DQ(pv, ss = ss, ss_inf = "TRUE", method = "Rand")
#'
#'}
#'
#'
#' @importFrom stats predict quantile runif smooth.spline density dnorm qnorm
#' @export
DQ <- function(pv, ss = NULL, ss_inf = FALSE, method = c("ST", "SS", "Liang", "Chen", "Rand"), lambda = seq(0.05, 0.95, 0.05)){
# cat("Call:", "\n")
# print(match.call())
match.arg(method)
if (missing(method)) {
method <- "ST"
cat("'ST' method used by default\n")
}
if(method == "ST" || method == "SS"){
ss <- NULL
}
if(ss_inf == FALSE) {
# Liang method
if(method == "Liang"){
p.grid <- ss
NN <- length(ss)
p.count.mat <- get.tbl.full(pv, NN, ss)
pi0_Liang <- as.numeric(est.pi0.disc(p.count.mat, p.grid)$pi0)
pi0_Liang <- max(pi0_Liang, 0)
pi0_Liang <- min(pi0_Liang, 1)
q_values_Liang <- q.value_pi0(pv, pi0_Liang)
return(list(pi0 = pi0_Liang, q.values = q_values_Liang))
}
# Chen method
if(method == "Chen"){
pi0_Chen <- pi0_G(pv, ss)
pi0_Chen <- max(pi0_Chen, 0)
pi0_Chen <- min(pi0_Chen, 1)
q_values_chen <- q.value_pi0(pv, pi0_Chen)
return(list(pi0 = pi0_Chen, q.values = q_values_chen))
}
# Randomized method
if(method == "Rand"){
pi0_Rd <- Rp(pv, ss)
pi0_Rd <- max(pi0_Rd, 0)
pi0_Rd <- min(pi0_Rd, 1)
q_values_Rd <- q.value_pi0(pv, pi0_Rd)
return(list(pi0 = pi0_Rd, q.values = q_values_Rd))
}
### Q-value method for continuous p-values
if(method == "ST"){
l_lambda <- length(lambda)
pi0<-1:l_lambda
for(i in 1:l_lambda){
pi0[i] <- sum(pv >= lambda[i]) / ((1 - lambda[i]) * length(pv))
}
spi0 <- smooth.spline(lambda, pi0, df = 3)
pi0Smooth <- predict(spi0, x = lambda)$y
pi0_ST <- min(pi0Smooth[l_lambda], 1)
pi0_ST <- max(pi0_ST, 0)
q.values_ST <- q.value_pi0(pv, pi0_ST)
return(list(pi0 = pi0_ST, q.values = q.values_ST))
}
if(method == "SS"){
lambda <- 0.5
pi0_SS <- sum(pv >= lambda) / ((1 - lambda) * length(pv))
pi0_SS <- min(pi0_SS, 1)
pi0_SS <- max(pi0_SS, 0)
q.values_SS <- q.value_pi0(pv, pi0_SS)
return(list(pi0 = pi0_SS, q.values = q.values_SS))
}
} else {
##### Liang method
if(method == "Liang"){
p.grid<-ss
NN <- length(ss)
p.count.mat <- get.tbl.full(pv, NN, ss)
pi0_Liang <- est.pi0.disc(p.count.mat, p.grid)$pi0 # DEVOLVE A FUNCIÓN pi0 método Liang
q_values_Liang <- q.value_pi0(pv, pi0_Liang) # DEVOLVE A FUNCIÓN q.values método Liang
return(list(pi0 = pi0_Liang, q.values = q_values_Liang))
}
### Chen method (the same as finite support)
if(method == "Chen"){
pv <- unlist(pv)
pi0_Chen <- pi0_G(pv,ss) #DEVOLVE A FUNCIÓN pi0 método Chen
q_values_chen <- q.value_pi0(pv, pi0_Chen) #DEVOLVE A FUNCIÓN q.values método chen
return(list(pi0 = pi0_Chen, q.values = q_values_chen))
}
### Randomized method
if(method == "Rand"){
pi0_Rd <- Rp(pv, ss) #DEVOLVE A FUNCIÓN pi0 método Rand
q_values_Rd <- q.value_pi0(pv, pi0_Rd) #DEVOLVE A FUNCIÓN q.values método Rand
return(list(pi0 = pi0_Rd, q.values = q_values_Rd))
}
}
}