Different results with refit = T / refit = F #37

florianhartig · 2017-10-21T15:13:14Z

A question from a user

I have a question regarding testTemporalAutocorrelation(). Here’s my code:

simulationOutput <- simulateResiduals(fittedModel = fittedmodel, n = 250, refit = T)
simulationOutput2 <- simulateResiduals(fittedModel = fittedmodel, n = 250)
testTemporalAutocorrelation(simulationOutput = simulationOutput, time = data2$date) # DW = 1.7327, p-value = 0.04495
testTemporalAutocorrelation(simulationOutput = simulationOutput2, time = data2$date) # DW = 2.1114, p-value = 0.7601

What does it mean if when testing for temporal autocorrelation, p value is significant when I used refit simulation and non-significant when I don’t use refit?

florianhartig · 2017-11-06T11:59:43Z

Hi,

sorry that it took me so long to get back to you. This concerns something that I had not thought about in detail before, and I wanted to first think through the implications of this before giving a recommendation.

The core of the issue is to note that DHARMa employs a stochastic algorithm a) to simulate synthetic data, and b) to additionally randomize the residuals to deal with the 0/1 problem with integer data (see vignette for details).

The 0/1 randomization for integer data means that each single residual might turn up slightly higher or lower if you re-simulate the residuals, even if you set nSim very high (essentially, setting nSim high removes the variation created from re-simulating the data, but the integer randomization is independent of nSim, so this random part always stays there).

So, if you run your test several times, you will get a different p-value each time, and the same happens if you run two tests with different settings (e.g. in this case refit = T/F). Btw., this is one of the reasons why the vignette recommends setting the random seed.

That sounds a bit odd, but it's not really a problem. Remember that, for any hypothesis test, the chance to get significance although there is no effect is exactly alpha, the significance level. Normally your data is fixed, so you don't notice that problem, but still, it is there -
you have a 5% chance to get p < 0.05 without a effect (lots of people have raised concerns about this lately, but this is another discussion).

I have checked, and the temporal autocorrelation tests in DHARMa have correct Type I errors. Here is a p-distribution if the temporal autocorrelation test for 50 experiments without temporal autocorrelation

with temporal autocorrelation, the p-distribution moves to the left, reflecting the higher tendency to return positives

In the simulations above, the original data was re-simulated, reflecting the idea that we run 50 complete new experiments. What happens if we have the same dataset / analysis, and just re-simulate the DHARMa residuals? Here are 3 different datasets, with 50 times re-simulated residuals.

What we see is that p-values vary, but not over the entire range (remember, when re-simulating the raw data as well, we get the entire range). The width of the distribution can vary depending on the data / model. The variation is smaller than before because we have removed the variability in the test data, but we pick up the variability created by the randomization of the integer residuals.

Now, from a pure hypothesis testing viewpoint, i.e. if you just want to get a p-value, all this is really inconsequential. What we want is a test with an error rate of 5% on if H0 is true, and proper power, and I think we have this already, you just shouldn't re-simulate p values, run one test, and that's that.

However, from a practical viewpoint, where you probably just want an indicator of how far away the residuals are from what you would expect under H0 in absolute terms, it seems to me that it totally makes sense to re-run the DHARMa test a few times, and if p-values spread all over the place, you can probably say that there is no consistent evidence against the model, and that a low p-value you observed in one simulation was probably bad luck "1/20 chance".

-> that is what I would recommend for the moment

I realize that this ad hoc solution is not very satisfying, and that one would like to somehow aggregate those replicate runs again into a new p-value. But I have to think more about how to best do this in a statistically rigorous way. At the moment, we have a test with correct Type I error rates. It seems relatively easy to screw this up by doing some naive summaries, such as taking the mean of the p-values of several runs. I'll create a new ticket for this and will give it some to time to sink in.

florianhartig · 2017-11-06T12:17:27Z

An afterthought / afterexplanation - not sure if this became sufficiently clear in the text above:

Even if I manage to do what's promised in #38, i.e. stabilize the value so that repeated simulations would produce the same p-value, it's totally open if this is adding anything in terms of power, i.e. increasing your probability of making a correct decision about whether your model corresponds to the data-generating model.

This is because a big part of the variability / Type I error is in the particular realization of the data. You may just be unlucky that you have "drawn" a dataset that is unusual for the particular generating mechanism. If you are in the p<0.05 part, i.e. your dataset is very unusual, the only thing that stabilization of the p-value would do that you ALWAYS make the wrong decision and reject the model, while in the current version of DHARMA, you would OFTEN make the wrong decision and reject the model.

Still, I guess that corresponds more to the behavior a user is expecting from a normal test, so as said, I will be thinking about what can be done about this.

florianhartig · 2017-11-06T12:28:25Z

pps: I realize these doubts kind of contradict my recommendation to run the test a few times.

I guess the truth is that I really have to test if there is and advantage of running multiple residual simulations. I think there is a pretty solid "psychological advantage", in the sense that one feels there should be a stable value of how well the residuals fit to the model, and the procedure above will deliver on that, but as said above, I have to see if this really increases the power to detect a problem (and whatever you do, you will have a 5% chance to think the model is wrong if it isn't).

I think it could be theoretically possible that multiple simulations increase power, as this in some way averages over the stochasticity introduced by the integer randomization.

florianhartig added the question label Oct 21, 2017

florianhartig mentioned this issue Nov 6, 2017

Aggregation of multiple simulations to stabilize p-values #38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different results with refit = T / refit = F #37

Different results with refit = T / refit = F #37

florianhartig commented Oct 21, 2017

florianhartig commented Nov 6, 2017

florianhartig commented Nov 6, 2017

florianhartig commented Nov 6, 2017

Different results with refit = T / refit = F #37

Different results with refit = T / refit = F #37

Comments

florianhartig commented Oct 21, 2017

florianhartig commented Nov 6, 2017

florianhartig commented Nov 6, 2017

florianhartig commented Nov 6, 2017