Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calculation of p-values in testOutliers() #182

Closed
Istalan opened this issue Jun 2, 2020 · 14 comments
Closed

calculation of p-values in testOutliers() #182

Istalan opened this issue Jun 2, 2020 · 14 comments

Comments

@Istalan
Copy link

Istalan commented Jun 2, 2020

First of all, this is a lovely package and it does everything i ever wanted from a residual analysis.

Now with the well deserved flattery out of the way here is my problem.

The Code with the Issue

The calculation of the p value of testOutliers in Test.R lines 176 to 188:

  outLow = sum(simulationOutput$scaledResiduals == 0)
  outHigh = sum(simulationOutput$scaledResiduals == 1)
  trials = simulationOutput$nObs
  outFreqH0 = 1/(simulationOutput$nSim +1)

  out$testlow = binom.test(outLow, trials, p = outFreqH0, alternative = "less")
  out$testhigh = binom.test(outHigh, trials, p = outFreqH0, alternative = "greater")

  out$statistic = c(outLow = outLow, outHigh = outHigh, nobs = trials, freqH0 = outFreqH0)

  if(alternative == "greater") p = out$testlow$p.value
  if(alternative == "less") p = out$testhigh$p.value
  if(alternative == "two.sided") p = min(min(out$testlow$p.value, out$testhigh$p.value) * 2,1)

The Issue

This code conflates the Testing of to low outliers (outLow) with the testing for too few outliers (alternative = "less") and the same for too large and too many. To illustrate the point, let's consider a model fit with n = 1000 and 500 outliers of too small a residual:

out <- list() # just to make the code work with as little changes as possible

outLow = 500
outHigh = 0
trials = 1000
outFreqH0 = 1/(250 +1)

out$testlow = binom.test(outLow, trials, p = outFreqH0, alternative = "less")
out$testhigh = binom.test(outHigh, trials, p = outFreqH0, alternative = "greater")

out$statistic = c(outLow = outLow, outHigh = outHigh, nobs = trials, freqH0 = outFreqH0)
# running two.sided test
p = min(min(out$testlow$p.value, out$testhigh$p.value) * 2,1)
p # == 1 !!!

With such a horrible result as 500 outliers you'd expect an alarm, but in fact the p value is 1, which is kind of it's own issue.
A less exaggerated version of this would be with n=1000 and 0 outliers:

out <- list()

outLow = 0
outHigh = 0
trials = 1000
outFreqH0 = 1/(250 +1)

out$testlow = binom.test(outLow, trials, p = outFreqH0, alternative = "less")
out$testhigh = binom.test(outHigh, trials, p = outFreqH0, alternative = "greater")

out$statistic = c(outLow = outLow, outHigh = outHigh, nobs = trials, freqH0 = outFreqH0)
# running two.sided test
p = min(min(out$testlow$p.value, out$testhigh$p.value) * 2,1)
p # == 0.03692472

but what i would expect is:

p <- binom.test(0, 1000, p = 1/(250 +1), alternative = "two.sided")$p.value
p # == 0.03912085

As we can see, for matching number of low and high outliers the resulting p-value are almost correct, but do not precisely equal, the expected values.

Proposed Solution

After reading the documentation i expected this test to basically be binomial test with the amount of outliers:

p <- binom.test(outHigh + outLow, trials, p = outFreqH0, alternative = alternative)$p.value

Replacing the 3 lines of if-statements with this should just work. It is also highly unintuitive that the alternative argument does more than set the alternative argument in binom.test, so testlow and testhigh need to get changed as well:

out$testlow = binom.test(outLow, trials, p = outFreqH0, alternative = alternative)
out$testhigh = binom.test(outHigh, trials, p = outFreqH0, alternative = alternative)

Also i think it would be good to indicate in plotQQunif whether the outliertest was significant because of too many or to few outliers. That, in my opinion, makes a little bit of a difference.

Closing remark and an offer

This mistake was easy to miss, since low and high outliers usually appear in roughly the same number and that produces roughly correct results. So most people who have used the current version probably don't have to worry.

I'm gonna have to fix this for my personal use anyway and if you want i can send you a pull request once i'm done. I could get to it around Thursday, so a timely reply would be nice but of course that's very short notice so i'm not holding breath.

Finally: if i'm mistaken and the way it is written right now is actually well established statistical practice i'd like to a) request a citation for that and b) humbly apologize.

@florianhartig
Copy link
Owner

Hi Lukas,

no, you're quite right, on revisiting the code, the current programming doesn't really make sense to me either, it seems I was solving two things with the same parameter, and the use of the alternative for the side of the simulation distribution (instead of the NULL distribution) is definitely not intuitive.

I have proposed a fix in this branch https://github.com/florianhartig/DHARMa/tree/182-outlier - this allows you now to specify the alternative and the margin, i.e. the side of the data simulations at which outliers are considered (see help)

I'd appreciate feedback about whether this solves the problem. For instructions about how to install a branch, see main DHARMa github page. If this solves your problem, I can merge this into master and make a pre-release, so that you can cite this in a paper if needed.

I am still deliberating if the default for the test should be "two.sided" or "greater" - "two.sided", which is the current default, would also flag models if you have less outliers than expected, which strictly speaking is also a problem (akin to underdispersion), but I suspect that people would not be able to interpret this correctly, and I wonder if it's useful to flag a lack of outliers, which is not really a statistical problem (unless the underdispersion is severe). Opinions appreciated

@florianhartig florianhartig added this to the 0.3.2 milestone Jun 2, 2020
@florianhartig
Copy link
Owner

p.s. - by re-reading your text, I suspect your opinion is that I should move the default to greater, right? I'm tending towards this as well and will change this now in the code.

@florianhartig florianhartig changed the title p value of outlier test seems to be wrong calculation of p-values in testOutliers() Jun 2, 2020
@Istalan
Copy link
Author

Istalan commented Jun 3, 2020

Hi Florian,

That was a quick response. I hope you didn't loose any sleep over this. My current residuals project is thankfully more conceptional, meaning there are no immediate deadlines. It should be usable by September, but the current version is in my opinion usable for what little residual analysis would make into that paper. I'll come back to you if the main authors think it's more important.

Proposed fix

The proposed fix seems mostly correct, but for margin = "both" It should be the probability of being an outlier that doubles not the number of trials.

Lines 178 to 181

 # calculations of trials and H0
  if(margin == "both") trials = 2* simulationOutput$nObs
  else trials = simulationOutput$nObs
  outFreqH0 = 1/(simulationOutput$nSim +1)

What i think would be correct:

# calculations of frequency and trials in H0
outFreqH0 = 1/(simulationOutput$nSim +1) * ifelse(margin == "both", 2, 1)
trials = simulationOutput$nObs

Default alternative

I actually don't have a strong opinion on that. For me personally i'd prefer the default to be two sided, while appending the message for significance to something like: Outlier test significant, 0 outliers to 7.97 expected (using n= 1000, nSim = 250, and 0 outliers). Since the default test has been two sided until now and i assume no one has complained, that might be the way to go. Also passing a one sided test with p <0.05 is a lower hurdle too pass, so some model fits for which the current version works more or less correctly and that were non significant would suddenly become significant.

Then again this is a lot of text plus you are already running a dispersion test, which should hopefully detect any severe case of under-dispersion. In conclusion, only checking for too many outliers is probably fine from statistics side of things. It's just a problem of consistency and being perhaps to sensitive. Testing for p <0.025 would fix that, but is it worth the extra coding?

If you were to implement my personal favorite you could do something like this:

out$signfMessage <- ifelse(out$p.value <= 0.05,
                           paste("Outlier test significant:", outliers, "outliers to", 
                                 round(trials*outFreqH0, digits = 2), "expected"),
                           "Outlier test n.s.")

You can then reference signfMessage wherever you want to display it. Although this would clash with the exact wording in plotQQunif .

Small stuff in the new code

Being most likely the second pair of eyes ever to be looking over that code, there are a few small things, that i would do different: Here are lines 167 to 181 rewritten with doubling freq instead of trials:

 # check inputs
alternative = match.arg(alternative)# used <- not =
margin = match.arg(margin)
simulationOutput = ensureDHARMa(simulationOutput, convert = "Model")


# calculation of outliers
if(margin == "both")  outliers = sum(simulationOutput$scaledResiduals %in% c(0, 1)) # this is shorter 
if(margin == "upper") outliers = sum(simulationOutput$scaledResiduals == 1) # margin can't have two values so else seems pointless
if(margin == "lower") outliers = sum(simulationOutput$scaledResiduals == 0) # same

# calculations of frequency and trials in H0
outFreqH0 = 1/(simulationOutput$nSim +1) * ifelse(margin == "both", 2, 1) # fixed 
trials = simulationOutput$nObs

Finally

Obviously, use this stuff at you own discretion. I'm not married to my ideas.

@florianhartig
Copy link
Owner

Hi Lukas,

thanks, yes, you're right that you are the first other set of eyes to look at this, so this discussion is very helpful.

wrt to the freq of outliers under H0 - you're right, this was not correct again, but I think your solution is not complete either. To summarise my thinking:

  • I simulate nSim times from a distribution (= the model) = SAMPLE
  • now simulate another point as the data (which is the same as drawing from the model under H0)

The question is now: what is the probability that the data point lies outside outside the SAMPLE? The answer is 1/(nSim + 1), and I had actually already confirmed this a year ago or so https://github.com/florianhartig/DHARMa/blob/master/Code/DHARMaDevelopment/outliers.R when I implemented the first version of this function. Moreover, we would expect that the points fall equally to both sides of the distribution.

Thus, it should actually be

outFreqH0 = 1/(simulationOutput$nSim +1) * ifelse(margin == "both", 1, 0.5)
trials = simulationOutput$nObs

Does that make sense?

Wrt the default alternative - I'll think about it again. In any case, I should probably modify the output to improve the interpretation of the test!

@Istalan
Copy link
Author

Istalan commented Jun 3, 2020

Does that make sense?

No, i don't think so. The code you linked is this:

n = 10

mean((replicate(1000000, runif(1) < min(runif(n)))))

1/11

This shows the probability of being smaller than the minimum of nSim simulated value to be equal to 1/(nSim+1). So only the lower margin. Add in the upper margin and we get twice the value.

Mathematically:
A := original value smaller than minimum of simulated values
B := original value bigger than maximum of simulated values
Then A or B = being an outler # this bein a math or not an exclusive

P(A or B) = P(A) + P(B) - P(A and B)
A and B of course being impossible. You can't be smaller and larger than all the simulated values.

If you want to torture one of your cpu cores for a minute or two you can run this:

test_both <- function(n){
  orig_val <- runif(1)
  sim_vals <- runif(n)
  return((orig_val < min(sim_vals)) + (orig_val > max(sim_vals)))
}

mean_test_both <- function(n, m = 10^5){
  print(paste("calcualting for", n))
  mean(replicate(m, test_both(n)))
}

mean_test_both(10)
2/11

sim_result <- vapply(1:100, mean_test_both, FUN.VALUE = 0.5)

plot(sim_result)
curve(2/(x+1), add = TRUE)

plot(sim_result- 2/((1:100)+1))

P.S. I can actually proof the probability for uniform distributions mathematically. I also checked with standard normal and it still held. I think it should work for all distributions as long as all values are iid, but i failed to proof that fully.

@Istalan
Copy link
Author

Istalan commented Jun 3, 2020

P.S.S.
Never mind, the math proof for general case is actually trivially easy. We nSim Simulation + 1 original value, they're all iid, so each one has equal probability of being the minimum, threfore the probability of the original value being the minimum is 1/(nSim +1) regardless of the specific distribution.

Not even my idea, i got it from here: https://stats.stackexchange.com/a/132961

@florianhartig
Copy link
Owner

Arrggh *$$%%%, what a mess. Yes, OK, you're right. It's not about the distribution, and the argument with the min is clear, but I was again falling back into the 1-sided test thinking, whatever. I will finish this and merge this into trunk now.

I am also working on a more systematic calibration check for the p-values of all tests, if I had done this earlier I should have spotted this problem a long time ago.

In any case, many thanks for your comments!

@florianhartig
Copy link
Owner

OK, I have finished the function and updated the help. In the end I decided to keep the two-sided tests as a default. I added explanations in the help and in the examples, so I hope users will interpret it correctly, also in case the test is significant because of a lack of outliers. I will try to merge this into master now, and will then create an interim release. This should go with DHARMa 0.3.2 into CRAN.

@Istalan
Copy link
Author

Istalan commented Jun 4, 2020

Arrggh *$$%%%

I know that feeling^^

I'm happy i was helpful and i'll be on the look out for new versions.

Best regards,
Lukas

@florianhartig
Copy link
Owner

Just to note that this is now on CRAN with version 0.3.2

@Istalan
Copy link
Author

Istalan commented Jun 22, 2020

Thank you. After explaining the situation once, i'm quite happy not having to do it again.

Apropos explanations, i'v been looking through the other tests and they seem fine. I got a little confused about the getP() function used in testDispersion, because it had the same weird min(min(..)) logic, but it's actually correct there. The only thing is that you are ignoring is that you only have finitely many simulations and therefore you could put confidence intervals on those p-values^^

I don't see that doing anyone any good ever, but maybe put it somewhere in the documentation, that you're ignoring Monte Carlo error.

The reason i'm telling you right now is that i saw this question about dispersion: #117 (comment)

Here's my take on that.

Looking at:
https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#testing-for-overdispersioncomputing-overdispersion-factor

We can see that the test only works correctly under specific circumstances like lambda > 5 , which would mean linear predictors > log(5), or something like it. For negative inear predictors one( or a few) unlikely event too many will have a relatively large Pearson residual, which changes the sum of squares of massively.

DHARMa using the simulation approach will identify the probability of such "distorting" events. Therefore the p-value is not significant.

Request: plot (squared) Pearson residuals against linear predictors.

All the best,
Lukas

@florianhartig
Copy link
Owner

Hi Lukas,

thanks for the comments - about the request - could you file a new issue where you explain why you want that, and how you envision the plots? Because at least lme4 already plots Pearson residuals, right?

About #117 - I'm not quite sure if I understand you correctly, but I would be interested to hear your thoughts about this - so you think the Pearson test is wrong, and DHARMa is actually right (and not just insensitive)?

Cheers,
F

@florianhartig
Copy link
Owner

ps. - maybe move the last point to #117 ?

@Istalan
Copy link
Author

Istalan commented Jun 22, 2020

Sorry, the communication was unclear. I meant the request for the linked person. I#m just gonna comment there now and explain my perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants