-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calculation of p-values in testOutliers() #182
Comments
Hi Lukas, no, you're quite right, on revisiting the code, the current programming doesn't really make sense to me either, it seems I was solving two things with the same parameter, and the use of the alternative for the side of the simulation distribution (instead of the NULL distribution) is definitely not intuitive. I have proposed a fix in this branch https://github.com/florianhartig/DHARMa/tree/182-outlier - this allows you now to specify the alternative and the margin, i.e. the side of the data simulations at which outliers are considered (see help) I'd appreciate feedback about whether this solves the problem. For instructions about how to install a branch, see main DHARMa github page. If this solves your problem, I can merge this into master and make a pre-release, so that you can cite this in a paper if needed. I am still deliberating if the default for the test should be "two.sided" or "greater" - "two.sided", which is the current default, would also flag models if you have less outliers than expected, which strictly speaking is also a problem (akin to underdispersion), but I suspect that people would not be able to interpret this correctly, and I wonder if it's useful to flag a lack of outliers, which is not really a statistical problem (unless the underdispersion is severe). Opinions appreciated |
p.s. - by re-reading your text, I suspect your opinion is that I should move the default to greater, right? I'm tending towards this as well and will change this now in the code. |
Hi Florian, That was a quick response. I hope you didn't loose any sleep over this. My current residuals project is thankfully more conceptional, meaning there are no immediate deadlines. It should be usable by September, but the current version is in my opinion usable for what little residual analysis would make into that paper. I'll come back to you if the main authors think it's more important. Proposed fixThe proposed fix seems mostly correct, but for Lines 178 to 181 # calculations of trials and H0
if(margin == "both") trials = 2* simulationOutput$nObs
else trials = simulationOutput$nObs
outFreqH0 = 1/(simulationOutput$nSim +1) What i think would be correct: # calculations of frequency and trials in H0
outFreqH0 = 1/(simulationOutput$nSim +1) * ifelse(margin == "both", 2, 1)
trials = simulationOutput$nObs Default alternativeI actually don't have a strong opinion on that. For me personally i'd prefer the default to be two sided, while appending the message for significance to something like: Then again this is a lot of text plus you are already running a dispersion test, which should hopefully detect any severe case of under-dispersion. In conclusion, only checking for too many outliers is probably fine from statistics side of things. It's just a problem of consistency and being perhaps to sensitive. Testing for p <0.025 would fix that, but is it worth the extra coding? If you were to implement my personal favorite you could do something like this: out$signfMessage <- ifelse(out$p.value <= 0.05,
paste("Outlier test significant:", outliers, "outliers to",
round(trials*outFreqH0, digits = 2), "expected"),
"Outlier test n.s.") You can then reference Small stuff in the new codeBeing most likely the second pair of eyes ever to be looking over that code, there are a few small things, that i would do different: Here are lines 167 to 181 rewritten with doubling freq instead of trials: # check inputs
alternative = match.arg(alternative)# used <- not =
margin = match.arg(margin)
simulationOutput = ensureDHARMa(simulationOutput, convert = "Model")
# calculation of outliers
if(margin == "both") outliers = sum(simulationOutput$scaledResiduals %in% c(0, 1)) # this is shorter
if(margin == "upper") outliers = sum(simulationOutput$scaledResiduals == 1) # margin can't have two values so else seems pointless
if(margin == "lower") outliers = sum(simulationOutput$scaledResiduals == 0) # same
# calculations of frequency and trials in H0
outFreqH0 = 1/(simulationOutput$nSim +1) * ifelse(margin == "both", 2, 1) # fixed
trials = simulationOutput$nObs FinallyObviously, use this stuff at you own discretion. I'm not married to my ideas. |
Hi Lukas, thanks, yes, you're right that you are the first other set of eyes to look at this, so this discussion is very helpful. wrt to the freq of outliers under H0 - you're right, this was not correct again, but I think your solution is not complete either. To summarise my thinking:
The question is now: what is the probability that the data point lies outside outside the SAMPLE? The answer is 1/(nSim + 1), and I had actually already confirmed this a year ago or so https://github.com/florianhartig/DHARMa/blob/master/Code/DHARMaDevelopment/outliers.R when I implemented the first version of this function. Moreover, we would expect that the points fall equally to both sides of the distribution. Thus, it should actually be
Does that make sense? Wrt the default alternative - I'll think about it again. In any case, I should probably modify the output to improve the interpretation of the test! |
No, i don't think so. The code you linked is this: n = 10
mean((replicate(1000000, runif(1) < min(runif(n)))))
1/11 This shows the probability of being smaller than the minimum of nSim simulated value to be equal to 1/(nSim+1). So only the lower margin. Add in the upper margin and we get twice the value. Mathematically: P(A or B) = P(A) + P(B) - P(A and B) If you want to torture one of your cpu cores for a minute or two you can run this: test_both <- function(n){
orig_val <- runif(1)
sim_vals <- runif(n)
return((orig_val < min(sim_vals)) + (orig_val > max(sim_vals)))
}
mean_test_both <- function(n, m = 10^5){
print(paste("calcualting for", n))
mean(replicate(m, test_both(n)))
}
mean_test_both(10)
2/11
sim_result <- vapply(1:100, mean_test_both, FUN.VALUE = 0.5)
plot(sim_result)
curve(2/(x+1), add = TRUE)
plot(sim_result- 2/((1:100)+1)) P.S. I can actually proof the probability for uniform distributions mathematically. I also checked with standard normal and it still held. I think it should work for all distributions as long as all values are iid, but i failed to proof that fully. |
P.S.S. Not even my idea, i got it from here: https://stats.stackexchange.com/a/132961 |
Arrggh *$$%%%, what a mess. Yes, OK, you're right. It's not about the distribution, and the argument with the min is clear, but I was again falling back into the 1-sided test thinking, whatever. I will finish this and merge this into trunk now. I am also working on a more systematic calibration check for the p-values of all tests, if I had done this earlier I should have spotted this problem a long time ago. In any case, many thanks for your comments! |
OK, I have finished the function and updated the help. In the end I decided to keep the two-sided tests as a default. I added explanations in the help and in the examples, so I hope users will interpret it correctly, also in case the test is significant because of a lack of outliers. I will try to merge this into master now, and will then create an interim release. This should go with DHARMa 0.3.2 into CRAN. |
I know that feeling^^ I'm happy i was helpful and i'll be on the look out for new versions. Best regards, |
Just to note that this is now on CRAN with version 0.3.2 |
Thank you. After explaining the situation once, i'm quite happy not having to do it again. Apropos explanations, i'v been looking through the other tests and they seem fine. I got a little confused about the getP() function used in testDispersion, because it had the same weird min(min(..)) logic, but it's actually correct there. The only thing is that you are ignoring is that you only have finitely many simulations and therefore you could put confidence intervals on those p-values^^ I don't see that doing anyone any good ever, but maybe put it somewhere in the documentation, that you're ignoring Monte Carlo error. The reason i'm telling you right now is that i saw this question about dispersion: #117 (comment) Here's my take on that. We can see that the test only works correctly under specific circumstances like lambda > 5 , which would mean linear predictors > log(5), or something like it. For negative inear predictors one( or a few) unlikely event too many will have a relatively large Pearson residual, which changes the sum of squares of massively. DHARMa using the simulation approach will identify the probability of such "distorting" events. Therefore the p-value is not significant. Request: plot (squared) Pearson residuals against linear predictors. All the best, |
Hi Lukas, thanks for the comments - about the request - could you file a new issue where you explain why you want that, and how you envision the plots? Because at least lme4 already plots Pearson residuals, right? About #117 - I'm not quite sure if I understand you correctly, but I would be interested to hear your thoughts about this - so you think the Pearson test is wrong, and DHARMa is actually right (and not just insensitive)? Cheers, |
ps. - maybe move the last point to #117 ? |
Sorry, the communication was unclear. I meant the request for the linked person. I#m just gonna comment there now and explain my perspective. |
First of all, this is a lovely package and it does everything i ever wanted from a residual analysis.
Now with the well deserved flattery out of the way here is my problem.
The Code with the Issue
The calculation of the p value of
testOutliers
in Test.R lines 176 to 188:The Issue
This code conflates the Testing of to low outliers (outLow) with the testing for too few outliers (alternative = "less") and the same for too large and too many. To illustrate the point, let's consider a model fit with n = 1000 and 500 outliers of too small a residual:
With such a horrible result as 500 outliers you'd expect an alarm, but in fact the p value is 1, which is kind of it's own issue.
A less exaggerated version of this would be with n=1000 and 0 outliers:
but what i would expect is:
As we can see, for matching number of low and high outliers the resulting p-value are almost correct, but do not precisely equal, the expected values.
Proposed Solution
After reading the documentation i expected this test to basically be binomial test with the amount of outliers:
Replacing the 3 lines of if-statements with this should just work. It is also highly unintuitive that the
alternative
argument does more than set thealternative
argument inbinom.test
, so testlow and testhigh need to get changed as well:Also i think it would be good to indicate in plotQQunif whether the outliertest was significant because of too many or to few outliers. That, in my opinion, makes a little bit of a difference.
Closing remark and an offer
This mistake was easy to miss, since low and high outliers usually appear in roughly the same number and that produces roughly correct results. So most people who have used the current version probably don't have to worry.
I'm gonna have to fix this for my personal use anyway and if you want i can send you a pull request once i'm done. I could get to it around Thursday, so a timely reply would be nice but of course that's very short notice so i'm not holding breath.
Finally: if i'm mistaken and the way it is written right now is actually well established statistical practice i'd like to a) request a citation for that and b) humbly apologize.
The text was updated successfully, but these errors were encountered: