Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model failing outlier test because of too few outliers? #197

Closed
DanielWhat opened this issue Jul 21, 2020 · 21 comments
Closed

Model failing outlier test because of too few outliers? #197

DanielWhat opened this issue Jul 21, 2020 · 21 comments
Labels

Comments

@DanielWhat
Copy link

I'm running a probit regression to try and explain a binary dependent variable with some binary and continous regressors (independent variables). The residual diagnostics seem to be ok except for the outlier test. Strangely though, the outlier test is failing because there are too few outliers (rather than too many outliers).

image

image

DHARMa outlier test based on exact binomial test

data:  model_residuals
outliers at both margin(s) = 0, simulations = 72723, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.0009995002
95 percent confidence interval:
 0.000000e+00 5.072379e-05
sample estimates:
frequency of outliers (expected: 0.000999500249875062 ) 
                                                      0 

You indicate here that a lack of outliers can be caused by underdispersion. But my dispersion plot looks to be well behaved.

image

So given that the lack of outliers is not arrising from underdispersion, does this mean I don't have to worry about my model having too few outliers?

Thanks

@florianhartig
Copy link
Owner

Hi,

thanks for bringing this to my attention - I think this is an issue with the test and not with your model, I also get significance for a correctly specified binomial model if the data size is large, which shouldn't be the case

testData = createData(sampleSize = 5000, fixedEffects = 1, family = binomial(), randomEffectVariance = 0)
fittedModel <- glm(observedResponse ~ 1, family = "binomial", data = testData)

res <- simulateResiduals(fittedModel = fittedModel)
testOutliers(res)

I have to investigate what's the problem here - possibly related to recent changes that I have made to the testOutliers or the PIT residual function that came with DHARMa 0.3.1

@Istalan
Copy link

Istalan commented Aug 13, 2020

Hi Florian,

I spent a bit of time thinking about this and i think the issue is two-fold:

Conceptually Outlier-probabilities for discrete random variables are more complex

The logic behind the probability of the observed value being the maximal value in nSim+1 values being 1/(nSim+1) actually only works if the probability of multiple random numbers having the maximal value is 0. This is the case for all continous random variables. Now for the Traditional version you had this part, where the addition of U[-0.5, 0.5] made the residuals continuous:

file: helper.R Line: 78

if(integerResponse == T){
        scaledResiduals[i] <- DHARMa.ecdf(simulations[i,] + runif(nSim, -0.5, 0.5))(observed[i] + runif(1, -0.5, 0.5))
      }

Nothing like this happens in the PIT algorithm as comparisons with the simulated responses happen without making the continuous. Matter of fact you can see that you have to few outliers for Poissons with not large lambdas:

   testData = createData(sampleSize = 5000, fixedEffects = 1, family =poisson(), randomEffectVariance = 0)
fittedModel <- glm(observedResponse ~ Environment1, family = "poisson", data = testData)

res <- simulateResiduals(fittedModel = fittedModel)
testOutliers(res)

The specific implementation of PIT makes it impossible for logistic regression to have an outlier:

file: helper.R Line: 88

    scaledResiduals = rep(NA, n)
    for (i in 1:n){
      if(length(unique(simulations[i,])) == 1) scaledResiduals[i] = runif(1, 0, 1) # offending part
      else{
        min <- sum(simulations[i,] < observed[i]) / length(simulations[i,])
        max <- sum(simulations[i,] <= observed[i]) / length(simulations[i,])
        if (min == max) scaledResiduals[i] = DHARMa.ecdf(simulations[i,])(observed[i])
        else{
          scaledResiduals[i] = runif(1, min, max)
        }        
      }
    }

This means scaledResiduals == 1 when all simulations are smaller than the observed. Now if the observations have values 0,1 then all simulations must be 0, but then the if statement cancels the calculation. I honestly don't see why it's there at all.

Solutions

2 is obvious but i'm very uncertain about 1. The correct probabilities for outliers are difficult to calculate and dependent on the type of model + fitted parameters. Now if we know these we can also avoid the outliers to the simulation completely by implementing PIT with the exact CDFs but that is a substantial amount of effort for each type of model.

Maybe we're lucky and it works out that for the PIT-residual u P(u > (1-1/(nSim+1))) = 1/(nSim+1) as simulation error and the rarity of events cancel out. Sadly i haven't been able to work it out and really shouldn't spent the time to work it out this week. Perhaps i'll have time next week, but i wanted write down and share what i think i've worked out so far.

P.S. Any specific reason for you having your own ecdf function? Seems weird to me. Also calling it in the PIT-implementation is unnecessary. I'd write it like this:

 scaledResiduals = rep(NA, n)
    for (i in 1:n){
        min <- sum(simulations[i,] < observed[i]) / length(simulations[i,])
        max <- sum(simulations[i,] <= observed[i]) / length(simulations[i,])
        if (min == max) {
          scaledResiduals[i] = max
        }
        else{
          scaledResiduals[i] = runif(1, min, max)
        }       
      }

@Istalan
Copy link

Istalan commented Aug 15, 2020

Hi,
me again. I haven't finished my other project but i also couldn't keep my hands off this.

In short: P(u > (1-1/(nSim+1))) = 1/(nSim+1) appears to work as an outlier test, with the new definition of an outlier now being: values larger than the simulations or larger than the second largest simulation and "lucky"/"unlucky". Everything being symmetric for the lower margin, of course.

Here's a test implementation which uses DHARMa the way i think it gets used in simulateResiduals:

require(DHARMa)


PIT_test_outlier <- function(fittedModel, nSim = 250){
  simulations <- getSimulations(fittedModel, nsim = nSim, type = "normal")
  observed <- getObservedResponse(fittedModel)
  n <- length(observed)
  
  scaledResiduals = rep(NA, n)
  for (i in 1:n){
# renamed these, because i'm uncomfortable overwriting base R
    pSmaller <- sum(simulations[i,] < observed[i]) / length(simulations[i,])
    pSmallerOrEq <- sum(simulations[i,] <= observed[i]) / length(simulations[i,])
    if (pSmaller == pSmallerOrEq) {
      scaledResiduals[i] = pSmallerOrEq
    }
    else{
      scaledResiduals[i] = runif(1, pSmaller, pSmallerOrEq)
    }       
  }
# just a look at the distribution
  hist(scaledResiduals)
  
  return(binom.test(sum(scaledResiduals < (1/(nSim+1))) +  sum(scaledResiduals > (1-1/(nSim+1))), 
                    n = n, p = 2/(nSim+1)))
}

It passes the basic smell test:

testData = createData(sampleSize = 5000, fixedEffects = 1, family = binomial(), randomEffectVariance = 0)
fittedModel_bin <- glm(observedResponse ~ 1, family = "binomial", data = testData)

PIT_test_outlier(fittedModel_bin)

It can detect a relatively small amount of overdispersion:

testData = createData(sampleSize = 500, fixedEffects = 1, family = poisson(), randomEffectVariance = 0, 
                      overdispersion = 0.2)
fittedModel_pois <- glm(observedResponse ~ 1, family = "poisson", data = testData)

PIT_test_outlier(fittedModel_pois)

I've also ran the example from Issue #195 and since this doesn't use ecdf the problem doesn't show up.

Have a good look at it, try it out some and if you're ok with it maybe i'd try to implement in the actual package myself.

@florianhartig
Copy link
Owner

Hi Lukas,

I removed the DHARMa.ecdf in the PIT residuals to solve #195, so the current code for the PIT residuals is basically what you have.

Your definition of the outliers is an excellent idea. The only caveat is that it is not as straightforward as before, where an outlier was really outside the simulation envelope.

After much forth and back I have nevertheless decided to use it in the outlier test. I am still a bit undecided if I should use it in the plots, I feel it's hard to interpret and possibly confusing to a user that a particular data point will be highlighted just because the randomisation procedure has it in the "outlier range".

p.s.: I think it's fair to add you as a contributor to the package, will do so with the next CRAN update (which may be today).

@florianhartig
Copy link
Owner

Sorry, I forgot to link the changes, the commit is 258a328

@florianhartig
Copy link
Owner

OK, I take this back, the issue is not fully solved, I think there is a bias in the test for nSim being small, which comes out more strongly for the Poisson. Try e.g.

testData = createData(sampleSize = 50000, fixedEffects = 1, family = binomial(), randomEffectVariance = 0)
fittedModel <- glm(observedResponse ~ 1, family = "binomial", data = testData)

res <- simulateResiduals(fittedModel = fittedModel, n = 10)
testOutliers(res)

or

testData = createData(sampleSize = 50000, fixedEffects = 1, family = poisson(), randomEffectVariance = 0)
fittedModel <- glm(observedResponse ~ Environment1, family = "poisson", data = testData)

res <- simulateResiduals(fittedModel = fittedModel, n = 100)
#plot(res)
testOutliers(res)

Seems to be the test becomes calibrated if we set nSim large, which, of course, is not not really the point of the test in the first place, the point was to see if the outliers are an issue, for the rest I have the dispersion test.

I think I will have to re-think the sense of this test entirely, but for the moment I will nevertheless push the current fix to CRAN, because it is vastly better to what is in there at the moment, and I won't have time to work on this during the next week.

@florianhartig
Copy link
Owner

Uff, turns out CRAN is on holidays and submissions are not possible - so, @Istalan, if you have any ideas feel free to comment - I will be able to work on this next Friday probably, would like to update the package then one way or another.

@Istalan
Copy link

Istalan commented Aug 26, 2020

Hi Florian,

I believe the technical term is egg on my face. I've been looking at the issue, but so far have com to no real conclusions yet. I'll write out where i'm at tomorrow but i'm starting to think that it is not really solvable for Poisson-models.

@Istalan
Copy link

Istalan commented Aug 27, 2020

Hi,

so i've done a bunch of simulations and i'm now quit certain, that no a priori adjustment of outlier frequency is possible for unbound integer models. Specifically for Poisson there seem to be 3 phases:

1.) The simulation only covers a not that large portion of the probability mass of the distribution. Mostly observable with 50 or 100 Simulations where not only are there too many outliers but also too many extreme residuals period. 20-30% too many outliers.

2.) (Only relevant for large lambda) The simulation insufficiently covers both tails. The overall distribution of the resiudlas is uniform but there are still 20%-30% too many outliers.

3.) For larger nSim 0 is usually part of the simulation and the excess of outliers drops to 10-20%, as lower outliers are now in line.

When these happen depends on the lambda. Perhaps for extremely high nSim it works as intended, but there's no guarantee for that, since the relevant ratio of increasingly tiny probabilities doesn't have to change. Also i'm working on 4GB of RAM right now.

But what mostly happens is that the test looses their power once nSim is getting close to n. For example for nSim = n/10 the difference between 10 and 13 outliers is almost certainly not significant even if the overabundance is still in full effect. The test remains overly sensitive, but perhaps this could be something like a guideline?

Negative binomial has a similar issue but for my test the overabundance was only about 10%. However i don't know what a good range of values for testing it would be.

I've attached my Code for the simulations. I hope you come to a different, more hopeful, conclusion from them. Also perhaps we should collect some models to do standard-tests for future changes to DHARMa.

best regards,
Lukas

outlier_PIT.txt

@Istalan
Copy link

Istalan commented Aug 27, 2020

I hadn't saved my final changes:
outlier_PIT.txt

@Istalan
Copy link

Istalan commented Aug 31, 2020

For maybe a bit of additional illumination i've looked into theoretical probabilities. I figured them out for the values being larger than all simulations and while it hasn't revealed much, it does look pretty ¯_(ツ)_/¯

i <- 0:50000
nSim <- 50

# generalizing
true_outlier <- function(lam){
  sum(ppois(i-1, lambda = lam)^nSim * dpois(i, lambda = lam))
}

lambdas <- exp(seq(-10, 10, by = 0.2))
# check if max(i) is large enough
ppois(max(i), max(lambdas)) == 1

true_prop <- sapply(lambdas, FUN = true_outlier)
#plot(lambdas, true_prop, log = "x")
plot(log(lambdas), true_prop, 
     ylab = paste("probabilty of value larger than all", nSim, "simulations")
)
abline(h = 1/(nSim+1)) 

As you can see at the extremes it becomes either almost binomial or almost continuous.

A few checks against "reality"

n <- 10^5
true_outlier(1) * n
PIT_test_outlier(glm(rpois(n, 1) ~ 1, family = "poisson"), nSim)

true_outlier(exp(-4)) * n
PIT_test_outlier(glm(rpois(n,exp(-4)) ~ 1, family = "poisson"), nSim)

# the test really is calibrated for super large lambdas,...
true_outlier(exp(10)) * n
PIT_test_outlier(glm(rpois(n,exp(10)) ~ 1, family = "poisson"), nSim)

Same principles apply to negative Binomial and they seem much more likely to be on one of the two extremes:


true_outlier_negBinom <- function(prob, size){
  sum(pnbinom(i-1, prob  = prob, size = size)^nSim * 
      dnbinom(i, prob  = prob, size = size)
    )
}

true_outlier_negBinom(size = 10, prob = exp(-1)) * n
fittedModel_negBinom <- MASS::glm.nb(rnbinom(n, size = 10, prob = exp(-1)) ~ 1)

PIT_test_outlier(fittedModel_negBinom, nSim)

props <- exp(seq(-8, 0, by = 0.2))

true_prop <- sapply(props, FUN = function(x){
  true_outlier_negBinom(size = 10, prob = x)}
  )
plot(log(props), true_prop)
abline(h = 1/(nSim+1))

@florianhartig
Copy link
Owner

OK, I'm just brainstorming here, what are the options:

  1. Keep the function as it is now (with the approximate, less uncalibrated frequencies)
  2. Somehow generate correct frequencies, e.g. via simulation / re-sampling
  3. Switch outliers back to the old residuals
  4. Remove outliers

@florianhartig
Copy link
Owner

What I'm thinking about 2 - it should be possible to take the simulations, and bootstrap the PIT procedure to generate an approximate expectation for the outliers. I think I will try this now.

@florianhartig
Copy link
Owner

OK, I have just pushed a version of the test with a boostrap, by accident directly into master, but will continue to work there ... it seems to work fine, the only issue is that it's quite slow, so the question is when to activate this

@Istalan
Copy link

Istalan commented Sep 2, 2020

Bootstrapping the tests is a great idea.

I have few notes on waht you've written so far:

  1. small one, but you wrote out$method = "DHARMa outlier test based on boostrap" in the approximate section (Line 216)

  2. You are using this code in your bootstrap.

 residuals <- getQuantile(simulations = simulationOutput$simulatedResponse[,-i], observed = simulationOutput$simulatedResponse[, i], ...)

(simulations = simulationOutput$simulatedResponse[,-i], means that you're working with one less simulation, so you're really bootstraping nSim -1. How about: simulations = simulationOutput$simulatedResponse[,sample((1:nSim)[-i], nSim, replace = T)]

  1. More importantly the approximate method with 2/(nSim+1) underestimates the amount of outliers by up to 30% (in my tests). It is therefore more suited to detect a lack of outliers. If you wanted it to detect an excess perhaps using a one sided test to H0: p = 4/(nSim+1) might work. This then only detects a relatively large excesses. I'm assuming that there are no parameters in which a model would naturally exceed that. For Poisson it really should be bounded at about 30%. I'll make some investigations into this later.

Point 3 leads other to your question about use cases and i'd say, since 2/(nSim+1) works perfectly for continuous cases it should be used for them. Otherwise i don't quite have an opinion yet.

@florianhartig
Copy link
Owner

Hi Lukas,

  1. yes, all outputs need to be cleaned up still

  2. yes, it's more a jacknife at the moment really. I was thinking about replace = T, but then for some reason while programming I shied away from it, but I can't really give a good explanation for it. I guess I might as well go to replace = T and implement the standard bootstrap

  3. OK, but then why 4, and not 3/(sSim + 1). To be honest, in your simulations it seemed to me any value might appear.

@florianhartig
Copy link
Owner

OK, surprisingly, the real bootstrap does not seem to generate correct values, while the jacknife does ... I have no idea why, but check out replacing https://github.com/florianhartig/DHARMa/blob/master/DHARMa/R/tests.R#L233

I have not yet changed the expected outlier frequency from 2/(nSim+1), and I'm somewhat hesitant to put this on a random value.

Currently, I have

  • set default to boostrap
  • warned about non-exactness of the binomial test for integer-valued distr.
  • switched plot function to binomial for nObs > 500 for speed reasons, but added warning message to re-run with bootstrap

This is all still a bit shaky / hacky, and a more elegant (ideally exact) solution would be appreciated, but I am still tending to push this on CRAN this evening, as this is way better than the current version.

@Istalan
Copy link

Istalan commented Sep 2, 2020

While writing this i noticed that i was starting to ramble. I just don't have the time to really get into it today. I'll revisit it on Friday.

I think an exact solution is always going to be dependent on the exact parameters for the fitted distribution on each observation X. If you have this exact distribution you don't need to simulate, you can just use PIT as intended and there will be no residuals 0, 1.

4 was just a random number that gave plenty of space to the unknown. What i've simulated has always been in the range of 2 - 2.7.

Also if we have those exact distributions for integer values than sum(P(X<= k-1)^nSim * P(X = k)) over all k in Z is going to give us the precise probability of X being larger than nSim iid values. For Poisson that looks like this:

true_outlier <- function(lam){
  sum(ppois(i-1, lambda = lam)^nSim * dpois(i, lambda = lam))
}

Sadly this changes a lot with lambda, going from 0 to 1/(nSim+1). I'm sure the lower bound is almost identical. Now extending the definition of an outlier to u > 1- 1/(nSim+1) sort of equals out the changes by giving more new outliers when the amount of true outliers declines. There are just a few too many when it's not at the extremes. And i only see values between 1.35/(nSim+1) (new high i saw today at lambda = exp(0.9) and nSim = 25) and of course 1/(nSim+1) , or for both margins 2-2.7.

I'm currently trying to use optim to find maximal deviations, but it doesn't play well with the randomness and it's starting to look increasingly complicated. Perhaps u > 1- 1/(nSim+1) was a bad idea, although while it changes it never goes too far off.

I'll get back to you Friday evening.

@Istalan
Copy link

Istalan commented Sep 5, 2020

Hi,

So first off: I figured out what's wrong with replace = T.

sel = sample((1:simulationOutput$nSim)[-i], size = nboot, replace = T)

should use nSim not nboot. Otherwise you only use 100 simulations not 250. I've tested it and it seems to fix the issue.

Otherwise i'm still very stuck on trying to find a better theoretical foundations for the alternative outlier definition. If S are the Simulations i've figured out P(max(S) = k), but the next step would be to find the distribution of how many simulated responses are equal to k. Let's call it B. My assumption would be:

1(the one value that must the maximum) + Binom(p= P(X = k | X <= k), n = nSim -1), X iid just like everything else

But simulations show that not only my p is off, but that even with the empirical p, the distribution is not binomial. Which troubles me immensely, as it implies a lack of independence.

But even if we were to figure this out. The best we could do is rigorously define an upper bound for what the outlier-probability could be. Unless we come up with very different definition of outlier, the probability will continue to be dependent on the exact parameters.

To summarize: I'm out of ideas and bootstrap works.

Also here's my Code for the distributions, but i'm reluctant to clean it up much because it is starting to feel like a waste of time:

nSim <- 50
# this part works
p_k_is_max <- function(k, lam){
  ppois(k, lambda = lam)^nSim * (1-(1-dpois(k, lambda = lam))^nSim) 
}

p_k_is_max(0:10, lam = 1)
x <- replicate(100000, max(rpois(nSim, 1)))
table(x)/(p_k_is_max(2:10, lam = 1) * 100000)

# this part doesn't
# i just looks at how many extra maxima there should be
num_max_duplicates <- function(k, lam){
  dpois(k, lambda = lam)/ppois(k, lambda = lam) * (nSim -1)
}

num_max_duplicates(0:10, 1)

simul_max <- replicate(1000000,{ 
  x <- rpois(nSim, 1)
  k <- max(x)
  freq <- sum(x == k) - 1
  c(k, freq)
})


simul_max <- matrix(simul_max, ncol = 2, byrow = T)
colnames(simul_max) <- c("k", "num")
res_sim <- aggregate(num ~ k, data = simul_max, FUN = mean)
res_sim
table(simul_max[, 1])
hist(simul_max[simul_max[, 1] == 2, 2])

loc_k <- 3
loc_p <- (res_sim[res_sim$k == loc_k, 2])/(nSim-1) 
# close but no cigar
#qqplot(simul_max[simul_max[, 1] == loc_k, 2], rbinom(100000, prob = loc_p, size = nSim-1))

tab <- table(simul_max[simul_max[, 1] == loc_k, 2])/sum(simul_max[, 1] == loc_k)

tab/dbinom(as.integer(names(tab)), prob = loc_p, size = nSim-1)

@florianhartig
Copy link
Owner

Hi Lukas,

many thanks for your thoughts! I think I will wrap this up here and close this issue for the moment, and bring DHARMa 0.3.3. on it's way (I had actually submitted this last week, but there was a CRAN complaint about another small thing, and I was on holidays over the weekend).

I'll open a new issue for 0.3.4 or onwards, to see if we can find a better solution to the problem of H0 for the binomial.

@florianhartig
Copy link
Owner

p.s.: I fixed the bBoot type, thanks for that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants