Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in approxfun(vals, cumsum(tabulate(match(x, vals)))/(n + 1), method = "linear", : need at least two non-NA values to interpolate #195

Closed
b-jorges opened this issue Jul 20, 2020 · 10 comments

Comments

@b-jorges
Copy link

b-jorges commented Jul 20, 2020

Heyho,

I am getting the following error for one specification of a GLMM:

GLMM_RandomIntercepts_PSE = glmer(cbind(Yes, Total - Yes) ~ ConditionOfInterest*Difference + (1| ID), 
                                  family = binomial(link = "probit"),
                                  data = Psychometric)

I tried playing around a bit,

  • adding random slopes for "Difference" per ID (didn't work),
  • changing the probit link for a logit link, but keeping the random effects as is (worked)
  • adding random slopes for "Difference" per ID AND changing the probit link for a logit link (didn't work)

For context: I am working on some guidelines on how to best analyze a particular type of data, so I am looking into the "best" model specification from different perspectives, one of which are to what extent they meet the assumptions. So it would be extremely useful for me to be able to compare all relevant specifications.

Here is a (trimmed) version of the analyses (includes simulation of data), including the four cases I am mentioning:
https://github.com/b-jorges/Power-Analyses-Psychophysics/blob/master/Sample%20Code.R

Do you see any simple fix for that?

Thanks,
Björn

Ps: Awesome package!

@florianhartig
Copy link
Owner

OK, for future reference, I paste the reproducible example here

require(dplyr)
require(tidyverse)
require(lme4)
require(DHARMa)


set.seed(1)


ID = paste0("S0",1:10)
ConditionOfInterest = c(0,1)
StandardValues = c(5,6,7,8)
reps = 1:100

PSE_Difference = 0.5
JND_Difference = 0.5

Multiplicator_PSE_Standard = 0
Multiplicator_SD_Standard = 0.15
Type_ResponseFunction = "Normal"
SD_ResponseFunction = 0.1
Mean_Variability_Between = 0.1
SD_Variability_Between = 0.1


Psychometric = expand.grid(ID=ID, ConditionOfInterest=ConditionOfInterest, StandardValues=StandardValues, reps = reps)

Psychometric = Psychometric %>%
  group_by(ID) %>%#
  mutate(PSE_Factor_ID = rnorm(1,1,Mean_Variability_Between), #how much variability is in the means of the psychometric functions between subjects?
         SD_Factor_ID = rnorm(1,1,SD_Variability_Between)) #how much variability is in the standard deviations of the psychometric functions between subjects?

Psychometric = Psychometric %>%
  mutate(
    Mean_Standard = StandardValues+StandardValues*Multiplicator_PSE_Standard,
    SD_Standard = StandardValues*Multiplicator_SD_Standard,
    Mean = (Mean_Standard + (ConditionOfInterest==1)*Mean_Standard*PSE_Difference),
    SD = abs(SD_Standard + (ConditionOfInterest==1)*SD_Standard*JND_Difference))

Psychometric = Psychometric %>%
  mutate(
    Mean = Mean*PSE_Factor_ID,
    SD = SD*SD_Factor_ID)

if (Type_ResponseFunction == "normal"){
  
  Psychometric = Psychometric %>%
    mutate(
      staircase_factor = pnorm(length(reps),1,SD_ResponseFunction*(1+ConditionOfInterest*JND_Difference)))
  
} else if (Type_ResponseFunction == "Cauchy"){
  
  Psychometric = Psychometric %>%
    mutate(
      staircase_factor = rcauchy(length(reps),1,SD_ResponseFunction*(1+ConditionOfInterest*JND_Difference)))
  
  #} else if (Type_ResponseFunction == "uniform"){
  #  
  #  Psychometric = Psychometric %>%
  #  mutate(
  #      staircase_factor = seq(SD_ResponseFunction[1],SD_ResponseFunction[2],(SD_ResponseFunction[2]-SD_ResponseFunc#tion[1]/6)))
  
} else{
  
  print("distribution not valid")
  
}

Psychometric = Psychometric %>%
  mutate(
    staircase_factor = rcauchy(length(reps),1,SD_ResponseFunction), 
    Presented_TestStimulusStrength = Mean*staircase_factor,
    Difference = Presented_TestStimulusStrength - StandardValues)

Psychometric = Psychometric %>%
  mutate(
    AnswerProbability = pnorm(Presented_TestStimulusStrength,Mean,SD),
    
    ##get binary answers ("Test was stronger" yes/no) from probabilities for each trial
    Answer = as.numeric(rbernoulli(length(AnswerProbability),AnswerProbability))
  )

###prepare for glmer()
Psychometric = Psychometric %>%
  filter(abs(staircase_factor-1) < 0.75) %>%
  group_by(ID,ConditionOfInterest,StandardValues,Difference) %>%
  mutate(Yes = sum(Answer==1),
         Total = length(ConditionOfInterest))




GLMM_RandomIntercepts_PSE = glmer(cbind(Yes, Total - Yes) ~ ConditionOfInterest + Difference + (1| ID), 
                                  family = binomial(link = "probit"),
                                  data = Psychometric)

Sim = simulateResiduals(GLMM_RandomIntercepts_PSE)
plot(Sim)

@florianhartig florianhartig changed the title Error for simulateResiduals() with glmerMOD object Error in approxfun(vals, cumsum(tabulate(match(x, vals)))/(n + 1), method = "linear", : need at least two non-NA values to interpolate Jul 20, 2020
@florianhartig
Copy link
Owner

The problem is there are several data points in the simulation that produce zero variance. You can see this via

apply(getSimulations(GLMM_RandomIntercepts_PSE, 100), 1, var)

I think other users had this problem before. I guess the best way to deal with it would be to calculate the residual to NA instead of throwing an error. I should have implemented that before and will do so now!

@florianhartig
Copy link
Owner

After a bit of thinking, I guess it should be evaluate to U[0,1] instead of NA. I have implemented this now, and it seems to work OK, will push this to master, if you install the development version this should work now.

@florianhartig
Copy link
Owner

Note to self - this needs a small correction to check for < and >

@Istalan
Copy link

Istalan commented Aug 21, 2020

Hi,

This appears to be an issue stemming from the function DHARMa.ecdf and while i'm in favor of not calling it in the PIT algorithm it would also be good to rewrite in order to be able to deal with vectors that only have one unique value. My proposition would be to use a value just smaller than the data: min(x)-e providing a function value of 0:

DHARMa.ecdf <- function (x)
{
  x <- sort(x)
  n <- length(x)
  if (n < 1)
    stop(paste("DHARMa.ecdf - length vector < 1", x))
  e <- 10^-15 # numeric close to zero to identify a point just before the data with F(min(x) -e) = 0
  vals <- c(x[1] - e, unique(x))
 
  rval <- approxfun(vals, cumsum(tabulate(match(x, vals)))/ (n +1) ,
                    method = "linear", yleft = 0, yright = 1, ties = "ordered")
  class(rval) <- c("ecdf", "stepfun", class(rval))
  assign("nobs", n, envir = environment(rval))
  attr(rval, "call") <- sys.call()
  rval
}

This approach has some weirdness with regard to numerical precision, but if that is relevant to the data-set, than R is probably not the right platform anyway. Also, weirdly, both this and the original function divides by n+1 which leads to F(max(x)) = 1-1/(n+1). This seams to be intentional but i'm not all that convinced by the overall setup.

Here are my tests:

my_ecdf <- DHARMa.ecdf(rep(1, 250))

my_ecdf(-1)
my_ecdf(0)
e <- 10^-15 
my_ecdf(1-e)
# strange behaviour at the limit of numerical precision
my_ecdf(1-e/1.05)
my_ecdf(1-(e/1.06))
my_ecdf(1) # this is also happens with the as is version, since it divides by n+1
1-(1/251)
my_ecdf(1 + e) # making it slightly larger obtains the 1 

The default ecdf function behaves precisely as expected, but is not symmetric or anything, so perhapps one could use something like (ecdf(x) + ecdf(x+e))/2, sadly this doesn't work on it's own, so a bit additional effort would have to be invested.

my_ecdf <- ecdf(rep(1, 250))

my_ecdf(-1)
my_ecdf(0)
e <- 10^-15 
my_ecdf(1-e)
# numerical issues happen at smaller values
my_ecdf(1-e/10)
my_ecdf(1-(e/20))
my_ecdf(1) # takes the standard value of 1 at max(x)
1-(1/251)
my_ecdf(1 + e)

@florianhartig
Copy link
Owner

Hi Lukas, I agree with not calling DHARMa.ecdf in the PIT calculation, this is indeed unnecessary. About your modifications - this may very well be a good idea, but at the moment my priority is to correct the current errors in DHARMa asap. I will thus open a new ticked for the ecdf and move this to the next commit.

I just modified the code, the issue above should be solved for a standard normal user!

@James-Thorson-NOAA
Copy link

James-Thorson-NOAA commented May 31, 2021

Hi all,

I just closed issue #286 but have some questions about the fix that was implemented here (which appears to be the same issue). Hopefully my questions are helpful and not annoying, and thanks for your attention.

My interpretation of PIT residuals is that:

  1. If all of simulatedResponse[i,] (i.e., the simulated approximation to the distribution for sample i) are less than observedResponse[i] (the observed value for sample i), then the quantile residual should be 1.
  2. Similarly, if all simulatedResponse[i,] are greater than observedResponse[i], then the quantile residual should be 0.

This is true even in instances when var(simulatedResponse[i,])=0, where these instances were throwing an error. In such cases, the third possible scenario is:

  1. if all simulatedResponse[i,] are equal to observedResponse[i], then the quantile residual should be U(0,1).

I believe these three are a complete set of possible outcomes when var=0, and my interpretation of the proper fix appears to differ from some of the proposals discussed in the thread, i.e., using U(0,1) or fixing a value of 0 for all three cases when any simulated variance is zero (where by my logic these solutions are appropriate for some but not all such cases).

Do y'all want to discuss further here? If yes, are you willing to summarize what the ultimate fix was that was implemented?

@Istalan
Copy link

Istalan commented May 31, 2021

Hi,

there was no need to apologize in the original thread, i just felt awkward answering without a preface.

Anyway, this issue was resolved by changing the PIT-algorithm to no longer use the custom ecdf. You can find the current implementation here:

DHARMa/DHARMa/R/helper.R

Lines 87 to 93 in bcf9e1e

scaledResiduals = rep(NA, n)
for (i in 1:n){
minSim <- mean(simulations[i,] < observed[i])
maxSim <- mean(simulations[i,] <= observed[i])
if (minSim == maxSim) scaledResiduals[i] = minSim
else scaledResiduals[i] = runif(1, minSim, maxSim)
}

I agree with your interpretation and i'm quite sure that we implemented that behavior. Plus, if all simulations and the observed value are identical the residual is U[0, 1] distributed. The discussion on this issue is hard to follow as it simultaneously toke place in #197, which had a different focus.

Please feel free to double check the behavior of the current implementation. It can't hurt.

@florianhartig
Copy link
Owner

Hi @James-Thorson-NOAA - as @Istalan said, my reading is that the code is implementing the behavior that you describe.

@James-Thorson-NOAA
Copy link

I agree, thanks for pointing me to the code; the snippet appears to do what I describe (and previous and ongoing use testing doesn't indicate any problems either).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants