wonky plot from `check_model()` on a `glmmTMB` example #654

bbolker · 2023-11-21T15:13:19Z

This is from an nbinom1 model - the "overdispersion" and "normality of residuals" plots both look odd ...

library(glmmTMB)
library(dplyr) ## for mutate_at, %>%
#Build example data
x <- c("A", "B", "C", "D")
(time <- rep(x, each=20, times=3)) #time factor
y <- c("exposed", "ref1", "ref2")
(lake <- rep (y, each=80))  #lake factor
set.seed(123)
min <- runif(n=240, min=4.5, max=5.5) #mins used in model offset
set.seed(123)
(count <- rnbinom(n=240,mu=10,size=100)) #randomly generated negative binomial data

#make data frame
dat <- as.data.frame(cbind(time, lake, min, count)) 
dat <- dat %>% 
   mutate_at(c('min', 'count'), as.numeric)

#remove one combination of factors to make example rank deficient (all observations from time A and lake ref1)
dat2 <- filter(dat, time!="A" | lake !="ref1")

model <-glmmTMB(count~time*lake, family=nbinom1,
                      control = glmmTMBControl(rank_check = "adjust"),
                      offset=log(min), data=dat2)

library(performance)
check_model(model)

The text was updated successfully, but these errors were encountered:

strengejacke · 2024-02-04T23:34:07Z

A quick guess is that inappropriate residuals are calculated. This is the code to detect overdispersion for this specific model:

    d <- data.frame(Predicted = stats::predict(model, type = "response"))
    d$Residuals <- insight::get_response(model) - as.vector(d$Predicted)
    d$Res2 <- d$Residuals^2
    d$V <- d$Predicted * (1 + d$Predicted / insight::get_sigma(model))
    d$StdRes <- insight::get_residuals(model, type = "pearson")

And the qq-plot for glmmTMB simply uses stats::residuals(model).

If you don't have a suggestion for calculating the most appropriate residuals, the best solution is probably to go with simulated residuals and diagnostics based on DHARMa (#643)

bbolker · 2024-02-05T00:00:27Z

OK, the Q-Q plot should definitely be using stats::residuals(model, type = "pearson"). Or residuals(model, type = "deviance") if that's what the y-axis label says ... (we can't yet get standardized deviance residuals, because we still haven't got machinery to extract the hat matrix ...)

I don't know enough about how the columns in the overdispersion machinery are being used downstream or why you need to compute the variance yourself - it seems like it should be possibly to get it more reliably with built-in glmmTMB machinery (but I haven't dug around there ...)

Fixes #654

easystats/performance#654

strengejacke · 2024-02-05T07:43:27Z

I don't know enough about how the columns in the overdispersion machinery are being used downstream or why you need to compute the variance yourself - it seems like it should be possibly to get it more reliably with built-in glmmTMB machinery (but I haven't dug around there ...)

@bwiernik I think you wrote most/all of the code for the overdisperson plot/check.

easystats/performance#654

bwiernik · 2024-02-05T16:48:49Z

Not sure what the reason was for this decision (i.e. using halfnorm) - I can remember we had good reasons, but can't remember details. @easystats/core-team any ideas? @bbolker what would you suggest, the new solution presented below? (as long as we don't rely on DHARMa)

Base R switched to using half-normnal for binomial and count models. I would suggest we keep using it for glmmTMB for the relevant families?

bwiernik · 2024-02-05T16:50:10Z

I don't know enough about how the columns in the overdispersion machinery are being used downstream or why you need to compute the variance yourself - it seems like it should be possibly to get it more reliably with built-in glmmTMB machinery (but I haven't dug around there ...)

@bwiernik I think you wrote most/all of the code for the overdisperson plot/check.

I think I just wrote one set of code there and didn't check if anything was already available for glmmTMB models. It would be good to use existing machinery there if possible.

strengejacke · 2024-02-05T17:01:57Z

I would suggest we keep using it for glmmTMB for the relevant families?

That would be poisson and binomial, but not nbinom?

strengejacke · 2024-02-05T17:04:18Z

It would be good to use existing machinery there if possible.

Ok - but I'm not sure how to revise the relevant code section. Happy if someone could make a proposal.

bwiernik · 2024-02-05T18:16:08Z

I would suggest we keep using it for glmmTMB for the relevant families?

That would be poisson and binomial, but not nbinom?

Base R uses a half-normal with absolute standardized deviance residuals for any family of model fit with glm(). We should probably be a little smarter than that and limit it to non-Gaussian models.

If we can't get standardized residuals, then we probably need to adjust the reference distribution to not be a standard normal/half-normal.

bwiernik · 2024-02-05T18:41:44Z

Let me dig into these plots a bit and see what the most justifiable default should be.

strengejacke · 2024-02-05T19:46:26Z

or why you need to compute the variance yourself - it seems like it should be possibly to get it more reliably with built-in glmmTMB machinery (but I haven't dug around there ...)

@bbolker we compute the per-observation variance - not sure how to do this with family$variance()?

bbolker · 2024-02-05T19:59:09Z

Hmm. In principle it would be nice if it were sigma(model)^2*family(model)$variance(predict(model, type = "response")) (at least for models without ZI or dispersion components ...), but sigma.glmmTMB is wonky - returns different values, not necessarily equal to the sqrt of the scaling factor of the $variance() function, for different families (see ?sigma.glmmTMB): maybe there needs to be another function that will get us what we want (or a flag to sigma.glmmTMB ?)

bwiernik · 2024-02-05T21:14:24Z

Theory on half-normal plot for deviance residuals https://www.jstatsoft.org/article/view/v081i10

strengejacke · 2024-02-05T21:23:01Z

This looks better for glmmTMB now, but overdispersion plot for glm.nb looks strange to me. Maybe it's correct, though? See #680

library(glmmTMB)
library(performance)
library(MASS)
library(dplyr) ## for mutate_at, %>%
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:MASS':
#> 
#>     select
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

x <- c("A", "B", "C", "D")
time <- rep(x, each = 20, times = 3) # time factor
y <- c("exposed", "ref1", "ref2")
lake <- rep(y, each = 80) # lake factor
set.seed(123)
min <- runif(n = 240, min = 4.5, max = 5.5) # mins used in model offset
set.seed(123)
count <- rnbinom(n = 240, mu = 10, size = 100) # randomly generated negative binomial data

# make data frame
dat <- as.data.frame(cbind(time, lake, min, count))
dat <- dat %>%
  mutate_at(c("min", "count"), as.numeric)

# remove one combination of factors to make example rank deficient (all observations from time A and lake ref1)
dat2 <- filter(dat, time != "A" | lake != "ref1")

model <- glmmTMB(count ~ time * lake,
  family = nbinom1,
  control = glmmTMBControl(rank_check = "adjust"),
  offset = log(min), data = dat2
)
#> dropping columns from rank-deficient conditional model: timeD:lakeref1
check_model(model)
#> `check_outliers()` does not yet support models of class `glmmTMB`.

m2 <- glm.nb(count ~ time * lake + offset(log(min)), data = dat2)
check_model(m2)

^{Created on 2024-02-05 with reprex v2.1.0}

bbolker · 2024-02-06T01:30:17Z

We still need to be careful. m3 <- update(model, family = nbinom2); plot(check_overdispersion(m3)) is also busted (precisely because sigma() has the inverse meaning for nbinom2 ... (haven't dug in enough to see what's up with glm.nb)

strengejacke · 2024-02-06T07:55:55Z

precisely because sigma() has the inverse meaning for nbinom2

We can add a different handling for nbinom2, but what would be the solution in this case? 1/sigma?

strengejacke · 2024-02-06T08:03:50Z

If the new code is correct, this would be the result.

First, the new code:

  if (faminfo$is_negbin && !faminfo$is_zero_inflated) {
    if (inherits(model, "glmmTMB")) {
      d <- data.frame(Predicted = stats::predict(model, type = "response"))
      d$Residuals <- insight::get_residuals(model, type = "pearson")
      d$Res2 <- d$Residuals^2
      d$StdRes <- insight::get_residuals(model, type = "pearson")
      if (faminfo$family == "nbinom1") {
        # for nbinom1, we can use "sigma()"
        d$V <- insight::get_sigma(model)^2 * stats::family(model)$variance(d$Predicted)
      } else {
        # for nbinom2, "sigma()" has "inverse meaning" (see #654)
        d$V <- (1 / insight::get_sigma(model)^2) * stats::family(model)$variance(d$Predicted)
      }
    } else {
      ## FIXME: this is not correct for glm.nb models?
      d <- data.frame(Predicted = stats::predict(model, type = "response"))
      d$Residuals <- insight::get_response(model) - as.vector(d$Predicted)
      d$Res2 <- d$Residuals^2
      d$V <- d$Predicted * (1 + d$Predicted / insight::get_sigma(model))
      d$StdRes <- insight::get_residuals(model, type = "pearson")
    }
  }

Result:

model <- glmmTMB(count ~ time * lake,
  family = nbinom1,
  control = glmmTMBControl(rank_check = "adjust"),
  offset = log(min), data = dat2
)
m3 <- update(model, family = nbinom2)

p1 <- plot(check_overdispersion(model))
p2 <- plot(check_overdispersion(m3))

p1 + p2

bbolker · 2024-02-06T14:36:11Z

Looks good (although obviously glm.nb still needs to be fixed ... moving forward I'd like to try to find a way to fix the inconsistency of sigma() definitions (see also glmmTMB/glmmTMB#951). Maybe we could provide some kind of variance function that gave the predicted variance as a function of mu directly (rather than the predicted scaled variance ...)

strengejacke · 2024-02-06T14:48:10Z

While we're talking about variance, documentation etc.:

This is the variance-function from beta-families:

glmmTMB::beta_family()$variance
#> function (mu) 
#> {
#>     mu * (1 - mu)
#> }
#> <bytecode: 0x000001dc83b3ae58>
#> <environment: 0x000001dc83b3aa30>

The docs, however, say:

Could you clarify which one is correct? I used the one from the docs in insight::get_variance() instead of family$variance(), and if the latter is correct (not the docs), then this could resolve some issues (https://github.com/easystats/insight/issues?q=is%3Aissue+is%3Aopen+label%3Aget_variance)

bwiernik · 2024-02-06T15:02:40Z

Looks good (although obviously glm.nb still needs to be fixed ... moving forward I'd like to try to find a way to fix the inconsistency of sigma() definitions (see also glmmTMB/glmmTMB#951). Maybe we could provide some kind of variance function that gave the predicted variance as a function of mu directly (rather than the predicted scaled variance ...)

A "variance" option in predict() would be super useful

bbolker · 2024-02-06T15:04:19Z

That's the thing: the $variance() function as defined/implemented in base R does not return the variance for a given mean, it returns the scaled variance (even though ?family says it is "the variance as a function of the mean"). For example, gaussian$variance is function (mu) rep.int(1, length(mu)). We probably screwed up when we allowed sigma.glmmTMB to return things that are not scale parameters; it should be the case that

sigma(model)^2*family(model)$variance(predict(model, type = "response"))

works generally ...

It could be worth making breaking changes to sigma.glmmTMB to fix this ... although I guess adding a type = "variance" objection to predict would be the most straightforward/least-breaking solution ...

strengejacke · 2024-02-06T21:31:38Z

I think this is where my lack of statistical expertise prevents me from understanding how to proceed ;-)

The initial code base in insight::get_variance() was drafted by you, and I tried to enhance for further model families. Based on Nakagawa et al. 2017, you commented:

in general want log(1+var(x)/mu^2)

https://github.com/easystats/insight/blob/5f886703cf2303d8f0d0c44b89b29dd092046360/R/compute_variances.R#L603

This is, e.g., what is done for the beta-family, based on the docs (see my screenshot above):

# Get distributional variance for beta-family
# ----------------------------------------------
.variance_family_beta <- function(x, mu, phi) {
  if (inherits(x, "MixMod")) {
    stats::family(x)$variance(mu)
  } else {
    mu * (1 - mu) / (1 + phi)
  }
}

This sometimes leads to weird results when computing R2 or ICC for mixed models. For most families, performance::r2_nakagawa() is in line with results from other packages. However, "new" families, which are not yet supported by other packages, sometime produce weird results for performance::r2_nakagawa(). Not sure if the distributional variance (calculated in insight:::..compute_variance_distribution() and insight:::.variance_distributional()) is accurate for all model families.

Is there some way to "validate" the results against something? E.g. against non-mixed or Bayesian models, or some simulated data where the R2 is known? (not sure how to simulate such data, though)

strengejacke · 2024-02-12T12:07:20Z

Initial issue should be fixed, but re-open for the later discussed points here.

strengejacke · 2024-03-16T09:21:53Z

Q-Q plot now based on DHARMa (see #643), but still need to think about overdispersion plots.

library(glmmTMB)
library(performance)
library(datawizard)
# Build example data
x <- c("A", "B", "C", "D")
time <- rep(x, each = 20, times = 3) # time factor
y <- c("exposed", "ref1", "ref2")
lake <- rep(y, each = 80) # lake factor
set.seed(123)
min <- runif(n = 240, min = 4.5, max = 5.5) # mins used in model offset
set.seed(123)
count <- rnbinom(n = 240, mu = 10, size = 100) # randomly generated negative binomial data

# make data frame
dat <- as.data.frame(cbind(time, lake, min, count))
dat <- dat |>
  data_modify(.at = c("min", "count"), .modify = as.numeric)

# remove one combination of factors to make example rank deficient (all observations from time A and lake ref1)
dat2 <- data_filter(dat, time != "A" | lake != "ref1")

model <- glmmTMB(count ~ time * lake,
  family = nbinom1,
  control = glmmTMBControl(rank_check = "adjust"),
  offset = log(min), data = dat2
)
#> dropping columns from rank-deficient conditional model: timeD:lakeref1

check_model(model)
#> `check_outliers()` does not yet support models of class `glmmTMB`.

^{Created on 2024-03-16 with reprex v2.1.0}

strengejacke · 2024-03-17T12:09:53Z

@bwiernik Based on my comments here: #643 (comment)

the question is whether we need to do anything regarding the code of the overdispersion plot? The current code relies on residuals / Pearson residuals. To check for over-/underdispersion in more complex models, we now use simulated residuals based on the DHARMa package. Can these residuals possibly be used for the code to create overdispersion plots?

A draft to play with is .new_diag_overdispersion:

performance/R/check_model_diagnostics.R

Line 296 in 35b5e19

.new_diag_overdispersion <- function(model, ...) {

I'm not sure if this code really works well? I'm not fully understanding the implementation in .diag_overdispersion:

performance/R/check_model_diagnostics.R

Line 370 in 35b5e19

.diag_overdispersion <- function(model, ...) {

and how this "translated" into a function using simulated residuals?

strengejacke · 2024-03-17T14:45:18Z

Or is the major concern still the variance function and/or sigma?

bwiernik · 2024-03-17T14:47:25Z

I think we should be able to use the dharma residuals. Let me take a look

strengejacke mentioned this issue Nov 21, 2023

DHARMa implementation for new check_residuals() function #643

Merged

15 tasks

strengejacke added bug 🐛 Something isn't working 3 investigators ❔❓ Need to look further into this issue labels Nov 22, 2023

This comment was marked as outdated.

Sign in to view

strengejacke self-assigned this Feb 5, 2024

strengejacke added a commit that referenced this issue Feb 5, 2024

wonky plot from check_model() on a glmmTMB example

b4cf7a7

Fixes #654

strengejacke mentioned this issue Feb 5, 2024

wonky plot from check_model() on a glmmTMB example #680

Merged

strengejacke added a commit to easystats/see that referenced this issue Feb 5, 2024

glmmTMB shouldn't use qqhalfnormal

5c17618

easystats/performance#654

strengejacke mentioned this issue Feb 5, 2024

glmmTMB shouldn't use qqhalfnormal easystats/see#322

Merged

This comment was marked as outdated.

Sign in to view

strengejacke added a commit to easystats/see that referenced this issue Feb 5, 2024

glmmTMB shouldn't use qqhalfnormal (#322)

90cb572

easystats/performance#654

strengejacke added a commit to easystats/see that referenced this issue Feb 5, 2024

revert change (easystats/performance#654)

10d812d

strengejacke closed this as completed in #680 Feb 12, 2024

strengejacke reopened this Feb 12, 2024

strengejacke mentioned this issue Mar 17, 2024

Improving check_model() for GLMs #376

Open

This was referenced Mar 18, 2024

Revising check_model() #697

Open

Revising check_model() #698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wonky plot from `check_model()` on a `glmmTMB` example #654

wonky plot from `check_model()` on a `glmmTMB` example #654

bbolker commented Nov 21, 2023

strengejacke commented Feb 4, 2024

bbolker commented Feb 5, 2024 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

strengejacke commented Feb 5, 2024

bwiernik commented Feb 5, 2024

bwiernik commented Feb 5, 2024

strengejacke commented Feb 5, 2024

strengejacke commented Feb 5, 2024

bwiernik commented Feb 5, 2024

bwiernik commented Feb 5, 2024

strengejacke commented Feb 5, 2024

bbolker commented Feb 5, 2024

bwiernik commented Feb 5, 2024

strengejacke commented Feb 5, 2024

bbolker commented Feb 6, 2024

strengejacke commented Feb 6, 2024

strengejacke commented Feb 6, 2024

bbolker commented Feb 6, 2024

strengejacke commented Feb 6, 2024

bwiernik commented Feb 6, 2024

bbolker commented Feb 6, 2024 •

edited

Loading

strengejacke commented Feb 6, 2024

in general want log(1+var(x)/mu^2)

strengejacke commented Feb 12, 2024

strengejacke commented Mar 16, 2024

strengejacke commented Mar 17, 2024 •

edited

Loading

strengejacke commented Mar 17, 2024

bwiernik commented Mar 17, 2024

wonky plot from check_model() on a glmmTMB example #654

wonky plot from check_model() on a glmmTMB example #654

Comments

bbolker commented Nov 21, 2023

strengejacke commented Feb 4, 2024

bbolker commented Feb 5, 2024 • edited Loading

This comment was marked as outdated.

This comment was marked as outdated.

strengejacke commented Feb 5, 2024

bwiernik commented Feb 5, 2024

bwiernik commented Feb 5, 2024

strengejacke commented Feb 5, 2024

strengejacke commented Feb 5, 2024

bwiernik commented Feb 5, 2024

bwiernik commented Feb 5, 2024

strengejacke commented Feb 5, 2024

bbolker commented Feb 5, 2024

bwiernik commented Feb 5, 2024

strengejacke commented Feb 5, 2024

bbolker commented Feb 6, 2024

strengejacke commented Feb 6, 2024

strengejacke commented Feb 6, 2024

bbolker commented Feb 6, 2024

strengejacke commented Feb 6, 2024

bwiernik commented Feb 6, 2024

bbolker commented Feb 6, 2024 • edited Loading

strengejacke commented Feb 6, 2024

in general want log(1+var(x)/mu^2)

strengejacke commented Feb 12, 2024

strengejacke commented Mar 16, 2024

strengejacke commented Mar 17, 2024 • edited Loading

strengejacke commented Mar 17, 2024

bwiernik commented Mar 17, 2024

wonky plot from `check_model()` on a `glmmTMB` example #654

wonky plot from `check_model()` on a `glmmTMB` example #654

bbolker commented Feb 5, 2024 •

edited

Loading

bbolker commented Feb 6, 2024 •

edited

Loading

strengejacke commented Mar 17, 2024 •

edited

Loading