Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving check_model() for GLMs #376

Open
Tracked by #698 ...
bwiernik opened this issue Nov 8, 2021 · 57 comments
Open
Tracked by #698 ...

Improving check_model() for GLMs #376

bwiernik opened this issue Nov 8, 2021 · 57 comments
Labels
enhancement 💥 Implemented features can be improved or revised

Comments

@bwiernik
Copy link
Contributor

bwiernik commented Nov 8, 2021

The current selection of plots returned by check_model() for GLMs aren't ideal in a few ways.

1. They are missing a linearity check (fitted vs residuals). For binomial models, this should be a called to binned_residuals(). For other families, the standard check is fine.
2. For binomial models, the constant variance plot should be omitted.
3. For binomial models, the residual QQ plot is hard to interpret.
4. For non-bernoulli models, we should include a plot for checking overdispersion.

For the latter few points, the DHARMa package provides an easy-to-interpret approach for checking distributional assumptions from qq plots and problems with fitted vs residual plots using quantile residuals. We might consider soft-importing DHARMa or re-implementing those approaches.
https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html
florianhartig/DHARMa#33

@bwiernik bwiernik added the enhancement 💥 Implemented features can be improved or revised label Nov 8, 2021
@bbolker

This comment was marked as outdated.

@bwiernik

This comment was marked as outdated.

@bbolker

This comment was marked as outdated.

@bwiernik

This comment was marked as outdated.

@IndrajeetPatil

This comment was marked as outdated.

@DominiqueMakowski

This comment was marked as outdated.

@bwiernik

This comment was marked as outdated.

@bwiernik

This comment was marked as outdated.

@bbolker

This comment was marked as outdated.

@bwiernik

This comment was marked as outdated.

@bbolker

This comment was marked as outdated.

@bwiernik

This comment was marked as outdated.

@jebyrnes
Copy link

jebyrnes commented Dec 9, 2021

Would it be easier to pull in https://github.com/florianhartig/DHARMa as its very flexible in terms of model type?

@bbolker
Copy link

bbolker commented Dec 11, 2021

https://gist.github.com/bbolker/945e684dc148239364ab0f99c9a1beed

I realized that the residuals from binomials are not binomial (so constructing binomial CIs isn't useful); however I think that a binned plot using mean_cl_boot() to generate CIs is probably fine. See examples and decide what you think is useful or not ...

@bwiernik

This comment was marked as outdated.

@strengejacke

This comment was marked as outdated.

strengejacke added a commit that referenced this issue Feb 25, 2022
@bwiernik

This comment was marked as outdated.

strengejacke added a commit to easystats/see that referenced this issue Feb 25, 2022
@strengejacke

This comment was marked as outdated.

@strengejacke

This comment was marked as outdated.

@strengejacke

This comment was marked as outdated.

@bwiernik

This comment was marked as outdated.

@jebyrnes
Copy link

What about using DHARMa and it’s quartile residuals approach? The qqunif plot that results is very intuitive if you are used to qqplots.

@strengejacke
Copy link
Member

What about using DHARMa and it’s quartile residuals approach? The qqunif plot that results is very intuitive if you are used to qqplots.

Do you mean for binomial models?

@jebyrnes
Copy link

Yes? Or any glm. Also meant quantile not quartile. Stupid autocorrect. Ha!

strengejacke added a commit that referenced this issue Feb 25, 2022
@mattansb
Copy link
Member

mattansb commented Mar 9, 2022

I also like the first one (:

I don't think that (most people) can diagnose dispersion by looking a a qq-plot, but maybe that's just me (:

@strengejacke
Copy link
Member

@bwiernik How can we create the first plot variance versus expected)? We may include this in the forthcoming update.

@bwiernik
Copy link
Contributor Author

tl;dr I think we should go with the second plot below. What do you think?

Okay, so I looked into what SAS was doing with that first plot. Super disappointing. It's actually just plotting the model-predicted mean agains the model-predicted variance (so for a Poisson model, always a straight line). So there's no actual observed data there, only predictions. I wouldn't even call it a diagnostic plot -- you need to have already fit a model with over dispersion before it will confirm that, yes, there was over dispersion that needed to handled.

So I played around and came up with two ideas

image

In the top plot, we show the standardized residuals. These should follow a normal distribution, so most points should be within ±2 and almost no points should be outside ±4. There is a pretty big number larger that +2.

The second plot takes the squared residuals and fits a geom_smooth() to them. This gives us the observed average squared residual (i.e., variance) for each Predicted value. It then shows the comparison against the model-predicted variance. In this plot, we see that the observed variance is much higher than expected at a bunch of ranges.

I think the second plot is probably better. It doesn't have individual points, but it is still based on the actual observed residuals, and I think it more clearly indicates the violation of the assumption.

What do you think?

Here is an example of the above data with a model better handling the dispersion (negative binomial, zero-inflated, with a variable dispersion model):

image

@bwiernik
Copy link
Contributor Author

@bbolker Either of the above require computation of Pearson residuals and/or model-expected conditional variance. Do you think that a function could be added for the conditional variance in glmmTMB for the various count families, including with zero-inflation and variable dispersion (e.g., for nbinom2, V = Predicted * (1 + Predicted / Disp) * (1 - Prob) * (1 + Predicted * (1 + Predicted / Disp) * Prob))?

@strengejacke
Copy link
Member

2nd plot looks great, and is easy to understand even for non-experts. Can you share some code, so I can implement it in performance (and see), or do you want to do that?

@bwiernik
Copy link
Contributor Author

I'll through some code up

@bwiernik
Copy link
Contributor Author

docvisit.txt

docvisit <- readr::read_table(file.path("docvisit.txt"))

docvisit

library(glmmTMB)

mp <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = poisson()
)

mnb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = nbinom2()
)

mzip <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  ziformula = ~ age,
  data = docvisit,
  family = poisson()
)

mzinb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore,
  ziformula = ~ age,
  data = docvisit,
  family = nbinom2()
)

mzinbd <- glmmTMB(
  doctorco ~ sex + illness + income + hscore + age,
  ziformula = ~ sex + illness + income + hscore + age,
  dispformula = ~ sex + illness + income + hscore + age,
  data = docvisit,
  family = nbinom2()
)

ep <- modelbased::estimate_expectation(mp) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = 0,
    Prob = 0,
    V = Predicted,
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = residuals(mp, type = "pearson")
  )

enb <- modelbased::estimate_expectation(mnb) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = predict(mnb, type = "disp"),
    Prob = 0,
    V = Predicted * (1 + Predicted / Disp),
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = residuals(mnb, type = "pearson")
  )

ezip <- modelbased::estimate_expectation(mzip) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = 0
    Prob = predict(mzip, type = "zprob"),
    V = Predicted * (1 - Prob) * (1 + Predicted * Prob),
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = Residuals / sqrt(V)
  )

ezinb <- modelbased::estimate_expectation(mzinb) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = predict(mzinb, type = "disp"),
    Prob = predict(mzinb, type = "zprob"),
    V = Predicted * (1 + Predicted / Disp) * (1 - Prob) * (1 + Predicted * (1 + Predicted / Disp) * Prob),
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = Residuals / sqrt(V)
  )

ezinbd <- modelbased::estimate_expectation(mzinbd) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = predict(mzinbd, type = "disp"),
    Prob = predict(mzinbd, type = "zprob"),
    V = Predicted * (1 + Predicted / Disp) * (1 - Prob) * (1 + Predicted * (1 + Predicted / Disp) * Prob),
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = Residuals / sqrt(V)
  )

p1 <- ggplot(ezinbd) + aes(x = Predicted) + 
  geom_point(aes(y = StdRes)) + 
  geom_hline(
    yintercept = c(-2, 2, -4, 4), 
    linetype = c("solid", "solid", "dashed", "dashed"),
    color = c(rep(see::social_colors("green"), 2), rep(see::social_colors("blue"), 2))
  ) +
  labs(
    title = "Overdispersion and zero-inflation", 
    subtitle = "Most points should be within ±2 (green), few points outside ±4 (blue)",
    x = "Predicted mean",
    y = "Standardized resiuduals"
  ) +
  see::theme_modern(plot.title.space = 2)

p2 <- ggplot(ezinbd) + aes(x = Predicted) + 
  geom_smooth(aes(y = V), color = see::social_colors("blue"), se = FALSE) + 
  geom_smooth(aes(y = Res2), color = see::social_colors("green")) + 
  labs(
    title = "Overdispersion and zero-inflation", 
    subtitle = "Observed residual variance (green) should follow predicted residual variance (blue)",
    x = "Predicted mean",
    y = "Residual variance"
  ) +
  see::theme_modern(plot.title.space = 2)

p1 / p2  

strengejacke added a commit that referenced this issue Mar 16, 2022
@strengejacke

This comment was marked as outdated.

@bwiernik
Copy link
Contributor Author

Awesome. Can you link to where this is implemented?

@strengejacke

This comment was marked as outdated.

@strengejacke
Copy link
Member

Awesome. Can you link to where this is implemented?

This works for check_model(), not yet for plot() (because it requires the forthcoming update from see on CRAN). I'll commit the latest changes to working PR branches, so you can reproduce the plots above.

@strengejacke
Copy link
Member

Ok, you need the latest insight from GitHub/r-universe, performance from this PR (#404), and latest see from GitHub / r-universe. Then, the above examples work. If you install performance from master instead of PR, "only" the check_model() including overdispersion plots work.

@strengejacke

This comment was marked as outdated.

@DominiqueMakowski
Copy link
Member

This design is somewhat a bit at odds with our traditional opinionated API, rather than having different plot_types, I'd just pick one version which we think it's the best and stick with it (until the next improvement)... or, if they show different stuff, have the two overdispersion plots side by side in the corner (subplots)?

@strengejacke
Copy link
Member

Yeah, I think for now we stick to plot 1, the default. I just implemented that option, which is undocumented for now, bnut we could make it in a similar way as with normality plots, where we have 3 different types:
https://easystats.github.io/see/articles/performance.html#check-for-normal-distributed-residuals

@mattansb
Copy link
Member

This design is somewhat a bit at odds with our traditional opinionated API, rather than having different plot_types, I'd just pick one version which we think it's the best and stick with it

Agreed.

@mccarthy-m-g

This comment was marked as outdated.

@strengejacke
Copy link
Member

@mccarthy-m-g suggestion looks rather easy to implement.

@mccarthy-m-g
Copy link
Collaborator

@strengejacke What would a PR for this involve? I could get a draft started if this is the solution you want to go for.

@bwiernik
Copy link
Contributor Author

Let's call the function check_residuals()

@mccarthy-m-g You can add the function to a new check_residuals.R file here in the performance package. Take a look at one of the other check_ functions like check_normality.R for an example of the documentation syntax and structure.

Then open a PR here and we can merge it in. After that, then we can move over to the see package repo and add the plotting function there.

@mccarthy-m-g
Copy link
Collaborator

Hi all, I just opened a new issue to discuss the implementation for check_residuals() (#595). There are a few things that should be resolved before getting a PR started.

@strengejacke
Copy link
Member

This is the current development stage. We see a mismatch between the tests based on simulated residuals and generated plots for following families/models:

  • mnb / nbinom2() (plot suggests underdispersion)
  • mzinb / nbinom2() with ZI (plot suggests underdispersion)
library(performance)
library(glmmTMB)
library(readr)
docvisit <- read_table2("C:/Users/Daniel/Downloads/docvisit.txt")

mp <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = poisson()
)
out <- check_overdispersion(mp)
out
#> # Overdispersion test
#> 
#>        dispersion ratio =    1.808
#>   Pearson's Chi-Squared = 9375.539
#>                 p-value =  < 0.001
#> Overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

mnb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = nbinom2()
)
out <- check_overdispersion(mnb)
out
#> # Overdispersion test
#> 
#>  dispersion ratio = 1.005
#>           p-value = 0.816
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

mzip <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  ziformula = ~ age,
  data = docvisit,
  family = poisson()
)
out <- check_overdispersion(mzip)
out
#> # Overdispersion test
#> 
#>  dispersion ratio =   1.417
#>           p-value = < 0.001
#> Overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

mzinb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore,
  ziformula = ~ age,
  data = docvisit,
  family = nbinom2()
)
out <- check_overdispersion(mzinb)
out
#> # Overdispersion test
#> 
#>  dispersion ratio = 1.031
#>           p-value =  0.64
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

mzinbd <- glmmTMB(
  doctorco ~ sex + illness + income + hscore + age,
  ziformula = ~ sex + illness + income + hscore + age,
  dispformula = ~ sex + illness + income + hscore + age,
  data = docvisit,
  family = nbinom2()
)
out <- check_overdispersion(mzinbd)
out
#> # Overdispersion test
#> 
#>  dispersion ratio = 1.133
#>           p-value = 0.104
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Created on 2024-03-17 with reprex v2.1.0

Looks like "nbinom2()" is currently inaccurate, the code we use is here:

.diag_overdispersion <- function(model, ...) {

(also pinging @bbolker and cross referencing to #654)

This was referenced Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 💥 Implemented features can be improved or revised
Projects
None yet
Development

No branches or pull requests

8 participants