Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICC for beta_family not accurate #742

Closed
roaldarbol opened this issue Jul 5, 2024 · 25 comments
Closed

ICC for beta_family not accurate #742

roaldarbol opened this issue Jul 5, 2024 · 25 comments
Labels
3 investigators ❔❓ Need to look further into this issue

Comments

@roaldarbol
Copy link

roaldarbol commented Jul 5, 2024

The ICC generated from icc() for a glmmTMB model with a beta_family() family seems suspiciously low now. Compared to either the variance decomposition of a similar Bayesian model, or the same model specified with a gaussian() family (the Bayesian and Gaussian seem quite in agreement).

So this is kind of a continuation of easystats/insight#664, but decided to create a new issue rather than re-open, as there seems to be an extra issue with the bootstrapping in icc(). Maybe the new {insight} developments have not made their way into the bootstrapping, IDK. Currently, the estimate is fery low, but the CI is quite high, so the estimate falls well outside the CI.

The reprex is run on the latest commit into main in {performance}.

# Versions from latest commit to {performance}
packageVersion("insight")
#> [1] '0.20.1.11'
packageVersion("performance")
#> [1] '0.12.0.4'

Here's a reprex:

library(tibble)
library(glmmTMB)
#> Warning in checkDepPackageVersion(dep_pkg = "TMB"): Package version inconsistency detected.
#> glmmTMB was built with TMB version 1.9.11
#> Current TMB version is 1.9.14
#> Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)
library(brms)
#> Loading required package: Rcpp
#> Loading 'brms' package (version 2.21.0). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
#> 
#> Attaching package: 'brms'
#> The following object is masked from 'package:glmmTMB':
#> 
#>     lognormal
#> The following object is masked from 'package:stats':
#> 
#>     ar
library(performance)
library(ggplot2)
library(paletteer)
library(forcats)
df_ratio_episode <- tibble::tibble(
  animal_id = factor(
    rep(
      c(
        "208", "209", "210", "214", "223", "228", "211", "213", "217", "222", "234",
        "241", "216", "230", "231", "240", "242", "244", "218", "220", "225", "237",
        "239", "219", "251", "252", "253", "254"
      ),
      each = 2L
    ),
    levels = c(
      "200", "204", "205", "206", "215", "224", "208", "209", "210", "214", "223",
      "228", "211", "213", "217", "222", "234", "241", "216", "230", "231", "240",
      "242", "244", "218", "220", "225", "237", "239", "245", "219", "236", "251",
      "252", "253", "254"
    )
  ),
  trial = rep(c(1, 2), 28),
  activity_ratio = c(
    0.1313027016689785, 0.08387917431645128, 0.1395420340967623,
    0.09844057594710427, 0.19414443290359096, 0.16304581176275632,
    0.17274983272168504, 0.17357956037939837, 0.09729583968716982,
    0.05138063319955499, 0.14298075594540044, 0.10179701101266003,
    0.09168390375802275, 0.11591243874797318, 0.2521345405747349,
    0.16335726666875724, 0.13436311090275369, 0.12012636336085161,
    0.13868852567209072, 0.12008249718946021, 0.27708418835127824,
    0.22042035159734397, 0.2649703945513039, 0.22158610629846917,
    0.2001770607989554, 0.2238562351804714, 0.1105503693420828,
    0.08255349183783911, 0.21927303214082697, 0.22211274055043914,
    0.10446530203550744, 0.11336175801811256, 0.0826812722435201,
    0.09328851878674252, 0.13701773797551595, 0.1297098120849381,
    0.05986226055235673, 0.14423247009476106, 0.19474645802355026,
    0.1713563584485577, 0.25663498351317365, 0.30249307043720924,
    0.09082761877930186, 0.10402396536249521, 0.21941679494558652,
    0.28459112981037343, 0.11218161441362348, 0.12449715062493952,
    0.18427917423975973, 0.14845015830783756, 0.19444224064643065,
    0.13471565660441723, 0.11247341287367296, 0.08660523675310272,
    0.1763980204528711, 0.1049572229068965
  ),
) |>
  dplyr::group_by(animal_id)

# Clearly quite a lot of the variance comes from Animal ID
df_ratio_episode |> 
  ggplot(aes(x = fct_reorder(animal_id, activity_ratio, .desc = TRUE), 
             y = activity_ratio)) +
  geom_line(aes(group = animal_id), alpha = 0.5) +
  geom_point(aes(colour = factor(trial))) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_colour_paletteer_d("nationalparkcolors::Badlands") +
  labs(x = "Animal ID",
       y = "Activity Ratio") +
  theme(panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.text.x = element_blank()
  )

# Fit beta GLMM with glmmTMB
glmm_ratio_beta <- glmmTMB::glmmTMB(activity_ratio ~ trial + (1 | animal_id),
                                        data = df_ratio_episode,
                                        family = glmmTMB::beta_family)

performance::icc(glmm_ratio_beta)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.077
#>   Unadjusted ICC: 0.077
performance::icc(glmm_ratio_beta, ci = 0.95, method = "boot")
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.077 [1.000, 1.000]
#>   Unadjusted ICC: 0.077 [0.889, 1.000]

# Same but Gaussian
glmm_ratio_gauss <- glmmTMB::glmmTMB(activity_ratio ~ trial + (1 | animal_id),
                                       data = df_ratio_episode,
                                       family = gaussian)

performance::icc(glmm_ratio_gauss)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.782
#>   Unadjusted ICC: 0.772
performance::icc(glmm_ratio_gauss, ci = 0.95, method = "boot")
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.782 [0.616, 0.891]
#>   Unadjusted ICC: 0.772 [0.592, 0.886]

# Meanwhile the Bayesian equivalent gives a high variance ratio (and the ICC is off, but I think that is somewhat expected right?)
bayes_ratio_episode <- brms::brm(activity_ratio ~ trial + (1 | animal_id),
                                       data = df_ratio_episode,
                                       family = brms::Beta())
#> Compiling Stan program...
#> Trying to compile a simple C file
#> Running /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/bin/R \
#>   CMD SHLIB foo.c
#> using C compiler: ‘Apple clang version 15.0.0 (clang-1500.3.9.4)’
#> using SDK: ‘’
#> clang -arch x86_64 -I"/Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library/Rcpp/include/"  -I"/Volumes/ResearchSSD/002-Tracking-1Day/.renv/library/macos/R-4.4/x86_64-apple-darwin20/RcppEigen/include/"  -I"/Volumes/ResearchSSD/002-Tracking-1Day/.renv/library/macos/R-4.4/x86_64-apple-darwin20/RcppEigen/include/unsupported"  -I"/Volumes/ResearchSSD/002-Tracking-1Day/.renv/library/macos/R-4.4/x86_64-apple-darwin20/BH/include" -I"/Users/roaldarbol/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/x86_64-apple-darwin20/StanHeaders/2.32.9/affbfba7203fd83b6594362f1355c25b/StanHeaders/include/src/"  -I"/Users/roaldarbol/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/x86_64-apple-darwin20/StanHeaders/2.32.9/affbfba7203fd83b6594362f1355c25b/StanHeaders/include/"  -I"/Users/roaldarbol/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/x86_64-apple-darwin20/RcppParallel/5.1.7/a45594a00f5dbb073d5ec9f48592a08a/RcppParallel/include/"  -I"/Users/roaldarbol/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/x86_64-apple-darwin20/rstan/2.32.6/8a5b5978f888a3477c116e0395d006f8/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/Users/roaldarbol/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/x86_64-apple-darwin20/StanHeaders/2.32.9/affbfba7203fd83b6594362f1355c25b/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/opt/R/x86_64/include    -fPIC  -falign-functions=64 -Wall -g -O2  -c foo.c -o foo.o
#> In file included from <built-in>:1:
#> In file included from /Users/roaldarbol/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/x86_64-apple-darwin20/StanHeaders/2.32.9/affbfba7203fd83b6594362f1355c25b/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
#> In file included from /Volumes/ResearchSSD/002-Tracking-1Day/.renv/library/macos/R-4.4/x86_64-apple-darwin20/RcppEigen/include/Eigen/Dense:1:
#> In file included from /Volumes/ResearchSSD/002-Tracking-1Day/.renv/library/macos/R-4.4/x86_64-apple-darwin20/RcppEigen/include/Eigen/Core:19:
#> /Volumes/ResearchSSD/002-Tracking-1Day/.renv/library/macos/R-4.4/x86_64-apple-darwin20/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: 'cmath' file not found
#> #include <cmath>
#>          ^~~~~~~
#> 1 error generated.
#> make: *** [foo.o] Error 1
#> Start sampling
#> 
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
#> Chain 1: 
#> Chain 1: Gradient evaluation took 8.2e-05 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.82 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1: 
#> Chain 1: 
#> Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
#> Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
#> Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
#> Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
#> Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
#> Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
#> Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
#> Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
#> Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
#> Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
#> Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
#> Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
#> Chain 1: 
#> Chain 1:  Elapsed Time: 0.618 seconds (Warm-up)
#> Chain 1:                0.645 seconds (Sampling)
#> Chain 1:                1.263 seconds (Total)
#> Chain 1: 
#> 
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
#> Chain 2: 
#> Chain 2: Gradient evaluation took 3.7e-05 seconds
#> Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.37 seconds.
#> Chain 2: Adjust your expectations accordingly!
#> Chain 2: 
#> Chain 2: 
#> Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
#> Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
#> Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
#> Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
#> Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
#> Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
#> Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
#> Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
#> Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
#> Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
#> Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
#> Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
#> Chain 2: 
#> Chain 2:  Elapsed Time: 0.628 seconds (Warm-up)
#> Chain 2:                0.412 seconds (Sampling)
#> Chain 2:                1.04 seconds (Total)
#> Chain 2: 
#> 
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
#> Chain 3: 
#> Chain 3: Gradient evaluation took 2.7e-05 seconds
#> Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
#> Chain 3: Adjust your expectations accordingly!
#> Chain 3: 
#> Chain 3: 
#> Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
#> Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
#> Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
#> Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
#> Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
#> Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
#> Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
#> Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
#> Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
#> Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
#> Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
#> Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
#> Chain 3: 
#> Chain 3:  Elapsed Time: 0.567 seconds (Warm-up)
#> Chain 3:                0.415 seconds (Sampling)
#> Chain 3:                0.982 seconds (Total)
#> Chain 3: 
#> 
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
#> Chain 4: 
#> Chain 4: Gradient evaluation took 2.6e-05 seconds
#> Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.26 seconds.
#> Chain 4: Adjust your expectations accordingly!
#> Chain 4: 
#> Chain 4: 
#> Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
#> Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
#> Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
#> Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
#> Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
#> Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
#> Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
#> Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
#> Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
#> Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
#> Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
#> Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
#> Chain 4: 
#> Chain 4:  Elapsed Time: 0.536 seconds (Warm-up)
#> Chain 4:                0.442 seconds (Sampling)
#> Chain 4:                0.978 seconds (Total)
#> Chain 4:
performance::variance_decomposition(bayes_ratio_episode)
#> # Random Effect Variances and ICC
#> 
#> Conditioned on: all random effects
#> 
#> ## Variance Ratio (comparable to ICC)
#> Ratio: 0.74  CI 95%: [0.44 0.87]
#> 
#> ## Variances of Posterior Predicted Distribution
#> Conditioned on fixed effects: 0.00  CI 95%: [0.00 0.00]
#> Conditioned on rand. effects: 0.00  CI 95%: [0.00 0.01]
#> 
#> ## Difference in Variances
#> Difference: 0.00  CI 95%: [0.00 0.00]
# performance::icc(bayes_ratio_episode)

# Versions from latest commit to {performance}
packageVersion("insight")
#> [1] '0.20.1.11'
packageVersion("performance")
#> [1] '0.12.0.4'

Created on 2024-07-05 with reprex v2.1.0

@strengejacke strengejacke added the 3 investigators ❔❓ Need to look further into this issue label Jul 5, 2024
@strengejacke
Copy link
Member

The variance components for the beta-model don't look that bad, actually.

library(glmmTMB)
library(performance)

df_ratio_episode <- data.frame(
  animal_id = factor(
    rep(
      c(
        "208", "209", "210", "214", "223", "228", "211", "213", "217", "222", "234",
        "241", "216", "230", "231", "240", "242", "244", "218", "220", "225", "237",
        "239", "219", "251", "252", "253", "254"
      ),
      each = 2L
    ),
    levels = c(
      "200", "204", "205", "206", "215", "224", "208", "209", "210", "214", "223",
      "228", "211", "213", "217", "222", "234", "241", "216", "230", "231", "240",
      "242", "244", "218", "220", "225", "237", "239", "245", "219", "236", "251",
      "252", "253", "254"
    )
  ),
  trial = rep(c(1, 2), 28),
  activity_ratio = c(
    0.1313027016689785, 0.08387917431645128, 0.1395420340967623,
    0.09844057594710427, 0.19414443290359096, 0.16304581176275632,
    0.17274983272168504, 0.17357956037939837, 0.09729583968716982,
    0.05138063319955499, 0.14298075594540044, 0.10179701101266003,
    0.09168390375802275, 0.11591243874797318, 0.2521345405747349,
    0.16335726666875724, 0.13436311090275369, 0.12012636336085161,
    0.13868852567209072, 0.12008249718946021, 0.27708418835127824,
    0.22042035159734397, 0.2649703945513039, 0.22158610629846917,
    0.2001770607989554, 0.2238562351804714, 0.1105503693420828,
    0.08255349183783911, 0.21927303214082697, 0.22211274055043914,
    0.10446530203550744, 0.11336175801811256, 0.0826812722435201,
    0.09328851878674252, 0.13701773797551595, 0.1297098120849381,
    0.05986226055235673, 0.14423247009476106, 0.19474645802355026,
    0.1713563584485577, 0.25663498351317365, 0.30249307043720924,
    0.09082761877930186, 0.10402396536249521, 0.21941679494558652,
    0.28459112981037343, 0.11218161441362348, 0.12449715062493952,
    0.18427917423975973, 0.14845015830783756, 0.19444224064643065,
    0.13471565660441723, 0.11247341287367296, 0.08660523675310272,
    0.1763980204528711, 0.1049572229068965
  )
)

glmm_ratio_beta <- glmmTMB::glmmTMB(activity_ratio ~ trial + (1 | animal_id),
  data = df_ratio_episode,
  family = glmmTMB::beta_family
)
insight::get_variance(glmm_ratio_beta)
#> $var.fixed
#> [1] 0.003054337
#> 
#> $var.random
#> [1] 0.160573
#> 
#> $var.residual
#> [1] 1.92015
#> 
#> $var.distribution
#> [1] 1.92015
#> 
#> $var.dispersion
#> [1] 0
#> 
#> $var.intercept
#> animal_id 
#>  0.160573
glmm_ratio_gauss <- glmmTMB::glmmTMB(activity_ratio ~ trial + (1 | animal_id),
  data = df_ratio_episode,
  family = gaussian
)
insight::get_variance(glmm_ratio_gauss)
#> $var.fixed
#> [1] 4.886653e-05
#> 
#> $var.random
#> [1] 0.00282474
#> 
#> $var.residual
#> [1] 0.0007858734
#> 
#> $var.distribution
#> [1] 0.0007858734
#> 
#> $var.dispersion
#> [1] 0
#> 
#> $var.intercept
#>  animal_id 
#> 0.00282474

Created on 2024-07-05 with reprex v2.1.0

@roaldarbol
Copy link
Author

roaldarbol commented Jul 5, 2024

The residual variance seems suspiciously high to me though. 🤔

@strengejacke
Copy link
Member

We had this discussion before, but I can't find it right now. There seems to be a mismatch between the variance-function from glmmTMB::beta_family():

> family(mod1)$variance
function (mu)
{
    mu * (1 - mu)
}

and what the docs suggest:

V=μ(1−μ)/(ϕ+1)

(which is mu * (1 - mu) / (1 + phi))

The code base in insight is inconclusive, as well:

# Get distributional variance for beta-family
# ----------------------------------------------
.variance_family_beta <- function(model, mu, phi) {
  stats::family(model)$variance(mu)
  # was:
  # mu * (1 - mu) / (1 + phi)
  # but that code is not what "glmmTMB" uses for the beta family
  # mu * (1 - mu)
}

Tagging @bbolker

@strengejacke
Copy link
Member

strengejacke commented Jul 5, 2024

It's hard to find something to validate against. The current implementation in insight and performance yields results similar to betareg for your example in easystats/insight#664:

library(glmmTMB)
set.seed(123)
a <- seq(from = 5, to = 95)
b1 <- jitter(a, factor = 20)
b2 <- jitter(a, factor = 20)
b2 <- b2 + 30
b3 <- jitter(a, factor = 20)
b3 <- b3 + 30
t_a <- rep('a', length(a))
t_b <- rep('b', length(a))

c <- as.factor(a)
d <- data.frame(id = c(c,c,c,c),
            value = c(a,b1,b2,b3),
            treatment = c(t_a, t_a, t_b, t_b))
d$value <- d$value / (max(d$value) + 0.1)
m <- glmmTMB::glmmTMB(value ~ treatment,
                              data = d,
                              family=beta_family)
performance::r2_nakagawa(m)
#> Random effect variances not available. Returned R2 does not account for random effects.
#> # R2 for Mixed Models
#> 
#>   Conditional R2: NA
#>      Marginal R2: 0.289

m2 <- betareg::betareg(value ~ treatment, data = d)
summary(m2)$pseudo.r.squared
#> [1] 0.2434683

The next results (including id) differ, seems to be too high for betareg?

m <- glmmTMB::glmmTMB(value ~ treatment + (1 | id),
                              data = d,
                              family=beta_family)
performance::r2_nakagawa(m) # 0.685
m2 <- betareg::betareg(value ~ treatment + id, data = d)
summary(m2)$pseudo.r.squared # 0.940

But... for the example in this issue, using mu * (1 - mu) / (1 + phi) seems to fit better than the current implementation in insight (which is mu * (1 - mu)).

@roaldarbol
Copy link
Author

roaldarbol commented Jul 5, 2024

I actually think the betareg version is on point - it helps to plot it out, and here you can really see that the ICC should be super high for the first example!

library(glmmTMB)
#> Warning in checkDepPackageVersion(dep_pkg = "TMB"): Package version inconsistency detected.
#> glmmTMB was built with TMB version 1.9.11
#> Current TMB version is 1.9.14
#> Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)
library(performance)
library(ggplot2)
library(paletteer)
set.seed(123)
a <- seq(from = 5, to = 95)
b1 <- jitter(a, factor = 20)
b2 <- jitter(a, factor = 20)
b2 <- b2 + 30
b3 <- jitter(a, factor = 20)
b3 <- b3 + 30
t_a <- rep('a', length(a))
t_b <- rep('b', length(a))

c <- as.factor(a)
d <- data.frame(id = c(c,c,c,c),
                value = c(a,b1,b2,b3),
                treatment = c(t_a, t_a, t_b, t_b))
d$value <- d$value / (max(d$value) + 0.1)

d |> 
  ggplot(aes(x = forcats::fct_reorder(id, value, .desc = TRUE), 
             y = value)) +
  geom_line(aes(group = id), alpha = 0.5) +
  geom_point() +
  theme(panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.text.x = element_blank()
  ) +
  facet_grid(~ treatment)

Created on 2024-07-05 with reprex v2.1.0

@roaldarbol
Copy link
Author

Did some more digging. :-) It seems that for the Gaussian family, the conditional R2 of a model without random effects pretty much equates the Adjusted ICC for a random effects model. For the beta family, that's not the case.

library(glmmTMB)
#> Warning in checkDepPackageVersion(dep_pkg = "TMB"): Package version inconsistency detected.
#> glmmTMB was built with TMB version 1.9.11
#> Current TMB version is 1.9.14
#> Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)
library(performance)
library(brms)
#> Loading required package: Rcpp
#> Loading 'brms' package (version 2.21.0). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
#> 
#> Attaching package: 'brms'
#> The following object is masked from 'package:glmmTMB':
#> 
#>     lognormal
#> The following object is masked from 'package:stats':
#> 
#>     ar
library(betareg)
set.seed(123)
a <- seq(from = 5, to = 95)
b1 <- jitter(a, factor = 20)
b2 <- jitter(a, factor = 20)
b2 <- b2 + 30
b3 <- jitter(a, factor = 20)
b3 <- b3 + 30
t_a <- rep('a', length(a))
t_b <- rep('b', length(a))

c <- as.factor(a)
d <- data.frame(id = c(c,c,c,c),
                value = c(a,b1,b2,b3),
                treatment = c(t_a, t_a, t_b, t_b))
d$value <- d$value / (max(d$value) + 0.1)

# GLMMs and GLMs with beta and Gaussian families
glmm_beta <- glmmTMB::glmmTMB(value ~ treatment + (1|id),
                        data = d,
                        family=beta_family)
glm_beta <- betareg(value ~ treatment + id,
                           data = d)
glmm_gauss <- glmmTMB::glmmTMB(value ~ treatment + (1|id),
                         data = d,
                         family = gaussian)
glm_gauss <- glm(value ~ treatment + id,
                            data = d,
                            family = gaussian)
icc(glmm_beta)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.619
#>   Unadjusted ICC: 0.511
r2(glmm_beta)
#> # R2 for Mixed Models
#> 
#>   Conditional R2: 0.685
#>      Marginal R2: 0.174
r2(glm_beta)
#> # R2 for Beta Regression
#>   Pseudo R2: 0.941
icc(glmm_gauss)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.995
#>   Unadjusted ICC: 0.749
r2(glmm_gauss)
#> # R2 for Mixed Models
#> 
#>   Conditional R2: 0.996
#>      Marginal R2: 0.247
r2(glm_gauss)
#>   R2: 0.997

Created on 2024-07-05 with reprex v2.1.0

A side-note, might be for a different issue - it seems that variance_decomposition() resembles the Unadjusted ICC
rather than the Adjusted ICC (compare the Gaussian examples above and below) - it would be great to have both Adjusted and Unadjusted.

bayes_beta <- brms::brm(value ~ treatment + (1|id),
                       data = d,
                       family=Beta)
#> Compiling Stan program...
#> I just removed all the compiling for convenience....
bayes_beta_fe <- brms::brm(value ~ treatment + id,
                           data = d,
                           family = Beta)
#> Compiling Stan program...
#> I just removed all the compiling for convenience....
variance_decomposition(bayes_beta)
#> # Random Effect Variances and ICC
#> 
#> Conditioned on: all random effects
#> 
#> ## Variance Ratio (comparable to ICC)
#> Ratio: 0.61  CI 95%: [0.58 0.65]
#> 
#> ## Variances of Posterior Predicted Distribution
#> Conditioned on fixed effects: 0.02  CI 95%: [0.02 0.03]
#> Conditioned on rand. effects: 0.06  CI 95%: [0.06 0.06]
#> 
#> ## Difference in Variances
#> Difference: 0.04  CI 95%: [0.03 0.04]
r2(bayes_beta)
#> # Bayesian R2 with Compatibility Interval
#> 
#>   Conditional R2: 0.979 (95% CI [0.976, 0.981])
#>      Marginal R2: 0.326 (95% CI [0.311, 0.341])
r2(bayes_beta_fe)
#> # Bayesian R2 with Compatibility Interval
#> 
#>   Conditional R2: 0.979 (95% CI [0.976, 0.981])
variance_decomposition(bayes_gauss)
#> # Random Effect Variances and ICC
#> 
#> Conditioned on: all random effects
#> 
#> ## Variance Ratio (comparable to ICC)
#> Ratio: 0.75  CI 95%: [0.74 0.76]
#> 
#> ## Variances of Posterior Predicted Distribution
#> Conditioned on fixed effects: 0.01  CI 95%: [0.01 0.01]
#> Conditioned on rand. effects: 0.06  CI 95%: [0.06 0.06]
#> 
#> ## Difference in Variances
#> Difference: 0.04  CI 95%: [0.04 0.04]
r2(bayes_gauss)
#> # Bayesian R2 with Compatibility Interval
#> 
#>   Conditional R2: 0.996 (95% CI [0.996, 0.996])
#>      Marginal R2: 0.247 (95% CI [0.242, 0.252])
r2(bayes_gauss_fe)
#> # Bayesian R2 with Compatibility Interval
#> 
#>   Conditional R2: 0.996 (95% CI [0.996, 0.996])

@strengejacke
Copy link
Member

betareg computes the R2 as power of correlation between eta and linkfun(y) (cor(eta, linkfun(y))^2). I'm not sure if this always translates perfect to the mixed models scenario.

@bbolker
Copy link

bbolker commented Jul 5, 2024

This is definitely a definition of R^2, closest to Efron's pseudo-R^2 (although Efron computes correlation on the response scale before squaring it ...) It would be interesting to see where betareg got that variant from ... I haven't found it in any of my pseudo-R^2 refs (Wikipedia, UCLA, my class notes (although not sure what source I used for the notes — doesn't seem to be in the textbook I used ...))

@strengejacke
Copy link
Member

A quick search in the related publication (https://www.jstatsoft.org/article/view/v034i02) has only one match for "R2". Maybe elsewhere it's described in more detail.

May I also point out to this comment: #742 (comment)

What's your opinion on this?

@strengejacke
Copy link
Member

strengejacke commented Jul 6, 2024

Here: https://stats.stackexchange.com/questions/233169/pseudo-r-squared-and-multicollinearity-checks-with-beta-regression

But what would be "the linear predictor for the mean"? Maybe that would help improving insight::get_variance().

Edit: found this
image

@roaldarbol
Copy link
Author

@strengejacke For the uninitiated, could you just write a line or two about why we talk about R2 almost exclusively when the issue is on ICC - I imagine there's a mathematical connection is implicit, but it's not entirely clear. :-) Also, feel free to change the issue title to reflect that it's an issue that affects both ICC and R2 for the beta family if they're two parts of the same puzzle. 😊

@bbolker
Copy link

bbolker commented Jul 8, 2024

A bit out of order in the thread, but responding to the comment above about the mismatch between the variance function (mu*(1-mu)) and the conditional variance of the distribution (mu * (1 - mu) / (1 + phi)

? glmmTMB::beta_family says:

variance: a function of either 1 (mean) or 2 (mean and dispersion parameter) arguments giving a value proportional to the predicted variance (scaled by ‘sigma(.)’)

This is for consistency with the way that $variance is implemented/interpreted in stats::glm(). I think I've raised this point before ...

@strengejacke
Copy link
Member

Thanks for clarification!

I think I've raised this point before ...

Yes, definitely, it's just I couldn't find the thread/post. :-/

@strengejacke
Copy link
Member

strengejacke commented Jul 8, 2024

@strengejacke For the uninitiated, could you just write a line or two about why we talk about R2 almost exclusively when the issue is on ICC - I imagine there's a mathematical connection is implicit, but it's not entirely clear. :-) Also, feel free to change the issue title to reflect that it's an issue that affects both ICC and R2 for the beta family if they're two parts of the same puzzle. 😊

Sure. ICC is relevant for mixed models only. The ICC is calculated by dividing the random effect variance, σ2i, by the total variance, i.e. the sum of the random effect variance and the residual variance, σ2ε.

The R2 for mixed models are calculated this way:

  • marginal R2: fixed effects variances divided by the sum of the fixed, random and residual variances.
  • conditional R2: sum of fixed and random effects variances divided by the sum of the fixed, random and residual variances.

The main point is that for both ICC and R2 in mixed models, we need the different variance components. That's why both are related.

@roaldarbol
Copy link
Author

roaldarbol commented Jul 8, 2024

Thanks for the clarification, that helps.

Just tested out the latest changes - Ferrari's R2 looks more in line with what I'd expect. Will this eventually change the ICC too? A few notes on consistency as of the latest commit:

  • Currently the default R2 for a glmmTMB and glm with beta_family is not Ferrari (IDK what is default for glmm and it's Nagelkerke's R2 for glm - which gives NaN)
  • The Ferrari R2 is slightly different for a glmmTMB and beta_family compared to betareg
  • Bootstrapped ICC and R2 for glmmTMB with beta_family is wrong
  • Bootstrapping for betareg R2 does nothing
  • Bootstrapping for glm with beta_family fails
library(glmmTMB)
#> Warning in checkDepPackageVersion(dep_pkg = "TMB"): Package version inconsistency detected.
#> glmmTMB was built with TMB version 1.9.11
#> Current TMB version is 1.9.14
#> Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)
library(performance)
library(betareg)
set.seed(123)
a <- seq(from = 5, to = 95)
b1 <- jitter(a, factor = 20)
b2 <- jitter(a, factor = 20)
b2 <- b2 + 30
b3 <- jitter(a, factor = 20)
b3 <- b3 + 30
t_a <- rep('a', length(a))
t_b <- rep('b', length(a))

c <- as.factor(a)
d <- data.frame(id = c(c,c,c,c),
                value = c(a,b1,b2,b3),
                treatment = c(t_a, t_a, t_b, t_b))
d$value <- d$value / (max(d$value) + 0.1)

# Specify models
glmm_beta <- glmmTMB::glmmTMB(value ~ treatment + (1|id),
                              data = d,
                              family=beta_family)
glm_betareg <- betareg(value ~ treatment + id,
                    data = d)
glm_betafamily <- glm(value ~ treatment + id,
                 data = d,
                 family = beta_family)
glmm_gauss <- glmmTMB::glmmTMB(value ~ treatment + (1|id),
                               data = d,
                               family = gaussian)
glm_gauss <- glm(value ~ treatment + id,
                 data = d,
                 family = gaussian)
# ICC
icc(glmm_beta)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.619
#>   Unadjusted ICC: 0.511
icc(glmm_gauss)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.995
#>   Unadjusted ICC: 0.749

# ICC Bootstrapped
icc(glmm_beta, method = "boot", ci = 0.95)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.619 [1.000, 1.000]
#>   Unadjusted ICC: 0.511 [0.660, 0.785]
icc(glmm_gauss, method = "boot", ci = 0.95)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.995 [0.993, 0.996]
#>   Unadjusted ICC: 0.749 [0.693, 0.795]

# R2 - default
r2(glmm_beta)
#> # R2 for Mixed Models
#> 
#>   Conditional R2: 0.685
#>      Marginal R2: 0.174
r2(glm_betareg)
#> # R2 for Beta Regression
#>   Pseudo R2: 0.941
r2(glm_betafamily)
#> # R2 for Generalized Linear Regression
#>   Nagelkerke's R2: NaN
r2(glmm_gauss)
#> # R2 for Mixed Models
#> 
#>   Conditional R2: 0.996
#>      Marginal R2: 0.247
r2(glm_gauss)
#>   R2: 0.997

# R2 - Ferrari
r2_ferrari(glmm_beta)
#> # R2 for Generalized Linear Regression
#>   Ferrari's R2: 0.941
r2_ferrari(glm_betareg)
#> # R2 for Generalized Linear Regression
#>   Ferrari's R2: 0.941
r2_ferrari(glm_betafamily)
#> # R2 for Generalized Linear Regression
#>   Ferrari's R2: 0.929

# R2 Bootstrapped
r2(glmm_beta, method = "boot", ci = 0.95)
#> # R2 for Mixed Models
#> 
#>   Conditional R2: 0.685 [1.000, 1.000]
#>      Marginal R2: 0.174 [0.214, 0.326]
r2(glm_betareg, method = "boot", ci = 0.95)
#> # R2 for Beta Regression
#>   Pseudo R2: 0.941
r2(glm_betafamily, method = "boot", ci = 0.95)
#> Error in stats::integrate(.dRsq, i, j, R2_pop = dots$R2_pop, R2_obs = dots$R2_obs, : non-finite function value
r2(glmm_gauss, method = "boot", ci = 0.95)
#> # R2 for Mixed Models
#> 
#>   Conditional R2: 0.996 [0.995, 0.997]
#>      Marginal R2: 0.247 [0.198, 0.302]
r2(glm_gauss, method = "boot", ci = 0.95)
#>   R2: 0.997 [0.995, 0.997]

Created on 2024-07-08 with reprex v2.1.0

@strengejacke
Copy link
Member

strengejacke commented Jul 8, 2024

and the conditional variance of the distribution (mu * (1 - mu) / (1 + phi)

@bbolker: That means, in insight::get_variance(), the and conditional variance of the distribution (mu * (1 - mu)) / (1 + phi) should be used?

@bbolker
Copy link

bbolker commented Jul 8, 2024

Yes, the conditional variance should definitely be divided by (1+phi)

library(glmmTMB)
set.seed(101)
dd <- data.frame(x=rnorm(100))
## (I was originally planning to use an x covariate, now ignored)
dd$y <- simulate_new(~ 1, family = beta_family, newdata = dd,
           newparams = list(beta = c(1), betadisp = 2))[[1]]
m <- glmmTMB(y~1, family = beta_family, data = dd)
mu <- predict(m, type = "response")
var1 <- family(m)$var(mu)/(1+sigma(m)) ## 0.0187
var(dd$y) ## 0.0193

@strengejacke
Copy link
Member

What about orderedbeta family? mu * (1 - mu) or also dividing by (1+phi)?

strengejacke added a commit to easystats/insight that referenced this issue Jul 8, 2024
strengejacke added a commit to easystats/insight that referenced this issue Jul 8, 2024
* Fix conditional distribution of Beta family
easystats/performance#742

* remove test
strengejacke added a commit that referenced this issue Jul 8, 2024
@strengejacke
Copy link
Member

strengejacke commented Jul 8, 2024

Currently the default R2 for a glmmTMB and glm with beta_family is not Ferrari (IDK what is default for glmm and it's Nagelkerke's R2 for glm - which gives NaN)

It's now implemented for glm (see easystats/insight@887bb59). For glmmTMB, r2_ferrari() is only used for non-mixed models. Mixed models always use r2_nakagawa() by default when r2() is called.

The Ferrari R2 is slightly different for a glmmTMB and beta_family compared to betareg

Only if you fit a mixed model for glmmTMB. For non-mixed models, results are identical.

Bootstrapped ICC and R2 for glmmTMB with beta_family is wrong

Yeah, not sure about this one. Will look into it.

Bootstrapping for betareg R2 does nothing. Bootstrapping for glm with beta_family fails

Bootstrapping is not implemented for R2 for non-mixed models.

You should be able to update insight, which now finally is supposed to return the correct ICC/R2 for mixed models from the beta-family.

@roaldarbol
Copy link
Author

The Ferrari R2 is slightly different for a glmmTMB and beta_family compared to betareg

Only if you fit a mixed model for glmmTMB. For non-mixed models, results are identical.

Actually the other way around - the mixed model matches the betareg model, the GLM is different. See the above (quoted below).

# R2 - Ferrari
r2_ferrari(glmm_beta)
#> # R2 for Generalized Linear Regression
#>   Ferrari's R2: 0.941
r2_ferrari(glm_betareg)
#> # R2 for Generalized Linear Regression
#>   Ferrari's R2: 0.941
r2_ferrari(glm_betafamily)
#> # R2 for Generalized Linear Regression
#>   Ferrari's R2: 0.929

You should be able to update insight, which now finally is supposed to return the correct ICC/R2 for mixed models from the beta-family.

Works really well now!!! 🥳🥳🥳

@strengejacke
Copy link
Member

library(glmmTMB)
set.seed(101)
dd <- data.frame(x=rnorm(100))
## (I was originally planning to use an x covariate, now ignored)
dd$y <- simulate_new(~ 1, family = beta_family, newdata = dd,
           newparams = list(beta = c(1), betadisp = 2))[[1]]
m <- glmmTMB(y~1, family = beta_family, data = dd)
mu <- predict(m, type = "response")
var1 <- family(m)$var(mu)/(1+sigma(m)) ## 0.0187
var(dd$y) ## 0.0193

Two questions: does this work for you? I get:

library(glmmTMB)
set.seed(101)
dd <- data.frame(x=rnorm(100))
## (I was originally planning to use an x covariate, now ignored)
dd$y <- simulate_new(~ 1, family = beta_family, newdata = dd,
           newparams = list(beta = c(1), betadisp = 2))[[1]]
#> Error in eval(family$initialize): y values must be 0 < y < 1

Second is #742 (comment).

@bbolker
Copy link

bbolker commented Jul 8, 2024

  1. I think this should work with the devel version, simulation bug fixed in Simulate new init fix glmmTMB/glmmTMB#1008
    2.Hmm. ordbeta uses the same $variance and definition of sigma() as beta (just added a note to ?sigma.glmmTMB): see https://github.com/glmmTMB/glmmTMB/blob/master/glmmTMB/src/glmmTMB.cpp#L733-L737, but I think the actual conditional variances will not be the same, because the parameterization is relative to the underlying/baseline beta distribution, not counting the modifications created by moving to the ordered-beta specification. You might be able to work this out by hand (by analogy with the variances of ZI models, hurdle models, etc. etc.), or maybe look in Kubinec's paper/ask him if he knows the answer??

@roaldarbol
Copy link
Author

roaldarbol commented Jul 9, 2024

For the confidence intervals, it seems that none of the ci_methods work (neither default, boot or analytical).

icc(glmm_beta, ci = 0.95)
icc(glmm_beta, ci = 0.95, method = "boot")
icc(glmm_beta, ci = 0.95, method = "analytical")

(Installed both {performance} and {insight} from the latest commits, 06ed6f1 and easystats/insight@d346f95)

@strengejacke
Copy link
Member

  1. I think this should work with the devel version, simulation bug fixed in Simulate new init fix glmmTMB/glmmTMB#1008
    2.Hmm. ordbeta uses the same $variance and definition of sigma() as beta (just added a note to ?sigma.glmmTMB): see https://github.com/glmmTMB/glmmTMB/blob/master/glmmTMB/src/glmmTMB.cpp#L733-L737, but I think the actual conditional variances will not be the same, because the parameterization is relative to the underlying/baseline beta distribution, not counting the modifications created by moving to the ordered-beta specification. You might be able to work this out by hand (by analogy with the variances of ZI models, hurdle models, etc. etc.), or maybe look in Kubinec's paper/ask him if he knows the answer??

Thanks! Will open a new issue and close this as resolved for now.

@roaldarbol
Copy link
Author

In case anyone tries icc for beta family and thinks they're weird - I think so too, they are quite high, even when you draw two completely unrelated samples from a beta distribution (like 0.4-0.5). Would love to test it thoroughly, but don't have the time currently. But I posted an example that could be used to dig in here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 investigators ❔❓ Need to look further into this issue
Projects
None yet
Development

No branches or pull requests

3 participants