`check_model`: missing outlier plot when scaling data #432

rempsyc · 2022-06-11T19:26:50Z

Summary: check_model fails to plot the outlier panel when scaling data because the scaled variables become incompatible matrix arrays.

Reprex:
The following works:

library(performance)
m <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars)
check_model(m)

Looks good. Let's scale the data

library(dplyr)
mtcars2 <- mtcars %>%
  mutate(across(everything(), scale))

m2 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars2)
check_model(m2)

The outlier panel is missing. The reason is that the outlier check is failing silently.

check_model(m2, check = "outliers")
#> Error in unit(rep(0, TABLE_ROWS * dims[1]), "null"): 'x' and 'units' must have length > 0

The reason is that scaling changes the object class from numeric vector to matrix array.

class(mtcars2$mpg)
#> [1] "matrix" "array"

Solution is to change to vector or numeric

mtcars3 <- mtcars %>%
  mutate(across(everything(), ~scale(.x) %>% as.numeric))

m3 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars3)
check_model(m3)

mtcars4 <- mtcars %>%
  mutate(across(everything(), ~scale(.x) %>% as.vector))

m4 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars4)
check_model(m4)

Note that scaling through lapply instead of dplyr::mutate works:

mtcars5 <- lapply(mtcars, scale) |> as.data.frame()

m5 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars5)
check_model(m5)

The issue emerges also if one simply changes one variable only, suggesting the issue actually lies in the base R scale function.

mtcars6 <- mtcars
mtcars6$wt <- scale(mtcars$wt)

m6 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars6)
check_model(m6)

^{Created on 2022-06-11 by the reprex package (v2.0.1)}

This is confusing many students in an introductory R stats class here because they are taught to scale their variables at the beginning of their script, but then the following fails. It would be nice if check_model could automatically convert from matrix array to numeric vector, if applicable.

The text was updated successfully, but these errors were encountered:

strengejacke · 2022-06-11T20:18:47Z

As a quick workaround, I would always recommend to use a standardize-function that preserves the vector class, e.g. datawizard::standardize().

I'll look into this, not sure where this exactly fails, because check_outliers() seems to work.

strengejacke · 2022-06-12T18:36:59Z

The error comes from insight::get_predicted(). For now, I added a warning. Not quite sure how to best fix this issue.

library(dplyr)
mtcars2 <- mtcars %>%
  mutate(across(everything(), scale))

m2 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars2)
insight::get_predicted(m2)
#> Some of the variables were in matrix-format - probably you used
#>   'scale()' on your data?
#>   If so, and you get an error, please try 'datawizard::standardize()' to
#>   standardize your data.
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit

^{Created on 2022-06-12 by the reprex package (v2.0.1)}

bwiernik · 2022-06-12T19:09:35Z

The issue is scale()'s terrible behavior of always returning a matrix. Users should just never use scale().

We could add a check to see if a predictor variable is a matrix and throw an error/warning like we do if a formula includes a $?

rempsyc · 2022-06-12T19:27:19Z

Thank you, I like the warning and bwiernik's suggestion to throw an error also. Out of curiosity, would there be any con to automatically check if any variable is a matrix, and if so, convert to vector, with a similar warning about the conversion? Since it seems it hasn't been a problem for any of the other panels in check_model.

strengejacke · 2022-06-12T21:09:25Z

We could add a check to see if a predictor variable is a matrix and throw an error/warning like we do if a formula includes a $?

The problem is that if get_predicted() is called w/o data argument, get_data() is called, which coerced matrix columns into vectors. scale() causes no problem when called on-the-fly in the formula. If it's called before fitting the model, then the variable names in the data are the same as the original variable names, but the variable types are 1D-matrices. get_data() returns a data frame where the variable names are also the same as in the original data, but data types are coerced into numeric. But predict() expects the same type, probably because the names are identical?

At this point, it's difficult to check the original input type. I try to read the dataClasses attribute of terms, but not all model type have a terms() method: easystats/insight@216d735

See example here to make a bit clearer what I described above.

library(insight)
library(dplyr)
mtcars2 <- mtcars %>%
  mutate(across(everything(), scale))

m1 <- lm(scale(mpg) ~ scale(wt) + scale(cyl) + scale(gear) + scale(disp), data = mtcars)

# model frame contains scaled variables, including column names with "scale()"
model.frame(m1) |> str()
#> 'data.frame':    32 obs. of  5 variables:
#>  $ scale(mpg) : num [1:32, 1] 0.151 0.151 0.45 0.217 -0.231 ...
#>   ..- attr(*, "scaled:center")= num 20.1
#>   ..- attr(*, "scaled:scale")= num 6.03
#>  $ scale(wt)  : num [1:32, 1] -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#>   ..- attr(*, "scaled:center")= num 3.22
#>   ..- attr(*, "scaled:scale")= num 0.978
#>  $ scale(cyl) : num [1:32, 1] -0.105 -0.105 -1.225 -0.105 1.015 ...
#>   ..- attr(*, "scaled:center")= num 6.19
#>   ..- attr(*, "scaled:scale")= num 1.79
#>  $ scale(gear): num [1:32, 1] 0.424 0.424 0.424 -0.932 -0.932 ...
#>   ..- attr(*, "scaled:center")= num 3.69
#>   ..- attr(*, "scaled:scale")= num 0.738
#>  $ scale(disp): num [1:32, 1] -0.571 -0.571 -0.99 0.22 1.043 ...
#>   ..- attr(*, "scaled:center")= num 231
#>   ..- attr(*, "scaled:scale")= num 124
#> ...

# get_data returns original data
get_data(m1) |> str()
#> 'data.frame':    32 obs. of  5 variables:
#>  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#>  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#>  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#>  $ disp: num  160 160 108 258 360 ...
#> ...

# get_predicted and predict work
get_predicted(m1)
#> Some of the variables were in matrix-format - probably you used
#>   'scale()' on your data?
#>   If so, and you get an error, please try 'datawizard::standardize()' to
#>   standardize your data.
#> Predicted values:
#> 
#>  [1]  0.32445260  0.16397650  1.04543816  0.14430212 -0.47187319 -0.04790433
#>  [7] -0.55368454  0.54252269  0.56089726 -0.18283127 -0.18283127 -0.96536116
#> [13] -0.75139302 -0.78285892 -1.48188932 -1.60521739 -1.57854583  1.08719605
#> [19]  1.45189042  1.30814020  1.04950441 -0.57061221 -0.53325137 -0.73512269
#> [25] -0.68065788  1.25431100  1.09151240  1.45705867 -0.47507819  0.13139606
#> [31] -0.78441681  0.77093082
#> 
#> NOTE: Confidence intervals, if available, are stored as attributes and can be accessed using `as.data.frame()` on this output.
predict(m1, newdata = get_data(m1))
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>          0.32445260          0.16397650          1.04543816          0.14430212 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>         -0.47187319         -0.04790433         -0.55368454          0.54252269 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>          0.56089726         -0.18283127         -0.18283127         -0.96536116 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>         -0.75139302         -0.78285892         -1.48188932         -1.60521739 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>         -1.57854583          1.08719605          1.45189042          1.30814020 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>          1.04950441         -0.57061221         -0.53325137         -0.73512269 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>         -0.68065788          1.25431100          1.09151240          1.45705867 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>         -0.47507819          0.13139606         -0.78441681          0.77093082


m2 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars2)

#  model frame contains scaled variables, with variable names of original data
model.frame(m2) |> str()
#> 'data.frame':    32 obs. of  5 variables:
#>  $ mpg : num [1:32, 1] 0.151 0.151 0.45 0.217 -0.231 ...
#>   ..- attr(*, "scaled:center")= num 20.1
#>   ..- attr(*, "scaled:scale")= num 6.03
#>  $ wt  : num [1:32, 1] -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#>   ..- attr(*, "scaled:center")= num 3.22
#>   ..- attr(*, "scaled:scale")= num 0.978
#>  $ cyl : num [1:32, 1] -0.105 -0.105 -1.225 -0.105 1.015 ...
#>   ..- attr(*, "scaled:center")= num 6.19
#>   ..- attr(*, "scaled:scale")= num 1.79
#>  $ gear: num [1:32, 1] 0.424 0.424 0.424 -0.932 -0.932 ...
#>   ..- attr(*, "scaled:center")= num 3.69
#>   ..- attr(*, "scaled:scale")= num 0.738
#>  $ disp: num [1:32, 1] -0.571 -0.571 -0.99 0.22 1.043 ...
#>   ..- attr(*, "scaled:center")= num 231
#>   ..- attr(*, "scaled:scale")= num 124
#> ...

# get_data returns data that was used to fit model (i.e. scaled variables),
# but coerces 1D-matrix to numeric vector
get_data(m2) |> str()
#> 'data.frame':    32 obs. of  5 variables:
#>  $ mpg : num  0.151 0.151 0.45 0.217 -0.231 ...
#>  $ wt  : num  -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#>  $ cyl : num  -0.105 -0.105 -1.225 -0.105 1.015 ...
#>  $ gear: num  0.424 0.424 0.424 -0.932 -0.932 ...
#>  $ disp: num  -0.571 -0.571 -0.99 0.22 1.043 ...
#> ...

# fails
get_predicted(m2)
#> Some of the variables were in matrix-format - probably you used
#>   'scale()' on your data?
#>   If so, and you get an error, please try 'datawizard::standardize()' to
#>   standardize your data.
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit
predict(m2, newdata = get_data(m2))
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit

^{Created on 2022-06-12 by the reprex package (v2.0.1)}

…ta) (#671) * fixes easystats/performance#432 (missing outlier plot when scaling data) * Adds import stats::setNames * Update R/get_predicted.R Co-authored-by: Indrajeet Patil <patilindrajeet.science@gmail.com> * Added tests, updated NEWS, DESCRIPTION Co-authored-by: Indrajeet Patil <patilindrajeet.science@gmail.com>

@grantmcdermott

# insight 0.19.2 ## Breaking changes * The minimum needed R version has been bumped to `3.6`. * `download_model()` no longer errors when a model object could not be downloaded, but instead returns `NULL`. This prevents test failures, and allows to skip tests when the return value of `download_model()` is `NULL`. ## General * Improved support for `mclogit` models (package *mclogit*) and `mipo` objects (package *mice*) for models with ordinal or categorical response. ## New supported models * `phylolm` and `phyloglm` (package *phylolm*), `nestedLogit` (package *nestedLogit*). ## Bug fixes * Fixed issue in `get_variance()` for *glmmTMB* models with rank deficient coefficients. * Fixed issues in `get_weights()` for `glm` models without weights and `na.action` not set to default in the model call. * `clean_names()` now also removes the `relevel()` pattern. * Fixed issue in `model_info()` for models of class `gamlss`. * Fixed problems preventing `get_data()` from locating data defined in non-global environments. * Fixed issue in `get_predicted()` for variables of class numeric matrix created by `scale()`, which were correctly handled only when `get_data()` failed to find the data in the appropriate environment. * Fixed issue in `model_info()` for `gee` models from `binomial` families. # insight 0.19.1 ## New supported models * `hglm` (package *hglm*). ## Changes to functions * Minor improvements to `get_data()` for `t.test()`. * `format_value()` gets a `lead_zero` argument, to keep or drop the leading zero of a formatted value, as well as arguments `style_positive` and `style_negative` to style positive or negative numbers. * `format_table()` now also formats columns named `SGPV` (second generation p-values) as p-values. * Functions for models of class `clm` (like `find_formula()`, `find_variables()`, `get_data()` etc.) now also include variables that were defined as `scale` or `nominal` component. ## Bug fixes * Fixed issue in `get_data()` for results from `kruskal.test()`. * Fixed issue in `find_weights()` for models of class `lme` and `gls`. * Fixed issue in `get_datagrid()` for models with multiple weight variables. # insight 0.19.0 ## New supported models * `mmrm` (package *mmrm*), `flac` and `flic` (*logistf*) ## Breaking changes * `get_data()` was revised and now always tries to recover the data that was used to fit a model from the environment. If this fails, it falls back to recovering data from the model frame (the former default behaviour). Futrhermore, the `source` argument can be used to explicitly force the old behaviour: `source = "mf"` will try to recover data from the model frame first, then possibly falling back to look in the environment. ## New functions * `n_grouplevels()`, to return random effect groups and number of group levels for mixed models. ## Changes to functions * `get_datagrid()` preserves all factor levels for factors that are hold constant at their reference level. This is required to work together with `get_modelmatrix()` when calculating standard errors for `get_predicted()`. ## Bug fixes * Fixed but in `get_modelmatrix()` handling of incomplete factors which sometimes had downstream implications for numerical results in the uncertainty estimates produced by `get_predicted()`. * Fixed minor issues for HTML tables in `export_table()` when model parameters were grouped. * Fixed issue with incorrect back-transforming in `get_data()` for models with log-transformed variables. * Fixes issue in `compact_list()`. * `has_single_value()` now returns `FALSE` when the object only has `NA` and `na.rm = TRUE`. * Fixed issue in `get_parameters()` for gam-models without smooth terms, or with only smooth terms and removed intercept. # insight 0.18.8 ## Bug fixes * Fixed test due to changes in the _performance_ package. # insight 0.18.7 ## General * Minor revisions to `get_predicted.glmmTMB()` due to changes in behaviour of `predict.glmmTMB()` for truncated-family models since _glmmTMB_ 1.1.5. * New function `has_single_value()` that is equivalent to `length(unique()) == 1` (or `n_unique() == 1`) but faster. ## Changes to functions * `ellipses_info()` now includes an attribute `$is_binomial`, which is `TRUE` for each model from binomial family. ## Bug fixes * Fixed behaviour of the `at` argument in `get_datagrid()`. * Fixed issue for accessing model data in `get_datagrid()` for some edge cases. # insight 0.18.6 ## New supported models * Support the *logitr* package: `get_data()`, `find_variables()` and more. ## Bug fixes * Better detection of unicode-support, to avoid failures when building vignettes. * `get_predicted()` now correctly handles variables of class numeric matrix created by `scale()`, which fixes a bug in `performance::check_model()` (easystats/performance#432). * Fixed issue with `iterations` argument in `get_predicted()` with _brms_ models. # insight 0.18.5 ## Breaking * `get_df(type = "satterthwaite")` for `lmerMod` objects now return degrees of freedom per parameter, and no longer per observation. Use `df_per_obs TRUE` to return degrees of freedom per observation. ## New functions * `safe_deparse_symbol()` to only deparses a substituted expressions when possible,which increases performance in case many calls to `deparse(substitute())`. ## Changes to functions * `format_table()` gets a `use_symbols` argument. If `TRUE`, column names that refer to particular effectsizes (like Phi, Omega or Epsilon) include the related unicode-character instead of the written name. This only works on Windows for R >= 4.2, and on OS X or Linux for R >= 4.0. * The `stars` argument in `format_table()` can now also be a character vector, naming the columns that should include stars for significant values. This is especially useful for Bayesian models, where we might have multiple columns with significant values, e.g. `"BF"` for the Bayes factor or `"pd"` for the probability of direction. * `get_df()` gets more `type` options to return different type of degrees of freedom (namely, `"wald"` and `"normal"`, and for mixed models, `"ml1"`, `"betwithin"`, `"satterthwaite"` and `"kenward-roger"`). * `standardize_names()` now recognized more classes from package _marginaleffects_. * Minor improvements to `find_parameters()` for models with nonlinear formula. * Minor speed improvements. ## Bug fixes * Fixed issue in `get_data()` for models of class `plm`, which accidentally converted factors into character vectors. * Fixed issue with column alignment in `export_table()` when the data frame to print contained unicode-characters longer than 1 byte. * Correctly extract predictors for `fixest::i(f1, i.f2)` interactions (#649 by @grantmcdermott). # insight 0.18.4 ## Changes to functions * `model_info()` now includes information for `htest` objects from `shapiro.test()` and `bartlett.test()` (will return `$is_variancetest = TRUE`). ## Bug fixes * Fixed issue in `get_data()` which did not correctly backtransform to original data when terms had log-transformations such as `log(1 + x)` or `log(x + 1)`. * Fixed CRAN check issues. # insight 0.18.3 ## New functions * `format_alert()`, `format_warning()` and `format_error()`, as convenient wrappers around `message()`, `warning()` or `stop()` in combination with `format_message()`. You can use these funcionts to format messages, warnings or errors. ## Changes to functions * `get_predicted()` for models of class `clm` now includes confidence intervals of predictions. * `format_message()` gets some additional formatting features. See 'Details' in `?format_message` for more information and some current limitations. * `format_message()` gets an `indent` argument, to specify indention string for subsequent lines. * `format_table()` now merges IC and IC weights columns into one column (e.g., former columns `"AIC"` and `"AIC_wt"` will now be printed as one column, named `"AIC (weights)"`). Furthermore, an `ic_digits` argument was added to control the number of significant digits for the IC values. * `print_color()` and `color_text()` now support bright variants of colors and background colors. * `get_datagrid()` gets more options for `at` and `range`, to provide more control how to generate the reference grid. * `get_data()` for models of class `geeglm` and `fixest`now more reliably retrieves the model data. ## New supported models * Support for models of class `mblogit` and `mclogit`. ## Bug fixes * Fixed issues with wrong attribute `adjusted_for` in `insight::get_datagrid()`. * Fixed issue (resp. implemented workaround) in `get_data.iv_robust()`, which failed due to a bug in the _estimatr_ package. * Fixed issue where `get_predicted()` failed when data contains factors with only one or incomplete levels. * Fixed issue in `get_predicted()` for models of class `mlm`. * Fixed issue where `get_predicted()` failed to compute confidence intervals of predictions when model contained matrix-alike response columns, e.g. a response variable created with `cbind()`. # insight 0.18.2 ## New functions * `format_percent()` as short-cut for `format_value(as_percent = TRUE)`. * `is_converged()`, to check whether a mixed model has converged or not. ## Changes to functions * `format_table()` gains an `exact` argument, to either report exact or rounded Bayes factors. * `get_predicted()` gets a method for models of class `gamlss` (and thereby, `get_loglikelihood()` now also works for those model classes). * `get_predicted()` now better handles models of class `polr`, `multinom` and `rlm`. ## Bug fixes * Fixed test failures. * Minor fixes to address changes in other packages. # insight 0.18.0 ## Breaking changes * The `ci` argument in `get_predicted()` now defaults to `NULL`. One reason was to make the function faster if confidence intervals are not required, which was the case for many downstream usages of that function. Please set `ci` explicitly to compute confidence intervals for predictions. * `get_data()` no longer returns logical types for numeric variables that have been converted to logicals on-the-fly within formulas (like `y ~ as.logical(x)`). Instead, for each numeric variable that was coerced to logical within a formula gets a `logical` attribute (set to `TRUE`), and the returned data frame gets a `logicals` attribute including all names of affected variables. * `parameters_table()`, the alias for `format_table()`, was removed. ## Changes to functions * `find_transformation()` and `get_transformation()` now also work for models where the response was transformed using `log2()` or `log10()`. ## Bug fixes * `get_sigma()` for models from package _VGAM_ returned wrong sigma-parameter. * `find_predictors()` for models from package _fixest_ that contained interaction terms in the endogenous formula part did not correctly return all instruments. * Fixed formatting of HTML table footers in `export_table()`. * Several fixes to `get_predicted()` for models from `mgcv::gam()`. * The `component` argument in `find_parameters()` for `stanmvreg` models did not accept the `"location"` value. * `null_model()` did not consider offset-terms if these were specified inside formulas. * Argument `allow.new.levels` was not passed to `predict()` for `get_predicted.glmmTMB()`. * `clean_names()` now works correctly when several variables are specified in `s()` (#573, @etiennebacher).

rempsyc changed the title ~~check_model: missing outlier plot when scaling data with dplyr::mutate~~ check_model: missing outlier plot when scaling data with dplyr::mutate Jun 11, 2022

rempsyc changed the title ~~check_model: missing outlier plot when scaling data with dplyr::mutate~~ check_model: missing outlier plot when scaling data Jun 11, 2022

IndrajeetPatil added the Bug 🐛 Something isn't working label Jun 11, 2022

strengejacke added a commit to easystats/insight that referenced this issue Jun 12, 2022

https://github.com/easystats/performance/issues/432

216d735

rempsyc added a commit to rempsyc/insight that referenced this issue Oct 17, 2022

fixes easystats/performance#432 (missing outlier plot when scaling data)

a02a6d5

rempsyc mentioned this issue Oct 17, 2022

Fixes easystats/performance#432 (missing outlier plot when scaling data) easystats/insight#671

Merged

IndrajeetPatil closed this as completed in easystats/insight#671 Oct 17, 2022

rempsyc self-assigned this Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`check_model`: missing outlier plot when scaling data #432

`check_model`: missing outlier plot when scaling data #432

rempsyc commented Jun 11, 2022 •

edited

Loading

strengejacke commented Jun 11, 2022

strengejacke commented Jun 12, 2022

bwiernik commented Jun 12, 2022 •

edited

Loading

rempsyc commented Jun 12, 2022

strengejacke commented Jun 12, 2022

check_model: missing outlier plot when scaling data #432

check_model: missing outlier plot when scaling data #432

Comments

rempsyc commented Jun 11, 2022 • edited Loading

strengejacke commented Jun 11, 2022

strengejacke commented Jun 12, 2022

bwiernik commented Jun 12, 2022 • edited Loading

rempsyc commented Jun 12, 2022

strengejacke commented Jun 12, 2022

`check_model`: missing outlier plot when scaling data #432

`check_model`: missing outlier plot when scaling data #432

rempsyc commented Jun 11, 2022 •

edited

Loading

bwiernik commented Jun 12, 2022 •

edited

Loading