-
-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check_model
: missing outlier plot when scaling data
#432
Comments
check_model
: missing outlier plot when scaling data with dplyr::mutate
check_model
: missing outlier plot when scaling data with dplyr::mutate
check_model
: missing outlier plot when scaling data
As a quick workaround, I would always recommend to use a standardize-function that preserves the vector class, e.g. I'll look into this, not sure where this exactly fails, because |
The error comes from library(dplyr)
mtcars2 <- mtcars %>%
mutate(across(everything(), scale))
m2 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars2)
insight::get_predicted(m2)
#> Some of the variables were in matrix-format - probably you used
#> 'scale()' on your data?
#> If so, and you get an error, please try 'datawizard::standardize()' to
#> standardize your data.
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit Created on 2022-06-12 by the reprex package (v2.0.1) |
The issue is We could add a check to see if a predictor variable is a matrix and throw an error/warning like we do if a formula includes a |
Thank you, I like the warning and bwiernik's suggestion to throw an error also. Out of curiosity, would there be any con to automatically check if any variable is a matrix, and if so, convert to vector, with a similar warning about the conversion? Since it seems it hasn't been a problem for any of the other panels in |
The problem is that if At this point, it's difficult to check the original input type. I try to read the See example here to make a bit clearer what I described above. library(insight)
library(dplyr)
mtcars2 <- mtcars %>%
mutate(across(everything(), scale))
m1 <- lm(scale(mpg) ~ scale(wt) + scale(cyl) + scale(gear) + scale(disp), data = mtcars)
# model frame contains scaled variables, including column names with "scale()"
model.frame(m1) |> str()
#> 'data.frame': 32 obs. of 5 variables:
#> $ scale(mpg) : num [1:32, 1] 0.151 0.151 0.45 0.217 -0.231 ...
#> ..- attr(*, "scaled:center")= num 20.1
#> ..- attr(*, "scaled:scale")= num 6.03
#> $ scale(wt) : num [1:32, 1] -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#> ..- attr(*, "scaled:center")= num 3.22
#> ..- attr(*, "scaled:scale")= num 0.978
#> $ scale(cyl) : num [1:32, 1] -0.105 -0.105 -1.225 -0.105 1.015 ...
#> ..- attr(*, "scaled:center")= num 6.19
#> ..- attr(*, "scaled:scale")= num 1.79
#> $ scale(gear): num [1:32, 1] 0.424 0.424 0.424 -0.932 -0.932 ...
#> ..- attr(*, "scaled:center")= num 3.69
#> ..- attr(*, "scaled:scale")= num 0.738
#> $ scale(disp): num [1:32, 1] -0.571 -0.571 -0.99 0.22 1.043 ...
#> ..- attr(*, "scaled:center")= num 231
#> ..- attr(*, "scaled:scale")= num 124
#> ...
# get_data returns original data
get_data(m1) |> str()
#> 'data.frame': 32 obs. of 5 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#> $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
#> $ disp: num 160 160 108 258 360 ...
#> ...
# get_predicted and predict work
get_predicted(m1)
#> Some of the variables were in matrix-format - probably you used
#> 'scale()' on your data?
#> If so, and you get an error, please try 'datawizard::standardize()' to
#> standardize your data.
#> Predicted values:
#>
#> [1] 0.32445260 0.16397650 1.04543816 0.14430212 -0.47187319 -0.04790433
#> [7] -0.55368454 0.54252269 0.56089726 -0.18283127 -0.18283127 -0.96536116
#> [13] -0.75139302 -0.78285892 -1.48188932 -1.60521739 -1.57854583 1.08719605
#> [19] 1.45189042 1.30814020 1.04950441 -0.57061221 -0.53325137 -0.73512269
#> [25] -0.68065788 1.25431100 1.09151240 1.45705867 -0.47507819 0.13139606
#> [31] -0.78441681 0.77093082
#>
#> NOTE: Confidence intervals, if available, are stored as attributes and can be accessed using `as.data.frame()` on this output.
predict(m1, newdata = get_data(m1))
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> 0.32445260 0.16397650 1.04543816 0.14430212
#> Hornet Sportabout Valiant Duster 360 Merc 240D
#> -0.47187319 -0.04790433 -0.55368454 0.54252269
#> Merc 230 Merc 280 Merc 280C Merc 450SE
#> 0.56089726 -0.18283127 -0.18283127 -0.96536116
#> Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> -0.75139302 -0.78285892 -1.48188932 -1.60521739
#> Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
#> -1.57854583 1.08719605 1.45189042 1.30814020
#> Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
#> 1.04950441 -0.57061221 -0.53325137 -0.73512269
#> Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
#> -0.68065788 1.25431100 1.09151240 1.45705867
#> Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> -0.47507819 0.13139606 -0.78441681 0.77093082
m2 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars2)
# model frame contains scaled variables, with variable names of original data
model.frame(m2) |> str()
#> 'data.frame': 32 obs. of 5 variables:
#> $ mpg : num [1:32, 1] 0.151 0.151 0.45 0.217 -0.231 ...
#> ..- attr(*, "scaled:center")= num 20.1
#> ..- attr(*, "scaled:scale")= num 6.03
#> $ wt : num [1:32, 1] -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#> ..- attr(*, "scaled:center")= num 3.22
#> ..- attr(*, "scaled:scale")= num 0.978
#> $ cyl : num [1:32, 1] -0.105 -0.105 -1.225 -0.105 1.015 ...
#> ..- attr(*, "scaled:center")= num 6.19
#> ..- attr(*, "scaled:scale")= num 1.79
#> $ gear: num [1:32, 1] 0.424 0.424 0.424 -0.932 -0.932 ...
#> ..- attr(*, "scaled:center")= num 3.69
#> ..- attr(*, "scaled:scale")= num 0.738
#> $ disp: num [1:32, 1] -0.571 -0.571 -0.99 0.22 1.043 ...
#> ..- attr(*, "scaled:center")= num 231
#> ..- attr(*, "scaled:scale")= num 124
#> ...
# get_data returns data that was used to fit model (i.e. scaled variables),
# but coerces 1D-matrix to numeric vector
get_data(m2) |> str()
#> 'data.frame': 32 obs. of 5 variables:
#> $ mpg : num 0.151 0.151 0.45 0.217 -0.231 ...
#> $ wt : num -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#> $ cyl : num -0.105 -0.105 -1.225 -0.105 1.015 ...
#> $ gear: num 0.424 0.424 0.424 -0.932 -0.932 ...
#> $ disp: num -0.571 -0.571 -0.99 0.22 1.043 ...
#> ...
# fails
get_predicted(m2)
#> Some of the variables were in matrix-format - probably you used
#> 'scale()' on your data?
#> If so, and you get an error, please try 'datawizard::standardize()' to
#> standardize your data.
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit
predict(m2, newdata = get_data(m2))
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit Created on 2022-06-12 by the reprex package (v2.0.1) |
…ta) (#671) * fixes easystats/performance#432 (missing outlier plot when scaling data) * Adds import stats::setNames * Update R/get_predicted.R Co-authored-by: Indrajeet Patil <patilindrajeet.science@gmail.com> * Added tests, updated NEWS, DESCRIPTION Co-authored-by: Indrajeet Patil <patilindrajeet.science@gmail.com>
# insight 0.19.2 ## Breaking changes * The minimum needed R version has been bumped to `3.6`. * `download_model()` no longer errors when a model object could not be downloaded, but instead returns `NULL`. This prevents test failures, and allows to skip tests when the return value of `download_model()` is `NULL`. ## General * Improved support for `mclogit` models (package *mclogit*) and `mipo` objects (package *mice*) for models with ordinal or categorical response. ## New supported models * `phylolm` and `phyloglm` (package *phylolm*), `nestedLogit` (package *nestedLogit*). ## Bug fixes * Fixed issue in `get_variance()` for *glmmTMB* models with rank deficient coefficients. * Fixed issues in `get_weights()` for `glm` models without weights and `na.action` not set to default in the model call. * `clean_names()` now also removes the `relevel()` pattern. * Fixed issue in `model_info()` for models of class `gamlss`. * Fixed problems preventing `get_data()` from locating data defined in non-global environments. * Fixed issue in `get_predicted()` for variables of class numeric matrix created by `scale()`, which were correctly handled only when `get_data()` failed to find the data in the appropriate environment. * Fixed issue in `model_info()` for `gee` models from `binomial` families. # insight 0.19.1 ## New supported models * `hglm` (package *hglm*). ## Changes to functions * Minor improvements to `get_data()` for `t.test()`. * `format_value()` gets a `lead_zero` argument, to keep or drop the leading zero of a formatted value, as well as arguments `style_positive` and `style_negative` to style positive or negative numbers. * `format_table()` now also formats columns named `SGPV` (second generation p-values) as p-values. * Functions for models of class `clm` (like `find_formula()`, `find_variables()`, `get_data()` etc.) now also include variables that were defined as `scale` or `nominal` component. ## Bug fixes * Fixed issue in `get_data()` for results from `kruskal.test()`. * Fixed issue in `find_weights()` for models of class `lme` and `gls`. * Fixed issue in `get_datagrid()` for models with multiple weight variables. # insight 0.19.0 ## New supported models * `mmrm` (package *mmrm*), `flac` and `flic` (*logistf*) ## Breaking changes * `get_data()` was revised and now always tries to recover the data that was used to fit a model from the environment. If this fails, it falls back to recovering data from the model frame (the former default behaviour). Futrhermore, the `source` argument can be used to explicitly force the old behaviour: `source = "mf"` will try to recover data from the model frame first, then possibly falling back to look in the environment. ## New functions * `n_grouplevels()`, to return random effect groups and number of group levels for mixed models. ## Changes to functions * `get_datagrid()` preserves all factor levels for factors that are hold constant at their reference level. This is required to work together with `get_modelmatrix()` when calculating standard errors for `get_predicted()`. ## Bug fixes * Fixed but in `get_modelmatrix()` handling of incomplete factors which sometimes had downstream implications for numerical results in the uncertainty estimates produced by `get_predicted()`. * Fixed minor issues for HTML tables in `export_table()` when model parameters were grouped. * Fixed issue with incorrect back-transforming in `get_data()` for models with log-transformed variables. * Fixes issue in `compact_list()`. * `has_single_value()` now returns `FALSE` when the object only has `NA` and `na.rm = TRUE`. * Fixed issue in `get_parameters()` for gam-models without smooth terms, or with only smooth terms and removed intercept. # insight 0.18.8 ## Bug fixes * Fixed test due to changes in the _performance_ package. # insight 0.18.7 ## General * Minor revisions to `get_predicted.glmmTMB()` due to changes in behaviour of `predict.glmmTMB()` for truncated-family models since _glmmTMB_ 1.1.5. * New function `has_single_value()` that is equivalent to `length(unique()) == 1` (or `n_unique() == 1`) but faster. ## Changes to functions * `ellipses_info()` now includes an attribute `$is_binomial`, which is `TRUE` for each model from binomial family. ## Bug fixes * Fixed behaviour of the `at` argument in `get_datagrid()`. * Fixed issue for accessing model data in `get_datagrid()` for some edge cases. # insight 0.18.6 ## New supported models * Support the *logitr* package: `get_data()`, `find_variables()` and more. ## Bug fixes * Better detection of unicode-support, to avoid failures when building vignettes. * `get_predicted()` now correctly handles variables of class numeric matrix created by `scale()`, which fixes a bug in `performance::check_model()` (easystats/performance#432). * Fixed issue with `iterations` argument in `get_predicted()` with _brms_ models. # insight 0.18.5 ## Breaking * `get_df(type = "satterthwaite")` for `lmerMod` objects now return degrees of freedom per parameter, and no longer per observation. Use `df_per_obs TRUE` to return degrees of freedom per observation. ## New functions * `safe_deparse_symbol()` to only deparses a substituted expressions when possible,which increases performance in case many calls to `deparse(substitute())`. ## Changes to functions * `format_table()` gets a `use_symbols` argument. If `TRUE`, column names that refer to particular effectsizes (like Phi, Omega or Epsilon) include the related unicode-character instead of the written name. This only works on Windows for R >= 4.2, and on OS X or Linux for R >= 4.0. * The `stars` argument in `format_table()` can now also be a character vector, naming the columns that should include stars for significant values. This is especially useful for Bayesian models, where we might have multiple columns with significant values, e.g. `"BF"` for the Bayes factor or `"pd"` for the probability of direction. * `get_df()` gets more `type` options to return different type of degrees of freedom (namely, `"wald"` and `"normal"`, and for mixed models, `"ml1"`, `"betwithin"`, `"satterthwaite"` and `"kenward-roger"`). * `standardize_names()` now recognized more classes from package _marginaleffects_. * Minor improvements to `find_parameters()` for models with nonlinear formula. * Minor speed improvements. ## Bug fixes * Fixed issue in `get_data()` for models of class `plm`, which accidentally converted factors into character vectors. * Fixed issue with column alignment in `export_table()` when the data frame to print contained unicode-characters longer than 1 byte. * Correctly extract predictors for `fixest::i(f1, i.f2)` interactions (#649 by @grantmcdermott). # insight 0.18.4 ## Changes to functions * `model_info()` now includes information for `htest` objects from `shapiro.test()` and `bartlett.test()` (will return `$is_variancetest = TRUE`). ## Bug fixes * Fixed issue in `get_data()` which did not correctly backtransform to original data when terms had log-transformations such as `log(1 + x)` or `log(x + 1)`. * Fixed CRAN check issues. # insight 0.18.3 ## New functions * `format_alert()`, `format_warning()` and `format_error()`, as convenient wrappers around `message()`, `warning()` or `stop()` in combination with `format_message()`. You can use these funcionts to format messages, warnings or errors. ## Changes to functions * `get_predicted()` for models of class `clm` now includes confidence intervals of predictions. * `format_message()` gets some additional formatting features. See 'Details' in `?format_message` for more information and some current limitations. * `format_message()` gets an `indent` argument, to specify indention string for subsequent lines. * `format_table()` now merges IC and IC weights columns into one column (e.g., former columns `"AIC"` and `"AIC_wt"` will now be printed as one column, named `"AIC (weights)"`). Furthermore, an `ic_digits` argument was added to control the number of significant digits for the IC values. * `print_color()` and `color_text()` now support bright variants of colors and background colors. * `get_datagrid()` gets more options for `at` and `range`, to provide more control how to generate the reference grid. * `get_data()` for models of class `geeglm` and `fixest`now more reliably retrieves the model data. ## New supported models * Support for models of class `mblogit` and `mclogit`. ## Bug fixes * Fixed issues with wrong attribute `adjusted_for` in `insight::get_datagrid()`. * Fixed issue (resp. implemented workaround) in `get_data.iv_robust()`, which failed due to a bug in the _estimatr_ package. * Fixed issue where `get_predicted()` failed when data contains factors with only one or incomplete levels. * Fixed issue in `get_predicted()` for models of class `mlm`. * Fixed issue where `get_predicted()` failed to compute confidence intervals of predictions when model contained matrix-alike response columns, e.g. a response variable created with `cbind()`. # insight 0.18.2 ## New functions * `format_percent()` as short-cut for `format_value(as_percent = TRUE)`. * `is_converged()`, to check whether a mixed model has converged or not. ## Changes to functions * `format_table()` gains an `exact` argument, to either report exact or rounded Bayes factors. * `get_predicted()` gets a method for models of class `gamlss` (and thereby, `get_loglikelihood()` now also works for those model classes). * `get_predicted()` now better handles models of class `polr`, `multinom` and `rlm`. ## Bug fixes * Fixed test failures. * Minor fixes to address changes in other packages. # insight 0.18.0 ## Breaking changes * The `ci` argument in `get_predicted()` now defaults to `NULL`. One reason was to make the function faster if confidence intervals are not required, which was the case for many downstream usages of that function. Please set `ci` explicitly to compute confidence intervals for predictions. * `get_data()` no longer returns logical types for numeric variables that have been converted to logicals on-the-fly within formulas (like `y ~ as.logical(x)`). Instead, for each numeric variable that was coerced to logical within a formula gets a `logical` attribute (set to `TRUE`), and the returned data frame gets a `logicals` attribute including all names of affected variables. * `parameters_table()`, the alias for `format_table()`, was removed. ## Changes to functions * `find_transformation()` and `get_transformation()` now also work for models where the response was transformed using `log2()` or `log10()`. ## Bug fixes * `get_sigma()` for models from package _VGAM_ returned wrong sigma-parameter. * `find_predictors()` for models from package _fixest_ that contained interaction terms in the endogenous formula part did not correctly return all instruments. * Fixed formatting of HTML table footers in `export_table()`. * Several fixes to `get_predicted()` for models from `mgcv::gam()`. * The `component` argument in `find_parameters()` for `stanmvreg` models did not accept the `"location"` value. * `null_model()` did not consider offset-terms if these were specified inside formulas. * Argument `allow.new.levels` was not passed to `predict()` for `get_predicted.glmmTMB()`. * `clean_names()` now works correctly when several variables are specified in `s()` (#573, @etiennebacher).
Summary:
check_model
fails to plot the outlier panel when scaling data because the scaled variables become incompatible matrix arrays.Reprex:
The following works:
Looks good. Let's scale the data
The outlier panel is missing. The reason is that the outlier check is failing silently.
The reason is that scaling changes the object class from numeric vector to matrix array.
Solution is to change to vector or numeric
Note that scaling through
lapply
instead ofdplyr::mutate
works:The issue emerges also if one simply changes one variable only, suggesting the issue actually lies in the base R
scale
function.Created on 2022-06-11 by the reprex package (v2.0.1)
This is confusing many students in an introductory R stats class here because they are taught to scale their variables at the beginning of their script, but then the following fails. It would be nice if
check_model
could automatically convert from matrix array to numeric vector, if applicable.The text was updated successfully, but these errors were encountered: