Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_model: missing outlier plot when scaling data #432

Closed
rempsyc opened this issue Jun 11, 2022 · 5 comments · Fixed by easystats/insight#671
Closed

check_model: missing outlier plot when scaling data #432

rempsyc opened this issue Jun 11, 2022 · 5 comments · Fixed by easystats/insight#671
Assignees
Labels
Bug 🐛 Something isn't working

Comments

@rempsyc
Copy link
Member

rempsyc commented Jun 11, 2022

Summary: check_model fails to plot the outlier panel when scaling data because the scaled variables become incompatible matrix arrays.

Reprex:
The following works:

library(performance)
m <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars)
check_model(m)

Looks good. Let's scale the data

library(dplyr)
mtcars2 <- mtcars %>%
  mutate(across(everything(), scale))

m2 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars2)
check_model(m2)

The outlier panel is missing. The reason is that the outlier check is failing silently.

check_model(m2, check = "outliers")
#> Error in unit(rep(0, TABLE_ROWS * dims[1]), "null"): 'x' and 'units' must have length > 0

The reason is that scaling changes the object class from numeric vector to matrix array.

class(mtcars2$mpg)
#> [1] "matrix" "array"

Solution is to change to vector or numeric

mtcars3 <- mtcars %>%
  mutate(across(everything(), ~scale(.x) %>% as.numeric))

m3 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars3)
check_model(m3)

mtcars4 <- mtcars %>%
  mutate(across(everything(), ~scale(.x) %>% as.vector))

m4 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars4)
check_model(m4)

Note that scaling through lapply instead of dplyr::mutate works:

mtcars5 <- lapply(mtcars, scale) |> as.data.frame()

m5 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars5)
check_model(m5)

The issue emerges also if one simply changes one variable only, suggesting the issue actually lies in the base R scale function.

mtcars6 <- mtcars
mtcars6$wt <- scale(mtcars$wt)

m6 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars6)
check_model(m6)

Created on 2022-06-11 by the reprex package (v2.0.1)

This is confusing many students in an introductory R stats class here because they are taught to scale their variables at the beginning of their script, but then the following fails. It would be nice if check_model could automatically convert from matrix array to numeric vector, if applicable.

@rempsyc rempsyc changed the title check_model: missing outlier plot when scaling data with dplyr::mutate check_model: missing outlier plot when scaling data with dplyr::mutate Jun 11, 2022
@rempsyc rempsyc changed the title check_model: missing outlier plot when scaling data with dplyr::mutate check_model: missing outlier plot when scaling data Jun 11, 2022
@IndrajeetPatil IndrajeetPatil added the Bug 🐛 Something isn't working label Jun 11, 2022
@strengejacke
Copy link
Member

As a quick workaround, I would always recommend to use a standardize-function that preserves the vector class, e.g. datawizard::standardize().

I'll look into this, not sure where this exactly fails, because check_outliers() seems to work.

strengejacke added a commit to easystats/insight that referenced this issue Jun 12, 2022
@strengejacke
Copy link
Member

The error comes from insight::get_predicted(). For now, I added a warning. Not quite sure how to best fix this issue.

library(dplyr)
mtcars2 <- mtcars %>%
  mutate(across(everything(), scale))

m2 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars2)
insight::get_predicted(m2)
#> Some of the variables were in matrix-format - probably you used
#>   'scale()' on your data?
#>   If so, and you get an error, please try 'datawizard::standardize()' to
#>   standardize your data.
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit

Created on 2022-06-12 by the reprex package (v2.0.1)

@bwiernik
Copy link
Contributor

bwiernik commented Jun 12, 2022

The issue is scale()'s terrible behavior of always returning a matrix. Users should just never use scale().

We could add a check to see if a predictor variable is a matrix and throw an error/warning like we do if a formula includes a $?

@rempsyc
Copy link
Member Author

rempsyc commented Jun 12, 2022

Thank you, I like the warning and bwiernik's suggestion to throw an error also. Out of curiosity, would there be any con to automatically check if any variable is a matrix, and if so, convert to vector, with a similar warning about the conversion? Since it seems it hasn't been a problem for any of the other panels in check_model.

@strengejacke
Copy link
Member

We could add a check to see if a predictor variable is a matrix and throw an error/warning like we do if a formula includes a $?

The problem is that if get_predicted() is called w/o data argument, get_data() is called, which coerced matrix columns into vectors. scale() causes no problem when called on-the-fly in the formula. If it's called before fitting the model, then the variable names in the data are the same as the original variable names, but the variable types are 1D-matrices. get_data() returns a data frame where the variable names are also the same as in the original data, but data types are coerced into numeric. But predict() expects the same type, probably because the names are identical?

At this point, it's difficult to check the original input type. I try to read the dataClasses attribute of terms, but not all model type have a terms() method: easystats/insight@216d735

See example here to make a bit clearer what I described above.

library(insight)
library(dplyr)
mtcars2 <- mtcars %>%
  mutate(across(everything(), scale))

m1 <- lm(scale(mpg) ~ scale(wt) + scale(cyl) + scale(gear) + scale(disp), data = mtcars)

# model frame contains scaled variables, including column names with "scale()"
model.frame(m1) |> str()
#> 'data.frame':    32 obs. of  5 variables:
#>  $ scale(mpg) : num [1:32, 1] 0.151 0.151 0.45 0.217 -0.231 ...
#>   ..- attr(*, "scaled:center")= num 20.1
#>   ..- attr(*, "scaled:scale")= num 6.03
#>  $ scale(wt)  : num [1:32, 1] -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#>   ..- attr(*, "scaled:center")= num 3.22
#>   ..- attr(*, "scaled:scale")= num 0.978
#>  $ scale(cyl) : num [1:32, 1] -0.105 -0.105 -1.225 -0.105 1.015 ...
#>   ..- attr(*, "scaled:center")= num 6.19
#>   ..- attr(*, "scaled:scale")= num 1.79
#>  $ scale(gear): num [1:32, 1] 0.424 0.424 0.424 -0.932 -0.932 ...
#>   ..- attr(*, "scaled:center")= num 3.69
#>   ..- attr(*, "scaled:scale")= num 0.738
#>  $ scale(disp): num [1:32, 1] -0.571 -0.571 -0.99 0.22 1.043 ...
#>   ..- attr(*, "scaled:center")= num 231
#>   ..- attr(*, "scaled:scale")= num 124
#> ...

# get_data returns original data
get_data(m1) |> str()
#> 'data.frame':    32 obs. of  5 variables:
#>  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#>  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#>  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#>  $ disp: num  160 160 108 258 360 ...
#> ...

# get_predicted and predict work
get_predicted(m1)
#> Some of the variables were in matrix-format - probably you used
#>   'scale()' on your data?
#>   If so, and you get an error, please try 'datawizard::standardize()' to
#>   standardize your data.
#> Predicted values:
#> 
#>  [1]  0.32445260  0.16397650  1.04543816  0.14430212 -0.47187319 -0.04790433
#>  [7] -0.55368454  0.54252269  0.56089726 -0.18283127 -0.18283127 -0.96536116
#> [13] -0.75139302 -0.78285892 -1.48188932 -1.60521739 -1.57854583  1.08719605
#> [19]  1.45189042  1.30814020  1.04950441 -0.57061221 -0.53325137 -0.73512269
#> [25] -0.68065788  1.25431100  1.09151240  1.45705867 -0.47507819  0.13139606
#> [31] -0.78441681  0.77093082
#> 
#> NOTE: Confidence intervals, if available, are stored as attributes and can be accessed using `as.data.frame()` on this output.
predict(m1, newdata = get_data(m1))
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>          0.32445260          0.16397650          1.04543816          0.14430212 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>         -0.47187319         -0.04790433         -0.55368454          0.54252269 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>          0.56089726         -0.18283127         -0.18283127         -0.96536116 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>         -0.75139302         -0.78285892         -1.48188932         -1.60521739 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>         -1.57854583          1.08719605          1.45189042          1.30814020 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>          1.04950441         -0.57061221         -0.53325137         -0.73512269 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>         -0.68065788          1.25431100          1.09151240          1.45705867 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>         -0.47507819          0.13139606         -0.78441681          0.77093082


m2 <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars2)

#  model frame contains scaled variables, with variable names of original data
model.frame(m2) |> str()
#> 'data.frame':    32 obs. of  5 variables:
#>  $ mpg : num [1:32, 1] 0.151 0.151 0.45 0.217 -0.231 ...
#>   ..- attr(*, "scaled:center")= num 20.1
#>   ..- attr(*, "scaled:scale")= num 6.03
#>  $ wt  : num [1:32, 1] -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#>   ..- attr(*, "scaled:center")= num 3.22
#>   ..- attr(*, "scaled:scale")= num 0.978
#>  $ cyl : num [1:32, 1] -0.105 -0.105 -1.225 -0.105 1.015 ...
#>   ..- attr(*, "scaled:center")= num 6.19
#>   ..- attr(*, "scaled:scale")= num 1.79
#>  $ gear: num [1:32, 1] 0.424 0.424 0.424 -0.932 -0.932 ...
#>   ..- attr(*, "scaled:center")= num 3.69
#>   ..- attr(*, "scaled:scale")= num 0.738
#>  $ disp: num [1:32, 1] -0.571 -0.571 -0.99 0.22 1.043 ...
#>   ..- attr(*, "scaled:center")= num 231
#>   ..- attr(*, "scaled:scale")= num 124
#> ...

# get_data returns data that was used to fit model (i.e. scaled variables),
# but coerces 1D-matrix to numeric vector
get_data(m2) |> str()
#> 'data.frame':    32 obs. of  5 variables:
#>  $ mpg : num  0.151 0.151 0.45 0.217 -0.231 ...
#>  $ wt  : num  -0.6104 -0.3498 -0.917 -0.0023 0.2277 ...
#>  $ cyl : num  -0.105 -0.105 -1.225 -0.105 1.015 ...
#>  $ gear: num  0.424 0.424 0.424 -0.932 -0.932 ...
#>  $ disp: num  -0.571 -0.571 -0.99 0.22 1.043 ...
#> ...

# fails
get_predicted(m2)
#> Some of the variables were in matrix-format - probably you used
#>   'scale()' on your data?
#>   If so, and you get an error, please try 'datawizard::standardize()' to
#>   standardize your data.
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit
predict(m2, newdata = get_data(m2))
#> Error: variables 'wt', 'cyl', 'gear', 'disp' were specified with different types from the fit

Created on 2022-06-12 by the reprex package (v2.0.1)

IndrajeetPatil added a commit to easystats/insight that referenced this issue Oct 17, 2022
…ta) (#671)

* fixes easystats/performance#432 (missing outlier plot when scaling data)

* Adds import stats::setNames

* Update R/get_predicted.R

Co-authored-by: Indrajeet Patil <patilindrajeet.science@gmail.com>

* Added tests, updated NEWS, DESCRIPTION

Co-authored-by: Indrajeet Patil <patilindrajeet.science@gmail.com>
@rempsyc rempsyc self-assigned this Oct 17, 2022
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Jun 15, 2023
# insight 0.19.2

## Breaking changes

* The minimum needed R version has been bumped to `3.6`.

* `download_model()` no longer errors when a model object could not be downloaded,
  but instead returns `NULL`. This prevents test failures, and allows to skip
  tests when the return value of `download_model()` is `NULL`.

## General

* Improved support for `mclogit` models (package *mclogit*) and `mipo` objects
  (package *mice*) for models with ordinal or categorical response.

## New supported models

* `phylolm` and `phyloglm` (package *phylolm*), `nestedLogit` (package *nestedLogit*).

## Bug fixes

* Fixed issue in `get_variance()` for *glmmTMB* models with rank deficient
  coefficients.

* Fixed issues in `get_weights()` for `glm` models without weights and `na.action`
  not set to default in the model call.

* `clean_names()` now also removes the `relevel()` pattern.

* Fixed issue in `model_info()` for models of class `gamlss`.

* Fixed problems preventing `get_data()` from locating data defined in
  non-global environments.

* Fixed issue in `get_predicted()` for variables of class numeric matrix created
  by `scale()`, which were correctly handled only when `get_data()` failed to
  find the data in the appropriate environment.

* Fixed issue in `model_info()` for `gee` models from `binomial` families.

# insight 0.19.1

## New supported models

* `hglm` (package *hglm*).

## Changes to functions

* Minor improvements to `get_data()` for `t.test()`.

* `format_value()` gets a `lead_zero` argument, to keep or drop the leading
  zero of a formatted value, as well as arguments `style_positive` and
  `style_negative` to style positive or negative numbers.

* `format_table()` now also formats columns named `SGPV` (second generation
  p-values) as p-values.

* Functions for models of class `clm` (like `find_formula()`, `find_variables()`,
  `get_data()` etc.) now also include variables that were defined as `scale` or
  `nominal` component.

## Bug fixes

* Fixed issue in `get_data()` for results from `kruskal.test()`.

* Fixed issue in `find_weights()` for models of class `lme` and `gls`.

* Fixed issue in `get_datagrid()` for models with multiple weight variables.

# insight 0.19.0

## New supported models

* `mmrm` (package *mmrm*), `flac` and `flic` (*logistf*)

## Breaking changes

* `get_data()` was revised and now always tries to recover the data that was
  used to fit a model from the environment. If this fails, it falls back to
  recovering data from the model frame (the former default behaviour).
  Futrhermore, the `source` argument can be used to explicitly force the old
  behaviour: `source = "mf"` will try to recover data from the model frame first,
  then possibly falling back to look in the environment.

## New functions

* `n_grouplevels()`, to return random effect groups and number of group levels
  for mixed models.

## Changes to functions

* `get_datagrid()` preserves all factor levels for factors that are hold constant
  at their reference level. This is required to work together with
  `get_modelmatrix()` when calculating standard errors for `get_predicted()`.

## Bug fixes

* Fixed but in `get_modelmatrix()` handling of incomplete factors which
  sometimes had downstream implications for numerical results in the uncertainty
  estimates produced by `get_predicted()`.

* Fixed minor issues for HTML tables in `export_table()` when model parameters
  were grouped.

* Fixed issue with incorrect back-transforming in `get_data()` for models with
  log-transformed variables.

* Fixes issue in `compact_list()`.

* `has_single_value()` now returns `FALSE` when the object only has `NA` and
  `na.rm = TRUE`.

* Fixed issue in `get_parameters()` for gam-models without smooth terms, or with
  only smooth terms and removed intercept.

# insight 0.18.8

## Bug fixes

* Fixed test due to changes in the _performance_ package.

# insight 0.18.7

## General

* Minor revisions to `get_predicted.glmmTMB()` due to changes in behaviour
  of `predict.glmmTMB()` for truncated-family models since _glmmTMB_ 1.1.5.

* New function `has_single_value()` that is equivalent to `length(unique()) == 1`
  (or `n_unique() == 1`) but faster.

## Changes to functions

* `ellipses_info()` now includes an attribute `$is_binomial`, which is `TRUE`
  for each model from binomial family.

## Bug fixes

* Fixed behaviour of the `at` argument in `get_datagrid()`.

* Fixed issue for accessing model data in `get_datagrid()` for some edge cases.

# insight 0.18.6

## New supported models

* Support the *logitr* package: `get_data()`, `find_variables()` and more.

## Bug fixes

* Better detection of unicode-support, to avoid failures when building
  vignettes.

* `get_predicted()` now correctly handles variables of class numeric matrix
  created by `scale()`, which fixes a bug in `performance::check_model()`
  (easystats/performance#432).

* Fixed issue with `iterations` argument in `get_predicted()` with _brms_
  models.

# insight 0.18.5

## Breaking

* `get_df(type = "satterthwaite")` for `lmerMod` objects now return degrees of
  freedom per parameter, and no longer per observation. Use `df_per_obs TRUE`
  to return degrees of freedom per observation.

## New functions

* `safe_deparse_symbol()` to only deparses a substituted expressions when
  possible,which increases performance in case many calls to
  `deparse(substitute())`.

## Changes to functions

* `format_table()` gets a `use_symbols` argument. If `TRUE`, column names that
  refer to particular effectsizes (like Phi, Omega or Epsilon) include the related unicode-character instead of the written name. This only works on Windows for
  R >= 4.2, and on OS X or Linux for R >= 4.0.

* The `stars` argument in `format_table()` can now also be a character vector,
  naming the columns that should include stars for significant values. This is
  especially useful for Bayesian models, where we might have multiple columns
  with significant values, e.g. `"BF"` for the Bayes factor or `"pd"` for the
  probability of direction.

* `get_df()` gets more `type` options to return different type of degrees of
  freedom (namely, `"wald"` and `"normal"`, and for mixed models, `"ml1"`,
  `"betwithin"`, `"satterthwaite"` and `"kenward-roger"`).

* `standardize_names()` now recognized more classes from package _marginaleffects_.

* Minor improvements to `find_parameters()` for models with nonlinear formula.

* Minor speed improvements.

## Bug fixes

* Fixed issue in `get_data()` for models of class `plm`, which accidentally
  converted factors into character vectors.

* Fixed issue with column alignment in `export_table()` when the data frame
  to print contained unicode-characters longer than 1 byte.

* Correctly extract predictors for `fixest::i(f1, i.f2)` interactions (#649 by
  @grantmcdermott).

# insight 0.18.4

## Changes to functions

* `model_info()` now includes information for `htest` objects from
  `shapiro.test()` and `bartlett.test()` (will return `$is_variancetest = TRUE`).

## Bug fixes

* Fixed issue in `get_data()` which did not correctly backtransform to original
  data when terms had log-transformations such as `log(1 + x)` or `log(x + 1)`.

* Fixed CRAN check issues.

# insight 0.18.3

## New functions

* `format_alert()`, `format_warning()` and `format_error()`, as convenient
  wrappers around `message()`, `warning()` or `stop()` in combination with
  `format_message()`. You can use these funcionts to format messages, warnings
  or errors.

## Changes to functions

* `get_predicted()` for models of class `clm` now includes confidence intervals
  of predictions.

* `format_message()` gets some additional formatting features. See 'Details'
  in `?format_message` for more information and some current limitations.

* `format_message()` gets an `indent` argument, to specify indention string
  for subsequent lines.

* `format_table()` now merges IC and IC weights columns into one column (e.g.,
  former columns `"AIC"` and `"AIC_wt"` will now be printed as one column, named
  `"AIC (weights)"`). Furthermore, an `ic_digits` argument was added to control
  the number of significant digits for the IC values.

* `print_color()` and `color_text()` now support bright variants of colors and
  background colors.

* `get_datagrid()` gets more options for `at` and `range`, to provide more
  control how to generate the reference grid.

* `get_data()` for models of class `geeglm` and `fixest`now more reliably
  retrieves the model data.

## New supported models

* Support for models of class `mblogit` and `mclogit`.

## Bug fixes

* Fixed issues with wrong attribute `adjusted_for` in `insight::get_datagrid()`.

* Fixed issue (resp. implemented workaround) in `get_data.iv_robust()`, which
  failed due to a bug in the _estimatr_ package.

* Fixed issue where `get_predicted()` failed when data contains factors with
  only one or incomplete levels.

* Fixed issue in `get_predicted()` for models of class `mlm`.

* Fixed issue where `get_predicted()` failed to compute confidence intervals
  of predictions when model contained matrix-alike response columns, e.g. a
  response variable created with `cbind()`.

# insight 0.18.2

## New functions

* `format_percent()` as short-cut for `format_value(as_percent = TRUE)`.

* `is_converged()`, to check whether a mixed model has converged or not.

## Changes to functions

* `format_table()` gains an `exact` argument, to either report exact or rounded
  Bayes factors.

* `get_predicted()` gets a method for models of class `gamlss` (and thereby,
  `get_loglikelihood()` now also works for those model classes).

* `get_predicted()` now better handles models of class `polr`, `multinom` and
  `rlm`.

## Bug fixes

* Fixed test failures.

* Minor fixes to address changes in other packages.

# insight 0.18.0

## Breaking changes

* The `ci` argument in `get_predicted()` now defaults to `NULL`. One reason was
  to make the function faster if confidence intervals are not required, which
  was the case for many downstream usages of that function. Please set `ci`
  explicitly to compute confidence intervals for predictions.

* `get_data()` no longer returns logical types for numeric variables that have
  been converted to logicals on-the-fly within formulas (like `y ~ as.logical(x)`).
  Instead, for each numeric variable that was coerced to logical within a formula
  gets a `logical` attribute (set to `TRUE`), and the returned data frame gets
  a `logicals` attribute including all names of affected variables.

* `parameters_table()`, the alias for `format_table()`, was removed.

## Changes to functions

* `find_transformation()` and `get_transformation()` now also work for models
  where the response was transformed using `log2()` or `log10()`.

## Bug fixes

* `get_sigma()` for models from package _VGAM_ returned wrong sigma-parameter.

* `find_predictors()` for models from package _fixest_ that contained
  interaction terms in the endogenous formula part did not correctly return
  all instruments.

* Fixed formatting of HTML table footers in `export_table()`.

* Several fixes to `get_predicted()` for models from `mgcv::gam()`.

* The `component` argument in `find_parameters()` for `stanmvreg` models did
  not accept the `"location"` value.

* `null_model()` did not consider offset-terms if these were specified inside
  formulas.

* Argument `allow.new.levels` was not passed to `predict()` for
  `get_predicted.glmmTMB()`.

* `clean_names()` now works correctly when several variables are specified in
  `s()` (#573, @etiennebacher).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug 🐛 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants