Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major update that improves support for formulas specification #582

Open
wants to merge 35 commits into
base: master
Choose a base branch
from

Conversation

stefvanbuuren
Copy link
Member

@stefvanbuuren stefvanbuuren commented Sep 11, 2023

  • reintroduces the square predictorMatrix
  • defines conversion functions p2f(), p2c(), f2p(), n2b(), b2n()
  • defines validate.blocks(), validate.predictorMatrix()
  • extends edit.setup() to formulas and blots
  • for reading ease, use ~ 1 for the empty predictor set instead of ~ 0
  • does not automatically set method = "" for variables that are not imputed (NOTE: DECISION REVERTED. SEE BELOW)
  • as far as possible, changes the leading argument to formulas (instead of blocks or predictorMatrix)
  • adds function typecodes() in sampler() to reduce multiple predictorMatrix lines to one (support for multivariate imputation methods)
  • implement new logic in samper.univ()
  • outcomments some tests that depend on hard-coded parameter estimates
  • sharpens test for equality between predictorMatrix and formulas specifications

- reintrocudes the square predictorMatrix
- defines conversion functions p2f(), p2c(), f2p(), n2b(), b2n()
- defines validate.blocks(), validate.predictorMatrix()
- extends edit.setup() to formulas and blots
- for reading ease, use "~ 1" for the empty model instead of "~ 0"
- does not automatically set method = "" for variables that are not imputed
- as far as possible, changes the leading argument to formulas (instead of blocks or predictorMatrix)
- adds function typecodes() in sampler() to reduce multiple predictorMatrix lines to one (support for multivariate imputation methods)
- implement new logic in samper.univ()
- outcomments some tests that depend on hard-coded parameter estimates
- sharpens test for equality between predictorMatrix and formulas specifications
@stefvanbuuren
Copy link
Member Author

Ideas for further development:

  • add news function to YAML so that they appear on site
  • soft replace of blocks by nest (character vector with length ncol(data) with block names. The default is colnames(data))
  • Provide a way for the user to see head of design matrix created in sampler.univ(). Add examples that exploit formulas to add interactions, nested variables, by-processing and other advanced models
  • Describe differences and equivalences between predictorMatrix and formulas specification
  • ...

@stefvanbuuren
Copy link
Member Author

stefvanbuuren commented Sep 11, 2023

  • In preparation to tweaking documentation, converts Rd tags to roxygen2 tags.
  • Adds new functions to YAML

…block. Update make.method() so that homogeneous types and nlevels within a block get an appropriate default method.
…, ] to zero when variable j is member of a block for which no imputations are needed.
@stefvanbuuren
Copy link
Member Author

stefvanbuuren commented Sep 13, 2023

Commits 5c6bee2 and 755c23a generalise the classic behaviour of the predictorMatrix to blocks.

It works as follows:

  • mice() uses the nimp() function to calculate the number of imputations needed for a given block of variables;
  • if the number of needed imputations in block j is zero, the following happens:
  1. mice() sets method[j] <- ""
  2. mice() sets predictorMatrix[v, ] <- 0 for all variables v in block j

This PR also removes the error message mice detected constant and/or collinear variables. No predictors were left after their removal. Imputations will be generated without predictors by the intercept-only imputation model (not recommended in general).

WARNING: Setting predictorMatrix[v, ] <- 0 does not prevent imputation of variable v. To prevent imputation of v, specify the appropriate entry of method as "".

@stefvanbuuren
Copy link
Member Author

Commit c2da03c cleans up the internal function edit.setup(). It return the proper formulas of the reduced model, but it is not quite right for meth, vis and post. Added FIXME.

@stefvanbuuren
Copy link
Member Author

New behaviours

  1. Prevention of NA propagation by removing incomplete predictors. This version detects when a predictor contains missing values that are not imputed. In order to prevent NA propagation, mice() does the following actions: 1) removes incomplete predictor(s) from the RHS, 2) adds incomplete predictor(s) to formulas (var ~ 1) and block components, sets method[var] = "", and sets the predictorMatrix column and row to zero

  2. The predictorMatrix input can be a square submatrix of the full predictorMatrix. mice() will augment predictorMatrix to the full matrix and always return a p * p named matrix corresponding to the p columns in the data. The inactive variables will have zero columns and rows.

  3. The predictorMatrix input may be unnamed if its size is p * p. For other than p * p, an unnamed matrix generated an error.

Changes

  • Adds supports a tiny predictorMatrix
  • Solves bug in f2p()
  • Adds new function remove.rhs.variables()
  • Adds a validate.mids() check at exit that errors if rownames(predictorMatrix) differ from colnames(data). Some more output tests need to be added.
  • Removes codes designed to work specifically with a non-square predictorMatrix
  • Generates an error if predictorMatrix has fewer rows than length of blocks

@stefvanbuuren
Copy link
Member Author

Exit checks added:

  • rownames(predictorMatrix) must match colnames(data)
  • length of formulas and blocks must be equal
  • length of formulas and method must be equal
  • length of method vector cannot exceed number of variables
  • length of imp and number of variables must be equal

@stefvanbuuren
Copy link
Member Author

stefvanbuuren commented Sep 22, 2023

New behaviours and features thus far

  1. TWO SEPARATE INTERFACES FOR MODEL SPECIFICATION: This version promotes two interfaces to specify imputations models: predictor (predictorMatrix + parcel + method) and formula (formulas + method). This version does not accept anymore accept mixes of predictorMatrix and formulas arguments in the call to mice().

  2. NA-PROPAGATION PREVENTION. This version detects when a predictor contains missing values that are not imputed. In order to prevent NA propagation, mice() can follow two strategies: "Autoremove" (remove incomplete predictor(s) from the RHS, set method to "", adapt predictorMatrix, formulas and blocks, write to loggedEvents), or "Autoimpute" (Impute incomplete predictor and adapt method, predictorMatrix, formulas, and so on). "Autoremove" is implemented and current default. Use mice(..., autoremove = FALSE) to revert to old behavior (NA propagation).

  3. SUBMODELS: The predictorMatrix input can be a square submatrix of the full predictorMatrix when its dimensions are named. mice() will augment the tiny predictorMatrix to the full matrix and always return a p * p named matrix corresponding to the p columns in the data. Unmentioned variables are not imputed, and the predictorMatrix, formulas and method are adapted accordingly.

  4. DROP NON-SQUARE PREDICTOR MATRIX: Version 3.0 introduced non-square versions, but its interpretation turned out to be complex and ambiguous. For clarity, this update works with a predictor matrix that is square with both dimensions identically named with the names of the variables in the data. Variable groups are now specified through the parcel argument.

  5. NEW PARCEL ARGUMENT. There is a new parcel argument that is easier to use. The print of the mids object shows parcel when it is different from the default. parcel can take over the role of blocks in specification. blocks is soft-deprecated, but still widely used within the program code.

  6. NEW DOTS ARGUMENT. The blots argument is renamed to dots

  7. EXIT VALIDATION: Adds a new validate.mids() checks the mids object before exit.

@stefvanbuuren
Copy link
Member Author

stefvanbuuren commented Oct 2, 2023

Three proposed changes to new behaviour

  1. NA-PROPAGATION. It is better to use NA-PROPAGATION by default. The reason is that the user becomes aware of a potential model specification problem (e.g. not imputing a variable used as a predictor). mice() should offer two easy ways to solve the problem: "autoremove" and "autoimpute". We prefer the NA-PROPAGATION default because it alerts the user, whereas the other two options would "magically" make the problem disappear (and thereby downgrade model specification hygiene).

  2. The formula of a complete variable is now something like age ~ 1. It is better to use age ~ 0, to signal that for the dependent not even the intercept-only model is used.

  3. The formulas argument return as environment attached to the each formula. This environment does not seem to necessary in mice(), so it is cleaner to remove environment.

@@ -82,9 +83,23 @@ check.predictorMatrix <- function(predictorMatrix,
)
}

# calculate ynames (variables to impute) for use in check.method()
# NA-propagation prevention
# find all dependent (imputed) variables
hit <- apply(predictorMatrix, 1, function(x) any(x != 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified to: apply(predictorMatrix != 0, 1, any)

# find all variables in data that are not imputed
notimputed <- setdiff(colnames(data), ynames)
# select uip: unimputed incomplete predictors
completevars <- colnames(data)[!apply(is.na(data), 2, sum)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!apply(is.na(data), 2, any) might be more efficient

@@ -157,6 +156,16 @@ check.blocks <- function(blocks, data, calltype = "pred") {
))
}

# save ynames (variables to impute) for use in check.method()
ynames <- unique(as.vector(unname(unlist(blocks))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is as.vector redundant for the return value from unlist?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How should mice behave when variables are not specified in the model
2 participants