Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regularised univariate imputation methods following Deng et al 2016 #438

Merged
merged 17 commits into from
Nov 14, 2021

Conversation

EdoardoCostantini
Copy link
Member

I added two regularised univariate imputation methods following what was proposed by Deng et al 2016. The methods currently support continuous and dichotomous target columns. I hope they can be interesting for both practical uses and for further methodological research in the performance of these methods.

References:
Deng, Y., Chang, C., Ido, M. S., & Long, Q. (2016). Multiple imputation for general missing data patterns in the presence of high-dimensional data. Scientific reports, 6(1), 1-10.

Copy link
Member

@stefvanbuuren stefvanbuuren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eduardo, thanks a lot.

Your PR is a very useful addition to mice. It is polished, and it's great that you added test files. I would gladly merge it, but before doing so there's a couple of things to look at:

  • Please review and respond to my comments made per file;
  • Please explain the major steps in the direct and indirect methods in the documentation, and highlight typical cases where durr and iurr could make a difference in practice;
  • Apply the styler package to style source files using the styler defaults to make it conform to other code in mice;
  • Add yourself as contributor at the end of the list in DESCRIPTION.

Two more general points to consider:

  • The names iurr and durr are taken from the literature, but users are generally unaware of these abbreviations. Wouldn't names like norm.lasso() and norm.lasso.pre() be easier and more appealing?
  • Do you have position on making these methods default over norm() and logreg(). Would these be comparable in terms of quality and computational speed?

.gitignore Outdated Show resolved Hide resolved
DESCRIPTION Outdated Show resolved Hide resolved
R/mice.impute.durr.logreg.R Outdated Show resolved Hide resolved
R/mice.impute.durr.logreg.R Show resolved Hide resolved
Comment on lines 17 to 19
#' When using only mice.impute.iurr methods, the user can provide the default
#' predictor matrix. The method will then take care of selecting which variables are
#' important for imputation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

Copy link
Member Author

@EdoardoCostantini EdoardoCostantini Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same idea as above. See ebf1e17.

dotyobs <- y[ry][s]

# Train imputation model
cv_lasso <- glmnet::cv.glmnet(x = dotxobs, y = dotyobs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does glmnet::cv.glmnet() add the intercept?

Copy link
Member Author

@EdoardoCostantini EdoardoCostantini Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. cv.glmnet estimates the intercept as well. However, thanks to this comment I noticed a bug in how mice.impute.iurr.norm() was treating the x argument. It was considering the argument x as a full design matrix with a column of 1s for the intercept estimation. I corrected this in 73ba929.

Comment on lines 17 to 19
#' When using only mice.impute.iurr methods, the user can provide the default
#' predictor matrix. The method will then take care of selecting which variables are
#' important for imputation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, the indirect method consists of three steps: 1) find variables that should remain in the model, 2) reduce the dataset with only the variables that survived step 1, and 3) find imputations given the reduced data set.

Perhaps you describe these steps in the documentation.

Also: Could we just call mice.impute.logreg() or mice.impute.norm() to do step 3?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Calling mice.impute.logreg() and mice.impute.norm() for step 3 is a great idea!

  • mice.impute.iurr.logreg() was already implementing the same steps of mice.impute.logreg() so it makes sense to simply call it instead. Furthermore, in the previous version, I did not use the augment() function to deal with possible perfect prediction after the selection of the active set of predictors. Calling mice.impute.logreg() directly deals with this as well.
  • with mice.impute.iurr.norm() I was trying to follow as strictly as possible what Deng et al 2016 described. Deng et al follow the Normal-approximation draw approach to obtain proper imputations while using mice.impute.norm() implies following a more explicitly Bayesian approach. However, there should be no practical differences between the two, so it is a good idea to just call mice.impute.norm() to remain consistent with the mice package behavior and language.

I implemented the change in 2f4e660.

I'm working on a commit addressing the documentation more broadly.

Copy link
Member Author

@EdoardoCostantini EdoardoCostantini Oct 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the description of the steps, I expanded the "Details" section of the functions docs in 033bc17. I now describe more clearly the steps of each method.

Comment on lines 17 to 19
#' When using only mice.impute.iurr methods, the user can provide the default
#' predictor matrix. The method will then take care of selecting which variables are
#' important for imputation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

Copy link
Member Author

@EdoardoCostantini EdoardoCostantini Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same idea as above. See ebf1e17.


# Use univariate imputation model
set.seed(123)
imps <- mice.impute.durr.logreg(y, ry, x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add print = FALSE to prevent printing. Also at various other places. There should be no terminal output coming from test scripts.

Copy link
Member Author

@EdoardoCostantini EdoardoCostantini Oct 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added print = FALSE where needed to avoid tests outputs in 8f8d921.

Comment on lines 48 to 50
durr_default <- mice(X, m = 2, maxit = 2, method = "durr.logreg", eps = 0)
durr_custom <- mice(X, m = 2, maxit = 2, method = "durr.logreg", eps = 0,
nfolds = 5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eps parameter seems superfluous here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the eps parameter from the mice() calls in the test scripts in b52d243.

These are created by the IDE I use. No one else needs them.
The new functions also install glmnet on demand
…tions

I programmed the functions thinking that the x argument provided to the univariate imputation functions was a design matrix with a vector of 1s as first column.
However, I noticed that sampler.univ() gets rid of the intercept in x before passing it to any mice.impute.method.

Affected:
- The behaviour of the durr method was not impacted by this as any constant is dropped by glmnet.
- The behaviour of iurr was impacted, in its normal version, by an incorrect indexing of the active set.
- With this commit I have also changed the test files to reflect the correct representation of x at the stage it is provided to the mice.impute.method() functions. For these scripts, I have also increased the size of the intercept in the data generating model to monitor clearly how the intercept is treated by cv.glmnet.
The test checks that the logreg methods return objects of the same class in a well-behaved case and in a perfect separation case.
The two cases differ only by exclusion / inclusion of a perfect predictor of a dichotomous DV.
This means we are testing discrepancy in behaviour of the same mice.impute.method function due to perfect prediction.
This means that:
- if the well-behaved case returns a factor (expected outcome), while the perfect separation case returns an error or warning, the test fails (glm returns warning in case of perfect separation)
- if both return a factor, the test passes.
- if there is a code breaking bug that makes both cases return an error, the test does not fail! This is desirable because it does not point to
perfect prediction as a problem to solve. The root of the problem is elsewhere.
…() directly

- mice.impute.iurr.logreg() was already implementing the same steps of mice.impute.logreg() so it makes sense to simply call it instead.
Furthermore, in the previous version, I did not use the augment() function to deal with possible perfect prediction after the selection of the active set of predictors. Calling mice.impute.logreg() directly deals with this as well.
- with mice.impute.iurr.norm() I was trying to follow as strictly as possible what Deng et al 2016 described. Deng et al follow the Normal-approximation draw to obtain proper imputations, while using mice.impute.norm() implies following a more explicitly bayesian approach. However, there should be no practical differences between the two, so it is a good idea to just call mice.impute.norm() to remain consistent with how the mice package behaviour and language.
- testing scripts for .iurr.norm() were adapted to correctly check behaviour
- the description of iurr makes it more clear how the predictMatrix object is treated
- the description of durr doesn't mention this anymore as this method is more straightforward in its treatment of predictMatrix.
Old names -> new names:
- mice.impute.durr.norm()   -> mice.impute.lasso.norm()
- mice.impute.durr.logreg() -> mice.impute.lasso.logreg()
- mice.impute.iurr.norm()   -> mice.impute.lasso.select.norm()
- mice.impute.iurr.logreg() -> mice.impute.lasso.select.logreg()
@EdoardoCostantini
Copy link
Member Author

EdoardoCostantini commented Oct 30, 2021

Stef, thank you for the swift feedback on this pull request.

As requested, I responded to the comments per file.
Here, I respond to the more general points you had:

  • Please explain the major steps in the direct and indirect methods in the documentation, and highlight typical cases where durr and iurr could make a difference in practice;

I explained the main steps in 033bc17

  • Apply the styler package to style source files using the styler defaults to make it conform to other code in mice;

I applied styler to the function and testing scripts in fa82d79

  • Add yourself as contributor at the end of the list in DESCRIPTION.

Done in 4339012

  • The names iurr and durr are taken from the literature, but users are generally unaware of these abbreviations. Wouldn't names like norm.lasso() and norm.lasso.pre() be easier and more appealing?

In 1185952, I renamed the functions with the following convention:

Old names → new names:

  • mice.impute.durr.norm() → mice.impute.lasso.norm()
  • mice.impute.durr.logreg() → mice.impute.lasso.logreg()
  • mice.impute.iurr.norm() → mice.impute.lasso.select.norm()
  • mice.impute.iurr.logreg() → mice.impute.lasso.select.logreg()

What do you think of them?
I kept the filenames as they were for now so that we don't lose the file history. I'll change the filenames when we settle on the new names.

  • Do you have position on making these methods default over norm() and logreg(). Would these be comparable in terms of quality and computational speed?

I would summarize my position on this topic with the following considerations:

  1. They are preferable to norm() and logreg() when data is high-dimensional in nature (i.e. the imputation model has more predictors than observed cases in the variable under imputation). An ideal example among the datasets included in mice is the SE Fireworks disaster data. I think this could be a great dataset to work with for a vignette.
  2. With low dimensional data, they can still help by reducing the burden choosing which predictors to include in the imputation models (as they make this choice in a data-driven way with little input from the user), while achieving performances comparable to well-specified norm() and logreg()
  3. They are significantly computationally more intensive. In a simulation set up I run for a study, durr and iurr took around 1 minute to impute 6 variables while a mice call with the impute.norm method with a given "correct" predictMatrix took around 15 seconds (maxit = 50, m = 10). In this study durr and iurr were also implemented to draw imputations from a single chain of draws, so I expect this mice implementation to take longer.
  4. There is an interplay between the predictive performance of lasso and the treatment/nature of categorical predictors.
    When categorical predictors are present, it is possible to use the group lasso penalty (Yuan & Lin, 2006) instead of the standard lasso. This approach forces predictors that belong to the same group to be either all included or all excluded. When categorical predictors are provided, this means that group-lasso either drops all dummy codes representing a single categorical predictor or includes all. The standard lasso keeps or drops dummy codes as if they were independent variables. Simon et al (2013) discuss how it is reasonable to use standard lasso when categorical predictors have many levels, and group lasso when categorical predictors have few levels. They also discuss how the sparse group lasso is better for in-between situations.
    The methods I have implemented here use the standard lasso penalty that was proposed by Deng et al 2016. However, before recommending the iurr and durr as defaults, further research should be done to understand how Simon et al's findings play a role in Multiple Imputation.

I think that considering points 3 and 4, durr and iurr should still be adopted as a conscious decision, instead of given as default.

References

Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2013). A sparse-group lasso. Journal of computational and graphical statistics, 22(2), 231-245.

Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49-67.

EdoardoCostantini and others added 4 commits November 1, 2021 15:27
This allows the edge case where the x argument of the methods has only 1 predictor to run.
The presence of a vector of 1s in the matrix provided to the cv.glmnet function does not impact in any way its performance.
cv.glmnet() will shrink reg coefs for constants to 0.
@stefvanbuuren stefvanbuuren merged commit 18372fd into amices:master Nov 14, 2021
@stefvanbuuren
Copy link
Member

Thanks a million. Now merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants