Add regularised univariate imputation methods following Deng et al 2016 #438

EdoardoCostantini · 2021-10-22T15:24:39Z

I added two regularised univariate imputation methods following what was proposed by Deng et al 2016. The methods currently support continuous and dichotomous target columns. I hope they can be interesting for both practical uses and for further methodological research in the performance of these methods.

References:
Deng, Y., Chang, C., Ido, M. S., & Long, Q. (2016). Multiple imputation for general missing data patterns in the presence of high-dimensional data. Scientific reports, 6(1), 1-10.

…nd DURR

stefvanbuuren

Eduardo, thanks a lot.

Your PR is a very useful addition to mice. It is polished, and it's great that you added test files. I would gladly merge it, but before doing so there's a couple of things to look at:

Please review and respond to my comments made per file;
Please explain the major steps in the direct and indirect methods in the documentation, and highlight typical cases where durr and iurr could make a difference in practice;
Apply the styler package to style source files using the styler defaults to make it conform to other code in mice;
Add yourself as contributor at the end of the list in DESCRIPTION.

Two more general points to consider:

The names iurr and durr are taken from the literature, but users are generally unaware of these abbreviations. Wouldn't names like norm.lasso() and norm.lasso.pre() be easier and more appealing?
Do you have position on making these methods default over norm() and logreg(). Would these be comparable in terms of quality and computational speed?

.gitignore

DESCRIPTION

R/mice.impute.durr.logreg.R

stefvanbuuren · 2021-10-23T13:55:37Z

R/mice.impute.durr.norm.R

+#' When using only mice.impute.iurr methods, the user can provide the default
+#' predictor matrix. The method will then take care of selecting which variables are
+#' important for imputation.


Same idea as above. See ebf1e17.

stefvanbuuren · 2021-10-23T13:56:55Z

R/mice.impute.durr.norm.R

+  dotyobs <- y[ry][s]
+
+  # Train imputation model
+  cv_lasso <- glmnet::cv.glmnet(x = dotxobs, y = dotyobs,


Does glmnet::cv.glmnet() add the intercept?

Yes. cv.glmnet estimates the intercept as well. However, thanks to this comment I noticed a bug in how mice.impute.iurr.norm() was treating the x argument. It was considering the argument x as a full design matrix with a column of 1s for the intercept estimation. I corrected this in 73ba929.

stefvanbuuren · 2021-10-23T14:02:05Z

R/mice.impute.iurr.logreg.R

+#' When using only mice.impute.iurr methods, the user can provide the default
+#' predictor matrix. The method will then take care of selecting which variables are
+#' important for imputation.


As far as I understand, the indirect method consists of three steps: 1) find variables that should remain in the model, 2) reduce the dataset with only the variables that survived step 1, and 3) find imputations given the reduced data set.

Perhaps you describe these steps in the documentation.

Also: Could we just call mice.impute.logreg() or mice.impute.norm() to do step 3?

I agree. Calling mice.impute.logreg() and mice.impute.norm() for step 3 is a great idea!

mice.impute.iurr.logreg() was already implementing the same steps of mice.impute.logreg() so it makes sense to simply call it instead. Furthermore, in the previous version, I did not use the augment() function to deal with possible perfect prediction after the selection of the active set of predictors. Calling mice.impute.logreg() directly deals with this as well.

with mice.impute.iurr.norm() I was trying to follow as strictly as possible what Deng et al 2016 described. Deng et al follow the Normal-approximation draw approach to obtain proper imputations while using mice.impute.norm() implies following a more explicitly Bayesian approach. However, there should be no practical differences between the two, so it is a good idea to just call mice.impute.norm() to remain consistent with the mice package behavior and language.

I implemented the change in 2f4e660.

I'm working on a commit addressing the documentation more broadly.

As for the description of the steps, I expanded the "Details" section of the functions docs in 033bc17. I now describe more clearly the steps of each method.

stefvanbuuren · 2021-10-23T14:02:42Z

R/mice.impute.iurr.norm.R

+#' When using only mice.impute.iurr methods, the user can provide the default
+#' predictor matrix. The method will then take care of selecting which variables are
+#' important for imputation.


Same idea as above. See ebf1e17.

stefvanbuuren · 2021-10-23T14:05:16Z

tests/testthat/test-mice.impute.durr.logreg.R

+
+# Use univariate imputation model
+set.seed(123)
+imps <- mice.impute.durr.logreg(y, ry, x)


Add print = FALSE to prevent printing. Also at various other places. There should be no terminal output coming from test scripts.

I added print = FALSE where needed to avoid tests outputs in 8f8d921.

stefvanbuuren · 2021-10-23T14:07:26Z

tests/testthat/test-mice.impute.durr.logreg.R

+durr_default <- mice(X, m = 2, maxit = 2, method = "durr.logreg", eps = 0)
+durr_custom <- mice(X, m = 2, maxit = 2, method = "durr.logreg", eps = 0,
+                    nfolds = 5)


The eps parameter seems superfluous here.

I removed the eps parameter from the mice() calls in the test scripts in b52d243.

These are created by the IDE I use. No one else needs them.

The new functions also install glmnet on demand

…tions I programmed the functions thinking that the x argument provided to the univariate imputation functions was a design matrix with a vector of 1s as first column. However, I noticed that sampler.univ() gets rid of the intercept in x before passing it to any mice.impute.method. Affected: - The behaviour of the durr method was not impacted by this as any constant is dropped by glmnet. - The behaviour of iurr was impacted, in its normal version, by an incorrect indexing of the active set. - With this commit I have also changed the test files to reflect the correct representation of x at the stage it is provided to the mice.impute.method() functions. For these scripts, I have also increased the size of the intercept in the data generating model to monitor clearly how the intercept is treated by cv.glmnet.

The test checks that the logreg methods return objects of the same class in a well-behaved case and in a perfect separation case. The two cases differ only by exclusion / inclusion of a perfect predictor of a dichotomous DV. This means we are testing discrepancy in behaviour of the same mice.impute.method function due to perfect prediction. This means that: - if the well-behaved case returns a factor (expected outcome), while the perfect separation case returns an error or warning, the test fails (glm returns warning in case of perfect separation) - if both return a factor, the test passes. - if there is a code breaking bug that makes both cases return an error, the test does not fail! This is desirable because it does not point to perfect prediction as a problem to solve. The root of the problem is elsewhere.

…() directly - mice.impute.iurr.logreg() was already implementing the same steps of mice.impute.logreg() so it makes sense to simply call it instead. Furthermore, in the previous version, I did not use the augment() function to deal with possible perfect prediction after the selection of the active set of predictors. Calling mice.impute.logreg() directly deals with this as well. - with mice.impute.iurr.norm() I was trying to follow as strictly as possible what Deng et al 2016 described. Deng et al follow the Normal-approximation draw to obtain proper imputations, while using mice.impute.norm() implies following a more explicitly bayesian approach. However, there should be no practical differences between the two, so it is a good idea to just call mice.impute.norm() to remain consistent with how the mice package behaviour and language. - testing scripts for .iurr.norm() were adapted to correctly check behaviour

- the description of iurr makes it more clear how the predictMatrix object is treated - the description of durr doesn't mention this anymore as this method is more straightforward in its treatment of predictMatrix.

Old names -> new names: - mice.impute.durr.norm() -> mice.impute.lasso.norm() - mice.impute.durr.logreg() -> mice.impute.lasso.logreg() - mice.impute.iurr.norm() -> mice.impute.lasso.select.norm() - mice.impute.iurr.logreg() -> mice.impute.lasso.select.logreg()

EdoardoCostantini · 2021-10-30T12:40:52Z

Stef, thank you for the swift feedback on this pull request.

As requested, I responded to the comments per file.
Here, I respond to the more general points you had:

Please explain the major steps in the direct and indirect methods in the documentation, and highlight typical cases where durr and iurr could make a difference in practice;

I explained the main steps in 033bc17

Apply the styler package to style source files using the styler defaults to make it conform to other code in mice;

I applied styler to the function and testing scripts in fa82d79

Add yourself as contributor at the end of the list in DESCRIPTION.

Done in 4339012

The names iurr and durr are taken from the literature, but users are generally unaware of these abbreviations. Wouldn't names like norm.lasso() and norm.lasso.pre() be easier and more appealing?

In 1185952, I renamed the functions with the following convention:

Old names → new names:

mice.impute.durr.norm() → mice.impute.lasso.norm()
mice.impute.durr.logreg() → mice.impute.lasso.logreg()
mice.impute.iurr.norm() → mice.impute.lasso.select.norm()
mice.impute.iurr.logreg() → mice.impute.lasso.select.logreg()

What do you think of them?
I kept the filenames as they were for now so that we don't lose the file history. I'll change the filenames when we settle on the new names.

Do you have position on making these methods default over norm() and logreg(). Would these be comparable in terms of quality and computational speed?

I would summarize my position on this topic with the following considerations:

They are preferable to norm() and logreg() when data is high-dimensional in nature (i.e. the imputation model has more predictors than observed cases in the variable under imputation). An ideal example among the datasets included in mice is the SE Fireworks disaster data. I think this could be a great dataset to work with for a vignette.
With low dimensional data, they can still help by reducing the burden choosing which predictors to include in the imputation models (as they make this choice in a data-driven way with little input from the user), while achieving performances comparable to well-specified norm() and logreg()
They are significantly computationally more intensive. In a simulation set up I run for a study, durr and iurr took around 1 minute to impute 6 variables while a mice call with the impute.norm method with a given "correct" predictMatrix took around 15 seconds (maxit = 50, m = 10). In this study durr and iurr were also implemented to draw imputations from a single chain of draws, so I expect this mice implementation to take longer.
There is an interplay between the predictive performance of lasso and the treatment/nature of categorical predictors.
When categorical predictors are present, it is possible to use the group lasso penalty (Yuan & Lin, 2006) instead of the standard lasso. This approach forces predictors that belong to the same group to be either all included or all excluded. When categorical predictors are provided, this means that group-lasso either drops all dummy codes representing a single categorical predictor or includes all. The standard lasso keeps or drops dummy codes as if they were independent variables. Simon et al (2013) discuss how it is reasonable to use standard lasso when categorical predictors have many levels, and group lasso when categorical predictors have few levels. They also discuss how the sparse group lasso is better for in-between situations.
The methods I have implemented here use the standard lasso penalty that was proposed by Deng et al 2016. However, before recommending the iurr and durr as defaults, further research should be done to understand how Simon et al's findings play a role in Multiple Imputation.

I think that considering points 3 and 4, durr and iurr should still be adopted as a conscious decision, instead of given as default.

References

Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2013). A sparse-group lasso. Journal of computational and graphical statistics, 22(2), 231-245.

Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49-67.

This allows the edge case where the x argument of the methods has only 1 predictor to run. The presence of a vector of 1s in the matrix provided to the cv.glmnet function does not impact in any way its performance. cv.glmnet() will shrink reg coefs for constants to 0.

stefvanbuuren · 2021-11-14T20:24:42Z

Thanks a million. Now merged.

feat: add regularized regression univairate imputation methods IURR a…

e407fa8

…nd DURR

stefvanbuuren requested changes Oct 23, 2021

View reviewed changes

EdoardoCostantini added 12 commits October 28, 2021 19:00

hk: clean gitignore file from superfluous elements

0bf2f4a

These are created by the IDE I use. No one else needs them.

refactor: move glmnet package call to suggested packages

74b3178

The new functions also install glmnet on demand

docs: clarify use of predictMatrix in iurr and durr

ebf1e17

- the description of iurr makes it more clear how the predictMatrix object is treated - the description of durr doesn't mention this anymore as this method is more straightforward in its treatment of predictMatrix.

docs: spell out durr and iurr steps more clearly

033bc17

test: no output from test scripts

8f8d921

test: get rid of superfluous argument eps in mice calls

b52d243

style: update code with run of styler

fa82d79

docs: add Edoardo Costantini as contributor

4339012

EdoardoCostantini requested a review from stefvanbuuren October 30, 2021 12:53

EdoardoCostantini and others added 4 commits November 1, 2021 15:27

fix: correct logit-to-prob transformation

e0118aa

typo: change predictMatrix to predictorMatrix

cc7593b

Merge branch 'master' into squash

4142e53

stefvanbuuren merged commit 18372fd into amices:master Nov 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add regularised univariate imputation methods following Deng et al 2016 #438

Add regularised univariate imputation methods following Deng et al 2016 #438

EdoardoCostantini commented Oct 22, 2021

stefvanbuuren left a comment

stefvanbuuren Oct 23, 2021

EdoardoCostantini Oct 26, 2021 •

edited

Loading

stefvanbuuren Oct 23, 2021

EdoardoCostantini Oct 29, 2021 •

edited

Loading

stefvanbuuren Oct 23, 2021

EdoardoCostantini Oct 29, 2021

EdoardoCostantini Oct 30, 2021 •

edited

Loading

stefvanbuuren Oct 23, 2021

EdoardoCostantini Oct 26, 2021 •

edited

Loading

stefvanbuuren Oct 23, 2021

EdoardoCostantini Oct 30, 2021 •

edited

Loading

stefvanbuuren Oct 23, 2021

EdoardoCostantini Oct 30, 2021

EdoardoCostantini commented Oct 30, 2021 •

edited

Loading

stefvanbuuren commented Nov 14, 2021

Add regularised univariate imputation methods following Deng et al 2016 #438

Add regularised univariate imputation methods following Deng et al 2016 #438

Conversation

EdoardoCostantini commented Oct 22, 2021

stefvanbuuren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdoardoCostantini Oct 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdoardoCostantini Oct 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdoardoCostantini Oct 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdoardoCostantini Oct 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdoardoCostantini Oct 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdoardoCostantini commented Oct 30, 2021 • edited Loading

stefvanbuuren commented Nov 14, 2021

EdoardoCostantini Oct 26, 2021 •

edited

Loading

EdoardoCostantini Oct 29, 2021 •

edited

Loading

EdoardoCostantini Oct 30, 2021 •

edited

Loading

EdoardoCostantini Oct 26, 2021 •

edited

Loading

EdoardoCostantini Oct 30, 2021 •

edited

Loading

EdoardoCostantini commented Oct 30, 2021 •

edited

Loading