-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add regularised univariate imputation methods following Deng et al 2016 #438
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eduardo, thanks a lot.
Your PR is a very useful addition to mice
. It is polished, and it's great that you added test files. I would gladly merge it, but before doing so there's a couple of things to look at:
- Please review and respond to my comments made per file;
- Please explain the major steps in the direct and indirect methods in the documentation, and highlight typical cases where
durr
andiurr
could make a difference in practice; - Apply the
styler
package to style source files using the styler defaults to make it conform to other code inmice
; - Add yourself as contributor at the end of the list in DESCRIPTION.
Two more general points to consider:
- The names
iurr
anddurr
are taken from the literature, but users are generally unaware of these abbreviations. Wouldn't names likenorm.lasso()
andnorm.lasso.pre()
be easier and more appealing? - Do you have position on making these methods default over
norm()
andlogreg()
. Would these be comparable in terms of quality and computational speed?
R/mice.impute.durr.norm.R
Outdated
#' When using only mice.impute.iurr methods, the user can provide the default | ||
#' predictor matrix. The method will then take care of selecting which variables are | ||
#' important for imputation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same idea as above. See ebf1e17.
R/mice.impute.durr.norm.R
Outdated
dotyobs <- y[ry][s] | ||
|
||
# Train imputation model | ||
cv_lasso <- glmnet::cv.glmnet(x = dotxobs, y = dotyobs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does glmnet::cv.glmnet()
add the intercept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. cv.glmnet
estimates the intercept as well. However, thanks to this comment I noticed a bug in how mice.impute.iurr.norm()
was treating the x argument. It was considering the argument x as a full design matrix with a column of 1s for the intercept estimation. I corrected this in 73ba929.
R/mice.impute.iurr.logreg.R
Outdated
#' When using only mice.impute.iurr methods, the user can provide the default | ||
#' predictor matrix. The method will then take care of selecting which variables are | ||
#' important for imputation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, the indirect method consists of three steps: 1) find variables that should remain in the model, 2) reduce the dataset with only the variables that survived step 1, and 3) find imputations given the reduced data set.
Perhaps you describe these steps in the documentation.
Also: Could we just call mice.impute.logreg()
or mice.impute.norm()
to do step 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Calling mice.impute.logreg()
and mice.impute.norm()
for step 3 is a great idea!
mice.impute.iurr.logreg()
was already implementing the same steps ofmice.impute.logreg()
so it makes sense to simply call it instead. Furthermore, in the previous version, I did not use theaugment()
function to deal with possible perfect prediction after the selection of the active set of predictors. Callingmice.impute.logreg()
directly deals with this as well.- with
mice.impute.iurr.norm()
I was trying to follow as strictly as possible what Deng et al 2016 described. Deng et al follow the Normal-approximation draw approach to obtain proper imputations while usingmice.impute.norm()
implies following a more explicitly Bayesian approach. However, there should be no practical differences between the two, so it is a good idea to just callmice.impute.norm()
to remain consistent with the mice package behavior and language.
I implemented the change in 2f4e660.
I'm working on a commit addressing the documentation more broadly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the description of the steps, I expanded the "Details" section of the functions docs in 033bc17. I now describe more clearly the steps of each method.
R/mice.impute.iurr.norm.R
Outdated
#' When using only mice.impute.iurr methods, the user can provide the default | ||
#' predictor matrix. The method will then take care of selecting which variables are | ||
#' important for imputation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same idea as above. See ebf1e17.
|
||
# Use univariate imputation model | ||
set.seed(123) | ||
imps <- mice.impute.durr.logreg(y, ry, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add print = FALSE
to prevent printing. Also at various other places. There should be no terminal output coming from test scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added print = FALSE
where needed to avoid tests outputs in 8f8d921.
durr_default <- mice(X, m = 2, maxit = 2, method = "durr.logreg", eps = 0) | ||
durr_custom <- mice(X, m = 2, maxit = 2, method = "durr.logreg", eps = 0, | ||
nfolds = 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The eps
parameter seems superfluous here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the eps
parameter from the mice()
calls in the test scripts in b52d243.
These are created by the IDE I use. No one else needs them.
The new functions also install glmnet on demand
…tions I programmed the functions thinking that the x argument provided to the univariate imputation functions was a design matrix with a vector of 1s as first column. However, I noticed that sampler.univ() gets rid of the intercept in x before passing it to any mice.impute.method. Affected: - The behaviour of the durr method was not impacted by this as any constant is dropped by glmnet. - The behaviour of iurr was impacted, in its normal version, by an incorrect indexing of the active set. - With this commit I have also changed the test files to reflect the correct representation of x at the stage it is provided to the mice.impute.method() functions. For these scripts, I have also increased the size of the intercept in the data generating model to monitor clearly how the intercept is treated by cv.glmnet.
The test checks that the logreg methods return objects of the same class in a well-behaved case and in a perfect separation case. The two cases differ only by exclusion / inclusion of a perfect predictor of a dichotomous DV. This means we are testing discrepancy in behaviour of the same mice.impute.method function due to perfect prediction. This means that: - if the well-behaved case returns a factor (expected outcome), while the perfect separation case returns an error or warning, the test fails (glm returns warning in case of perfect separation) - if both return a factor, the test passes. - if there is a code breaking bug that makes both cases return an error, the test does not fail! This is desirable because it does not point to perfect prediction as a problem to solve. The root of the problem is elsewhere.
…() directly - mice.impute.iurr.logreg() was already implementing the same steps of mice.impute.logreg() so it makes sense to simply call it instead. Furthermore, in the previous version, I did not use the augment() function to deal with possible perfect prediction after the selection of the active set of predictors. Calling mice.impute.logreg() directly deals with this as well. - with mice.impute.iurr.norm() I was trying to follow as strictly as possible what Deng et al 2016 described. Deng et al follow the Normal-approximation draw to obtain proper imputations, while using mice.impute.norm() implies following a more explicitly bayesian approach. However, there should be no practical differences between the two, so it is a good idea to just call mice.impute.norm() to remain consistent with how the mice package behaviour and language. - testing scripts for .iurr.norm() were adapted to correctly check behaviour
- the description of iurr makes it more clear how the predictMatrix object is treated - the description of durr doesn't mention this anymore as this method is more straightforward in its treatment of predictMatrix.
Old names -> new names: - mice.impute.durr.norm() -> mice.impute.lasso.norm() - mice.impute.durr.logreg() -> mice.impute.lasso.logreg() - mice.impute.iurr.norm() -> mice.impute.lasso.select.norm() - mice.impute.iurr.logreg() -> mice.impute.lasso.select.logreg()
Stef, thank you for the swift feedback on this pull request. As requested, I responded to the comments per file.
I explained the main steps in 033bc17
I applied
Done in 4339012
In 1185952, I renamed the functions with the following convention: Old names → new names:
What do you think of them?
I would summarize my position on this topic with the following considerations:
I think that considering points 3 and 4, References Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2013). A sparse-group lasso. Journal of computational and graphical statistics, 22(2), 231-245. Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49-67. |
This allows the edge case where the x argument of the methods has only 1 predictor to run. The presence of a vector of 1s in the matrix provided to the cv.glmnet function does not impact in any way its performance. cv.glmnet() will shrink reg coefs for constants to 0.
Thanks a million. Now merged. |
I added two regularised univariate imputation methods following what was proposed by Deng et al 2016. The methods currently support continuous and dichotomous target columns. I hope they can be interesting for both practical uses and for further methodological research in the performance of these methods.
References:
Deng, Y., Chang, C., Ido, M. S., & Long, Q. (2016). Multiple imputation for general missing data patterns in the presence of high-dimensional data. Scientific reports, 6(1), 1-10.