Request: make predictions possible after using pool() #82

oren0e · 2018-04-08T10:07:38Z

When using mice multiple imputation to generate multiple imputed sets and then running a model on each one of them (like elastic-net from the glmnet package) and then pooling the estimates to a single set - its not possible to use predict.glmnet() (or predict()) on the object that comes out of mice::pool().
The prediction functions expect a fitted model object while mice::pool() returns

An object of class mipo, which stands for 'multiple imputation pooled outcome'.

Is it possible to add the ability to make predictions from the return object of mice::pool()?

The text was updated successfully, but these errors were encountered:

stefvanbuuren · 2018-04-09T08:40:05Z

There are actually multiple ways to create predictions from multiply imputed data. See Obtaining Predictions from Models Fit to Multiply Imputed Data for a comparison of two methods. I tend to deviate from Miles' recommendation, and would generally prefer Predict then combine because that corresponds to Rubin's rules and accounts for nonlinearities at the proper place.

Also see the discussion by Wood et al., who highlight the decisions to be made when accounting for model optimism.

It should be relatively straightforward to implement any of these approaches in mice.

mnarayan · 2019-03-18T16:35:58Z

Just want to add that this feature would be quite useful. I'm a bit surprised by the recommendations in the Miles paper, it doesn't really show the regimes where combine and predict would break down substantially.

SolomonMg · 2020-08-21T18:54:51Z

Any word on this feature? Would be very helpful!

stefvanbuuren · 2020-10-27T08:57:00Z

I have done some thinking on mice prediction workflows. Here are two possibilities, one involving only a train datasets, another involving both a train and test dataset.

A

One data set: train, $n$ rows, $m$ imputations:

Impute the missing data;
Fit the complete-data model to each imputed data set;
Obtain predictions per imputed data set;
Pool predictions and prediction errors
Bind the result to train.

1. Standard mice: imp <- mice(train, …)
2. Standard complete-data analysis: fit <- with(imp, mymodel, …)
3. Calculate predictions: predlist <- lapply(fit$analysis, broom::augment.xxx, se.fit = TRUE, …) - We may need the data argument for some procedures. Perhaps use map().
4. Apply Rubin's rules to pool predictions: predmat <- pool(predlist, …), which results in a n * 3 matrix: .fitted, .residuals, .se.fit
5. Bind to data: ready <- bind_cols(train, predmat)

Supports all broom::augment.xxx() functions.
Steps 3 and 4 can jointly form an augment.mira() function, a kind of meta-augment.

—

B

Two data sets: train ($n_0$ rows) and test ($n_1$ rows), $m$ imputations.

Estimate the imputation model on train only;
Impute the missing data in both train and test;
Fit the complete-data model to each imputed data set using train only;
Obtain predictions per imputed data set in test only;
Pool predictions and prediction errors;
Bind the result to test.

1. Standard mice: imp.train <- mice(train, …)
2. Impute test data: imp.test <_ mice.mids(imp.train, newdata = test)
3. fit <- with(imp.test, mymodel, …)
4. Apply steps A3 and A4
5. Bind to data: ready <- bind_cols(test, predmat)

—

A and B are extreme scenarios, and obviously, we can mix aspects of both.

What I would like to know is whether A or B as described would be useful for applications. Suggestions for alternatives and tweaks are welcome.

cryanking · 2021-02-15T22:42:08Z

I came to request exactly option B. My application is to use MICE as part of a processing pipeline for predictive models, where the standard is to have a true hold-out set. In deployment I also don't want to update the estimation - just apply it (including the imputation model) to the example at hand. It's kind of complicated with predictive mean matching because (I think) the complete training dataset is needed at inference time to select the replacements at each step. Exporting the model to something light weight would be huge.

JonNDCN · 2021-05-28T13:35:22Z

I'd be very happy with A.

angelrodriguez2020 · 2021-05-31T17:49:33Z

Dear all,
I'd like to know the current status of this issue.
In fact, I only need the pooled predicted values, not their standard errors, so I suppose averaging the predictions would be enough. If mice does not provide that yet, would this script be sensible (13 imputations)?:

fit <- with(impsppb, glm(compo ~ edad+sexo+fumarw2,family = quasibinomial))
for (i in 1:13) {
assign(paste0("fit",i), data.frame(fit[["analyses"]][[i]][["data"]][["hi1"]],fit[["analyses"]][[i]][["fitted.values"]]))
assign(paste0("fit",i), names<-(get(paste0("fit",i)), c("hi1", "pred")))
}
fittot<-Reduce(function(x,y) merge(x = x, y = y, by = "hi1"),
list(fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9,fit10,fit11,fit12,fit13))
fittot$predef <- rowMeans(fittot[,c(2:14)], na.rm=TRUE)
fittot <- fittot[c(1,15)]

Thank you.

ASKurz · 2021-10-20T16:34:31Z

Option A would be spectacular. It's exactly what I'd want to teach to my students.

stjeandenise · 2021-11-11T23:48:45Z

Another vote for Option A!! I came looking for exactly that.

ASKurz · 2021-11-12T01:49:45Z

For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/

stjeandenise · 2021-11-12T02:30:58Z

For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/

🤗 you are my favorite person @ASKurz

ASKurz · 2021-11-12T02:59:09Z

Cheers! 🍻

JonNDCN · 2021-11-12T14:48:04Z

Thank you so much @ASKurz, just what I needed

stefvanbuuren · 2021-11-14T12:43:25Z

The script below uses the @ASKurz example in a more micy way. It is essentially Rubin's rules applied to model predictions instead of model parameters.

library(mice)
library(dplyr)

# generate data
set.seed(201)
b_small <-
  brandsma %>%
  filter(!complete.cases(.)) %>%
  slice_sample(n = 50) %>%
  select(ses, iqv, iqp)

# impute & analyse as usual
imp <- mice(b_small, seed = 540, m = 10, method = "norm", print = FALSE)
fit <- with(imp, lm(iqv ~ 1 + ses))

# obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df

# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
                 dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
  pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
  pred[i, 1] <- pi[["qbar"]]
  pred[i, 2] <- sqrt(pi[["t"]])
  pred[i, 3] <- pi[["df"]]
}

Perhaps we could evolve this into a predict.mira() function.

markdanese · 2022-01-14T20:02:34Z

FWIW, predm[[1]] doesn't have a df object for glm, so dfcom ends up being NULL. It probably makes sense to get it from getfit(fit)[[1]]$df.null assuming the null df is the correct one. Apologies if I am missing something obvious.

stefvanbuuren · 2022-01-15T15:17:23Z

@markdanese Thanks for flagging this. The code is an example. It works for lm(), but needs tweaking for other models.

RStellato · 2022-08-29T14:23:51Z

Perhaps we could evolve this into a predict.mira() function.

Would really appreciate that implementation!

reetakankaanpaa · 2023-03-26T16:28:33Z

Hi, the option A above looks tempting, I'm trying to plot predicted values from lme model using multilevel multiple imputed longitudinal data.

However, when I run the code on the third step,
predlist <- lapply(fit$analysis, broom::augment.xxx, se.fit = TRUE)
I repeatedly get an error: 'augment.xxx' is not an exported object from 'namespace:broom', regardless of whether I use broom.mixed::augment.lme.

I tried it with nlme Orthodent data too, using
fit <- with(imp, lm(distance~age*Sex, data = Orthodont))
predlist <- lapply(fit$analysis, broom::augment, se.fit = TRUE, data = Orthodont)
predmat <- pool(predlist)
Here, I get an error: Error in object$analyses[[1]] : subscript out of bounds

Could someone please give advice, how to get fitted values for plotting?

I also found the instructions from Haile, Sarah, 2019, Multiple Imputation using R, and noticed that "However van Buuren appears to disagree with this approach, which is perhaps why it’s not part of mice yet."

stefvanbuuren · 2023-03-28T12:01:49Z

Your code says fit$analysis. Does it help to say fit$analyses?

I don't know Haile.

ne-jos · 2023-04-14T23:46:05Z

I am trying to calculate predicted probabilities from a multiply imputed mixed effect logistic regression model:

df_select <- df_collapsed %>% select("op.day","age.admission.nmds", "gender", "ethnicity2", "asa2", "C3.score.categories", "extent.nzcr", "site", "dep13.nmds.quint", "op.year", "mortality.90d", "dhb.nmds", "days.to.surg")
imp <- mice(df_select, m=10)
fit <- with(imp, glmer(mortality.90d ~ op.day + age.admission.nmds + gender + ethnicity2 + asa2 + C3.score.categories + extent.nzcr + site + dep13.nmds.quint + op.year + days.to.surg + (1 | dhb.nmds), family = binomial()))

I have tried option A but I am running into the same problem as mentioned above, swtiching 'analysis' for 'analyses' does not seem to help. Any suggestions would be greatly appreciated.

thomvolker · 2023-04-15T10:49:10Z

Perhaps using broom.mixed:::augment.merMod will give you the correct predictions. However, pool() will not work, because pooling currently only works on the level of the analyses (i.e., pooling the estimated parameters), not on the level of the data (i.e., pooling the predictions). If you want to pool the predictions, you will have to average the predictions and calculate the variances manually.

stefvanbuuren · 2023-08-22T21:12:36Z

Please look at #32, #299 and the `ignore` argument in `mice()` Van: jeandigitale ***@***.***> Datum: dinsdag, 22 augustus 2023 om 20:58 Aan: amices/mice ***@***.***> CC: Buuren, S. van (Stef) ***@***.***>, Comment ***@***.***> Onderwerp: Re: [amices/mice] Request: make predictions possible after using pool() (#82) CAUTION: This email originated from outside of Utrecht University. Do not click links or open attachments unless you recognize the sender and know the content is safe. * Estimate the imputation model on train only; * Impute the missing data in both train and test; Like @cryanking<https://github.com/cryanking>, I'm trying to use imputation in a real-time prediction pipeline and am looking for a way to estimate imputation models with training data and then apply those models to test data without having to re-estimate them. I have been searching for a package that has this feature with no luck yet! — Reply to this email directly, view it on GitHub<#82 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA55ALYXMTH6THPAAJMQ2ZDXWT6OLANCNFSM4EZN5F5A>. You are receiving this because you commented.Message ID: ***@***.***>

fthielen · 2024-06-10T11:50:58Z

For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/

This link has changed and is now: https://solomonkurz.netlify.app/blog/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/

fthielen · 2024-06-10T11:52:17Z

The script below uses the @ASKurz example in a more micy way. It is essentially Rubin's rules applied to model predictions instead of model parameters.

library(mice)
library(dplyr)

# generate data
set.seed(201)
b_small <-
  brandsma %>%
  filter(!complete.cases(.)) %>%
  slice_sample(n = 50) %>%
  select(ses, iqv, iqp)

# impute & analyse as usual
imp <- mice(b_small, seed = 540, m = 10, method = "norm", print = FALSE)
fit <- with(imp, lm(iqv ~ 1 + ses))

# obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df

# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
                 dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
  pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
  pred[i, 1] <- pi[["qbar"]]
  pred[i, 2] <- sqrt(pi[["t"]])
  pred[i, 3] <- pi[["df"]]
}

Perhaps we could evolve this into a predict.mira() function.

Is this the most recent solution for lm() models? predict.mira() does not seem to be implemented (yet).

gerkovink · 2024-06-10T14:26:55Z

Voilà. Will need to think about the default choices that are made in predict.lm() before we implement this further.

library(purrr)
library(magrittr)
library(dplyr)
library(mice)

# generate data
set.seed(201)
b_small <-
  brandsma %>%
  filter(!complete.cases(.)) %>%
  slice_sample(n = 50) %>%
  select(ses, iqv, iqp)

imp <- b_small %>% 
  mice(seed = 540, m = 10, method = "norm", print = FALSE)

# fit with map() workflow
fit1 <- imp %>% 
  complete("all") %>%
  map(~.x %$% lm(iqv ~ 1 + ses))

# fit with with.mids() workflow
fit2 <- with(imp, lm(iqv ~ 1 + ses))

# function to pool individual values cf. Rubin's rules
predict.mira <- function(fit, tiny = TRUE){
  # preprocess to allow for flexible workflows
  if("mira" %in% class(fit)){
    fit <- fit %>% 
      .$analyses # extract the fitted analyses
  } 
  # create predictions and intervals
  pred <- fit %>% 
    map(~.x %>%  
          predict(se.fit = TRUE) %>% # create predictions with corresponding SE
          as.data.frame() %>% 
          select(fit, se.fit, df)) # purge irrelevant
  
  # pool predictions
  # convert predictions to array and call pool.scalar over dimensions 1 and 2
  pooled <- array(unlist(pred), dim = c(dim(pred[[1]]), length(pred))) %>% 
    apply(., c(1), function(x) pool.scalar(Q = x[1,], 
                                           U = x[2,]^2, 
                                           n = x[3,1] + 1)) %>% 
    sapply(unlist) %>% 
    t() %>% 
    as_tibble() %>% 
    mutate(se.fit = sqrt(t))
  
  if(!tiny){
    return(pooled)
  } else 
    return(pooled %>% select(qbar, se.fit, df))
}

#tiny output
predict.mira(fit1)
#> # A tibble: 50 × 3
#>       qbar se.fit    df
#>      <dbl>  <dbl> <dbl>
#>  1 -1.57    0.489 27.2 
#>  2 -1.12    1.29   3.45
#>  3 -0.578   0.335 29.3 
#>  4  2.40    0.719 31.1 
#>  5  0.450   0.865  4.94
#>  6  0.239   1.10   3.36
#>  7  0.0420  0.315 32.0 
#>  8 -0.578   0.335 29.3 
#>  9  0.0420  0.315 32.0 
#> 10  1.03    0.430 32.7 
#> # ℹ 40 more rows
predict.mira(fit2)
#> # A tibble: 50 × 3
#>       qbar se.fit    df
#>      <dbl>  <dbl> <dbl>
#>  1 -1.57    0.489 27.2 
#>  2 -1.12    1.29   3.45
#>  3 -0.578   0.335 29.3 
#>  4  2.40    0.719 31.1 
#>  5  0.450   0.865  4.94
#>  6  0.239   1.10   3.36
#>  7  0.0420  0.315 32.0 
#>  8 -0.578   0.335 29.3 
#>  9  0.0420  0.315 32.0 
#> 10  1.03    0.430 32.7 
#> # ℹ 40 more rows
# full output
predict.mira(fit1, tiny = FALSE)
#> # A tibble: 50 × 29
#>        m    qhat1    qhat2  qhat3   qhat4   qhat5  qhat6  qhat7   qhat8   qhat9
#>    <dbl>    <dbl>    <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
#>  1    10 -1.34    -1.73    -1.90  -1.81   -1.58   -1.46  -1.18  -1.41   -1.47  
#>  2    10  0.135    0.0911  -0.663 -0.291  -2.49   -3.18  -2.03  -1.06   -0.122 
#>  3    10 -0.512   -0.665   -0.821 -0.661  -0.674  -0.439 -0.266 -0.520  -0.573 
#>  4    10  1.97     2.54     2.42   2.79    2.05    2.61   2.49   2.14    2.12  
#>  5    10 -0.794   -0.130    1.18   0.263   0.818  -0.448  0.973  1.57    0.275 
#>  6    10 -1.25     1.64     1.14   0.552   0.0164 -1.01   0.966 -0.796   0.149 
#>  7    10  0.00583  0.00304 -0.146  0.0569 -0.106   0.196  0.308  0.0340 -0.0114
#>  8    10 -0.512   -0.665   -0.821 -0.661  -0.674  -0.439 -0.266 -0.520  -0.573 
#>  9    10  0.00583  0.00304 -0.146  0.0569 -0.106   0.196  0.308  0.0340 -0.0114
#> 10    10  0.835    1.07     0.933  1.21    0.803   1.21   1.23   0.921   0.886 
#> # ℹ 40 more rows
#> # ℹ 19 more variables: qhat10 <dbl>, u1 <dbl>, u2 <dbl>, u3 <dbl>, u4 <dbl>,
#> #   u5 <dbl>, u6 <dbl>, u7 <dbl>, u8 <dbl>, u9 <dbl>, u10 <dbl>, qbar <dbl>,
#> #   ubar <dbl>, b <dbl>, t <dbl>, df <dbl>, r <dbl>, fmi <dbl>, se.fit <dbl>
predict.mira(fit2, tiny = FALSE)
#> # A tibble: 50 × 29
#>        m    qhat1    qhat2  qhat3   qhat4   qhat5  qhat6  qhat7   qhat8   qhat9
#>    <dbl>    <dbl>    <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
#>  1    10 -1.34    -1.73    -1.90  -1.81   -1.58   -1.46  -1.18  -1.41   -1.47  
#>  2    10  0.135    0.0911  -0.663 -0.291  -2.49   -3.18  -2.03  -1.06   -0.122 
#>  3    10 -0.512   -0.665   -0.821 -0.661  -0.674  -0.439 -0.266 -0.520  -0.573 
#>  4    10  1.97     2.54     2.42   2.79    2.05    2.61   2.49   2.14    2.12  
#>  5    10 -0.794   -0.130    1.18   0.263   0.818  -0.448  0.973  1.57    0.275 
#>  6    10 -1.25     1.64     1.14   0.552   0.0164 -1.01   0.966 -0.796   0.149 
#>  7    10  0.00583  0.00304 -0.146  0.0569 -0.106   0.196  0.308  0.0340 -0.0114
#>  8    10 -0.512   -0.665   -0.821 -0.661  -0.674  -0.439 -0.266 -0.520  -0.573 
#>  9    10  0.00583  0.00304 -0.146  0.0569 -0.106   0.196  0.308  0.0340 -0.0114
#> 10    10  0.835    1.07     0.933  1.21    0.803   1.21   1.23   0.921   0.886 
#> # ℹ 40 more rows
#> # ℹ 19 more variables: qhat10 <dbl>, u1 <dbl>, u2 <dbl>, u3 <dbl>, u4 <dbl>,
#> #   u5 <dbl>, u6 <dbl>, u7 <dbl>, u8 <dbl>, u9 <dbl>, u10 <dbl>, qbar <dbl>,
#> #   ubar <dbl>, b <dbl>, t <dbl>, df <dbl>, r <dbl>, fmi <dbl>, se.fit <dbl>

^{Created on 2024-06-10 with reprex v2.1.0}

stefvanbuuren · 2024-06-13T06:32:41Z

Since predict() is pretty well adopted and implemented, I assume this might work more generally, i.e., for any object for which there is a predict method. Right?

Would there be value in using broom::augment for this feature?

gerkovink · 2024-06-17T16:13:57Z

Yes; should work universally. Don't know if broom::augment() would apply to all of those.

stefvanbuuren added the enhancement label Apr 10, 2018

TysonStanley mentioned this issue May 24, 2022

Support new models vincentarelbundock/marginaleffects#49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: make predictions possible after using pool() #82

Request: make predictions possible after using pool() #82

oren0e commented Apr 8, 2018

stefvanbuuren commented Apr 9, 2018

mnarayan commented Mar 18, 2019

SolomonMg commented Aug 21, 2020

stefvanbuuren commented Oct 27, 2020

cryanking commented Feb 15, 2021

JonNDCN commented May 28, 2021

angelrodriguez2020 commented May 31, 2021

ASKurz commented Oct 20, 2021

stjeandenise commented Nov 11, 2021

ASKurz commented Nov 12, 2021

stjeandenise commented Nov 12, 2021

ASKurz commented Nov 12, 2021

JonNDCN commented Nov 12, 2021

stefvanbuuren commented Nov 14, 2021 •

edited

Loading

markdanese commented Jan 14, 2022

stefvanbuuren commented Jan 15, 2022

RStellato commented Aug 29, 2022

reetakankaanpaa commented Mar 26, 2023

stefvanbuuren commented Mar 28, 2023

ne-jos commented Apr 14, 2023 •

edited

Loading

thomvolker commented Apr 15, 2023

stefvanbuuren commented Aug 22, 2023 via email

fthielen commented Jun 10, 2024

fthielen commented Jun 10, 2024

gerkovink commented Jun 10, 2024

stefvanbuuren commented Jun 13, 2024

gerkovink commented Jun 17, 2024

Request: make predictions possible after using pool() #82

Request: make predictions possible after using pool() #82

Comments

oren0e commented Apr 8, 2018

stefvanbuuren commented Apr 9, 2018

mnarayan commented Mar 18, 2019

SolomonMg commented Aug 21, 2020

stefvanbuuren commented Oct 27, 2020

A

B

cryanking commented Feb 15, 2021

JonNDCN commented May 28, 2021

angelrodriguez2020 commented May 31, 2021

ASKurz commented Oct 20, 2021

stjeandenise commented Nov 11, 2021

ASKurz commented Nov 12, 2021

stjeandenise commented Nov 12, 2021

ASKurz commented Nov 12, 2021

JonNDCN commented Nov 12, 2021

stefvanbuuren commented Nov 14, 2021 • edited Loading

markdanese commented Jan 14, 2022

stefvanbuuren commented Jan 15, 2022

RStellato commented Aug 29, 2022

reetakankaanpaa commented Mar 26, 2023

stefvanbuuren commented Mar 28, 2023

ne-jos commented Apr 14, 2023 • edited Loading

thomvolker commented Apr 15, 2023

stefvanbuuren commented Aug 22, 2023 via email

fthielen commented Jun 10, 2024

fthielen commented Jun 10, 2024

gerkovink commented Jun 10, 2024

stefvanbuuren commented Jun 13, 2024

gerkovink commented Jun 17, 2024

stefvanbuuren commented Nov 14, 2021 •

edited

Loading

ne-jos commented Apr 14, 2023 •

edited

Loading