New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Population weights in VIMP #14
Comments
Thanks for the question! You're right on that the |
Thank you so much for this suggestion! I'm trying to follow it, but seem to be running into an error. Here's a repro: library(dplyr)
df <- dplyr::tibble(id = 1:100) %>%
mutate(
x1 = rnorm(100),
x2 = rnorm(100),
y = x1 + rnorm(100),
w = rexp(100)
)
result <- vimp::cv_vim(
Y = df$y,
X = dplyr::select(df, x1, x2),
ipc_weights = df$w,
indx = 1,
run_regression = TRUE,
SL.library = c("SL.glm", "SL.gam"),
sample_splitting = TRUE
) This gives the following error:
This traces back to here: https://github.com/bdwilliamson/vimp/blob/master/R/utils.R#L114 It appears that I may need to pass something in through "Z" as well? When I try using "Y" or "X" as it seems the documentation suggests I should use, I get the following errors: Z = "Y":
Z = "X" gives the following traceback:
Using
Is this a bug, or is it just all unsupported and I shouldn't have an expectation of this working? |
Thanks for the great reprex and detailed i/o, and for your patience while working with the package! I've just created a new release, v2.2.6, that fixes this error. First, there was a problem with the documentation: you have to enter Second, there were two bugs that were throwing you off:
Using v2.2.6, I can run the following: library("dplyr")
library("SuperLearner")
set.seed(1234)
df <- dplyr::tibble(id = 1:100) %>%
mutate(
x1 = rnorm(100),
x2 = rnorm(100),
y = x1 + rnorm(100),
w = rexp(100)
)
result <- vimp::cv_vim(
Y = df$y,
X = dplyr::select(df, x1, x2),
ipc_weights = df$w,
indx = 1,
run_regression = TRUE,
SL.library = c("SL.glm", "SL.gam"),
sample_splitting = TRUE, Z = c("1", "2")
) and get the following output:
|
Thank you so much for the quick fix! I really appreciate a commitment to good statistical software (it feels really rare, sometimes)! A clarifying question: I assume this is something that should be answered in one of your papers on VIMP, but I don't see anything about coarsening weights in the Biometrics paper nor in this one on arXiv: https://arxiv.org/abs/2004.03683 |
Yes, something is being estimated -- for more details, see Chapter 25.5.3 in van der Vaart (2000); for an example, see "Example 6" in Section 10.4 of https://arxiv.org/abs/2004.03683. In your case, since |
I see, this is really helpful, thank you for the pointers. Is there a way for me to configure By the way, are you planning to submit 2.2.6 to CRAN, or will you just be leaving it as a development release for now on GitHub? |
I regret to say that I have further problems, and this solution no longer appears to work. There are two issues:
I'll start with a repro: library(vimp)
library(purrr)
library(dplyr)
library(SuperLearner)
set.seed(100)
n <- 500
n_splits <- 10
df <- dplyr::tibble(
id = 1:n,
x1 = rnorm(n),
x2 = rnorm(n),
x3 = runif(n),
y = x1 + 0.25 * x3 + rnorm(n),
split_id = sample(n_splits, n, replace = TRUE),
w = rexp(n, 0.9) + 0.1
)
x_df <- select(df, x1, x2, x3)
validRows <- purrr::map(sort(unique(df$split_id)), ~which(.x == df$split_id))
cv_ctl <- SuperLearner::SuperLearner.CV.control(V = n_splits, validRows = validRows)
inner_cv_ctl <- list(list(V = n_splits / 2))
full_fit <- suppressWarnings(SuperLearner::CV.SuperLearner(
Y = df$y,
X = x_df,
SL.library = c("SL.glm", "SL.mean"),
cvControl = cv_ctl,
innerCvControl = inner_cv_ctl,
obsWeights = df$w
))
ss_folds <- vimp::make_folds(unique(df$split_id), V = 2)
cross_fitted_f1 <- extract_sampled_split_predictions(
cvsl_obj = full_fit,
sample_splitting_folds = ss_folds, full = TRUE, vector = TRUE
)
results_list <- list()
for (cov in names(x_df)) {
idx <- which(names(x_df) == cov)
red_fit <- suppressWarnings(SuperLearner::CV.SuperLearner(
Y = full_fit$SL.predict,
X = x_df[, -idx, drop = FALSE],
SL.library = c("SL.glm", "SL.mean"),
cvControl = cv_ctl,
innerCvControl = inner_cv_ctl,
obsWeights = df$w
))
cross_fitted_f2 <- extract_sampled_split_predictions(
cvsl_obj = red_fit,
sample_splitting_folds = ss_folds, full = FALSE, vector = TRUE
)
result <- vimp::cv_vim(
Y = df$y,
type = "r_squared",
cross_fitted_f1 = cross_fitted_f1,
cross_fitted_f2 = cross_fitted_f2,
SL.library = c("SL.glm", "SL.mean"),
cross_fitting_folds = df$split_id,
sample_splitting_folds = ss_folds,
run_regression = FALSE,
V = n_splits / 2,
# ipc_weights = df$w,
# Z = "Y"
)
results_list[[cov]] <- mutate(result$mat, term = cov)
}
bind_rows(!!!results_list)
I get similar results (but a different warning) when I run the example you have in the docs here, I get the following warning, and results:
If I crank down the std of the residual with In the first example, I commented out some case-weights that I have in my setting. When I uncomment those two lines, I receive the following error:
I have put the traceback in a gist because it was very long: https://gist.github.com/ddimmery/fe8e517acbb0487ed547745431b04ad6 It's important to note, however, that there is no missingness in the data I'm passing. When we last corresponded, I was able to use population weights through the coarsening weights options to I'm planning to release the package that uses VIMP within the next few months (it will accompany a few empirical papers which will be coming out at that time), so I'd like to make sure that upstream will be stable (for my purposes: using weights and calculating [edit] added sessionInfo(): https://gist.github.com/ddimmery/571bced68f284d0033faebe9e915ac56 |
Thanks for your message, and the detailed reprex. I must not have properly updated the documentation when I made the change to use a vector, rather than a list, for passing in pre-computed CV predictions.
Since the vectors you passed in were the predictions only on the (correct) sampled-split folds, a bunch of NAs were getting appended on to make them the correct length (since R just appends NAs if subsetting by a longer vector). This was leading the point estimates, standard errors, etc. to all be NA. And when using weights, NAs were being passed into the SuperLearner. Here's your reprex back with code that works (both with and without using the weights):
|
Aha! Great, thank you for this. I really appreciate it. It wasn't at all obvious to me that this aspect of the API had changed. I think it's a much simpler and better API, but the docs could make it more clear if Is there a possibility of getting notifications for breaking API changes? I know it's hard at the moment since I haven't yet submitted to CRAN (so revdep tools won't work). I'll be submitting in a month or so, but in the meantime, I'd really appreciate a heads-up. Moreover, it would be really nice if you could add a unit test for this case (e.g. the basic repro above). Finally, can I get a confirmation that my understanding of how to use the coarsening options to get accurately weighted results when using pre-computed model predictions?
I want to make sure that (2) doesn't mistakenly negate the use of the weights. I wasn't sure how to interpret your comment that closed #15. My understanding is that in S2 of Section 3.4 there, the I believe my understanding is accurate, as the following modified example from above (correctly) shows an effect of library(vimp)
library(purrr)
library(dplyr)
library(SuperLearner)
set.seed(100)
n <- 1000
n_splits <- 10
df <- dplyr::tibble(
id = 1:n,
x1 = rnorm(n),
x2 = rnorm(n),
x3 = runif(n),
y = x1 + 5 * x3 * (x3 <= 0.95) + rnorm(n),
split_id = sample(n_splits, n, replace = TRUE),
w = 1 + 10 * (x3 > 0.95)
)
x_df <- select(df, x1, x2, x3)
validRows <- purrr::map(sort(unique(df$split_id)), ~which(.x == df$split_id))
cv_ctl <- SuperLearner::SuperLearner.CV.control(V = n_splits, validRows = validRows)
inner_cv_ctl <- list(list(V = n_splits / 2))
full_fit <- suppressWarnings(SuperLearner::CV.SuperLearner(
Y = df$y,
X = x_df,
SL.library = c("SL.glm", "SL.mean"),
cvControl = cv_ctl,
innerCvControl = inner_cv_ctl,
obsWeights = df$w
))
ss_folds <- vimp::make_folds(unique(df$split_id), V = 2)
cross_fitted_f1 <- full_fit$SL.predict
results_list <- list()
for (cov in names(x_df)) {
idx <- which(names(x_df) == cov)
red_fit <- suppressWarnings(SuperLearner::CV.SuperLearner(
Y = full_fit$SL.predict,
X = x_df[, -idx, drop = FALSE],
SL.library = c("SL.glm", "SL.mean"),
cvControl = cv_ctl,
innerCvControl = inner_cv_ctl,
obsWeights = df$w
))
cross_fitted_f2 <- red_fit$SL.predict
result <- vimp::cv_vim(
Y = df$y,
type = "r_squared",
indx = idx,
cross_fitted_f1 = cross_fitted_f1,
cross_fitted_f2 = cross_fitted_f2,
SL.library = c("SL.mean"),
cross_fitting_folds = df$split_id,
sample_splitting_folds = ss_folds,
run_regression = FALSE,
V = n_splits / 2,
ipc_weights = df$w,
Z = "Y"
)
results_list[[cov]] <- mutate(result$mat, term = cov)
}
bind_rows(!!!results_list) I could also imagine something like this being a useful unit test to make sure this functionality stays in place. |
The expected length of the vectors (or lists, for backwards compatibility) is laid out in the docs: from the man page for NEWS.md (referenced in the Changelog on the website) has a high-level summary of when I've made breaking changes, though I suppose I could be more clear about this. I think definitely an interesting idea to use a tool for major CRAN releases. I hope (but inevitably fail) to harmonize more the major releases with CRAN, but it often takes more effort than I anticipate. RE: unit tests, I do have a unit test for inputting a vector of pre-computed fitted values; but I updated the unit test to take a full-Y-length vector when I made the change from a list. Your understanding of the coarsening weights is correct. I originally had made the (almost equivalent) choice to only use the weights if any value of |
Hi all,
I'm using this package for a project of my own, and I really appreciate you working on it. My question is whether VIMP supports population weights.
For an example, suppose that I take a stratified sample from some population of units (where I know their true sample probabilities). It would be very straightforward to estimate the variable importance in the sample based on this data. Of course, I actually care about the population-level variable importance, which this would not be fully informative for. If I were estimating a mean (or a linear model), I would simply add weights equal to the inverse of my sampling probabilities and have unbiased estimated of my population quantities. Is this supported by VIMP out of the box? Is the method even amenable to this?
I was wondering whether the
ipc_weights
argument tocv_vim
might be used for this purpose, but it seems like this is specific to coarsening (and that the argument may not even be used ifC
is always equal to one.Thank you for your time and effort on VIMP!
The text was updated successfully, but these errors were encountered: