Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Population weights in VIMP #14

Closed
ddimmery opened this issue Sep 13, 2021 · 11 comments
Closed

Population weights in VIMP #14

ddimmery opened this issue Sep 13, 2021 · 11 comments

Comments

@ddimmery
Copy link

Hi all,

I'm using this package for a project of my own, and I really appreciate you working on it. My question is whether VIMP supports population weights.

For an example, suppose that I take a stratified sample from some population of units (where I know their true sample probabilities). It would be very straightforward to estimate the variable importance in the sample based on this data. Of course, I actually care about the population-level variable importance, which this would not be fully informative for. If I were estimating a mean (or a linear model), I would simply add weights equal to the inverse of my sampling probabilities and have unbiased estimated of my population quantities. Is this supported by VIMP out of the box? Is the method even amenable to this?

I was wondering whether the ipc_weights argument to cv_vim might be used for this purpose, but it seems like this is specific to coarsening (and that the argument may not even be used if C is always equal to one.

Thank you for your time and effort on VIMP!

@bdwilliamson
Copy link
Owner

Thanks for the question!

You're right on that the ipc_weights argument can be used for this purpose. The documentation is written in the language of coarsening (or missing data) since that's where I've primarily seen the need to re-weight; however, the weights are used identically to what you describe. The way the code is structured, C can be always equal to one and the weights will be applied so long as all of ipc_weights are not equal to one.

@ddimmery
Copy link
Author

Thank you so much for this suggestion!

I'm trying to follow it, but seem to be running into an error. Here's a repro:

library(dplyr)

df <- dplyr::tibble(id = 1:100) %>%
mutate(
    x1 = rnorm(100),
    x2 = rnorm(100),
    y = x1 + rnorm(100),
    w = rexp(100)
)

result <- vimp::cv_vim(
    Y = df$y,
    X = dplyr::select(df, x1, x2),
    ipc_weights = df$w,
    indx = 1,
    run_regression = TRUE,
    SL.library = c("SL.glm", "SL.gam"),
    sample_splitting = TRUE
)

This gives the following error:

Error in create_z(Y, C, Z, X, ipc_weights) : 
  Please enter a character vector corresponding to the names of the fully observed data.

This traces back to here: https://github.com/bdwilliamson/vimp/blob/master/R/utils.R#L114

It appears that I may need to pass something in through "Z" as well?

When I try using "Y" or "X" as it seems the documentation suggests I should use, I get the following errors:

Z = "Y":

Error in innerCvControl[[ii]] <- list() : 
  attempt to select less than one element in integerOneIndex

Z = "X" gives the following traceback:

r$> rlang::last_trace()                                                                                                             
<error/tibble_error_na_column_index>
Can't use NA as column index with `[` at position 1.
Backtrace:
    █
 1. └─vimp::cv_vim(...)
 2.   └─vimp::create_z(Y, C, Z, X, ipc_weights)
 3.     ├─base::cbind.data.frame(Z_in, X[, minus_X])
 4.     │ └─base::data.frame(..., check.names = FALSE)
 5.     ├─X[, minus_X]
 6.     └─tibble:::`[.tbl_df`(X, , minus_X)

Using Z = "x1", Z = "x2", Z = "x3" or Z = "y" does the same thing, except it also throws a warning:

In addition: Warning message:
In create_z(Y, C, Z, X, ipc_weights) : NAs introduced by coercion

Is this a bug, or is it just all unsupported and I shouldn't have an expectation of this working?

@bdwilliamson
Copy link
Owner

bdwilliamson commented Sep 21, 2021

Thanks for the great reprex and detailed i/o, and for your patience while working with the package! I've just created a new release, v2.2.6, that fixes this error.

First, there was a problem with the documentation: you have to enter "Y" (case important) if you want the outcome to be included in Z (which specifies the variables that you think are related to the population weights, in this case), or character positions for the covariates (X), e.g., "1". (You can also enter, e.g., "x1" and it will do the same thing) I've update the documentation to make this more clear.

Second, there were two bugs that were throwing you off:

  • if run_regression = TRUE was specified, there wasn't a valid default number of cross-fitting folds; now V defaults to 5 in this case;
  • if "Y" wasn't specified in Z, then I was trying to cbind a NULL with a data.frame, which wasn't allowed.

Using v2.2.6, I can run the following:

library("dplyr")
library("SuperLearner")
set.seed(1234)

df <- dplyr::tibble(id = 1:100) %>%
    mutate(
        x1 = rnorm(100),
        x2 = rnorm(100),
        y = x1 + rnorm(100),
        w = rexp(100)
    )

result <- vimp::cv_vim(
    Y = df$y,
    X = dplyr::select(df, x1, x2),
    ipc_weights = df$w,
    indx = 1,
    run_regression = TRUE,
    SL.library = c("SL.glm", "SL.gam"),
    sample_splitting = TRUE, Z = c("1", "2")
)

and get the following output:

Variable importance estimates:
      Estimate  SE        95% CI        VIMP > 0 p-value  
s = 1 0.1973972 0.5636479 [0, 1.302127] FALSE    0.3630892

@ddimmery
Copy link
Author

Thank you so much for the quick fix! I really appreciate a commitment to good statistical software (it feels really rare, sometimes)!

A clarifying question:
In my particular setup, the weights are known a-priori -- what exactly is this Z parameter saying about the relationship between my weights and the covariates in the models? (e.g. is something about the relationship between the weights and the covariates being estimated?)

I assume this is something that should be answered in one of your papers on VIMP, but I don't see anything about coarsening weights in the Biometrics paper nor in this one on arXiv: https://arxiv.org/abs/2004.03683

@bdwilliamson
Copy link
Owner

Yes, something is being estimated -- for more details, see Chapter 25.5.3 in van der Vaart (2000); for an example, see "Example 6" in Section 10.4 of https://arxiv.org/abs/2004.03683.

In your case, since C = 1 for all observations, the estimated quantity isn't actually being used, so you just need to enter something in for Z to ensure that the algorithm runs.

@ddimmery
Copy link
Author

I see, this is really helpful, thank you for the pointers.

Is there a way for me to configure cv_vim so that the quantity isn't actually estimated (if it isn't going to be used)? I'm currently trying to trim down runtime as much as possible, so if I can, that would be a big help.

By the way, are you planning to submit 2.2.6 to CRAN, or will you just be leaving it as a development release for now on GitHub?

@bdwilliamson
Copy link
Owner

bdwilliamson commented Sep 24, 2021 via email

@ddimmery
Copy link
Author

ddimmery commented Apr 12, 2023

I regret to say that I have further problems, and this solution no longer appears to work.

There are two issues:

  • When using sample splitting and pre-computed CV predictions in cv_vim, standard errors are all NA.
  • Using coarsening weights (as we discussed in the above thread) leads to an error complaining about missingness (but there is no missingness).

I'll start with a repro:

library(vimp)
library(purrr)
library(dplyr)
library(SuperLearner)

set.seed(100)
n <- 500
n_splits <- 10
df <- dplyr::tibble(
    id = 1:n,
    x1 = rnorm(n),
    x2 = rnorm(n),
    x3 = runif(n),
    y = x1 + 0.25 * x3 + rnorm(n),
    split_id = sample(n_splits, n, replace = TRUE),
    w = rexp(n, 0.9) + 0.1
)

x_df <- select(df, x1, x2, x3)

validRows <- purrr::map(sort(unique(df$split_id)), ~which(.x == df$split_id))
cv_ctl <- SuperLearner::SuperLearner.CV.control(V = n_splits, validRows = validRows)
inner_cv_ctl <- list(list(V = n_splits / 2))

full_fit <- suppressWarnings(SuperLearner::CV.SuperLearner(
    Y = df$y,
    X = x_df,
    SL.library = c("SL.glm", "SL.mean"),
    cvControl = cv_ctl,
    innerCvControl = inner_cv_ctl,
    obsWeights = df$w
))

ss_folds <- vimp::make_folds(unique(df$split_id), V = 2)

cross_fitted_f1 <- extract_sampled_split_predictions(
  cvsl_obj = full_fit, 
  sample_splitting_folds = ss_folds, full = TRUE, vector = TRUE
)
results_list <- list()
for (cov in names(x_df)) {
  idx <- which(names(x_df) == cov)
  red_fit <- suppressWarnings(SuperLearner::CV.SuperLearner(
    Y = full_fit$SL.predict,
    X = x_df[, -idx, drop = FALSE],
    SL.library = c("SL.glm", "SL.mean"),
    cvControl = cv_ctl,
    innerCvControl = inner_cv_ctl,
    obsWeights = df$w
  ))
  
  cross_fitted_f2 <- extract_sampled_split_predictions(
    cvsl_obj = red_fit,
    sample_splitting_folds = ss_folds, full = FALSE, vector = TRUE
  )
  
  result <- vimp::cv_vim(
    Y = df$y,
    type = "r_squared",
    cross_fitted_f1 = cross_fitted_f1,
    cross_fitted_f2 = cross_fitted_f2,
    SL.library = c("SL.glm", "SL.mean"),
    cross_fitting_folds = df$split_id,
    sample_splitting_folds = ss_folds,
    run_regression = FALSE,
    V = n_splits / 2,
    # ipc_weights = df$w,
    # Z = "Y"
  )
  results_list[[cov]] <- mutate(result$mat, term = cov)
}
bind_rows(!!!results_list)
Warning messages:
1: In vimp::cv_vim(Y = df$y, type = "r_squared", cross_fitted_f1 = cross_fitted_f1,  :
  Original estimate NA; consider using a different library of learners.
2: In vimp::cv_vim(Y = df$y, type = "r_squared", cross_fitted_f1 = cross_fitted_f1,  :
  Original estimate NA; consider using a different library of learners.
3: In vimp::cv_vim(Y = df$y, type = "r_squared", cross_fitted_f1 = cross_fitted_f1,  :
  Original estimate NA; consider using a different library of learners.
# A tibble: 3 × 8
  s       est    se   cil   ciu test  p_value term 
  <chr> <dbl> <dbl> <dbl> <dbl> <lgl>   <dbl> <chr>
1 1        NA    NA    NA    NA NA         NA x1   
2 1        NA    NA    NA    NA NA         NA x2   
3 1        NA    NA    NA    NA NA         NA x3

I get similar results (but a different warning) when I run the example you have in the docs here, I get the following warning, and results:

Warning message:
In cv_vim(Y = y, cross_fitted_f1 = full_cv_preds, cross_fitted_f2 = reduced_cv_preds,  :
  Original estimate < 0; returning zero.

# r$> est$mat
# A tibble: 1 × 7
  s       est    se   cil   ciu test  p_value
  <chr> <dbl> <dbl> <dbl> <dbl> <lgl>   <dbl>
1 2         0    NA    NA    NA NA         NA

If I crank down the std of the residual with y <- as.matrix(smooth + stats::rnorm(n, 0, 0.001)), I can get a nonzero estimate, but everything else is still NA, despite including sample splitting, and leaving cross_fitted_se = TRUE. Is something broken, or is this example not expected to provide a standard error? (This would be surprising to me)

In the first example, I commented out some case-weights that I have in my setting. When I uncomment those two lines, I receive the following error:

Error in SuperLearner::SuperLearner(Y = obs_grad, X = Z[C == 1, , drop = FALSE],  : 
  missing data is currently not supported. Check Y, X, and newX for missing values

I have put the traceback in a gist because it was very long: https://gist.github.com/ddimmery/fe8e517acbb0487ed547745431b04ad6

It's important to note, however, that there is no missingness in the data I'm passing.

When we last corresponded, I was able to use population weights through the coarsening weights options to cv_vim as we discussed, but it seems that this might no longer be possible? I'm not sure when this regressed; this project has been on the back-burner for me for a few months now.

I'm planning to release the package that uses VIMP within the next few months (it will accompany a few empirical papers which will be coming out at that time), so I'd like to make sure that upstream will be stable (for my purposes: using weights and calculating $R^2$ VIMP on pre-calculated CV results) when I do so. For what it's worth, the full implementation of VIMP I'm using is here: https://github.com/ddimmery/tidyhte/blob/main/R/variable_importance.R.

[edit] added sessionInfo(): https://gist.github.com/ddimmery/571bced68f284d0033faebe9e915ac56

@bdwilliamson
Copy link
Owner

bdwilliamson commented Apr 12, 2023

Thanks for your message, and the detailed reprex. I must not have properly updated the documentation when I made the change to use a vector, rather than a list, for passing in pre-computed CV predictions.

cv_vim expects a vector of fitted values (for both regressions) equal in length to the outcome vector, as in my example below. This was fixed in the docs for version 2.3.2, which for some reason isn't showing up on the package website. I've just updated the website to fix this. I've also added a check to ensure that the fitted values that are passed in are the correct length (for both the vector and list version).

Since the vectors you passed in were the predictions only on the (correct) sampled-split folds, a bunch of NAs were getting appended on to make them the correct length (since R just appends NAs if subsetting by a longer vector). This was leading the point estimates, standard errors, etc. to all be NA. And when using weights, NAs were being passed into the SuperLearner.

Here's your reprex back with code that works (both with and without using the weights):

library(vimp)
library(purrr)
library(dplyr)
library(SuperLearner)

set.seed(100)
n <- 500
n_splits <- 10
df <- dplyr::tibble(
  id = 1:n,
  x1 = rnorm(n),
  x2 = rnorm(n),
  x3 = runif(n),
  y = x1 + 0.25 * x3 + rnorm(n),
  split_id = sample(n_splits, n, replace = TRUE),
  w = rexp(n, 0.9) + 0.1
)

x_df <- select(df, x1, x2, x3)

validRows <- purrr::map(sort(unique(df$split_id)), ~which(.x == df$split_id))
cv_ctl <- SuperLearner::SuperLearner.CV.control(V = n_splits, validRows = validRows)
inner_cv_ctl <- list(list(V = n_splits / 2))

full_fit <- suppressWarnings(SuperLearner::CV.SuperLearner(
  Y = df$y,
  X = x_df,
  SL.library = c("SL.glm", "SL.mean"),
  cvControl = cv_ctl,
  innerCvControl = inner_cv_ctl,
  obsWeights = df$w
))

ss_folds <- vimp::make_folds(unique(df$split_id), V = 2)

# cross_fitted_f1 <- extract_sampled_split_predictions(
#   cvsl_obj = full_fit, 
#   sample_splitting_folds = ss_folds, full = TRUE, vector = TRUE
# )
cross_fitted_f1 <- full_fit$SL.predict
results_list <- list()
for (cov in names(x_df)) {
  idx <- which(names(x_df) == cov)
  red_fit <- suppressWarnings(SuperLearner::CV.SuperLearner(
    Y = full_fit$SL.predict,
    X = x_df[, -idx, drop = FALSE],
    SL.library = c("SL.glm", "SL.mean"),
    cvControl = cv_ctl,
    innerCvControl = inner_cv_ctl,
    obsWeights = df$w
  ))
  
  # cross_fitted_f2 <- extract_sampled_split_predictions(
  #   cvsl_obj = red_fit,
  #   sample_splitting_folds = ss_folds, full = FALSE, vector = TRUE
  # )
  cross_fitted_f2 <- red_fit$SL.predict
  
  result <- vimp::cv_vim(
    Y = df$y,
    type = "r_squared",
    cross_fitted_f1 = cross_fitted_f1,
    cross_fitted_f2 = cross_fitted_f2,
    SL.library = c("SL.glm", "SL.mean"),
    cross_fitting_folds = df$split_id,
    sample_splitting_folds = ss_folds,
    run_regression = FALSE,
    V = n_splits / 2,
    ipc_weights = df$w,
    Z = "Y"
  )
  results_list[[cov]] <- mutate(result$mat, term = cov)
}
bind_rows(!!!results_list)

@ddimmery
Copy link
Author

Aha! Great, thank you for this. I really appreciate it. It wasn't at all obvious to me that this aspect of the API had changed. I think it's a much simpler and better API, but the docs could make it more clear if cross_fitted_f1 needs to be in the form of a vector. It might also be clearer if you spell out the expected length of vectors.

Is there a possibility of getting notifications for breaking API changes? I know it's hard at the moment since I haven't yet submitted to CRAN (so revdep tools won't work). I'll be submitting in a month or so, but in the meantime, I'd really appreciate a heads-up.

Moreover, it would be really nice if you could add a unit test for this case (e.g. the basic repro above).

Finally, can I get a confirmation that my understanding of how to use the coarsening options to get accurately weighted results when using pre-computed model predictions?

  1. Set Z="Y", ipc_weights = popn_weights, C = rep(1, n)
  2. I can safely set SL.library = "SL.mean"

I want to make sure that (2) doesn't mistakenly negate the use of the weights. I wasn't sure how to interpret your comment that closed #15. My understanding is that in S2 of Section 3.4 there, the ipc_weights correspond to $1\over g_P(x)$, and with $C$ always being equal to 1, $\phi(z) = {\phi^F(z) \over g_P(x)}$. I've already passed in the CV predictions for the full data, so I assumed the SL.library models shouldn't be used here, but if $1\over g_P(x)$ were passed through a model somehow, that could certainly cause issues.

I believe my understanding is accurate, as the following modified example from above (correctly) shows an effect of x3 when weights are not used (all units have w=1), but shows no effect when w = if_else(x3 > 0.95, 11, 1):

library(vimp)
library(purrr)
library(dplyr)
library(SuperLearner)

set.seed(100)
n <- 1000
n_splits <- 10
df <- dplyr::tibble(
    id = 1:n,
    x1 = rnorm(n),
    x2 = rnorm(n),
    x3 = runif(n),
    y = x1 + 5 * x3 * (x3 <= 0.95) + rnorm(n),
    split_id = sample(n_splits, n, replace = TRUE),
    w = 1 + 10 * (x3 > 0.95)
)

x_df <- select(df, x1, x2, x3)

validRows <- purrr::map(sort(unique(df$split_id)), ~which(.x == df$split_id))
cv_ctl <- SuperLearner::SuperLearner.CV.control(V = n_splits, validRows = validRows)
inner_cv_ctl <- list(list(V = n_splits / 2))

full_fit <- suppressWarnings(SuperLearner::CV.SuperLearner(
    Y = df$y,
    X = x_df,
    SL.library = c("SL.glm", "SL.mean"),
    cvControl = cv_ctl,
    innerCvControl = inner_cv_ctl,
    obsWeights = df$w
))

ss_folds <- vimp::make_folds(unique(df$split_id), V = 2)

cross_fitted_f1 <- full_fit$SL.predict

results_list <- list()
for (cov in names(x_df)) {
  idx <- which(names(x_df) == cov)
  red_fit <- suppressWarnings(SuperLearner::CV.SuperLearner(
    Y = full_fit$SL.predict,
    X = x_df[, -idx, drop = FALSE],
    SL.library = c("SL.glm", "SL.mean"),
    cvControl = cv_ctl,
    innerCvControl = inner_cv_ctl,
    obsWeights = df$w
  ))
  
  cross_fitted_f2 <- red_fit$SL.predict
  
  result <- vimp::cv_vim(
    Y = df$y,
    type = "r_squared",
    indx = idx,
    cross_fitted_f1 = cross_fitted_f1,
    cross_fitted_f2 = cross_fitted_f2,
    SL.library = c("SL.mean"),
    cross_fitting_folds = df$split_id,
    sample_splitting_folds = ss_folds,
    run_regression = FALSE,
    V = n_splits / 2,
    ipc_weights = df$w,
    Z = "Y"
  )
  results_list[[cov]] <- mutate(result$mat, term = cov)
}
bind_rows(!!!results_list)

I could also imagine something like this being a useful unit test to make sure this functionality stays in place.

@bdwilliamson
Copy link
Owner

The expected length of the vectors (or lists, for backwards compatibility) is laid out in the docs: from the man page for cv_vim, the entry for cross_fitted_f1 reads: either (a) a vector, where each element is the predicted value when that observation is part of the validation fold; or (b) a list of length V, where each element in the list is a set of predictions on the corresponding validation data fold. If sample-splitting is requested, then these must be estimated specially; see Details. However, the resulting vector should be the same length as Y; if using a list, then the summed length of each element across the list should be the same length as Y (i.e., each observation is included in the predictions).

NEWS.md (referenced in the Changelog on the website) has a high-level summary of when I've made breaking changes, though I suppose I could be more clear about this. I think definitely an interesting idea to use a tool for major CRAN releases. I hope (but inevitably fail) to harmonize more the major releases with CRAN, but it often takes more effort than I anticipate.

RE: unit tests, I do have a unit test for inputting a vector of pre-computed fitted values; but I updated the unit test to take a full-Y-length vector when I made the change from a list.

Your understanding of the coarsening weights is correct. I originally had made the (almost equivalent) choice to only use the weights if any value of C differed from 1 -- this is fine for coarsening, but not for weighting fully-observed observations. In a previous version, I switched to looking at the inputted weights, which are now used if they differ from 1. I'll look into adding this unit test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants