-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add outputmargin = TRUE
in predict
of a xgboost model
#7
Conversation
Thank you! Your change is correct. I very much appreciate your PR and explanation. I'll merge this in now. |
@bcjaeger Thanks for your quick reply and accepting the PR. I'm reading your excellent work published on JCGS(2024) about AORSF, and some contents confuse me a little. Could I ask you about it here?
(the weights w is then used in Newton-Raphson scoring to solve betas) Thanks for any reply! |
@darentsai no problem! Thank you for reviewing the code and for your kind words about the paper. You've asked two very good questions.
There is no benefit apart from the improved computational speed. One of the most interesting findings from the JCGS paper is that the much faster version of using 1 iteration of the newton-raphson procedure ends up being practically equivalent in terms of C-statistic compared to the much more thorough penalized Cox regression approach from the AOAS paper.
I would guess that other packages (e.g., # bootstrap a standard error
n_obs <- 1000
n_boots <- 10000
x <- rnorm(n_obs)
boot_stats_regular <- vector('numeric', n_boots)
boot_stats_weights <- vector('numeric', n_boots)
for(i in seq(n_boots)){
# regular bootstrap approach
index <- sample(n_obs, replace = TRUE)
# weighted bootstrap (using same sample as regular)
weights <- vector("numeric", n_obs)
for(w in seq_along(weights)) weights[w] <- sum(index==w)
# do this to match what aorsf does (i.e., no getting sampled > 10 times)
weights <- pmin(weights, 10)
boot_stats_regular[i] <- mean(x[index])
boot_stats_weights[i] <- weighted.mean(x, w = weights)
}
# check
sd(boot_stats_regular)
#> [1] 0.03138031
sd(boot_stats_weights)
#> [1] 0.03138031
t.test(x)$stderr
#> [1] 0.03088334 Created on 2024-05-12 with reprex v2.1.0 |
@bcjaeger Very appreciate your kindness and detailed explanation! After testing your example code, I clearly understand that using "random integer-valued weights" indeed mimics the bootstrap procedure. Initially I had doubts about whether it's valid because of this line of pseudocode from JCGS paper: I intuitively thought that the weights are drawn from a discrete uniform distribution from 0 to 10. I adapt your example code to include weights generated by the binomial, Poisson, and uniform distributions. n_obs <- 1000
n_boots <- 10000
x <- rnorm(n_obs)
boot_stats_weights <- vector('numeric', n_boots)
boot_stats_weights_binom <- vector('numeric', n_boots)
boot_stats_weights_pois <- vector('numeric', n_boots)
boot_stats_weights_unif <- vector('numeric', n_boots)
for(i in seq(n_boots)){
# weighted bootstrap
index <- sample(n_obs, replace = TRUE)
weights <- vector("numeric", n_obs)
for(w in seq_along(weights)) weights[w] <- sum(index == w)
# weighted bootstrap (Binomial distribution)
w_binom <- rbinom(n_obs, size = n_obs, prob = 1/n_obs)
# weighted bootstrap (Poisson distribution)
w_pois <- rpois(n_obs, lambda = 1)
# weighted bootstrap (Uniform distribution)
w_unif <- sample(0:10, size = n_obs, replace = TRUE)
boot_stats_weights[i] <- weighted.mean(x, w = weights)
boot_stats_weights_binom[i] <- weighted.mean(x, w = w_binom)
boot_stats_weights_pois[i] <- weighted.mean(x, w = w_pois)
boot_stats_weights_unif[i] <- weighted.mean(x, w = w_unif)
}
# check
t.test(x)$stderr # 0.03226258
sd(boot_stats_weights) # 0.03236514
sd(boot_stats_weights_binom) # 0.03218649
sd(boot_stats_weights_pois) # 0.03287533
sd(boot_stats_weights_unif) # 0.02058976 |
Ahh I see. This is a very helpful illustration! I can see how the pseudocode implies uniform sampling. That was not my intention and I apologize for the confusion 😞 The sampling procedure used to mimic bootstrapping is carried out mostly by the code below uword i, draw, n = data->n_rows;
std::uniform_int_distribution<uword> udist_rows(0, n - 1);
if(sample_with_replacement){
for (i = 0; i < n; ++i) {
draw = udist_rows(random_number_generator);
++w_inbag[draw];
}
} do you think we could describe this with pseudocode? I am thinking maybe something like this: probs <- (1 / n_obs) ^ (seq(0, 10))
w <- sample(seq(0, 10), size = n_obs, replace = TRUE, probs = probs) |
Maybe using a loop to express binomial sampling is easy to read, but it could be somewhat lengthy for pseudocode.
Your idea is nice! But the probabilities
I'm not sure if putting this arithmetic into an algorithm pseudocode is proper or not. Maybe we can just write: Sorry I don't have better ideas now... |
Oh right, |
It looks great! My concern is that Another option is corresponding to R command: w <- rbinom(n_obs, size = n_obs, prob = 1/n_obs) This way saves the line to define best regards, |
I love this! I would like to re-run this analysis with xgboost predictions working as intended and then make the update to my pseudocode. This will take a little while since the analysis requires a lot of computing time, but when it's done, May I include an acknowledgment to you in the paper for your help? |
Of course, It's my pleasure! |
Dear Maintainer,
I'm trying to train a$H_0(t)$ by
xgboost
model with the Cox proportional hazard loss using your codes.You estimated the baseline cumulative hazard
gbm::basehaz.gbm
.aorsf-bench/R/model_fit.R
Lines 397 to 406 in c8a9b5b
I search its document and see that the parameter$HR = exp(X\beta)$ in the proportional hazard function $h(t) = h_0(t) * HR$ ), instead of log hazard scale $log(HR) = X\beta$ .
f.x
needs to be the predicted values of the regression model on the log hazard scale. However, you passed the output ofpredict(fit, newdata = xmat)
into it, where thepredict
method for class'xgb.Booster'
defaults to return predictions on the hazard ratio scale (i.e., asI think you may need to set
outputmargin = TRUE
before passing it intobasehaz.gbm
, i.e.outputmargin
controls whether the predictions should be returned in the form of original untransformed sum of predictions from boosting iterations' results.The same issue appears in the
xgb_cox_pred
function frommodel_pred.R
.aorsf-bench/R/model_pred.R
Lines 209 to 213 in c8a9b5b
It seems that you use the formula$S(t) = exp(exp(X\beta) * -H_0(t))$ to evaluate survival probabilities. $exp(X\beta)$ , and you take one more
lin_preds
you define has beenexp()
outside of it. So I think you also need to setoutputmargin = TRUE
here.Maybe all above are my misunderstanding! Thank you for any replies. This project is an outstanding work!
Best regards,