Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time-dependent C-index #6

Open
bblodfon opened this issue Oct 11, 2023 · 9 comments
Open

Time-dependent C-index #6

bblodfon opened this issue Oct 11, 2023 · 9 comments

Comments

@bblodfon
Copy link

Hi @bcjaeger,

Congrats on publishing the aorsf paper! It has also come up as best learner in my own work.

I wanted to ask your opinion on the following regarding this benchmarking work:

I was looking at the code that calculates the time-dependent C-index and I saw you are using the riskRegression::Score() function and getting the AUC for each of the times = pred_horizon (vector of time points). Then you take the mean of those values. And you have calculated risk scores per time point for each model here. Right?

First, do you know exactly from which paper is the AUC calculation coming from? In the paper you cite the timeROC paper Blanche (2013) but that is different?

Secondly, is the above calculation really a general time-dependent C-index? I have read the work of Antolini's (2005) which uses the survival distribution prediction to derive a C-index (so it's quite general) and there are other AUC-ROC-based C-index metrics that use linear predictions (so they require Cox models and/or some random censoring assumption need to be in place). I really don't know, I just wanted your opinion on this, since the details where missing from the paper.

In mlr3proba we will soon include the Antolini's C-index, and after I saw this, I was considering also adding this one as it is already implemened in RiskRegression but we'll see!

@bcjaeger
Copy link
Owner

Hello @bblodfon,

Thank you! I am thrilled to hear aorsf is doing well in your analysis. 😄

Then you take the mean of those values. And you have calculated risk scores per time point for each model here. Right?

This is correct, I have waffled back and forth on whether it makes sense to take a mean C-statistic over a range of times. My rationale for doing it in this analysis was that I wanted to make relative comparisons between different models rather than interpret how effectively a model would discriminate high and low risk at a specific task, so taking a mean made sense. If I were trying to estimate how effective a model would be for a specific prediction task, e.g., 10-year risk of CVD, I would just get the time-dependent value of C at 10 years.

First, do you know exactly from which paper is the AUC calculation coming from? In the paper you cite the timeROC paper Blanche (2013) but that is different?

Equation 3 in Blanche et al, 2013. This method has traversed through different software packages over the years. I include a reprex below so that you can see timeROC and riskRegression compute the same thing given certain inputs.

Secondly, is the above calculation really a general time-dependent C-index? I have read the work of Antolini's (2005) which uses the survival distribution prediction to derive a C-index (so it's quite general) and there are other AUC-ROC-based C-index metrics that use linear predictions (so they require Cox models and/or some random censoring assumption need to be in place). I really don't know, I just wanted your opinion on this, since the details where missing from the paper.

I would say the above calculation is just a mean of time-dependent C-statistics, and in that sense, you could say we are 'integrating out' the time dimension. One reason for taking the mean here was that otherwise, I would have to pick a specific prediction horizon for each prediction task, and I did not see a reasonably objective way to do that. Additionally, for the sake of comparing models, I thought taking the mean C-stat made the most sense (it's pretty similar to how we compare Brier scores via integration over time, right?).

Overall, my opinion on the matter is that time-dependent C-statistics are confusing because

(1) there are a lot of ways to compute them
(2) the same method is implemented in multiple packages
(3) it's not clear why there is an integrated Brier score and no integrated C-statistic.

I would love to see a stable R package that clears this up. Making them more accessible might just convince the majority of scientific authors to stop using Harrell's C-statistic in analyses with lots of censoring!

library(timeROC)
library(riskRegression)
#> riskRegression version 2023.03.22
library(aorsf)
library(survival)

cstat_timeroc <- timeROC(pbc_orsf$time, 
                         delta=pbc_orsf$status,
                         marker=pbc_orsf$bili, 
                         cause=1,
                         weighting="marginal", 
                         times=c(1000, 2000, 3000), 
                         iid=TRUE)

cstat_riskregression <- Score(object = list(pbc_orsf$bili), 
                              formula = Surv(time, status) ~ 1,
                              data = pbc_orsf, 
                              times = c(1000, 2000, 3000),
                              metrics = 'auc')

cstat_timeroc$AUC - cstat_riskregression$AUC$score$AUC
#> t=1000 t=2000 t=3000 
#>      0      0      0

Created on 2023-10-11 with reprex v2.0.2

@bblodfon
Copy link
Author

Thanks so much for your thoughts on this!

I also exchanged a few emails with Thomas Gerds and Paul Blanche to clear this up. It's as you said. In particular, the riskRegression version is a bit more sped-up and has some minor bug corrections compared to the timeROC. It's also pretty close to Uno's AUC from survAUC:

library(timeROC)
library(riskRegression)
#> riskRegression version 2023.03.22
library(aorsf)
library(survival)
library(survAUC)

cstat_timeroc <- timeROC(pbc_orsf$time,
  delta=pbc_orsf$status,
  marker=pbc_orsf$bili,
  cause=1,
  weighting="marginal",
  times=c(1000, 2000, 3000),
  iid=TRUE)

cstat_riskregression <- Score(object = list(pbc_orsf$bili),
  formula = Surv(time, status) ~ 1,
  data = pbc_orsf,
  times = c(1000, 2000, 3000),
  metrics = 'auc')

cstat_survAUC <- survAUC::AUC.uno(
  Surv.rsp = Surv(pbc_orsf$time, pbc_orsf$status),
  Surv.rsp.new = Surv(pbc_orsf$time, pbc_orsf$status), # train = test data
  times = c(1000, 2000, 3000),
  lpnew = pbc_orsf$bili
)

cstat_timeroc$AUC - cstat_riskregression$AUC$score$AUC
#> t=1000 t=2000 t=3000 
#>      0      0      0
# SAME RESULTS! USE riskRegression!
#> t=1000 t=2000 t=3000
#>      0      0      0
cstat_survAUC$auc - cstat_timeroc$AUC
#>        t=1000        t=2000        t=3000 
#>  2.220446e-16 -7.090504e-04 -3.869590e-03

Created on 2023-10-11 with reprex v2.0.2

My only concern is that since the above are the same (some differences in implementation for sure, but same I would call them) and their formulas reduce to Uno's AUC for right-cencored data using the Kaplan Meier for the estimation of the censoring distribution, then wouldn't Uno's AUC assumptions hold? e.g. 1-1 correspondence between predictor/marker and the expected survival times and random censoring?

@bblodfon
Copy link
Author

FYI @RaphaelS1

@bcjaeger
Copy link
Owner

I'm not sure if the equality of the C-statistics implies the assumptions are true, but I do think it makes sense that the riskRegression and timeROC C-stats reduce to Uno's when we don't leverage their additional features. That is reassuring.

@bblodfon
Copy link
Author

bblodfon commented Oct 12, 2023

I am still trying to piece out if the 1-1 assumption should hold or not (maybe it doesn't affect so much results in the end). My rationale comes from the following example:

If we have two or more patients/observations where their survival predictions/curves (with risk(t) = 1 - surv(t) as is documented in riskRegression::Score) cross (as it may happen with a survival forest model) and I request the model's AUC at t1 before the crossing and t2 afterwards? Does it even make sense to draw and evaluate the survival prediction using AUC(t) across time in this case? or even take the mean?

Note that for a particular model or for comparing models (as you suggested) taking a single time point, then definitely I see no problem with that (and that's time-dependent AUC but not integrated, maybe integrated AUC would have been a better name for what you calculated in the end). In the end the metric could be time-dependent, e.g. a model might perform better for t1 horizon, but another model better for t2 (later horizon).

Paul's answer to the above was:

With AUC(t), we evaluate predicted risks at a given time t (e.g. 1 year risk of death). The predicted risks at other times s in ]0,t[ are irrelevant.

I agree also with what you said:

I would have to pick a specific prediction horizon for each prediction task, and I did not see a reasonably objective way to do that.

Yes, it would have been the better solution but I see the difficulty trying to find out the optimal per task horizon (and there might be mutiple depending on the application).

But may I ask what was the vector of time points/horizons chosen for each particular task (I guess something aking to unique event points, different per test set?) I am asking because a weighted-AUC-mean (by the difference between time-points) would be better by a simple mean in the case that the time points weren't equidistant.

@bcjaeger
Copy link
Owner

It depends, here is the relevant code:


event_time_bounds <- quantile(
    x = data_all$time[data_all$status==1],
    probs = c(.25, .75)
  )

  pred_horizon <- sort(unique(.test$time[.test$status == 1]))

  pred_horizon <- pred_horizon[pred_horizon <= event_time_bounds['75%']]
  pred_horizon <- pred_horizon[pred_horizon >= event_time_bounds['25%']]

  if(is_empty(pred_horizon)) return(NULL)

  if(length(pred_horizon) > 30){
    pred_horizon <-
      pred_horizon[floor(seq(1, length(pred_horizon), length.out=30))]
  }

A time-weighted C would certainly be more like the IBS. Tldr I did not use the time-weighted C.

@bblodfon
Copy link
Author

ok! So you take the unique event times on the test set, keep the 95% interval and you force the vector by "spreading out" the time values so as they are not more than 30.

Why the value 30 was chosen?

@bcjaeger
Copy link
Owner

bcjaeger commented Oct 19, 2023

Arbitrary choice. Some prediction tasks had hundreds of unique event times and that made riskRegression run very slow. Using at most 30 times was convenient but it might have been okay to use up to 50 or even 100. Depends on how many models you want to evaluate at all those timepoints.

@bblodfon
Copy link
Author

Minimizing the number of time points to calculate the AUC at, is a very smart tactic I would say! I think we can implement something similar in mlr3proba for efficiency.

Feel free to keep the post as a question/discussion, thanks for spending the time to answer all of this Byron!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants