New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cvrisk for CoxPH models broken #85

Closed
hofnerb opened this Issue Aug 23, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@hofnerb
Member

hofnerb commented Aug 23, 2017

Using the cross-validation approach for Cox PH models as described in Verweij and van Houwelingen (1993) (cvrisk(..., correted = TRUE)) does not seem to work:

## use CoxFlexBoost to simulate survival times
install.packages("CoxFlexBoost", repos="http://R-Forge.R-project.org")
library("CoxFlexBoost")

## load further packages
library("devtools")
install_github("boost-R/mboost")  ## use most recent mboost to allow leave-one-out CV
library("mboost")
library("survival")

### sample data
set.seed(1234)
## sample covariates first
X <- matrix(NA, nrow=400, ncol=3)
X[,1] <- runif(400, -1, 1)
X[,2] <- runif(400, -1, 1)
X[,3] <- runif(400, -1, 1)

## time-constant hazard rate (needs to be a vectororized in the first argument time)
lambda <- function(time, x){
    exp(0 * time + 0.7 * x[1] + 0.2 * x[2] - 0.2 * x[3])
}

## specify censoring function
cens_fct <- function(time, mean_cens){
    ## censoring times are independent exponentially distributed
    censor_time <- rexp(n = length(time), rate = 1/mean_cens)
    event <- (time <= censor_time)
    t_obs <- apply(cbind(time, censor_time), 1, min)
    ## return matrix of observed survival times and event indicator
    return(cbind(t_obs, event))
}

## simulate data with above hazard function and censoring function
data <- rSurvTime(lambda, X, cens_fct, mean_cens = 5)
## add another 20 non-informative variables
data <- cbind(data, as.data.frame(matrix(rnorm(400*20), ncol = 20)))

## estimate standard CoxPH model
summary(coxph(Surv(time, event) ~ ., data))
## coefficients seem ok, simulation worked

### now boosting model
fm <- as.formula(paste("Surv(time, event) ~", paste(names(data)[-(1:2)], collapse = "+")))
mod <- glmboost(fm, data, family = CoxPH())
coef(mod)

## cvrisk not working (here leave-one-out CV)
cvr <- cvrisk(mod, folds = cv(model.weights(mod), type = "kfold", B = 400), grid = 1:200)
plot(cvr) 
mstop(cvr)
## does not seem sensible (as it stops immediately)

## using negative of cvr also not useful (as it does not seem to overfit at all)
plot(-cvr)
mstop(-cvr)

grafik

Using the "uncorrected" version seems fine

cvr2 <- cvrisk(mod, corrected = FALSE)
plot(cvr2)
mstop(cvr2)

grafik

@hofnerb

This comment has been minimized.

Member

hofnerb commented Aug 23, 2017

[Update]

Leave-one-out crossvalidation also does not work with corrected = FALSE, i.e., esentially, the paper is "correct" as it only deals with leave-one-out crossvalidation.

cvr <- cvrisk(mod, folds = cv(model.weights(mod), type = "kfold", B = 400), 
                    grid = 1:200, corrected = FALSE)
plot(cvr)

grafik

@hofnerb

This comment has been minimized.

Member

hofnerb commented Aug 23, 2017

Here the relevant code:

## If family = CoxPH(), cross-validation needs to be computed as in
## Verweij, van Houwelingen (1993), Cross-validation in survival
## analysis, Statistics in Medicine, 12:2305-2314.
plloss <- environment(object$family@risk)[["plloss"]]
if (is.null(fun)) {
if (0 %in% grid) {
warning("All values in ", sQuote("grid"), " must be greater 0 if ",
'family = "CoxPH", hence 0 is dropped from grid')
grid <- grid[grid != 0]
}
dummyfct <- function(weights, oobweights) {
## <FIXME> Should the risk be computed on the inbag
## (currently done) or on the oobag observations?
mod <- fitfct(weights = weights, oobweights = oobweights,
risk = "inbag")
mstop(mod) <- max(grid)
pr <- predict(mod, aggregate = "cumsum")
## <FIXME> are the weights w really equal to 1? Shouldn't it
## be equal to the original fitting weights? Is this
## computed on ALL observations (currently done) or only on
## the OOBAG observations?
lplk <- apply(pr[, grid], 2, function(f)
sum(plloss(y = object$response, f = f, w = 1)))
## return negative "cvl"
- mod$risk()[grid] - lplk
}
}

The dummyfct is the (most) relevant function.

@hofnerb

This comment has been minimized.

Member

hofnerb commented Sep 5, 2017

We are trying to estimate

grafik

where the
grafik
(Figures are screenshots from Verweij and van Houwelingen).

Hence, we refit the model without the out-of-bag (oobag) observations (line 54) and compute the inbag-risk (see line 66), i.e. the negative log-likelihood obtained without the oobag observations. This should equal to the negative of grafik evaluated at grafik. (For this reason I think we need to compute the "inbag" risk, as we want to compute the log-likelihood based on the subsample used to fit the model.) Consequently, we need -risk(mod) in order to obtain grafik evaluated at grafik.

What is missing is the complete likelihood, i.e., for all data, evaluated again at the estimate grafik. Hence, we need the negative loss function for the whole data evaluated at the relevant estimates (or predictions relating to the estimates, see line 58). Note that predict(mod, aggregate = "cumsum") also makes predictions for observations with weight 0, i.e., for the oobag observations. The loss is computed in line 63. Here we relate all observed outcomes (object$response) to all predictions based on the model without the oobag observations (pr --> f).

In summary, I would conclude that we need to compute the sum of these two quantities with correct signs. However, no combination of signs shows an appropriate behavior. Actually, I think that line 66 should be

lplk + mod$risk()[grid]

to obtain the cvl criterion, and as we need a loss function rather than a likelihood, we should use -cvl, i.e., line 66. However, +cvl lead to a decreasing function in contrast to Table 1 in Verweij and van Houwelingen, where cvl increases with additional variables until a certain point and then starts to decrease again (maximum is highlighted in yellow):
grafik

@hofnerb

This comment has been minimized.

Member

hofnerb commented Nov 6, 2017

@mschmid25 correctly stated that it should of course read

- lplk + mod$risk()[grid]

However, this fix also does not solve the issue. The functionality thus will eventually be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment