add_cumu_hazard does not aggregate hazard values correctly if input data are not sorted correctly #227

fabian-s · 2023-03-08T14:13:16Z

library(pammtools)
#> 
#> Attaching package: 'pammtools'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(survival)
library(tidyverse)

veteran <- survival::veteran  |> filter(time < 400)  |> #reduce data for ex
  mutate(trt   = 1L * (trt == 2),
         prior = 1L * (prior == 10))
veteran_ped <- veteran |> 
  as_ped(Surv(time, status)~., id = "id")

# equivalent models:
vet_cox_strata <- coxph(Surv(time, status) ~ strata(celltype),
                        data = veteran)
vet_pam_strata <- pamm(ped_status ~ s(tend) + celltype + s(tend, by = celltype),
                       data = veteran_ped)


pam_fit_correct <- veteran_ped |> ped_info() |> slice(rep(1:n(), each = 4)) |> 
  mutate(celltype = rep(levels(celltype), times = n()/4)) |> 
  group_by(celltype) |> arrange(tstart) |>  # !!!!!!  
  add_cumu_hazard(vet_pam_strata)
#> Warning in warn_about_new_time_points.pamm(object, newdata, time_var): Time
#> points/intervals in new data not equivalent to time points/intervals during
#> model fit. Setting intervals to values not used for original fitcan invalidate
#> the PEM assumption and yield incorrect predictions.

pam_fit_wrong <- veteran_ped |> ped_info() |> slice(rep(1:n(), each = 4)) |> 
  mutate(celltype = rep(levels(celltype), times = n()/4)) |> 
  add_cumu_hazard(vet_pam_strata) 
#> Warning in warn_about_new_time_points.pamm(object, newdata, time_var): Time
#> points/intervals in new data not equivalent to time points/intervals during
#> model fit. Setting intervals to values not used for original fitcan invalidate
#> the PEM assumption and yield incorrect predictions.

cox_fit <- vet_cox_strata |> basehaz() 

ggplot(pam_fit_correct, aes(x = tend, y = cumu_hazard)) + geom_step(aes(col = celltype)) +
  geom_line(data = cox_fit, aes(x = time, y = hazard, col = strata), linetype = 2)

ggplot(pam_fit_wrong, aes(x = tend, y = cumu_hazard)) + geom_step(aes(col = celltype)) +
  geom_line(data = cox_fit, aes(x = time, y = hazard, col = strata), linetype = 2)

^{Created on 2023-03-08 with reprex v2.0.2}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 Patched (2022-11-10 r83330)
#>  os       Linux Mint 20
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en_US
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Berlin
#>  date     2023-03-08
#>  pandoc   2.19.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package       * version    date (UTC) lib source
#>  assertthat      0.2.1      2019-03-21 [1] CRAN (R 4.2.2)
#>  backports       1.4.1      2021-12-13 [1] CRAN (R 4.2.2)
#>  broom           1.0.1      2022-08-29 [1] CRAN (R 4.2.2)
#>  cellranger      1.1.0      2016-07-27 [1] CRAN (R 4.2.2)
#>  checkmate       2.1.0      2022-04-21 [1] CRAN (R 4.2.2)
#>  cli             3.6.0      2023-01-09 [1] CRAN (R 4.2.2)
#>  codetools       0.2-18     2020-11-04 [4] CRAN (R 4.0.3)
#>  colorspace      2.1-0      2023-01-23 [1] CRAN (R 4.2.2)
#>  crayon          1.5.2      2022-09-29 [1] CRAN (R 4.2.2)
#>  curl            4.3.3      2022-10-06 [1] CRAN (R 4.2.2)
#>  DBI             1.1.3      2022-06-18 [1] CRAN (R 4.2.2)
#>  dbplyr          2.2.1      2022-06-27 [1] CRAN (R 4.2.2)
#>  digest          0.6.31     2022-12-11 [1] CRAN (R 4.2.2)
#>  dplyr         * 1.0.10     2022-09-01 [1] CRAN (R 4.2.2)
#>  ellipsis        0.3.2      2021-04-29 [1] CRAN (R 4.2.2)
#>  evaluate        0.20       2023-01-17 [1] CRAN (R 4.2.2)
#>  fansi           1.0.3      2022-03-24 [1] CRAN (R 4.2.2)
#>  farver          2.1.1      2022-07-06 [1] CRAN (R 4.2.2)
#>  fastmap         1.1.0      2021-01-25 [1] CRAN (R 4.2.2)
#>  forcats       * 0.5.2      2022-08-19 [1] CRAN (R 4.2.2)
#>  foreach         1.5.2      2022-02-02 [1] CRAN (R 4.2.2)
#>  Formula         1.2-4      2020-10-16 [1] CRAN (R 4.2.2)
#>  fs              1.5.2      2021-12-08 [1] CRAN (R 4.2.2)
#>  future          1.31.0     2023-02-01 [1] CRAN (R 4.2.2)
#>  future.apply    1.10.0     2022-11-05 [1] CRAN (R 4.2.2)
#>  gargle          1.2.1      2022-09-08 [1] CRAN (R 4.2.2)
#>  generics        0.1.3      2022-07-05 [1] CRAN (R 4.2.2)
#>  ggplot2       * 3.4.0      2022-11-04 [1] CRAN (R 4.2.2)
#>  globals         0.16.2     2022-11-21 [1] CRAN (R 4.2.2)
#>  glue            1.6.2      2022-02-24 [1] CRAN (R 4.2.2)
#>  googledrive     2.0.0      2021-07-08 [1] CRAN (R 4.2.2)
#>  googlesheets4   1.0.1      2022-08-13 [1] CRAN (R 4.2.2)
#>  gtable          0.3.1      2022-09-01 [1] CRAN (R 4.2.2)
#>  haven           2.5.1      2022-08-22 [1] CRAN (R 4.2.2)
#>  highr           0.10       2022-12-22 [1] CRAN (R 4.2.2)
#>  hms             1.1.2      2022-08-19 [1] CRAN (R 4.2.2)
#>  htmltools       0.5.3      2022-07-18 [1] CRAN (R 4.2.2)
#>  httr            1.4.4      2022-08-17 [1] CRAN (R 4.2.2)
#>  iterators       1.0.14     2022-02-05 [1] CRAN (R 4.2.2)
#>  jsonlite        1.8.4      2022-12-06 [1] CRAN (R 4.2.2)
#>  knitr           1.42       2023-01-25 [1] CRAN (R 4.2.2)
#>  labeling        0.4.2      2020-10-20 [1] CRAN (R 4.2.2)
#>  lattice         0.20-44    2021-05-02 [4] CRAN (R 4.1.0)
#>  lava            1.7.1      2023-01-06 [1] CRAN (R 4.2.2)
#>  lazyeval        0.2.2      2019-03-15 [1] CRAN (R 4.2.2)
#>  lifecycle       1.0.3      2022-10-07 [1] CRAN (R 4.2.2)
#>  listenv         0.9.0      2022-12-16 [1] CRAN (R 4.2.2)
#>  lubridate       1.9.0      2022-11-06 [1] CRAN (R 4.2.2)
#>  magrittr        2.0.3      2022-03-30 [1] CRAN (R 4.2.2)
#>  Matrix          1.5-3      2022-11-11 [1] CRAN (R 4.2.2)
#>  mgcv            1.8-41     2022-10-21 [1] CRAN (R 4.2.2)
#>  mime            0.12       2021-09-28 [1] CRAN (R 4.2.2)
#>  modelr          0.1.10     2022-11-11 [1] CRAN (R 4.2.2)
#>  munsell         0.5.0      2018-06-12 [1] CRAN (R 4.2.2)
#>  mvtnorm         1.1-3      2021-10-08 [1] CRAN (R 4.2.2)
#>  nlme            3.1-152    2021-02-04 [4] CRAN (R 4.0.3)
#>  numDeriv        2016.8-1.1 2019-06-06 [1] CRAN (R 4.2.2)
#>  pammtools     * 0.5.8      2022-01-09 [1] CRAN (R 4.2.2)
#>  parallelly      1.34.0     2023-01-13 [1] CRAN (R 4.2.2)
#>  pec             2022.05.04 2022-05-04 [1] CRAN (R 4.2.2)
#>  pillar          1.8.1      2022-08-19 [1] CRAN (R 4.2.2)
#>  pkgconfig       2.0.3      2019-09-22 [1] CRAN (R 4.2.2)
#>  prodlim         2019.11.13 2019-11-17 [1] CRAN (R 4.2.2)
#>  purrr         * 1.0.1      2023-01-10 [1] CRAN (R 4.2.2)
#>  R.cache         0.16.0     2022-07-21 [1] CRAN (R 4.2.2)
#>  R.methodsS3     1.8.2      2022-06-13 [1] CRAN (R 4.2.2)
#>  R.oo            1.25.0     2022-06-12 [1] CRAN (R 4.2.2)
#>  R.utils         2.12.2     2022-11-11 [1] CRAN (R 4.2.2)
#>  R6              2.5.1      2021-08-19 [1] CRAN (R 4.2.2)
#>  Rcpp            1.0.9      2022-07-08 [1] CRAN (R 4.2.2)
#>  readr         * 2.1.3      2022-10-01 [1] CRAN (R 4.2.2)
#>  readxl          1.4.1      2022-08-17 [1] CRAN (R 4.2.2)
#>  reprex          2.0.2      2022-08-17 [1] CRAN (R 4.2.2)
#>  rlang           1.0.6      2022-09-24 [1] CRAN (R 4.2.2)
#>  rmarkdown       2.18       2022-11-09 [1] CRAN (R 4.2.2)
#>  rstudioapi      0.14       2022-08-22 [1] CRAN (R 4.2.2)
#>  rvest           1.0.3      2022-08-19 [1] CRAN (R 4.2.2)
#>  scales          1.2.1      2022-08-20 [1] CRAN (R 4.2.2)
#>  sessioninfo     1.2.2      2021-12-06 [1] CRAN (R 4.2.2)
#>  stringi         1.7.8      2022-07-11 [1] CRAN (R 4.2.2)
#>  stringr       * 1.5.0      2022-12-02 [1] CRAN (R 4.2.2)
#>  styler          1.8.1      2022-11-07 [1] CRAN (R 4.2.2)
#>  survival      * 3.2-11     2021-04-26 [4] CRAN (R 4.0.5)
#>  tibble        * 3.1.8      2022-07-22 [1] CRAN (R 4.2.2)
#>  tidyr         * 1.3.0      2023-01-24 [1] CRAN (R 4.2.2)
#>  tidyselect      1.2.0      2022-10-10 [1] CRAN (R 4.2.2)
#>  tidyverse     * 1.3.2      2022-07-18 [1] CRAN (R 4.2.2)
#>  timechange      0.1.1      2022-11-04 [1] CRAN (R 4.2.2)
#>  timereg         2.0.5      2023-01-17 [1] CRAN (R 4.2.2)
#>  tzdb            0.3.0      2022-03-28 [1] CRAN (R 4.2.2)
#>  utf8            1.2.2      2021-07-24 [1] CRAN (R 4.2.2)
#>  vctrs           0.5.2      2023-01-23 [1] CRAN (R 4.2.2)
#>  withr           2.5.0      2022-03-03 [1] CRAN (R 4.2.2)
#>  xfun            0.37       2023-01-31 [1] CRAN (R 4.2.2)
#>  xml2            1.3.3      2021-11-30 [1] CRAN (R 4.2.2)
#>  yaml            2.3.7      2023-01-23 [1] CRAN (R 4.2.2)
#> 
#>  [1] /home/fabians/R/x86_64-pc-linux-gnu-library/4.2
#>  [2] /usr/local/lib/R/site-library
#>  [3] /usr/lib/R/site-library
#>  [4] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

adibender · 2023-03-08T14:28:17Z

bei Beispiel 2 musst du aber noch zusätzlich nach celltype gruppieren?

fabian-s · 2023-03-08T15:44:32Z

ja - das ist halt eben nicht dokumentiert dass man das machen muss und sonst grütze raus kommt

adibender · 2023-03-08T16:04:00Z

i see yes. Thought about it, not sure how to handle. No way to know that it is grouping factor rather than e.g. time-dependent covariate..
Suggestions? Group automatically within add_cumu_hazard when same timepoint occurs more than once and send message to console?

adibender · 2024-02-26T20:20:55Z

@fabian-s Ideas how to do it technically? check if nrow(unique(newdata) ) != length(unique(tend)) and if so group internally by all variables that are not "PED" variables?

fabian-s · 2024-02-27T09:27:20Z

"group internally by all variables that are not "PED""

and then do arrange(tend)

wouldn't that break if time-dependent covariates with any duplicated values are present, like you said yourself 2 comments above?

if that can be avoided that seems like a reasonable fix.

since this seems fragile/intransparent, i'd suggest to make prepping the input data the user's responsibility:

verify that the input data for add_cumu_hazard has a tend column that only contains strictly increasing sub-sequences of timepoints (exception/complication: need to allow repeats of "sequences of length one" if they only contain tend for the first interval), and if that's not the case throw an informative error with an explanation that a suitably grouped/arranged input is required.

EDIT:

s.th like:

sequence_breaks <- which(diff(newdata$tend) <= 0) + 1
stopifnot(all (newdata$tend[sequence_breaks] == min(newdata$tend))

BTW: code above should probably check against the actual minimal tend used in the model and not just the minimal tend in newdata -- this then also avoids another possible fuck-up for add_cumu_hazard if the input data does not satisfy the implicit assumption that the PED data start at $t_0$ for all groups for which $\Lambda(t)$ is to be generated....

adibender · 2024-02-27T09:53:04Z

wouldn't that break if time-dependent covariates with any duplicated values are present, like you said yourself 2 comments above?

so my thinking was, if the same tend value appears multiple times, than we should group, such that each sequence of non-unique timepoints is one group. In case of TDCs with concurrent effect we'd have multiple values of the covar, but a unique sequence of time-points, so no grouping
there might, however be a problem when we have cumulative effects, it will depend on how the data set used for prediction is structured 🤔

adibender · 2024-02-27T09:58:45Z

yap, probably need to go back to the drawing board...
user could also specify specific time-points to calculate e.g. the survival probability, but we'd need to create the sequence of all previous hazards anyway.
Right now we rely on the user to do the right thing (given the examples in vignettes)

fabian-s · 2024-02-27T10:03:01Z

there might, however be a problem when we have cumulative effects, it will depend on how the data set used for prediction is structured
user could also specify specific time-points to calculate e.g. the survival probability, but we'd need to create the sequence of all previous hazards anyway.

don't have enough detailed context to know what the PED for cumul. effects etc would look like, but this all seems to reinforce my general point above that it would be cleaner to check the minimal conditions the input data HAS to satisfy instead of trying to guess what the user may have wanted to achieve and then automagically modify the input accordingly -- these conditions includes at least:

every subsequence of tend has to start at $t_0$
every subsequence of tend includes all intervals in ascending order up to the desired endpoint (so disregard the code i posted above)

adibender · 2024-02-27T10:17:19Z

I think prediction of S(t), H(t), CIF(t) etc. at specific time-points is not an unreasonable request. For example for evaluation measures. We do have this functionality here: https://github.com/adibender/pammtools/blob/master/R/predict.R
but not for the add_* functions.

I'll think about it. Thanks for the input 🙏🏻

fabian-s · 2024-02-27T10:43:46Z

I think prediction of S(t), H(t), CIF(t) etc. at specific time-points is not an unreasonable request

agree. to enable this, it seems feasible to -- internally, automatically --

pad the input data so that it has subsequences of all intervals from $t_0$ up to the user-specified maximal time (i.e, the times at the "sequence_break" indices in my code above)
keep track of which time points/rows were added
return only the predictions for the user-supplied rows after removing the padded ones ...?

i think this strategy might even have worked with the "pam_fit_wrong" input data in the top example...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add_cumu_hazard does not aggregate hazard values correctly if input data are not sorted correctly #227

add_cumu_hazard does not aggregate hazard values correctly if input data are not sorted correctly #227

fabian-s commented Mar 8, 2023

adibender commented Mar 8, 2023

fabian-s commented Mar 8, 2023

adibender commented Mar 8, 2023 •

edited

Loading

adibender commented Feb 26, 2024

fabian-s commented Feb 27, 2024 •

edited

Loading

adibender commented Feb 27, 2024 •

edited

Loading

adibender commented Feb 27, 2024

fabian-s commented Feb 27, 2024

adibender commented Feb 27, 2024

fabian-s commented Feb 27, 2024

add_cumu_hazard does not aggregate hazard values correctly if input data are not sorted correctly #227

add_cumu_hazard does not aggregate hazard values correctly if input data are not sorted correctly #227

Comments

fabian-s commented Mar 8, 2023

adibender commented Mar 8, 2023

fabian-s commented Mar 8, 2023

adibender commented Mar 8, 2023 • edited Loading

adibender commented Feb 26, 2024

fabian-s commented Feb 27, 2024 • edited Loading

adibender commented Feb 27, 2024 • edited Loading

adibender commented Feb 27, 2024

fabian-s commented Feb 27, 2024

adibender commented Feb 27, 2024

fabian-s commented Feb 27, 2024

adibender commented Mar 8, 2023 •

edited

Loading

fabian-s commented Feb 27, 2024 •

edited

Loading

adibender commented Feb 27, 2024 •

edited

Loading