Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predict.flexsurvreg #39

Closed
chjackson opened this issue May 30, 2017 · 12 comments
Closed

predict.flexsurvreg #39

chjackson opened this issue May 30, 2017 · 12 comments

Comments

@chjackson
Copy link
Owner

General function to return predictions, either corresponding to the observed data, or to a user-supplied "newdata". Should behave similarly to other predict methods in base R and common packages. Could return predictions as linear predictors for the location parameter, or on the scale of the response, e.g. mean or median survival.

@jrdnmdhl
Copy link
Contributor

The typical functionality of predict is the same as what summary.flexsurvreg already does, except it returns a simple vector instead of a dataframe or list of dataframes.

In order to avoid duplicate code, I think the best way to do this might be to take the existing code for summary, use it to write a predict method which returns a simple vector, then call predict from summary to add the predicted values to the dataframe(s).

@jwdink
Copy link

jwdink commented Jun 8, 2017

I've written a function for generating predictions from flexsurvreg objects for one of my packages:

https://github.com/jwdink/tidysurv/blob/master/R/flexsurv-helpers.R

It's probably lacking in some important functionality, but thought I'd share in case its helpful.

@topepo
Copy link

topepo commented Aug 16, 2018

Any thoughts/progress on this?

I'd like to make flexsurv the default parametric modeling function in the parsnip package. Using the summary method is kinda kludgy and a bona fide prediction method would be great.

@chjackson
Copy link
Owner Author

I'll happily add this to play nicely with a tidyverse package!

What would you need it to do? Would predicted means (as type="response") and medians / other quantiles (as type="quantile") be sufficient? That should be an easy wrapper around summary.flexsurvreg. Linear predictors for the location parameter as well?

And what should be the returned format? A simple vector if no SEs are requested, or a matrix for multiple quantiles? What about with SEs? Would it have to return list(fit, se.fit) by convention, even though a data frame would be nicer?

Ideologically I prefer to present CIs than SEs, but I can add SEs as well if that's what people want.

@topepo
Copy link

topepo commented Aug 17, 2018

I'll happily add this to play nicely with a tidyverse package!

Excellent!

What would you need it to do? Would predicted means (as type="response") and medians / other quantiles (as type="quantile") be sufficient? That should be an easy wrapper around summary.flexsurvreg. Linear predictors for the location parameter as well?

Use "response" for for the mean response and "percentile" for others. Is there a parameter for the percentile in the predict function? Linear predictors (maybe type = "link"?) would be good too.

And what should be the returned format? A simple vector if no SEs are requested, or a matrix for multiple quantiles? What about with SEs? Would it have to return list(fit, se.fit) by convention, even though a data frame would be nicer?

Ideologically I prefer to present CIs than SEs, but I can add SEs as well if that's what people want.

Yeah, that's not obvious (and we are coming up with a standard in tidymodels/parsnip#41). I'd say to keep everything in a data frame. For simple mean predictions, I would suggest having a column named .pred.

Put the standard error in .std_error and bounds in .pred_lower and .pred_upper. For those, I would generate them when type = "conf_int" or type = "pred_int" (as appropriate) along with a level parameter.

I'm not sure about the percentiles. Ideally, predict should return the same number of rows as was in newdata so you would need to put then percentile predictions in columns. However, I hate to encode the percentile number (e.g. 75th, etc) in the column name. I'll think about that and then make a suggestion soon.

@topepo
Copy link

topepo commented Aug 17, 2018

How about a list column? Basically, you can add a data frame column (more easily with tibbles) that contains each subject-specific data frame. It's pretty easy to adapt your current data structure to that (code below).

You can still easily get to the single, stacked data frame easily (see the code at the end).

Here's an example using one of the models in ? flexsurvreg :

data(ovarian)
fitg <- flexsurvreg(formula = Surv(futime, fustat) ~ age, data = ovarian, dist="gengamma")

preds <- summary(fitg, newdata = ovarian[1:2, "age", drop = FALSE], t = c(100, 200, 300))

# set the names

preds <- lapply(preds, setNames, c(".time", ".pred", ".pred_lower", ".pred_upper"))

library(tibble)

preds <- tibble(.pred = preds)
preds

# if someone wants a single data frame, they could just:

tidyr::unnest(preds)

@chjackson
Copy link
Owner Author

I've uploaded a work in progress on predict.flexsurvreg to https://github.com/chjackson/flexsurv-dev/blob/master/R/predict.flexsurvreg.R. Happy to keep on refining it as your spec becomes more official. This needed some additions to summary.flexsurvreg to deal with SEs and type="link".

As I understand it, your suggestion was to return a data frame with pred,std_error etc columns for scalar predictions, but for vector predictions it would return a data frame with a single list column.
What do you think of the alternative I've written where it always returns pred,std_error etc columns, but these would be numeric columns for a scalar prediction and list columns for a vector prediction? That would seem to degrade more nicely.

I'm not sure about having "prediction intervals" for this kind of model. predict.glm and predict.survreg don't have that. type="quantile" seems to cover that concept by returning quantiles of the estimated distribution of the data.

@topepo
Copy link

topepo commented Sep 19, 2018

I'm not sure about having "prediction intervals" for this kind of model.

The notion of prediction intervals would be complex here and I don't know of any censored regression packages that do that so I wouldn't worry about it.

You could do it with Bayesian models really easily but otherwise it can be difficult to compute.

As I understand it, your suggestion was to return a data frame with pred, std_error etc columns for scalar predictions, but for vector predictions it would return a data frame with a single list column. What do you think of the alternative I've written where it always returns pred , std_error etc columns, but these would be numeric columns for a scalar prediction and list columns for a vector prediction? That would seem to degrade more nicely.

Based some feedback that we've gotten, I think that the optimal structure would be a list column of data frames (or tibbles). For the quantile code, that would look like:

[[1]]
# A tibble: 2 x 2
 .quantile  .pred
     <dbl>  <dbl>
1      0.1 243.  
2      0.9   9.93

[[2]]
# A tibble: 2 x 2
 .quantile  .pred
     <dbl>  <dbl>
1      0.1 1256. 
2      0.9   51.2

Similar for probabilities, the columns would be .time and .pred. On the website, we suggest:

> tidy_preds
# A tibble: 2 x 1
  .pred           
  <list>          
1 <tibble [3 × 4]>
2 <tibble [3 × 4]>
> tidy_preds$.pred[[1]]
# A tibble: 3 x 4
  .time .pred .pred_lower .pred_upper
  <dbl> <dbl>       <dbl>       <dbl>
1   100 0.799      0.484        0.966
2   200 0.485      0.128        0.806
3   300 0.278      0.0123       0.663

Always returning .pred and .std_error as their own columns (and others that are scalars per row) is perfect.

Looking at the predict code

Would we still use the summary function to get survival and/or hazard probabilities (or will you do these via predict too)?

##' TODO remove covariates?

Yes please!

How much would you want to use our prediction specification? It's not finalized yet but I feel pretty solid about what is in there. There are generally good ideas but we aren't trying to coerce people.

@mattwarkentin
Copy link
Contributor

mattwarkentin commented May 15, 2020

Hi @chjackson,

Any interest in resurrecting this issue? I would be interested in helping out and could start working on a PR. I am an avid user of the tidyverse and tidymodels ecosystems and also flexsurv. I could probably also help with #71.

I also wanted to check whether you would be interested in a PR which would add a function for estimating cumulative incidence in the presence of competing-risks when modeling cause-specific hazards with flexible survival models (flexsurvreg() or flexsurvspline()). I wrote this function for my own project, but I think it would be at home in this package. Happy to discuss further and I can create a separate issue, if you prefer.

Let me know!

@chjackson
Copy link
Owner Author

Thanks - happy for you to work on this. I've updated the current source a bit to make the TODOs clearer. I think the main thing left to do is to create those list columns when there's multiple predictions, and write examples that demonstrate how users can deal with them. Someone fluent in tidyverse would be appreciated for that.

I'm actively working on multi-state model stuff at the moment. The package already has some functions in this area, but anything that enhances it would be welcome. Happy to discuss in another issue.

@mattwarkentin
Copy link
Contributor

Great to hear! I've got a deadline this Friday for one of my doctoral projects thats currently occupying most of my time. I am planning to spend some time this weekend to read over the relevant source code and get more familiar with what exactly needs to be done.

I can start a new issue once I've got a better grasp on what the next steps might look like and we can discuss things further, if that works for you. I look forward to helping out in any way I can.

@mattwarkentin
Copy link
Contributor

@chjackson This can be closed, I think. Up to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants