Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACTG320 exploratory analysis #3

Open
bcjaeger opened this issue Jun 15, 2022 · 2 comments
Open

ACTG320 exploratory analysis #3

bcjaeger opened this issue Jun 15, 2022 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@bcjaeger
Copy link
Owner

bcjaeger commented Jun 15, 2022

aorsf works very well on ACTG320 mortality prediction and it would be great to figure out why it works well

Plan:

  • Fit several models
  • Use different node sizes
  • Death outcome
  • use Out of bag error estimate from aorsf to compare the different model's prediction accuracy

Here is a little synopsis of what I would like to check with the actg320 data:

The data have two main endpoints, death and aids diagnosis. For both endpoints, I want to see how well aorsf performs with a number of different hyper-parameter values. In other words, I am guessing that the performance of aorsf on this dataset is going to depend on how well we tune it. The main tuning parameters for aorsf are below (copied from ?aorsf::orsf). I think we could set up a simple experiment where we make a dataset with one column for each tuning parameter, with each row having a specific set of inputs for orsf(), and then we assess the performance of each set of parameter inputs using cross-validation, probably with just 3 folds b/c the count of events is low. This would be a great exercise and should also provide some useful info for us, i.e., we may change the default values of orsf() for datasets with smaller event counts.

@kristinlenoir, would you like to help me with this?

@bcjaeger bcjaeger self-assigned this Jun 15, 2022
@kristinlenoir
Copy link
Contributor

I would love to help with this!

@bcjaeger bcjaeger added the question Further information is requested label Jun 15, 2022
@kristinlenoir
Copy link
Contributor

FIrst analysis - observe how performance varies across leaf_min events.
I used a 3 fold cross-validation with 10 repeats and a prediction horizon of 350 days.

This is the c-statistic for AIDS - this trend has been consistent
image
Death - fewer events. The performance has been rather inconsistent prior to when I set a seed. A line is fit, but there is really little trend (possible that leaf_min_events=4 is an outlier). The c-statistic, however, is not bad across the board.
image

Future plans:
Amend code to make it more efficient with better output format
Add Brier score
vary split_num_obs
Perhaps try Monte Carlo in addition to CV (increase number of repeats?)
Maybe try varying the mtry down the line (probably not the n_retry)
Adding curved line to performance graphs may be preferable in anticipation of some point of optimal performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants