Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test model on all labels from the future #378

Open
jtwalsh0 opened this issue Feb 2, 2018 · 3 comments
Open

Test model on all labels from the future #378

jtwalsh0 opened this issue Feb 2, 2018 · 3 comments

Comments

@jtwalsh0
Copy link
Member

jtwalsh0 commented Feb 2, 2018

Triage should have an option to assess the performance of a model on any future test set, i.e. any test set that begins after the train set ends. For an annual-prediction example, a model trained on data ending December 31, 2009, should be tested on all test sets that begin January 1, 2010; January 2, 2011, January 3, 2012; and so on.

This would help us understand how often the model should be retrained and what the partner loses as models get older. It might also help identify problems or interesting patterns, where model performance degrades after a while but increases again.

The user should be able to set a parameter (what to call it, "max time difference"?) that would only test the model on labels within x time of the end of the train data. In the above example, the user might limit testing a model trained on 2009 data to test sets from 2010, 2011, and 2012 and not 2013 on.

@nanounanue
Copy link
Contributor

This is supported if you use the correct temporal config , isn't ? (but it is a pain)

@ecsalomon
Copy link
Contributor

I propose an alternative solution:

  1. An additional key in the temporal_config for prediction_frequency. Triage in default mode will continue making test matrices and testing models every prediction_frequency until the model_update_frequency, at which point, it will train a new model and use that for predictions going forward until it hits the model_update_frequency again. For example if prediction_frequency is one month and model_update_frequency is three months, it will generate three sets of predictions from each model. Of course, we'll need to address Better handling of multiple prediction windows #327 to make this truly useful. This would allow projects to jumpstart with a long retraining window while still generating predictions as frequently as required by the context to narrow down the learner grid more quickly.
  2. A predict-everything-after-the-model mode that is used strictly for deciding on retraining frequency. This would basically be like @jtwalsh0's suggestion, but per discussion with @Rayid is not something we want to do by default on our first experiments, as it is costly.

@ecsalomon
Copy link
Contributor

ecsalomon commented Dec 11, 2018

I thinnnnk the experiment implementation of # 1 is fairly trivial. It should just be a timechop change, as timechop is already written to pass lists of test matrices out to the other components (it just only puts one item in the list), so if the other components are anticipating lists with loops and not just grabbing the first element, the tests would likely be hardest part.

@thcrock thcrock added this to To be Prioritized in 2019 Triage Planning Jan 23, 2019
@saleiro saleiro moved this from To be Prioritized to Let's Do in 2019 Triage Planning Jan 29, 2019
ecsalomon added a commit that referenced this issue Apr 24, 2019
This commit addresses #663, #378, #223 by allowing a model to be
evaluated multiple times and thereby allowing users to see whether
performance of single trained model degrades over the time following
training.

Users must now set a timechop parameter, `test_evaluation_frequency` that
will add multiple test matrices to a time split. A model will be tested
once on each matrix in its list. Matrices are added until they reach the
label time limit, testing all models on the final test period (assuming
that you make model_update_frequency evenly dividable by
test_evaluation_frequency).

This initial commit only makes changes to timechop proper. Remaining
work includes:

- Write tests for the new behavior
- Make timechop plotting work with new behavior

New issues that I do not plan to address in the forthcoming PR:

- Incorporate multiple evaluation times into audition and/or
  postmodeling
- Maybe users should be able to set a maximum evaluation horizon so that
  early models are not tested for, say, 100 time periods
- Evaluation time-splitting could (or should) eventually not be done with
  pre-made matrices but on-the-fly atevaluation time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants