Test model on all labels from the future #378

jtwalsh0 · 2018-02-02T00:11:14Z

Triage should have an option to assess the performance of a model on any future test set, i.e. any test set that begins after the train set ends. For an annual-prediction example, a model trained on data ending December 31, 2009, should be tested on all test sets that begin January 1, 2010; January 2, 2011, January 3, 2012; and so on.

This would help us understand how often the model should be retrained and what the partner loses as models get older. It might also help identify problems or interesting patterns, where model performance degrades after a while but increases again.

The user should be able to set a parameter (what to call it, "max time difference"?) that would only test the model on labels within x time of the end of the train data. In the above example, the user might limit testing a model trained on 2009 data to test sets from 2010, 2011, and 2012 and not 2013 on.

nanounanue · 2018-03-01T18:58:16Z

This is supported if you use the correct temporal config , isn't ? (but it is a pain)

ecsalomon · 2018-12-11T20:35:37Z

I propose an alternative solution:

An additional key in the temporal_config for prediction_frequency. Triage in default mode will continue making test matrices and testing models every prediction_frequency until the model_update_frequency, at which point, it will train a new model and use that for predictions going forward until it hits the model_update_frequency again. For example if prediction_frequency is one month and model_update_frequency is three months, it will generate three sets of predictions from each model. Of course, we'll need to address Better handling of multiple prediction windows #327 to make this truly useful. This would allow projects to jumpstart with a long retraining window while still generating predictions as frequently as required by the context to narrow down the learner grid more quickly.
A predict-everything-after-the-model mode that is used strictly for deciding on retraining frequency. This would basically be like @jtwalsh0's suggestion, but per discussion with @Rayid is not something we want to do by default on our first experiments, as it is costly.

ecsalomon · 2018-12-11T20:51:20Z

I thinnnnk the experiment implementation of # 1 is fairly trivial. It should just be a timechop change, as timechop is already written to pass lists of test matrices out to the other components (it just only puts one item in the list), so if the other components are anticipating lists with loops and not just grabbing the first element, the tests would likely be hardest part.

This commit addresses #663, #378, #223 by allowing a model to be evaluated multiple times and thereby allowing users to see whether performance of single trained model degrades over the time following training. Users must now set a timechop parameter, `test_evaluation_frequency` that will add multiple test matrices to a time split. A model will be tested once on each matrix in its list. Matrices are added until they reach the label time limit, testing all models on the final test period (assuming that you make model_update_frequency evenly dividable by test_evaluation_frequency). This initial commit only makes changes to timechop proper. Remaining work includes: - Write tests for the new behavior - Make timechop plotting work with new behavior New issues that I do not plan to address in the forthcoming PR: - Incorporate multiple evaluation times into audition and/or postmodeling - Maybe users should be able to set a maximum evaluation horizon so that early models are not tested for, say, 100 time periods - Evaluation time-splitting could (or should) eventually not be done with pre-made matrices but on-the-fly atevaluation time

thcrock added this to To be Prioritized in 2019 Triage Planning Jan 23, 2019

saleiro moved this from To be Prioritized to Let's Do in 2019 Triage Planning Jan 29, 2019

ecsalomon mentioned this issue Feb 7, 2019

Metric calculation is bogus #223

Open

ecsalomon mentioned this issue Mar 30, 2019

Get an evaluation per as_of_date #663

Open

ecsalomon mentioned this issue Aug 9, 2021

Store batch-partitioned feature and label history #857

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test model on all labels from the future #378

Test model on all labels from the future #378

jtwalsh0 commented Feb 2, 2018 •

edited

nanounanue commented Mar 1, 2018

ecsalomon commented Dec 11, 2018

ecsalomon commented Dec 11, 2018 •

edited

Test model on all labels from the future #378

Test model on all labels from the future #378

Comments

jtwalsh0 commented Feb 2, 2018 • edited

nanounanue commented Mar 1, 2018

ecsalomon commented Dec 11, 2018

ecsalomon commented Dec 11, 2018 • edited

jtwalsh0 commented Feb 2, 2018 •

edited

ecsalomon commented Dec 11, 2018 •

edited