Risklist module for production (#631)

* listmaking WIP * forgot migraton * WIP * alembic add label_value to list_predictions table * add docstrings * move risklist a layer above * create risklist module * __init__lpy * fix alembic reversion and replace metta.generate_uuid with filename_friendly_hash * Fix down revision of production schema migration * Enable github checks on this branch too * Closer to getting tests to run * Add CLI for risklist * Risklist docs stub * Break up data gathering into experiment and matrix, use pytest fixtures to speed up subsequent tests * Modify schema for list prediction metadata * fix conflicts and add helper functions for getting imputed features * Handle other imputation flag cases, fix tracking indentation error * Add more tests, fill out doc page * Fix exception name typo * use timechop and planner to create matrix_metadata for production * retrain and predict forward * rename to retrain_definition * reusing random seeds from existing models * fix tests (write experiment to test db) * unit test for reusing model random seeds * add docstring * only store random seed in experiment runs * DB migration to remove random seed from experiments table * debugging * debug model trainer tests * debug catwalk utils tests * debug catwalk integration test * use public method * alembic merge * reuse random seed * use timechop for getting retrain information * create retrain model hash in retrain level instead of model_trainer level * move util functions to utils * fix cli and docs * update docs * use reconstructed feature dict * add RetrainModel and Retrain * remove break point * change experiment_runs to triage_runs * get retrain_config * explicitly include run_type in joins to triage_runs * DB migration updates * update argument name in docs * ensure correct temporal config is used for predicting forward * debug * debug Co-authored-by: Tristan Crockett <tristan.h.crockett@gmail.com> Co-authored-by: Kit Rodolfa <shaycrk@gmail.com>
dssg · Aug 27, 2021 · 537813a · 537813a
1 parent a994f3e
commit 537813a
Show file tree

Hide file tree

Showing 31 changed files with 1,682 additions and 143 deletions.
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -120,6 +120,7 @@ nav:
           - Using Postmodeling: postmodeling/index.md
           - Postmodeling & Crosstabs Configuration: postmodeling/postmodeling-config.md
       - Model governance:  dirtyduck/ml_governance.md
+      -Predictlist: predictlist/index.md
       - Scaling up: dirtyduck/aws_batch.md
       - Database Provisioner: db.md
       - API Reference:

diff --git a/docs/sources/predictlist/index.md b/docs/sources/predictlist/index.md
@@ -0,0 +1,87 @@
+# Retrain and Predict
+Use an existing model group to retrain a new model on all the data up to the current date and then predict forward into the future.
+
+## Examples
+Both examples assume you have already run a Triage Experiment in the past, and know these two pieces of information:
+1. A `model_group_id` from a Triage model group that you want to use to retrain a model and generate prediction
+2. A `prediction_date` to generate your predictions on.
+
+### CLI
+`triage retrainpredict <model_group_id> <prediction_date>`
+
+Example:
+`triage retrainpredict 30 2021-04-04`
+
+The `retrainpredict` will assume the current path to be the 'project path' to train models and write matrices, but this can be overridden by sending the `--project-path` option
+
+### Python
+The `Retrainer` class from `triage.predictlist` module can be used to retrain a model and predict forward.
+
+```python
+from triage.predictlist import Retrainer
+from triage import create_engine
+
+retrainer = Retrainer(
+    db_engine=create_engine(<your-db-info>),
+    project_path='/home/you/triage/project2'
+    model_group_id=36,
+)
+retrainer.retrain(prediction_date='2021-04-04')
+retrainer.predict(prediction_date='2021-04-04')
+
+```
+
+## Output
+The retrained model is sotred similariy to the matrices created during an Experiment:
+- Raw Matrix saved to the matrices directory in project storage
+- Raw Model saved to the trained_model directory in project storage
+- Retrained Model info saved in a table (triage_metadata.models) where model_comment = 'retrain_2021-04-04 21:19:09.975112'
+- Predictions saved in a table (triage_production.predictions)
+- Prediction metadata (tiebreaking, random seed) saved in a table (triage_produciton.prediction_metadata)
+
+
+# Predictlist
+If you would like to generate a list of predictions on already-trained Triage model with new data, you can use the 'Predictlist' module.
+
+# Predict Foward with Existed Model
+Use an existing model object to generate predictions on new data.
+
+## Examples
+Both examples assume you have already run a Triage Experiment in the past, and know these two pieces of information:
+1. A `model_id` from a Triage model that you want to use to generate predictions
+2. An `as_of_date` to generate your predictions on.
+
+### CLI
+`triage predictlist <model_id> <as_of_date>`
+
+Example:
+`triage predictlist 46 2019-05-06`
+
+The predictlist will assume the current path to be the 'project path' to find models and write matrices, but this can be overridden by sending the `--project-path` option.
+
+### Python
+
+The `predict_forward_with_existed_model` function from the `triage.predictlist` module can be used similarly to the CLI, with the addition of the database engine and project storage as inputs.
+```
+from triage.predictlist import generate predict_forward_with_existed_model 
+from triage import create_engine
+
+predict_forward_with_existed_model(
+    db_engine=create_engine(<your-db-info>),
+    project_path='/home/you/triage/project2'
+    model_id=46,
+    as_of_date='2019-05-06'
+)
+```
+
+## Output
+The Predictlist is stored similarly to the matrices created during an Experiment:
+- Raw Matrix saved to the matrices directory in project storage
+- Predictions saved in a table (triage_production.predictions)
+- Prediction metadata (tiebreaking, random seed) saved in a table (triage_production.prediction_metadata)
+
+## Notes
+- The cohort and features for the Predictlist are all inferred from the Experiment that trained the given model_id (as defined by the experiment_models table).
+- The feature list ensures that imputation flag columns are present for any columns that either needed to be imputed in the training process, or that needed to be imputed in the predictlist dataset.
+
+
diff --git a/src/tests/catwalk_tests/test_model_trainers.py b/src/tests/catwalk_tests/test_model_trainers.py
@@ -60,7 +60,6 @@ def set_test_seed():
         misc_db_parameters=dict(),
         matrix_store=get_matrix_store(project_storage),
     )
-
     # assert
     # 1. that the models and feature importances table entries are present
     records = [
@@ -286,11 +285,13 @@ def test_reuse_model_random_seeds(grid_config, default_model_trainer):
     def update_experiment_models(db_engine):
         sql = """
             INSERT INTO triage_metadata.experiment_models(experiment_hash,model_hash) 
-            SELECT m.built_by_experiment, m.model_hash 
-            FROM triage_metadata.models m 
+            SELECT er.run_hash, m.model_hash
+            FROM triage_metadata.models m
+            LEFT JOIN triage_metadata.triage_runs er
+                ON m.built_in_triage_run = er.id
             LEFT JOIN triage_metadata.experiment_models em 
-                ON m.model_hash = em.model_hash 
-                AND m.built_by_experiment = em.experiment_hash 
+                ON m.model_hash = em.model_hash
+                AND er.run_hash = em.experiment_hash
             WHERE em.experiment_hash IS NULL
             """
         db_engine.execute(sql)

diff --git a/src/tests/collate_tests/test_collate.py b/src/tests/collate_tests/test_collate.py
@@ -4,6 +4,7 @@
 Unit tests for `collate` module.
 
 """
+import pytest
 from triage.component.collate import Aggregate, Aggregation, Categorical
 
 def test_aggregate():
@@ -191,3 +192,54 @@ def test_distinct():
             ),
         )
     ) == ["count(distinct (x,y)) FILTER (WHERE date < '2012-01-01')"]
+
+
+def test_Aggregation_colname_aggregate_lookup():
+    n = Aggregate("x", "sum", {})
+    d = Aggregate("1", "count", {})
+    m = Aggregate("y", "avg", {})
+    aggregation = Aggregation(
+        [n, d, m],
+        groups=['entity_id'],
+        from_obj="source",
+        prefix="mysource",
+        state_table="tbl"
+    )
+    assert aggregation.colname_aggregate_lookup == {
+        'mysource_entity_id_x_sum': 'sum',
+        'mysource_entity_id_1_count': 'count',
+        'mysource_entity_id_y_avg': 'avg'
+    }
+
+def test_Aggregation_colname_agg_function():
+    n = Aggregate("x", "sum", {})
+    d = Aggregate("1", "count", {})
+    m = Aggregate("y", "stddev_samp", {})
+    aggregation = Aggregation(
+        [n, d, m],
+        groups=['entity_id'],
+        from_obj="source",
+        prefix="mysource",
+        state_table="tbl"
+    )
+
+    assert aggregation.colname_agg_function('mysource_entity_id_x_sum') == 'sum'
+    assert aggregation.colname_agg_function('mysource_entity_id_y_stddev_samp') == 'stddev_samp'
+
+
+def test_Aggregation_imputation_flag_base():
+    n = Aggregate("x", ["sum", "count"], {})
+    m = Aggregate("y", "stddev_samp", {})
+    aggregation = Aggregation(
+        [n, m],
+        groups=['entity_id'],
+        from_obj="source",
+        prefix="mysource",
+        state_table="tbl"
+    )
+
+    assert aggregation.imputation_flag_base('mysource_entity_id_x_sum') == 'mysource_entity_id_x'
+    assert aggregation.imputation_flag_base('mysource_entity_id_x_count') == 'mysource_entity_id_x'
+    assert aggregation.imputation_flag_base('mysource_entity_id_y_stddev_samp') == 'mysource_entity_id_y_stddev_samp'
+    with pytest.raises(KeyError):
+        aggregation.imputation_flag_base('mysource_entity_id_x_stddev_samp')
diff --git a/src/tests/postmodeling_tests/test_model_group_evaluator.py b/src/tests/postmodeling_tests/test_model_group_evaluator.py
@@ -11,7 +11,7 @@ def model_group_evaluator(finished_experiment):
 
 def test_ModelGroupEvaluator_metadata(model_group_evaluator):
     assert isinstance(model_group_evaluator.metadata, list)
-    assert len(model_group_evaluator.metadata) == 8 # 8 model groups expected from basic experiment
+    assert len(model_group_evaluator.metadata) == 2 # 2 models expected for a model_group from basic experiment
     for row in model_group_evaluator.metadata:
         assert isinstance(row, dict)
 

diff --git a/src/tests/results_tests/factories.py b/src/tests/results_tests/factories.py
@@ -181,12 +181,12 @@ class Meta:
     matrix_uuid = factory.SelfAttribute("matrix_rel.matrix_uuid")
 
 
-class ExperimentRunFactory(factory.alchemy.SQLAlchemyModelFactory):
+class TriageRunFactory(factory.alchemy.SQLAlchemyModelFactory):
     class Meta:
-        model = schema.ExperimentRun
+        model = schema.TriageRun
         sqlalchemy_session = session
 
-    experiment_rel = factory.SubFactory(ExperimentFactory)
+    # experiment_rel = factory.SubFactory(ExperimentFactory)
 
     start_time = factory.fuzzy.FuzzyNaiveDateTime(datetime(2008, 1, 1))
     start_method = "run"
@@ -210,7 +210,7 @@ class Meta:
     models_skipped = 0
     models_errored = 0
     last_updated_time = factory.fuzzy.FuzzyNaiveDateTime(datetime(2008, 1, 1))
-    current_status = schema.ExperimentRunStatus.started
+    current_status = schema.TriageRunStatus.started
     stacktrace = ""
 
 

diff --git a/src/tests/test_cli.py b/src/tests/test_cli.py
@@ -2,6 +2,7 @@
 import triage.cli as cli
 from unittest.mock import Mock, patch
 import os
+import datetime
 
 
 # we do not need a real database URL but one SQLalchemy thinks looks like a real one
@@ -56,3 +57,22 @@ def test_featuretest():
             try_command('featuretest', 'example/config/experiment.yaml', '2017-06-06')
             featuremock.assert_called_once()
             cohortmock.assert_called_once()
+
+
+def test_cli_predictlist():
+    with patch('triage.cli.predict_forward_with_existed_model', autospec=True) as mock:
+        try_command('predictlist', '40', '2019-06-04')
+        mock.assert_called_once()
+        assert mock.call_args[0][0].url
+        assert mock.call_args[0][1]
+        assert mock.call_args[0][2] == 40
+        assert mock.call_args[0][3] == datetime.datetime(2019, 6, 4)
+
+
+def test_cli_retrain_predict():
+    with patch('triage.cli.Retrainer', autospec=True) as mock:
+        try_command('retrainpredict', '3', '2021-04-04')
+        mock.assert_called_once()
+        assert mock.call_args[0][0].url
+        assert mock.call_args[0][1]
+        assert mock.call_args[0][2] == 3
diff --git a/src/tests/test_predictlist.py b/src/tests/test_predictlist.py
@@ -0,0 +1,110 @@
+from triage.predictlist import Retrainer, predict_forward_with_existed_model, train_matrix_info_from_model_id, experiment_config_from_model_id
+from triage.validation_primitives import table_should_have_data
+
+
+def test_predict_forward_with_existed_model_should_write_predictions(finished_experiment):
+    # given a model id and as-of-date <= today 
+    # and the model id is trained and is linked to an experiment with feature and cohort config
+    # generate records in triage_production.predictions
+    # the # of records should equal the size of the cohort for that date
+    model_id = 1
+    as_of_date = '2014-01-01'
+    predict_forward_with_existed_model(
+            db_engine=finished_experiment.db_engine,
+            project_path=finished_experiment.project_storage.project_path,
+            model_id=model_id,
+            as_of_date=as_of_date
+    )
+    table_should_have_data(
+        db_engine=finished_experiment.db_engine,
+        table_name="triage_production.predictions",
+    )
+
+
+def test_predict_forward_with_existed_model_should_be_same_shape_as_cohort(finished_experiment):
+    model_id = 1
+    as_of_date = '2014-01-01'
+    predict_forward_with_existed_model(
+            db_engine=finished_experiment.db_engine,
+            project_path=finished_experiment.project_storage.project_path,
+            model_id=model_id,
+            as_of_date=as_of_date)
+
+    num_records_matching_cohort = finished_experiment.db_engine.execute(
+        f'''select count(*)
+        from triage_production.predictions
+        join triage_production.cohort_{finished_experiment.config['cohort_config']['name']} using (entity_id, as_of_date)
+        '''
+    ).first()[0]
+
+    num_records = finished_experiment.db_engine.execute(
+        'select count(*) from triage_production.predictions'
+    ).first()[0]
+    assert num_records_matching_cohort == num_records
+
+
+def test_predict_forward_with_existed_model_matrix_record_is_populated(finished_experiment):
+    model_id = 1
+    as_of_date = '2014-01-01'
+    predict_forward_with_existed_model(
+            db_engine=finished_experiment.db_engine,
+            project_path=finished_experiment.project_storage.project_path,
+            model_id=model_id,
+            as_of_date=as_of_date)
+
+    matrix_records = list(finished_experiment.db_engine.execute(
+        "select * from triage_metadata.matrices where matrix_type = 'production'"
+    ))
+    assert len(matrix_records) == 1
+
+
+def test_experiment_config_from_model_id(finished_experiment):
+    model_id = 1
+    experiment_config = experiment_config_from_model_id(finished_experiment.db_engine, model_id)
+    assert experiment_config == finished_experiment.config
+
+
+def test_train_matrix_info_from_model_id(finished_experiment):
+    model_id = 1
+    (train_matrix_uuid, matrix_metadata) = train_matrix_info_from_model_id(finished_experiment.db_engine, model_id)
+    assert train_matrix_uuid
+    assert matrix_metadata
+
+
+def test_retrain_should_write_model(finished_experiment):
+    # given a model id and prediction_date 
+    # and the model id is trained and is linked to an experiment with feature and cohort config
+    # create matrix for retraining a model
+    # generate records in production models
+    # retrain_model_hash should be the same with model_hash in triage_metadata.models
+    model_group_id = 1
+    prediction_date = '2014-03-01'
+
+    retrainer = Retrainer(
+        db_engine=finished_experiment.db_engine,
+        project_path=finished_experiment.project_storage.project_path,
+        model_group_id=model_group_id,
+    )
+    retrain_info = retrainer.retrain(prediction_date)
+    model_comment = retrain_info['retrain_model_comment']
+
+    records = [
+        row
+        for row in finished_experiment.db_engine.execute(
+            f"select model_hash from triage_metadata.models where model_comment = '{model_comment}'"
+        )
+    ]
+    assert len(records) == 1
+    assert retrainer.retrain_model_hash == records[0][0]
+
+    retrainer.predict(prediction_date)
+
+    table_should_have_data(
+        db_engine=finished_experiment.db_engine,
+        table_name="triage_production.predictions",
+    )
+
+    matrix_records = list(finished_experiment.db_engine.execute(
+        f"select * from triage_metadata.matrices where matrix_uuid = '{retrainer.predict_matrix_uuid}'"
+    ))
+    assert len(matrix_records) == 1