Break up pipeline.run into smaller methods #93

thcrock · 2017-04-06T23:46:07Z

Codeclimate has identified some real issues with duplication and complexity. The two pipeline.run methods are very big offenders here. Creating some new methods here to hide details and remove duplication should help.

https://codeclimate.com/github/dssg/triage/issues

This also helps with more granular usage, and inspection before any intensive methods are called.

…113, #114] The original goal here was to implement a consistent 'replace' kwarg throughout the pipeline, so it can be restarted later more easily. A few other things made it in here. In general they made this change easier, so they aren't totally unrelated, but are different enough that they are helped by being called out. And some are simple performance updates. Directly related [#99]: - Create pipeline test that makes sure skip-if-present functionality works, rename existing generic pipeline test for clarity - Add retrieve from database functionality to Predictor class that ensures order is the same as the passed-in matrix, call if replace kwarg is False [#112] - Add replace kwarg to FeatureGenerator constructor that will skip creating table tasks if the feature table exists - Flip ModelTrainer replace default kwarg to True for conformity - Add replace kwarg to PipelineBase to pass through to all components Tangentially or not related: - Add sqlalchemy-postgres-copy, use in predictions writing for speed upgrade [#111] - Remove model.predict() [#45] - Move much functionality in pipelines to smaller methods, to allow both code reuse between them and inspection before running lengthy tasks [Resolves #93] - Use new matrix_store.empty instead of matrix_store.matrix.empty to allow the empty check without loading the entire matrix into memory [#113] - Use matrix_store internally in Predictor to reduce number of items that need to be passed around - Create trained models directory if it isn't there - Close DB sessions properly [#114]

Workflow and performance updates [Resolves #45, #93, #99, #111, #112,…

* Update sqlalchemy from 1.1.9 to 1.1.11 * Update sqlalchemy from 1.1.9 to 1.1.11 * Update sphinx from 1.5.6 to 1.6.3 * Update cryptography from 1.8.1 to 1.9 * Update pytest from 3.0.7 to 3.1.3

thcrock added the refactoring label Apr 11, 2017

thcrock added this to the v0.4 milestone Apr 18, 2017

thcrock changed the title ~~Refactor based on codeclimate issues~~ Break up pipeline.run into smaller methods Apr 18, 2017

thcrock mentioned this issue Apr 18, 2017

Workflow and performance updates [Resolves #45, #93, #99, #111, #112,… #115

Merged

ecsalomon closed this as completed in #115 Apr 18, 2017

ecsalomon added a commit that referenced this issue Apr 18, 2017

Merge pull request #115 from dssg/replace_or_use

9235348

Workflow and performance updates [Resolves #45, #93, #99, #111, #112,…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Break up pipeline.run into smaller methods #93

Break up pipeline.run into smaller methods #93

thcrock commented Apr 6, 2017 •

edited

Loading

Break up pipeline.run into smaller methods #93

Break up pipeline.run into smaller methods #93

Comments

thcrock commented Apr 6, 2017 • edited Loading

thcrock commented Apr 6, 2017 •

edited

Loading