Parallelize collate feature generation #83

thcrock · 2017-03-31T20:33:27Z

As projects scale, feature generation takes longer. An easy place to parallelize is to run each aggregation, or even group, in its own process. Since each aggregation/group writes to separate tables, there should be no problems with table locking and it should be able to take advantage of multiple cores on the database level.

- Add FeatureGenerator#create_group_table which creates and populates one single table - Add FeatureGenerator#generate_all_table_tasks which generates the commands for all configured tables - Remove generate from FeatureGenerator and add create_all_tables which wraps both of the new methods to approximate the behavior of generate (serialized creation of all tables) - Have LocalParallelPipeline call the new FeatureGenerator interface in order to run each group table's creation in parallel

- Introduce LocalParallelPipeline#parallelize to encapsulate parallelizing a function and list of tasks - Rename high-level FeatureGenerator interface from 'generate' to 'create_all_tables' - Add FeatureGenerator#aggregations for direct access to all collate Aggregations - Add FeatureGenerator#generate_all_table_tasks to produce a parallelizable list of all tables and their creation/population queries - Add FeatureGenerator#run_commands to run a list of SQL commands in a transaction - Switch to multiprocessing.Pool to take advantage of chunksize - In SerialPipeline, use new FeatureGenerator#create_all_tables - In LocalParallelPipeline, retrieve all table tasks, and run inserts in parallel batches of 25

Parallelize Feature Generation [Resolves #83]

Allow overriding of choice quoting [Resolves #81]

thcrock added the new-feature label Mar 31, 2017

thcrock self-assigned this Apr 3, 2017

thcrock added this to the v0.3 milestone Apr 5, 2017

k1aus closed this as completed in 93b2e0e Apr 7, 2017

k1aus added a commit that referenced this issue Apr 7, 2017

Merge pull request #91 from dssg/parallelize_features

478ac2e

Parallelize Feature Generation [Resolves #83]

thcrock added performance and removed new-feature labels Apr 11, 2017

jesteria pushed a commit that referenced this issue Nov 28, 2017

Merge pull request #83 from dssg/choice_quoting

36a93e2

Allow overriding of choice quoting [Resolves #81]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize collate feature generation #83

Parallelize collate feature generation #83

thcrock commented Mar 31, 2017

Parallelize collate feature generation #83

Parallelize collate feature generation #83

Comments

thcrock commented Mar 31, 2017