You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As projects scale, feature generation takes longer. An easy place to parallelize is to run each aggregation, or even group, in its own process. Since each aggregation/group writes to separate tables, there should be no problems with table locking and it should be able to take advantage of multiple cores on the database level.
The text was updated successfully, but these errors were encountered:
- Add FeatureGenerator#create_group_table which creates and populates one single table
- Add FeatureGenerator#generate_all_table_tasks which generates the commands for all configured tables
- Remove generate from FeatureGenerator and add create_all_tables which wraps both of the new methods to approximate the behavior of generate (serialized creation of all tables)
- Have LocalParallelPipeline call the new FeatureGenerator interface in order to run each group table's creation in parallel
- Introduce LocalParallelPipeline#parallelize to encapsulate parallelizing a function and list of tasks
- Rename high-level FeatureGenerator interface from 'generate' to 'create_all_tables'
- Add FeatureGenerator#aggregations for direct access to all collate Aggregations
- Add FeatureGenerator#generate_all_table_tasks to produce a parallelizable list of all tables and their creation/population queries
- Add FeatureGenerator#run_commands to run a list of SQL commands in a transaction
- Switch to multiprocessing.Pool to take advantage of chunksize
- In SerialPipeline, use new FeatureGenerator#create_all_tables
- In LocalParallelPipeline, retrieve all table tasks, and run inserts in parallel batches of 25
As projects scale, feature generation takes longer. An easy place to parallelize is to run each aggregation, or even group, in its own process. Since each aggregation/group writes to separate tables, there should be no problems with table locking and it should be able to take advantage of multiple cores on the database level.
The text was updated successfully, but these errors were encountered: