Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize collate feature generation #83

Closed
thcrock opened this issue Mar 31, 2017 · 0 comments
Closed

Parallelize collate feature generation #83

thcrock opened this issue Mar 31, 2017 · 0 comments
Assignees
Milestone

Comments

@thcrock
Copy link
Contributor

thcrock commented Mar 31, 2017

As projects scale, feature generation takes longer. An easy place to parallelize is to run each aggregation, or even group, in its own process. Since each aggregation/group writes to separate tables, there should be no problems with table locking and it should be able to take advantage of multiple cores on the database level.

@thcrock thcrock self-assigned this Apr 3, 2017
thcrock added a commit that referenced this issue Apr 3, 2017
- Add FeatureGenerator#create_group_table which creates and populates one single table
- Add FeatureGenerator#generate_all_table_tasks which generates the commands for all configured tables
- Remove generate from FeatureGenerator and add create_all_tables which wraps both of the new methods to approximate the behavior of generate (serialized creation of all tables)
- Have LocalParallelPipeline call the new FeatureGenerator interface in order to run each group table's creation in parallel
@thcrock thcrock added this to the v0.3 milestone Apr 5, 2017
thcrock added a commit that referenced this issue Apr 6, 2017
- Introduce LocalParallelPipeline#parallelize to encapsulate parallelizing a function and list of tasks
- Rename high-level FeatureGenerator interface from 'generate' to 'create_all_tables'
- Add FeatureGenerator#aggregations for direct access to all collate Aggregations
- Add FeatureGenerator#generate_all_table_tasks to produce a parallelizable list of all tables and their creation/population queries
- Add FeatureGenerator#run_commands to run a list of SQL commands in a transaction
- Switch to multiprocessing.Pool to take advantage of chunksize
- In SerialPipeline, use new FeatureGenerator#create_all_tables
- In LocalParallelPipeline, retrieve all table tasks, and run inserts in parallel batches of 25
@k1aus k1aus closed this as completed in 93b2e0e Apr 7, 2017
k1aus added a commit that referenced this issue Apr 7, 2017
Parallelize Feature Generation [Resolves #83]
jesteria pushed a commit that referenced this issue Nov 28, 2017
Allow overriding of choice quoting [Resolves #81]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant