Move checkpointing parallelism into TaskExecutor class, use that class for parallel union_by_name
#12957
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#9999 introduced parallel checkpointing by adding a separate mechanism for managing tasks during the checkpointing process. This was necessary because the checkpointing code cannot easily use the normal parallelism loops during execution - as the checkpointing can run outside of regular query execution (e.g. when shutting down a database, or during a commit).
This PR extracts the logic that was added in that PR into a separate class - the
TaskExecutor. This class can be used to easily add parallelism in places where we cannot use the regular parallelism model. TheTaskExecutorschedules tasks using theTaskScheduler, which are then executed in parallel using the regular worker threads. It merely provides a number of helper functions for keeping track of how many tasks have completed, and for error handling across different threads.In this PR we use the
TaskExecutorto provide a parallel implementation of theunion_by_namefile scanning. Since we perform auto-detection on all files, this is trivial to parallelize, and can provide substantial speedups when runningread_csvorread_parquetwithunion_by_nameenabled over many small files. Theunion_by_nameis also a good showcase for how easy the parallelism is to add using theTaskExecutor, e.g.:Benchmarks
Below are some timings of reading 1000 small CSV files. Source: