You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Brokers should be able to merge query result sets in parallel, adaptively/automatically, based on current overall utilization. The "merge/combine" of sequences constitutes the bulk of the real work that Brokers perform. This currently takes place within a single thread from the HTTP thread pool, which while fair-ish, means that we are also potentially under-utilizing additional cores on the server if the majority of queries are blocked waiting for results. Using a divide and conquer approach to perform this combining merge of results in parallel should allow us to often dramatically speed up the time this operation takes, and should also make broker resource utilization more predictable at the same time.
Proposed changes
To achieve this, we will introduce a new opt-in mode to enable parallel merging of results by Druid brokers using a fork-join pool in 'async' mode. This proposal is the result of running with the basic idea captured here #6629 (review), and building on the backs of the good work done in #5913 and #6629, creating a couple of prototype implementations, and performing a large number of experiments.
The primary change suggested by this proposal is to push some to all of the work currently done by QueryToolchest.mergeResults down into the Sequence merge currently done in CachingClusteredClient, for any QueryToolchest that implements createMergeFn. Note that in my current plans QueryToolchest.mergeResults will still be called and not modified, it just has a lot less work to do because some or all of the results will already be merged.
My current approach uses a 2 layer hierarchy, where the first layer merges sub-sets of input sequences and produces output to a blocking queue, and a single task for the second layer that merges input from the blocking queue outputs of the first layer into a single output blocking queue. The level of parallelism for layer 1 will be chosen automatically based on current 'merge' pool utilization, and the fork-join tasks will self-tune to perform a limited number of operations per task, before yielding their results and forking a new task to continue the work when the new task is scheduled.
In a nod to query vectorization which happens at the segment level for historical processes, and more importantly, to minimize the number of blocking operations within fork-join pool tasks, the results from the input sequences will be yielded in small batches, processed in batches, and of course added to the output blocking queues batch at a time. While I haven't yet spent the time to find the ideal small batch size, batching seems to work dramatically better than processing a single result at a time, which in some of my initial experiments was even slower in parallel than the existing serial approach due to high lock contention.
A prototype implementation based on the experiments so far (but still missing a few features) is available here: #8578. The design will be described using the terms from this branch, but I consider everything fair game and willing to change based on discussion in this proposal.
result merging on the fork-join pool
A new class ResultBatch<T> will capture this notion of result batches, wrapping a Queue<T>, as well as the idea of a 'terminal' object in order to communicate to downstream fork-join tasks that a sequence is completed. To simplify the processing of results without directly dealing with these batches, a cursor pattern to allow easily processing individual results from the batches:
class BatchedResultsCursor<T>implements ForkJoinPool.ManagedBlocker, Comparable<BatchedResultsCursor<T>>
is also introduced, with implementations for Yielder<ResultBatch<T>> and BlockingQueue<ResultBatch<T>> to allow using the same types of worker tasks for both layer 1 and layer 2. The yielder cursors and blocking queue cursors operate slightly differently in that the yielder cursors are created 'pre-loaded' by virtue of converting the input sequences into accumulating yielders
At the outer level, parallel merging will be exposed to CachingClusteredClient through a new sequence type:
class ParallelMergeCombiningSequence<T> extends YieldingSequenceBase<T>
which will wrap all of the work done on the fork-join pool in the form of a yielding sequence, to allow easy integration through existing interfaces. The ParallelMergeCombinginSequence operates on a List<Sequence<T>> baseSequences to merge and combine results given an Ordering<T> orderingFn and BinaryOperator<T> combineFn to order and combine results from the sequences. The HTTP thread has a 'background' combining sequence that builds a sequence from an iterator over the batched results from the output blocking queue of the layer 2 task.
Converting the sequence to a yielder will create single RecursiveAction to run on the fork join pool:
class MergeCombinePartitioningAction <T> extends RecursiveAction
which is responsible for computing the level of parallelism, partitioning the set of input sequences between the chosen level of parallelism, creating sets of BatchedResultsCursor and spawning the layer 1 and layer 2 initialization tasks:
class PrepareMergeCombineInputsAction <T> extends RecursiveAction
which serve the purpose to block until the initial batch of results is produced and ready to process for each cursor, allowing them to be placed in a PriorityQueue and sorted using the Ordering function. Once the results are available for all cursors in a PrepareMergeCombineInputsAction, the PriorityQueue will be fed into the main worker task of the ParallelMergeCombinginSequence:
class MergeCombineAction<T> extends RecursiveAction
which does the actual merging of results. MergeCombineAction. Results with the same ordering are then combined with the combining function while applicable before being added to an output ResultBatch to be pushed to an output blocking queue. MergeCombineAction will "yield" after processing n inputs, where n is initially 1024, and subsequently set by measuring the time it takes to process n inputs and computing the ideal n to run for 10ms. The new n is used for the next MergeCombineAction that is executed, continuing the work of processing the BatchedResultsCursor from the PriorityQueue until everything is completely drained, where a 'terminal' result batch is added to indicate to downstream processors that the stream is complete.
The level of parallelization in the prototype is currently very greedy. It is naively chosen by picking the maximum of available processors or remaining fork-join tasks, with a test oriented query context parameter to limit lower than available processors. I say naively because this should probably consider not just the remaining fork-join task slots, but how many queries are currently being processed, to attempt to save additional slots when a small number of queries are dominating the pool, but further experimentation and discussion I think might be required to pick an optimal strategy, as well as investigating the content mentioned in #8357.
Prioritization
The current prototype is lacking any sort of prioritization or gated access to the fork-join pool. Testing so far shows that unrestricted access to the fork-join pool is perhaps 'good enough' for the initial PR or perhaps in general, and prioritization should be perhaps handled elsewhere (or solely pushed down to the historicals as it is currently done). Unrestricted scheduling of tasks I think should achieve the interleaving suggested in #8356, by nature of the algorithm in use where work is done in small chunks and continuously scheduling additional tasks to run on the pool to complete the work.
However, if it unrestricted access to the pool proves to be not sufficient after further testing, I have considered 2 approaches we could take to handle this, and account for query priority. The first is sort of prioritized, first-come first-serve blocking mechanism to a fixed number of slots, an effective maximum concurrent queries to be merging limit, to block before spawning fork-join tasks and release the slots when the merge is complete.
While I haven't spent a significant amount of time thinking about this yet, a more elaborate mechanism I can imagine is some sort of customized fork-join pool implementation, where the 'execute' method goes through a prioritized queue, so that lower priority queries can be stalled in favor of higher priorities.
Semi-related, with the work broken up into small chunks like this, it seems like there could be even more elaborate strategies of constricting and expanding the number of parallel merge tasks based on pool utilization by just regrouping BatchedResultsCursor, but this would require much further investigation and testing than I think should be part of the initial effort on this.
Rationale
This seems like a much more flexible approach to dividing up the work for result merging at the broker level than the previous attempt in #6629, and a bit less invasive than the changes of #5913. The concept itself of the broker performing a magic parallel result merging I don't think is objectionable by anyone, so experimentation was necessary in order to provide the approach viable. The results so far appear very promising, testing on my 4 physcial/8 hyperthreaded core laptop has yielded the following results using the benchmarks added in #8089:
Parallelism 0 is the existing caching clustered client merge strategy, parallelism 1 is doing a serial merge on the fork-join pool, and 4 is using 3 layer 1 tasks to merge sequences in parallel, which is the limit to the number of physical cores my laptop has. In many cases queries are processing 2-3x faster when done in parallel, as would be expected for the given level of parallelism. Even doing serial processing with a single fork-join task is competitive with the existing serial approach, so all merges can be done with the same approach even when there is not capacity available to run the merge in parallel. I will continue to update this proposal as I collect more experiment results.
Operational impact
No forced operational impact since the feature will be opt-in initially and must be defined in the service configuration to be enabled. I think that this could result in more predicitable broker resource utilization, but operators experimenting with this new feature will need to closely monitor broker query performance to ensure that the new feature is producing beneficial results.
Test plan
Test plan includes live cluster testing on some small-ish clusters I have available, as well as running the benchmarks on a large core count machine to simulate larger clusters and round out the benchmarks, to ensure that the approach scales correctly. Additionally I plan to test 'overloaded' testing to ensure that a busy broker performs no worse within reason than the existing merge strategy.
Future work (optional)
Beyond the initial PR, I think the most benefit would be focusing on tuning the level of parallelism, re #8357.
The text was updated successfully, but these errors were encountered:
Motivation
Brokers should be able to merge query result sets in parallel, adaptively/automatically, based on current overall utilization. The "merge/combine" of sequences constitutes the bulk of the real work that Brokers perform. This currently takes place within a single thread from the HTTP thread pool, which while fair-ish, means that we are also potentially under-utilizing additional cores on the server if the majority of queries are blocked waiting for results. Using a divide and conquer approach to perform this combining merge of results in parallel should allow us to often dramatically speed up the time this operation takes, and should also make broker resource utilization more predictable at the same time.
Proposed changes
To achieve this, we will introduce a new opt-in mode to enable parallel merging of results by Druid brokers using a fork-join pool in 'async' mode. This proposal is the result of running with the basic idea captured here #6629 (review), and building on the backs of the good work done in #5913 and #6629, creating a couple of prototype implementations, and performing a large number of experiments.
The primary change suggested by this proposal is to push some to all of the work currently done by
QueryToolchest.mergeResults
down into theSequence
merge currently done inCachingClusteredClient
, for anyQueryToolchest
that implementscreateMergeFn
. Note that in my current plansQueryToolchest.mergeResults
will still be called and not modified, it just has a lot less work to do because some or all of the results will already be merged.My current approach uses a 2 layer hierarchy, where the first layer merges sub-sets of input sequences and produces output to a blocking queue, and a single task for the second layer that merges input from the blocking queue outputs of the first layer into a single output blocking queue. The level of parallelism for layer 1 will be chosen automatically based on current 'merge' pool utilization, and the fork-join tasks will self-tune to perform a limited number of operations per task, before yielding their results and forking a new task to continue the work when the new task is scheduled.
In a nod to query vectorization which happens at the segment level for historical processes, and more importantly, to minimize the number of blocking operations within fork-join pool tasks, the results from the input sequences will be yielded in small batches, processed in batches, and of course added to the output blocking queues batch at a time. While I haven't yet spent the time to find the ideal small batch size, batching seems to work dramatically better than processing a single result at a time, which in some of my initial experiments was even slower in parallel than the existing serial approach due to high lock contention.
A prototype implementation based on the experiments so far (but still missing a few features) is available here: #8578. The design will be described using the terms from this branch, but I consider everything fair game and willing to change based on discussion in this proposal.
result merging on the fork-join pool
A new class
ResultBatch<T>
will capture this notion of result batches, wrapping aQueue<T>
, as well as the idea of a 'terminal' object in order to communicate to downstream fork-join tasks that a sequence is completed. To simplify the processing of results without directly dealing with these batches, a cursor pattern to allow easily processing individual results from the batches:is also introduced, with implementations for
Yielder<ResultBatch<T>>
andBlockingQueue<ResultBatch<T>>
to allow using the same types of worker tasks for both layer 1 and layer 2. The yielder cursors and blocking queue cursors operate slightly differently in that the yielder cursors are created 'pre-loaded' by virtue of converting the input sequences into accumulating yieldersAt the outer level, parallel merging will be exposed to
CachingClusteredClient
through a new sequence type:which will wrap all of the work done on the fork-join pool in the form of a yielding sequence, to allow easy integration through existing interfaces. The
ParallelMergeCombinginSequence
operates on aList<Sequence<T>> baseSequences
to merge and combine results given anOrdering<T> orderingFn
andBinaryOperator<T> combineFn
to order and combine results from the sequences. The HTTP thread has a 'background' combining sequence that builds a sequence from an iterator over the batched results from the output blocking queue of the layer 2 task.Converting the sequence to a yielder will create single
RecursiveAction
to run on the fork join pool:which is responsible for computing the level of parallelism, partitioning the set of input sequences between the chosen level of parallelism, creating sets of
BatchedResultsCursor
and spawning the layer 1 and layer 2 initialization tasks:which serve the purpose to block until the initial batch of results is produced and ready to process for each cursor, allowing them to be placed in a
PriorityQueue
and sorted using theOrdering
function. Once the results are available for all cursors in aPrepareMergeCombineInputsAction
, thePriorityQueue
will be fed into the main worker task of theParallelMergeCombinginSequence
:which does the actual merging of results.
MergeCombineAction
. Results with the same ordering are then combined with the combining function while applicable before being added to an outputResultBatch
to be pushed to an output blocking queue.MergeCombineAction
will "yield" after processing n inputs, where n is initially1024
, and subsequently set by measuring the time it takes to process n inputs and computing the ideal n to run for10ms
. The new n is used for the nextMergeCombineAction
that is executed, continuing the work of processing theBatchedResultsCursor
from thePriorityQueue
until everything is completely drained, where a 'terminal' result batch is added to indicate to downstream processors that the stream is complete.The level of parallelization in the prototype is currently very greedy. It is naively chosen by picking the maximum of available processors or remaining fork-join tasks, with a test oriented query context parameter to limit lower than available processors. I say naively because this should probably consider not just the remaining fork-join task slots, but how many queries are currently being processed, to attempt to save additional slots when a small number of queries are dominating the pool, but further experimentation and discussion I think might be required to pick an optimal strategy, as well as investigating the content mentioned in #8357.
Prioritization
The current prototype is lacking any sort of prioritization or gated access to the fork-join pool. Testing so far shows that unrestricted access to the fork-join pool is perhaps 'good enough' for the initial PR or perhaps in general, and prioritization should be perhaps handled elsewhere (or solely pushed down to the historicals as it is currently done). Unrestricted scheduling of tasks I think should achieve the interleaving suggested in #8356, by nature of the algorithm in use where work is done in small chunks and continuously scheduling additional tasks to run on the pool to complete the work.
However, if it unrestricted access to the pool proves to be not sufficient after further testing, I have considered 2 approaches we could take to handle this, and account for query priority. The first is sort of prioritized, first-come first-serve blocking mechanism to a fixed number of slots, an effective maximum concurrent queries to be merging limit, to block before spawning fork-join tasks and release the slots when the merge is complete.
While I haven't spent a significant amount of time thinking about this yet, a more elaborate mechanism I can imagine is some sort of customized fork-join pool implementation, where the 'execute' method goes through a prioritized queue, so that lower priority queries can be stalled in favor of higher priorities.
Semi-related, with the work broken up into small chunks like this, it seems like there could be even more elaborate strategies of constricting and expanding the number of parallel merge tasks based on pool utilization by just regrouping
BatchedResultsCursor
, but this would require much further investigation and testing than I think should be part of the initial effort on this.Rationale
This seems like a much more flexible approach to dividing up the work for result merging at the broker level than the previous attempt in #6629, and a bit less invasive than the changes of #5913. The concept itself of the broker performing a magic parallel result merging I don't think is objectionable by anyone, so experimentation was necessary in order to provide the approach viable. The results so far appear very promising, testing on my 4 physcial/8 hyperthreaded core laptop has yielded the following results using the benchmarks added in #8089:
Parallelism
0
is the existing caching clustered client merge strategy, parallelism1
is doing a serial merge on the fork-join pool, and4
is using 3 layer 1 tasks to merge sequences in parallel, which is the limit to the number of physical cores my laptop has. In many cases queries are processing 2-3x faster when done in parallel, as would be expected for the given level of parallelism. Even doing serial processing with a single fork-join task is competitive with the existing serial approach, so all merges can be done with the same approach even when there is not capacity available to run the merge in parallel. I will continue to update this proposal as I collect more experiment results.Operational impact
No forced operational impact since the feature will be opt-in initially and must be defined in the service configuration to be enabled. I think that this could result in more predicitable broker resource utilization, but operators experimenting with this new feature will need to closely monitor broker query performance to ensure that the new feature is producing beneficial results.
Test plan
Test plan includes live cluster testing on some small-ish clusters I have available, as well as running the benchmarks on a large core count machine to simulate larger clusters and round out the benchmarks, to ensure that the approach scales correctly. Additionally I plan to test 'overloaded' testing to ensure that a busy broker performs no worse within reason than the existing merge strategy.
Future work (optional)
Beyond the initial PR, I think the most benefit would be focusing on tuning the level of parallelism, re #8357.
The text was updated successfully, but these errors were encountered: