Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation for sketch queries with higher levels of broker merge parallelism #11133

Open
a2l007 opened this issue Apr 19, 2021 · 1 comment

Comments

@a2l007
Copy link
Contributor

a2l007 commented Apr 19, 2021

Affected Version

0.17 and higher

Description

For Druid clusters with parallel merge enabled on the brokers, query performance can degrade for sketch based queries. These queries can exhibit 30% - 100% degradation in performance, depending on the value set for druid.processing.merge.pool.parallelism and the number of sequences that needs to be merged and combined . With a smaller parallelism setting however, the performance is slightly better than running it single-threaded.
Setting druid.processing.merge.pool.parallelism to 3 gave the best performance (~10% faster than the serial merge). However, increasing the parallelism beyond 3 is when we start to see degradation mentioned earlier.

I compared the flame graphs for a query on the broker toggling the merge parallelism and it looks like the parallel merge does a lot more combine operations on Theta Union objects compared to its serial counterpart.

Screen Shot 2021-04-19 at 3 21 59 PM

Combining two union objects involves compacting one of them into a sketch before updating it to the second union. This process is relatively slower as Theta union is designed to optimize merge efficiency for large number of sketch merges.

The first layer of the parallel merge process combines partitions of sequences and for sketch aggregators this generates Union objects which are pushed into the intermediate blocking queues for the second layer.
The MergeCombineAction in the second layer runs on a single thread and this has to do a bunch of union merge operations based on the number of intermediate blocking queues it has to read from. Therefore, higher the parallelism, more the union merge operations that the MergeCombineAction has to perform and worse the performance.

I don’t have a fix for this atm, but documenting this here for reference in case someone else runs into the same problem. If datasketches had a way to merge uncompacted unions, it could help alleviate this problem.

@clintropolis
Copy link
Member

Thanks for the report, I wondered if there might be a threshold where the expense of having to potentially compute the merge more times outweighs the advantage of being able to process in parallel. I unfortunately didn't really explore this very deeply in my investigations, but I think I did hit on something somewhat similar, where depending on the number of input sequences and number of cores, more parallel is not always better, #8578 (comment).

It may be that the complexity of the merge and compare functions might need to be factored into this computation which chooses how parallel the merge should be. I ... don't know exactly how we would do that off the top of my head, or even if it is the correct solution, or if we would be better focused on making the sketch merges more efficient.

I think it should be possible to replicate what you are seeing here, with a modified version the benchmarks that were created, https://github.com/apache/druid/blob/4caa221d72473f8307489d814f58bcfabde4c57e/benchmarks/src/test/java/org/apache/druid/benchmark/sequences/BaseParallelMergeCombiningSequenceBenchmark.java, testing with a more complicated merge function

public static final BinaryOperator<IntPair> INT_PAIR_MERGE_FN = (lhs, rhs) -> {

that does what is going on in the theta sketch aggregators.

I'll try to have a look into this sometime soon to see if I can reproduce it with benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants