Enhance empty queue conditional in slicing logic #114911

javanna · 2024-10-16T13:46:17Z

With recent changes in Lucene 9.12 around not forking execution when not necessary (see apache/lucene#13472), we have removed the search worker thread pool in #111099. The worker thread pool had unlimited queue, and we feared that we couuld have much more queueing on the search thread pool if we execute segment level searches on the same thread pool as the shard level searches, because every shard search would take up to a thread per slice when executing the query phase.

We have then introduced an additional conditional to stop parallelizing when there is a queue. That is perhaps a bit extreme, as it's a decision made when creating the searcher, while a queue may no longer be there once the search is executing. This has caused some benchmarks regressions, given that having a queue may be a transient scenario, especially with short-lived segment searches being queued up. We may end up disabling inter-segment concurrency more aggressively than we would want, penalizing requests that do benefit from concurrency. At the same time, we do want to have some kind of protection against rejections of shard searches that would be caused by excessive slicing. When the queue is above a certain size, we can turn off the slicing and effectively disable inter-segment concurrency. With this commit we set that threshold to be the number of threads in the search pool.

elasticsearchmachine · 2024-10-16T13:46:49Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elasticsearchmachine · 2024-10-16T13:46:49Z

Hi @javanna, I've created a changelog YAML for you.

original-brownbear

LGTM, just one suggestion on maybe making the calculation a little more stable, feel free to ignore it though ... as I said I have little to no intuition about this :)

original-brownbear · 2024-10-16T13:56:14Z

server/src/main/java/org/elasticsearch/search/DefaultSearchContext.java

    ) {
        return executor instanceof ThreadPoolExecutor tpe
-            && tpe.getQueue().isEmpty()
+            && tpe.getQueue().size() <= tpe.getMaximumPoolSize()


I wonder if we should give this a little more wiggle room. I have no intuition whatsoever about what percentage of a time a full queue will actually be "full" in the sense of this condition. You'll always have a thread randomly picking a task from the queue and it seems like this might hardly ever come into action?
Maybe maxSize - thread_count is more stable?

In my test, this has the same effect as removing the conditional. That feels good, because we can maximize the parallelism without losing the protection against search rejections. This obviously depends on what you are testing :) I think I am going to leave it this way for now, and we will eventually iterate over it as we do more testing / collect more info.

Sounds good :) thanks!

javanna · 2024-10-16T14:04:40Z

Latency for http_logs track, runtime-fields challenge, before the change:

|                                                 Min Throughput |  400s-in-range-from-source-using-grok |    0.3         |  ops/s |
|                                                Mean Throughput |  400s-in-range-from-source-using-grok |    0.3         |  ops/s |
|                                              Median Throughput |  400s-in-range-from-source-using-grok |    0.3         |  ops/s |
|                                                 Max Throughput |  400s-in-range-from-source-using-grok |    0.3         |  ops/s |
|                                        50th percentile latency |  400s-in-range-from-source-using-grok | 1094.78        |     ms |
|                                        90th percentile latency |  400s-in-range-from-source-using-grok | 1168.09        |     ms |
|                                        99th percentile latency |  400s-in-range-from-source-using-grok | 1211.41        |     ms |
|                                       100th percentile latency |  400s-in-range-from-source-using-grok | 1235.26        |     ms |
|                                   50th percentile service time |  400s-in-range-from-source-using-grok | 1090.98        |     ms |
|                                   90th percentile service time |  400s-in-range-from-source-using-grok | 1164.08        |     ms |
|                                   99th percentile service time |  400s-in-range-from-source-using-grok | 1207.09        |     ms |
|                                  100th percentile service time |  400s-in-range-from-source-using-grok | 1232.93        |     ms |
|                                                     error rate |  400s-in-range-from-source-using-grok |    0           |      % |
|                                                 Min Throughput | 400s-in-range-from-keyword-using-grok |    0.3         |  ops/s |
|                                                Mean Throughput | 400s-in-range-from-keyword-using-grok |    0.3         |  ops/s |
|                                              Median Throughput | 400s-in-range-from-keyword-using-grok |    0.3         |  ops/s |
|                                                 Max Throughput | 400s-in-range-from-keyword-using-grok |    0.3         |  ops/s |
|                                        50th percentile latency | 400s-in-range-from-keyword-using-grok | 2187.38        |     ms |
|                                        90th percentile latency | 400s-in-range-from-keyword-using-grok | 2278.73        |     ms |
|                                        99th percentile latency | 400s-in-range-from-keyword-using-grok | 2345.78        |     ms |
|                                       100th percentile latency | 400s-in-range-from-keyword-using-grok | 2353.05        |     ms |
|                                   50th percentile service time | 400s-in-range-from-keyword-using-grok | 2184.1         |     ms |
|                                   90th percentile service time | 400s-in-range-from-keyword-using-grok | 2275.23        |     ms |
|                                   99th percentile service time | 400s-in-range-from-keyword-using-grok | 2341.93        |     ms |
|                                  100th percentile service time | 400s-in-range-from-keyword-using-grok | 2349.96        |     ms |
|                                                     error rate | 400s-in-range-from-keyword-using-grok |    0           |      % |

and after:

|                                                 Min Throughput |  400s-in-range-from-source-using-grok |    0.3         |  ops/s |
|                                                Mean Throughput |  400s-in-range-from-source-using-grok |    0.3         |  ops/s |
|                                              Median Throughput |  400s-in-range-from-source-using-grok |    0.3         |  ops/s |
|                                                 Max Throughput |  400s-in-range-from-source-using-grok |    0.3         |  ops/s |
|                                        50th percentile latency |  400s-in-range-from-source-using-grok |  612.345       |     ms |
|                                        90th percentile latency |  400s-in-range-from-source-using-grok |  657.275       |     ms |
|                                        99th percentile latency |  400s-in-range-from-source-using-grok |  674.38        |     ms |
|                                       100th percentile latency |  400s-in-range-from-source-using-grok |  686.795       |     ms |
|                                   50th percentile service time |  400s-in-range-from-source-using-grok |  607.592       |     ms |
|                                   90th percentile service time |  400s-in-range-from-source-using-grok |  652.377       |     ms |
|                                   99th percentile service time |  400s-in-range-from-source-using-grok |  671.428       |     ms |
|                                  100th percentile service time |  400s-in-range-from-source-using-grok |  684.497       |     ms |
|                                                     error rate |  400s-in-range-from-source-using-grok |    0           |      % |
|                                                 Min Throughput | 400s-in-range-from-keyword-using-grok |    0.3         |  ops/s |
|                                                Mean Throughput | 400s-in-range-from-keyword-using-grok |    0.3         |  ops/s |
|                                              Median Throughput | 400s-in-range-from-keyword-using-grok |    0.3         |  ops/s |
|                                                 Max Throughput | 400s-in-range-from-keyword-using-grok |    0.3         |  ops/s |
|                                        50th percentile latency | 400s-in-range-from-keyword-using-grok | 1047.08        |     ms |
|                                        90th percentile latency | 400s-in-range-from-keyword-using-grok | 1085.22        |     ms |
|                                        99th percentile latency | 400s-in-range-from-keyword-using-grok | 1100.75        |     ms |
|                                       100th percentile latency | 400s-in-range-from-keyword-using-grok | 1143.23        |     ms |
|                                   50th percentile service time | 400s-in-range-from-keyword-using-grok | 1044.02        |     ms |
|                                   90th percentile service time | 400s-in-range-from-keyword-using-grok | 1082.76        |     ms |
|                                   99th percentile service time | 400s-in-range-from-keyword-using-grok | 1097.8         |     ms |
|                                  100th percentile service time | 400s-in-range-from-keyword-using-grok | 1140.11        |     ms |
|                                                     error rate | 400s-in-range-from-keyword-using-grok |    0           |      % |

With recent changes in Lucene 9.12 around not forking execution when not necessary (see apache/lucene#13472), we have removed the search worker thread pool in elastic#111099. The worker thread pool had unlimited queue, and we feared that we couuld have much more queueing on the search thread pool if we execute segment level searches on the same thread pool as the shard level searches, because every shard search would take up to a thread per slice when executing the query phase. We have then introduced an additional conditional to stop parallelizing when there is a queue. That is perhaps a bit extreme, as it's a decision made when creating the searcher, while a queue may no longer be there once the search is executing. This has caused some benchmarks regressions, given that having a queue may be a transient scenario, especially with short-lived segment searches being queued up. We may end up disabling inter-segment concurrency more aggressively than we would want, penalizing requests that do benefit from concurrency. At the same time, we do want to have some kind of protection against rejections of shard searches that would be caused by excessive slicing. When the queue is above a certain size, we can turn off the slicing and effectively disable inter-segment concurrency. With this commit we set that threshold to be the number of threads in the search pool.

elasticsearchmachine · 2024-10-16T17:52:49Z

💚 Backport successful

Status	Branch	Result
✅	8.16
✅	8.x

With recent changes in Lucene 9.12 around not forking execution when not necessary (see apache/lucene#13472), we have removed the search worker thread pool in elastic#111099. The worker thread pool had unlimited queue, and we feared that we couuld have much more queueing on the search thread pool if we execute segment level searches on the same thread pool as the shard level searches, because every shard search would take up to a thread per slice when executing the query phase. We have then introduced an additional conditional to stop parallelizing when there is a queue. That is perhaps a bit extreme, as it's a decision made when creating the searcher, while a queue may no longer be there once the search is executing. This has caused some benchmarks regressions, given that having a queue may be a transient scenario, especially with short-lived segment searches being queued up. We may end up disabling inter-segment concurrency more aggressively than we would want, penalizing requests that do benefit from concurrency. At the same time, we do want to have some kind of protection against rejections of shard searches that would be caused by excessive slicing. When the queue is above a certain size, we can turn off the slicing and effectively disable inter-segment concurrency. With this commit we set that threshold to be the number of threads in the search pool.

javanna · 2024-10-16T17:56:49Z

I marked this non-issue because it addresses an unreleased regression introduced with #111099 . That is the reason why despite it's label, it is a bugfix and it needs backporting to all affected branches.

With recent changes in Lucene 9.12 around not forking execution when not necessary (see apache/lucene#13472), we have removed the search worker thread pool in #111099. The worker thread pool had unlimited queue, and we feared that we couuld have much more queueing on the search thread pool if we execute segment level searches on the same thread pool as the shard level searches, because every shard search would take up to a thread per slice when executing the query phase. We have then introduced an additional conditional to stop parallelizing when there is a queue. That is perhaps a bit extreme, as it's a decision made when creating the searcher, while a queue may no longer be there once the search is executing. This has caused some benchmarks regressions, given that having a queue may be a transient scenario, especially with short-lived segment searches being queued up. We may end up disabling inter-segment concurrency more aggressively than we would want, penalizing requests that do benefit from concurrency. At the same time, we do want to have some kind of protection against rejections of shard searches that would be caused by excessive slicing. When the queue is above a certain size, we can turn off the slicing and effectively disable inter-segment concurrency. With this commit we set that threshold to be the number of threads in the search pool.

With recent changes in Lucene 9.12 around not forking execution when not necessary (see apache/lucene#13472), we have removed the search worker thread pool in #111099. The worker thread pool had unlimited queue, and we feared that we couuld have much more queueing on the search thread pool if we execute segment level searches on the same thread pool as the shard level searches, because every shard search would take up to a thread per slice when executing the query phase. We have then introduced an additional conditional to stop parallelizing when there is a queue. That is perhaps a bit extreme, as it's a decision made when creating the searcher, while a queue may no longer be there once the search is executing. This has caused some benchmarks regressions, given that having a queue may be a transient scenario, especially with short-lived segment searches being queued up. We may end up disabling inter-segment concurrency more aggressively than we would want, penalizing requests that do benefit from concurrency. At the same time, we do want to have some kind of protection against rejections of shard searches that would be caused by excessive slicing. When the queue is above a certain size, we can turn off the slicing and effectively disable inter-segment concurrency. With this commit we set that threshold to be the number of threads in the search pool. Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

With recent changes in Lucene 9.12 around not forking execution when not necessary (see apache/lucene#13472), we have removed the search worker thread pool in elastic#111099. The worker thread pool had unlimited queue, and we feared that we couuld have much more queueing on the search thread pool if we execute segment level searches on the same thread pool as the shard level searches, because every shard search would take up to a thread per slice when executing the query phase. We have then introduced an additional conditional to stop parallelizing when there is a queue. That is perhaps a bit extreme, as it's a decision made when creating the searcher, while a queue may no longer be there once the search is executing. This has caused some benchmarks regressions, given that having a queue may be a transient scenario, especially with short-lived segment searches being queued up. We may end up disabling inter-segment concurrency more aggressively than we would want, penalizing requests that do benefit from concurrency. At the same time, we do want to have some kind of protection against rejections of shard searches that would be caused by excessive slicing. When the queue is above a certain size, we can turn off the slicing and effectively disable inter-segment concurrency. With this commit we set that threshold to be the number of threads in the search pool.

javanna added >bug auto-backport Automatically create backport pull requests when merged v8.16.0 :Search Foundations/Search Catch all for Search Foundations labels Oct 16, 2024

javanna requested a review from original-brownbear October 16, 2024 13:46

elasticsearchmachine added v9.0.0 Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch labels Oct 16, 2024

original-brownbear approved these changes Oct 16, 2024

View reviewed changes

javanna requested review from a team as code owners October 16, 2024 14:18

javanna and others added 2 commits October 16, 2024 16:31

Update docs/changelog/114911.yaml

9572002

javanna force-pushed the fix/queue_size_concurrency branch from 7d82bb1 to 9572002 Compare October 16, 2024 14:32

javanna removed request for a team October 16, 2024 14:49

javanna changed the title ~~Replace empty queue conditional from slicing logic~~ Enhance empty queue conditional in slicing logic Oct 16, 2024

javanna added >non-issue and removed >bug labels Oct 16, 2024

Delete docs/changelog/114911.yaml

d95609c

brianseeders added v8.17.0 and removed v8.16.0 labels Oct 16, 2024

javanna added v8.16.0 v8.17.0 and removed v8.17.0 labels Oct 16, 2024

javanna merged commit 8b87969 into elastic:main Oct 16, 2024
16 checks passed

javanna deleted the fix/queue_size_concurrency branch October 16, 2024 17:51

javanna mentioned this pull request Oct 16, 2024

[8.16] Enhance empty queue conditional in slicing logic (#114911) #114939

Merged

javanna mentioned this pull request Oct 16, 2024

[8.x] Enhance empty queue conditional in slicing logic (#114911) #114940

Merged

javanna added v8.16.1 and removed v8.16.0 labels Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance empty queue conditional in slicing logic #114911

Enhance empty queue conditional in slicing logic #114911

Uh oh!

javanna commented Oct 16, 2024

Uh oh!

elasticsearchmachine commented Oct 16, 2024

Uh oh!

elasticsearchmachine commented Oct 16, 2024

Uh oh!

original-brownbear left a comment

Uh oh!

original-brownbear Oct 16, 2024

Uh oh!

javanna Oct 16, 2024

Uh oh!

original-brownbear Oct 16, 2024

Uh oh!

javanna commented Oct 16, 2024

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 16, 2024

Uh oh!

javanna commented Oct 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enhance empty queue conditional in slicing logic #114911

Enhance empty queue conditional in slicing logic #114911

Uh oh!

Conversation

javanna commented Oct 16, 2024

Uh oh!

elasticsearchmachine commented Oct 16, 2024

Uh oh!

elasticsearchmachine commented Oct 16, 2024

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear Oct 16, 2024

Choose a reason for hiding this comment

Uh oh!

javanna Oct 16, 2024

Choose a reason for hiding this comment

Uh oh!

original-brownbear Oct 16, 2024

Choose a reason for hiding this comment

Uh oh!

javanna commented Oct 16, 2024

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 16, 2024

💚 Backport successful

Uh oh!

javanna commented Oct 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants