Further optimize DrillSideways scoring #11881

gsmiller · 2022-10-26T18:29:16Z

Description

This change further optimizes the doc-at-a-time drill-sideways scoring. There are a couple small tweaks, but the two bigger ideas in here are:

Take some inspiration from "min should match" scoring (in WANDScorer) and track a "near miss" dim that is ahead of the rest of the dims (named a "runaway dim"). This is a simplification of what the head/lead/tail data structures in WANDScorer would do in the case where min-should-match == n - 1.
Add a special-case implementation for when there is exactly one "sideways" dimension.

Unfortunately, we don't have drill-sideways specific tasks in luceneutil benchmarks, but I was able to benchmark the impact of this change using some benchmarks we have on our Lucene-based application (that specifically stress our use of drill-sideways). In our tests, I observed a throughput increase of 2.4% red-line QPS and a 3.7% average latency reduction (which is mostly dominated by a p50 latency reduction of 3.9% with no noticeable impacts to long-tail latencies).

gsmiller · 2022-10-27T00:25:01Z

Converting to DRAFT temporarily to add some more testing. While the randomized testing does cover the various use-cases, I think it would be good to increase the testing given the non-trivial nature of the change. Will pull back out of DRAFT after doing so. Comments are of course welcome in the meantime if anyone is interested.

…on't need head/lead/tail tracking

gsmiller · 2022-11-03T22:44:00Z

OK, this should be ready for review now.

zhaih

Overall LGTM, but I would like to avoid using PQ if possible.

zhaih · 2022-11-04T07:06:32Z

lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java

-            int next = Math.min(dimDocID, failedDim.approximation.docID());
+      // If we carried a "runaway" over from the last iteration, see if we've "caught up" yet:
+      DocsAndCost runaway = runawayDim.top();
+      if (runaway != null && runaway.approximation.docID() <= docID) {


I'm not sure whether I understand it right, but seems the only advantage of introducing PQ here is we're able to carry over the previous "runaway" or "failed" dimension. But can't we carry that extra dim by using another variable (Or I guess we could even use only one variable, as in the previous version)? Since there might be slightly more overheads by using the PQ?

Right, we can use a local variable but the PQ lets our code be a bit simpler and cleaner (in my opinion at least). The thing to note is that, when two approximations "run away" (get ahead of docID), we want to advance to the "closer" of the two runaways. The PQ just gives us a clean way to do that using insertWithOverflow to evict the closer one (so it just takes care of the docID comparison and swapping). It's a little unorthodox maybe, but I felt it let our code in here be a bit more idiomatic/simple.

I'll upload a revision now that shows an alternative that doesn't use the PQ. I don't think there's really any significant overhead of the PQ to be honest, so I think it probably comes down to stylistic preference (but it's entirely possible I'm overlooking something). Please let me known if you have a strong preference either way. Thanks!

UPDATE: Here's the difference if we want to avoid the PQ.

I checked PQ's code more carefully and yes I think it does not introduce any significant overhead. (At first I was a bit intimidated by the complex heap's logic) Though I still prefer using a local variable for being explicit, but I don't have a strong preference, it is more of a personal one.

Either way works for me, so choose the version you like :)

Thanks @zhaih for taking a look!

zhaih · 2022-11-05T18:58:05Z

lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java

+    // We keep track of a "runaway" dimension, which is a previously "near missed" dimension that
+    // has advanced beyond the docID the rest of the dimensions are positioned on. This functions
+    // a bit like the "head" queue in WANDScorer's "min should match" implementation. We use a
+    // single-valued PQ ordered by docID to easily determine the "closest" runaway dim we'll use


If we'll go with this version, maybe change this comment. (Since not PQ)

This reverts commit db2b2f9.

gsmiller added 2 commits October 26, 2022 11:31

Bringing "WAND-like" optimizations to DrillSideways scoring.

105f77b

changes

232c515

gsmiller force-pushed the GH/drillsideways-min-should-match branch from 27e1a1e to 232c515 Compare October 26, 2022 18:32

gsmiller marked this pull request as draft October 27, 2022 00:24

gsmiller added 3 commits October 30, 2022 08:05

small tweak to not nextDoc sideways dims when a twophase misses

7415131

simplify min-should-max influence to only track a "runaway" dim; we d…

65e8eca

…on't need head/lead/tail tracking

changes update

cc122ff

gsmiller changed the title ~~Bringing "WAND-like" optimizations to DrillSideways scoring.~~ Further optimize DrillSideways scoring Nov 3, 2022

add some more test coverage

d210377

gsmiller marked this pull request as ready for review November 3, 2022 22:43

zhaih reviewed Nov 4, 2022

View reviewed changes

version that does not use PQ to show difference

db2b2f9

zhaih approved these changes Nov 5, 2022

View reviewed changes

zhaih reviewed Nov 5, 2022

View reviewed changes

gsmiller added 2 commits November 8, 2022 09:34

Revert "version that does not use PQ to show difference"

8fd8673

This reverts commit db2b2f9.

Merge branch 'main' into GH/drillsideways-min-should-match

cc4aa15

gsmiller merged commit c66a559 into apache:main Nov 8, 2022

gsmiller deleted the GH/drillsideways-min-should-match branch November 8, 2022 18:08

gsmiller added a commit that referenced this pull request Nov 8, 2022

Further optimize DrillSideways scoring (#11881)

75a6590

rmuir added this to the 9.5.0 milestone Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further optimize DrillSideways scoring #11881

Further optimize DrillSideways scoring #11881

gsmiller commented Oct 26, 2022 •

edited

Loading

gsmiller commented Oct 27, 2022

gsmiller commented Nov 3, 2022

zhaih left a comment

zhaih Nov 4, 2022

gsmiller Nov 4, 2022 •

edited

Loading

zhaih Nov 5, 2022

gsmiller Nov 8, 2022

zhaih Nov 5, 2022

Further optimize DrillSideways scoring #11881

Further optimize DrillSideways scoring #11881

Conversation

gsmiller commented Oct 26, 2022 • edited Loading

Description

gsmiller commented Oct 27, 2022

gsmiller commented Nov 3, 2022

zhaih left a comment

Choose a reason for hiding this comment

zhaih Nov 4, 2022

Choose a reason for hiding this comment

gsmiller Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

zhaih Nov 5, 2022

Choose a reason for hiding this comment

gsmiller Nov 8, 2022

Choose a reason for hiding this comment

zhaih Nov 5, 2022

Choose a reason for hiding this comment

gsmiller commented Oct 26, 2022 •

edited

Loading

gsmiller Nov 4, 2022 •

edited

Loading