Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further optimize DrillSideways scoring #11881

Merged
merged 9 commits into from
Nov 8, 2022

Conversation

gsmiller
Copy link
Contributor

@gsmiller gsmiller commented Oct 26, 2022

Description

This change further optimizes the doc-at-a-time drill-sideways scoring. There are a couple small tweaks, but the two bigger ideas in here are:

  1. Take some inspiration from "min should match" scoring (in WANDScorer) and track a "near miss" dim that is ahead of the rest of the dims (named a "runaway dim"). This is a simplification of what the head/lead/tail data structures in WANDScorer would do in the case where min-should-match == n - 1.
  2. Add a special-case implementation for when there is exactly one "sideways" dimension.

Unfortunately, we don't have drill-sideways specific tasks in luceneutil benchmarks, but I was able to benchmark the impact of this change using some benchmarks we have on our Lucene-based application (that specifically stress our use of drill-sideways). In our tests, I observed a throughput increase of 2.4% red-line QPS and a 3.7% average latency reduction (which is mostly dominated by a p50 latency reduction of 3.9% with no noticeable impacts to long-tail latencies).

@gsmiller gsmiller force-pushed the GH/drillsideways-min-should-match branch from 27e1a1e to 232c515 Compare October 26, 2022 18:32
@gsmiller gsmiller marked this pull request as draft October 27, 2022 00:24
@gsmiller
Copy link
Contributor Author

Converting to DRAFT temporarily to add some more testing. While the randomized testing does cover the various use-cases, I think it would be good to increase the testing given the non-trivial nature of the change. Will pull back out of DRAFT after doing so. Comments are of course welcome in the meantime if anyone is interested.

@gsmiller gsmiller changed the title Bringing "WAND-like" optimizations to DrillSideways scoring. Further optimize DrillSideways scoring Nov 3, 2022
@gsmiller gsmiller marked this pull request as ready for review November 3, 2022 22:43
@gsmiller
Copy link
Contributor Author

gsmiller commented Nov 3, 2022

OK, this should be ready for review now.

Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but I would like to avoid using PQ if possible.

int next = Math.min(dimDocID, failedDim.approximation.docID());
// If we carried a "runaway" over from the last iteration, see if we've "caught up" yet:
DocsAndCost runaway = runawayDim.top();
if (runaway != null && runaway.approximation.docID() <= docID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether I understand it right, but seems the only advantage of introducing PQ here is we're able to carry over the previous "runaway" or "failed" dimension. But can't we carry that extra dim by using another variable (Or I guess we could even use only one variable, as in the previous version)? Since there might be slightly more overheads by using the PQ?

Copy link
Contributor Author

@gsmiller gsmiller Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we can use a local variable but the PQ lets our code be a bit simpler and cleaner (in my opinion at least). The thing to note is that, when two approximations "run away" (get ahead of docID), we want to advance to the "closer" of the two runaways. The PQ just gives us a clean way to do that using insertWithOverflow to evict the closer one (so it just takes care of the docID comparison and swapping). It's a little unorthodox maybe, but I felt it let our code in here be a bit more idiomatic/simple.

I'll upload a revision now that shows an alternative that doesn't use the PQ. I don't think there's really any significant overhead of the PQ to be honest, so I think it probably comes down to stylistic preference (but it's entirely possible I'm overlooking something). Please let me known if you have a strong preference either way. Thanks!

UPDATE: Here's the difference if we want to avoid the PQ.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked PQ's code more carefully and yes I think it does not introduce any significant overhead. (At first I was a bit intimidated by the complex heap's logic) Though I still prefer using a local variable for being explicit, but I don't have a strong preference, it is more of a personal one.

Either way works for me, so choose the version you like :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhaih for taking a look!

// We keep track of a "runaway" dimension, which is a previously "near missed" dimension that
// has advanced beyond the docID the rest of the dimensions are positioned on. This functions
// a bit like the "head" queue in WANDScorer's "min should match" implementation. We use a
// single-valued PQ ordered by docID to easily determine the "closest" runaway dim we'll use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we'll go with this version, maybe change this comment. (Since not PQ)

@gsmiller gsmiller merged commit c66a559 into apache:main Nov 8, 2022
@gsmiller gsmiller deleted the GH/drillsideways-min-should-match branch November 8, 2022 18:08
gsmiller added a commit that referenced this pull request Nov 8, 2022
@rmuir rmuir added this to the 9.5.0 milestone Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants