Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer #advanceExact on expression dependencies until their values are needed #12560

Merged
merged 6 commits into from Sep 15, 2023

Conversation

gsmiller
Copy link
Contributor

Description

This extends the idea in GH#11878 to avoid advancing dependencies that are never referenced because of expression branching (i.e., ternary expressions). I think we should be able to be a little more optimal in terms of not unnecessarily advancing dependent double values.

… needed

This extends the idea in GH#11878 to avoid advancing dependencies that are
never referenced because of expression branching (i.e., ternary expressions).
if (currentDoc == doc) {
return true;
}
for (DoubleValues v : functionValues) {
v.advanceExact(doc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused... Isn't this calling advanceExact more eagerly?

Copy link
Contributor Author

@gsmiller gsmiller Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this layer, yes. But all these DoubleValue instances come from ExpressionValueSource#zeroWhenUnpositioned (see the setup done in ExpressionValueSource#getValues). In this PR, I'm proposing pushing down the lazy advancing into that layer. Does this make sense? We could keep the lazy advance logic up here as well, but I didn't think it was really necessary since the delegation to the underlying advancing is now lazy...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but still a little confused about why this is supposed to be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have an expression like a == 0 ? b : c where a, b and c are all other references in your bindings, only b or c need to be evaluated for any given doc, not both. But today, we greedily call advanceExact on all three up-front, for every doc. So we're doing wasted advancing work for one of b or c for every single doc. I added a similar example in a test case.

Depending on what the advancing is doing, this could come with significant cost (e.g., if b or c is another complex expression tree that ends up propagating the advance down all of its dependencies)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank you for explaining, makes sense to me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit confused - it seems like we're pushing things around, but I know you have some compelling-looking results so it must be OK? But I think it would be great if you could distill whatever knowledge you've garnered into a comment here. The missing link for me is -- what is ExpressionFunctionValues representing? Is that used for functions only (like max()) or generally for any expression? I guess if it's only for functions then it makes sense that we are now getting lazy advance for other expressions that might not need all their subs. But then have we lost some laziness for functions? Does this work for short-circuited booleans to? Like (1 == 1) || expensive()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bit of indirection here, so maybe this will help:

  1. ExpressionFunctionValues represents a computed expression value for a given doc. This comes from any instance of Expression (e.g., from JavascriptCompiler#compile). Expression#getDoubleValuesSource provides an instance of ExpressionValueSource, which in turn provides ExpressionFunctionValues when you call #getValues for a specific segment context. ExpressionFunctionValues#advanceExact always returns true (expressions produce results for all docs), and ExpressionFunctionValues#doubleValue computes the value for the doc.
  2. In ExpressionValueSource#getValues, we take all the expression's arguments (which are all instances of DoubleValues) and wrap them with an anonymous class provided by #zeroWhenUnpositioned. The main purpose of this wrapper is to provide a value of 0 for any expression argument that isn't available for a doc (i.e., advanceExact returned false for the doc). This PR extends the functionality of this wrapper to lazily advance until the point when doubleValues is called.
  3. So without this PR, ExpressionFunctionValues#advanceExact is lazy but once ExpressionFunctionValues#doubleValues is called, it advances all of the arguments (regardless of whether-or-not that argument is ever needed for a given doc). With this change, the advanceExact also becomes lazy, so it gets deferred until a value is actually needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE: I brought back the lazy ExpressionFunctionValues#advanceExact logic because it created confusion and isn't necessary to remove. It doesn't hurt to keep this lazy logic in place. I had removed it (making it greedy again) as an attempt to simplify, but since it created confusion, I figured I'd bring it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I also added another test-case to show that this laziness also works for condition short-circuiting (like you asked about @msokolov). And just to clarify, this is for all expressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, nice!

if (currentDoc == doc) {
return true;
}
for (DoubleValues v : functionValues) {
v.advanceExact(doc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank you for explaining, makes sense to me!

public boolean advanceExact(int doc) throws IOException {
return positioned = in.advanceExact(doc);
public boolean advanceExact(int doc) {
currentDoc = doc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any sense in checking if we are advancing to the same doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I'll include that logic.

if (currentDoc == doc) {
return true;
}
for (DoubleValues v : functionValues) {
v.advanceExact(doc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit confused - it seems like we're pushing things around, but I know you have some compelling-looking results so it must be OK? But I think it would be great if you could distill whatever knowledge you've garnered into a comment here. The missing link for me is -- what is ExpressionFunctionValues representing? Is that used for functions only (like max()) or generally for any expression? I guess if it's only for functions then it makes sense that we are now getting lazy advance for other expressions that might not need all their subs. But then have we lost some laziness for functions? Does this work for short-circuited booleans to? Like (1 == 1) || expensive()?

@gsmiller
Copy link
Contributor Author

Thanks @zhaih / @msokolov ! Merging.

@gsmiller gsmiller merged commit 58a50bf into apache:main Sep 15, 2023
4 checks passed
@gsmiller gsmiller deleted the GH-delay-expression-advance branch September 15, 2023 23:26
@zhaih zhaih added this to the 9.8.0 milestone Sep 20, 2023
@gsmiller
Copy link
Contributor Author

Circling back on this: For Amazon's Product Search engine, we make fairly heavy use of these expression implementations. I pulled this change into our Lucene fork early (currently on 9.7) and ran internal benchmarks we have, and this produced a ~23% redline QPS improvement. Milage may vary of course, but the impact was significant so others may find a nice win as well (if heavy expression users).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants