Defer #advanceExact on expression dependencies until their values are needed #12560

gsmiller · 2023-09-15T14:48:26Z

Description

This extends the idea in GH#11878 to avoid advancing dependencies that are never referenced because of expression branching (i.e., ternary expressions). I think we should be able to be a little more optimal in terms of not unnecessarily advancing dependent double values.

… needed This extends the idea in GH#11878 to avoid advancing dependencies that are never referenced because of expression branching (i.e., ternary expressions).

zhaih · 2023-09-15T17:37:25Z

lucene/expressions/src/java/org/apache/lucene/expressions/ExpressionFunctionValues.java

    if (currentDoc == doc) {
      return true;
    }
+    for (DoubleValues v : functionValues) {
+      v.advanceExact(doc);


I'm confused... Isn't this calling advanceExact more eagerly?

In this layer, yes. But all these DoubleValue instances come from ExpressionValueSource#zeroWhenUnpositioned (see the setup done in ExpressionValueSource#getValues). In this PR, I'm proposing pushing down the lazy advancing into that layer. Does this make sense? We could keep the lazy advance logic up here as well, but I didn't think it was really necessary since the delegation to the underlying advancing is now lazy...

I see, but still a little confused about why this is supposed to be better?

If you have an expression like a == 0 ? b : c where a, b and c are all other references in your bindings, only b or c need to be evaluated for any given doc, not both. But today, we greedily call advanceExact on all three up-front, for every doc. So we're doing wasted advancing work for one of b or c for every single doc. I added a similar example in a test case.

Depending on what the advancing is doing, this could come with significant cost (e.g., if b or c is another complex expression tree that ends up propagating the advance down all of its dependencies)

I see, thank you for explaining, makes sense to me!

I'm still a bit confused - it seems like we're pushing things around, but I know you have some compelling-looking results so it must be OK? But I think it would be great if you could distill whatever knowledge you've garnered into a comment here. The missing link for me is -- what is ExpressionFunctionValues representing? Is that used for functions only (like max()) or generally for any expression? I guess if it's only for functions then it makes sense that we are now getting lazy advance for other expressions that might not need all their subs. But then have we lost some laziness for functions? Does this work for short-circuited booleans to? Like (1 == 1) || expensive()?

There's a bit of indirection here, so maybe this will help:

ExpressionFunctionValues represents a computed expression value for a given doc. This comes from any instance of Expression (e.g., from JavascriptCompiler#compile). Expression#getDoubleValuesSource provides an instance of ExpressionValueSource, which in turn provides ExpressionFunctionValues when you call #getValues for a specific segment context. ExpressionFunctionValues#advanceExact always returns true (expressions produce results for all docs), and ExpressionFunctionValues#doubleValue computes the value for the doc.

In ExpressionValueSource#getValues, we take all the expression's arguments (which are all instances of DoubleValues) and wrap them with an anonymous class provided by #zeroWhenUnpositioned. The main purpose of this wrapper is to provide a value of 0 for any expression argument that isn't available for a doc (i.e., advanceExact returned false for the doc). This PR extends the functionality of this wrapper to lazily advance until the point when doubleValues is called.

So without this PR, ExpressionFunctionValues#advanceExact is lazy but once ExpressionFunctionValues#doubleValues is called, it advances all of the arguments (regardless of whether-or-not that argument is ever needed for a given doc). With this change, the advanceExact also becomes lazy, so it gets deferred until a value is actually needed.

UPDATE: I brought back the lazy ExpressionFunctionValues#advanceExact logic because it created confusion and isn't necessary to remove. It doesn't hurt to keep this lazy logic in place. I had removed it (making it greedy again) as an attempt to simplify, but since it created confusion, I figured I'd bring it back.

Oh, I also added another test-case to show that this laziness also works for condition short-circuiting (like you asked about @msokolov). And just to clarify, this is for all expressions.

thank you, nice!

zhaih · 2023-09-15T20:51:13Z

lucene/expressions/src/java/org/apache/lucene/expressions/ExpressionFunctionValues.java

    if (currentDoc == doc) {
      return true;
    }
+    for (DoubleValues v : functionValues) {
+      v.advanceExact(doc);


I see, thank you for explaining, makes sense to me!

msokolov · 2023-09-15T21:10:21Z

lucene/expressions/src/java/org/apache/lucene/expressions/ExpressionValueSource.java

-      public boolean advanceExact(int doc) throws IOException {
-        return positioned = in.advanceExact(doc);
+      public boolean advanceExact(int doc) {
+        currentDoc = doc;


is there any sense in checking if we are advancing to the same doc?

Thanks for the suggestion. I'll include that logic.

msokolov · 2023-09-15T21:18:37Z

lucene/expressions/src/java/org/apache/lucene/expressions/ExpressionFunctionValues.java

    if (currentDoc == doc) {
      return true;
    }
+    for (DoubleValues v : functionValues) {
+      v.advanceExact(doc);


I'm still a bit confused - it seems like we're pushing things around, but I know you have some compelling-looking results so it must be OK? But I think it would be great if you could distill whatever knowledge you've garnered into a comment here. The missing link for me is -- what is ExpressionFunctionValues representing? Is that used for functions only (like max()) or generally for any expression? I guess if it's only for functions then it makes sense that we are now getting lazy advance for other expressions that might not need all their subs. But then have we lost some laziness for functions? Does this work for short-circuited booleans to? Like (1 == 1) || expensive()?

…n and maybe for slight additional benefit

gsmiller · 2023-09-15T23:26:08Z

Thanks @zhaih / @msokolov ! Merging.

… needed (#12560)

gsmiller · 2023-09-22T17:47:52Z

Circling back on this: For Amazon's Product Search engine, we make fairly heavy use of these expression implementations. I pulled this change into our Lucene fork early (currently on 9.7) and ran internal benchmarks we have, and this produced a ~23% redline QPS improvement. Milage may vary of course, but the impact was significant so others may find a nice win as well (if heavy expression users).

gsmiller added 2 commits September 15, 2023 07:47

Defer #advanceExact on expression dependencies until their values are…

3250a19

… needed This extends the idea in GH#11878 to avoid advancing dependencies that are never referenced because of expression branching (i.e., ternary expressions).

add test

6c206f9

zhaih reviewed Sep 15, 2023

View reviewed changes

zhaih approved these changes Sep 15, 2023

View reviewed changes

msokolov reviewed Sep 15, 2023

View reviewed changes

gsmiller added 4 commits September 15, 2023 14:55

another test example and doc

e97c325

bring back lazy advance in ExpressionFunctionValues to avoid confusio…

b9d7177

…n and maybe for slight additional benefit

fix changes number

4f72f16

pr feedback

d912f21

gsmiller merged commit 58a50bf into apache:main Sep 15, 2023
4 checks passed

gsmiller deleted the GH-delay-expression-advance branch September 15, 2023 23:26

gsmiller added a commit that referenced this pull request Sep 15, 2023

Defer #advanceExact on expression dependencies until their values are…

af978fb

… needed (#12560)

gsmiller mentioned this pull request Sep 18, 2023

Expression: add a set of duplicate variables #12445

Closed

zhaih added this to the 9.8.0 milestone Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer #advanceExact on expression dependencies until their values are needed #12560

Defer #advanceExact on expression dependencies until their values are needed #12560

gsmiller commented Sep 15, 2023

zhaih Sep 15, 2023

gsmiller Sep 15, 2023 •

edited

zhaih Sep 15, 2023

gsmiller Sep 15, 2023

zhaih Sep 15, 2023

msokolov Sep 15, 2023

gsmiller Sep 15, 2023

gsmiller Sep 15, 2023

gsmiller Sep 15, 2023

msokolov Sep 15, 2023

zhaih Sep 15, 2023

msokolov Sep 15, 2023

gsmiller Sep 15, 2023

msokolov Sep 15, 2023

gsmiller commented Sep 15, 2023

gsmiller commented Sep 22, 2023

Defer #advanceExact on expression dependencies until their values are needed #12560

Defer #advanceExact on expression dependencies until their values are needed #12560

Conversation

gsmiller commented Sep 15, 2023

Description

Choose a reason for hiding this comment

gsmiller Sep 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsmiller commented Sep 15, 2023

gsmiller commented Sep 22, 2023

gsmiller Sep 15, 2023 •

edited