Option for speeding up runtime field queries #81124

nik9000 · 2021-11-29T19:45:00Z

This adds an option to runtime fields that attempts to speed up queries
against them by running a first pass "approximation" against the search
index. This should usually be faster than running the script against
every potential match which is how we execute queries against runtime
fields right now.

Take, for example, the query for the HTTP method POST against an
apache log line. Apache logs look like:

6.157.0.0 - - [1998-05-03T03:48:46-05:00] "POST /cgi-bin/trivia/Trivia.pl HTTP/1.0" 200 4807

And you can extract the method with a script like:

String m = doc["message"].value;
int start = m.indexOf('"') + 1;
int end = m.indexOf(" ", start);
emit(m.substring(start, end));

A query like {"term": {"method": "POST"}} can be approximated as a
for the substring POST.

Our rally track http_logs has exactly this data with the message field
as a wildcard field and wildcard fields. Without the approximation
added by this change the query above has to run the script 25 million
times. On my desktop it takes about 40 seconds. With the approximation
added by this query the query above has to run the script 400 thousand
times. On my desktop this let's the working set fit into memory and the
query takes 250 milliseconds. That's 160 times faster. Great! That's
pretty close to a best case scenario for this change though. But, not
totally uncommon.

The actual best case scenario for this change is a constant. It's
reasonable to create a runtime field for to add a constant value to data
in an old index with a script like emit(0). This change will
approximate range and term queries against this runtime field as
either match_all or match_none. The latter will skip running the
script entirely which seems like it could be a substantial performance
improvement. match_all is basically how runtime fields work now.

It's also reasonably common to write runtime fields that convert units
like emit(doc.rx / 1024) which converts bytes to kibibytes. This
change allows queries like {"range": {"rx.kb":{"gt": 1000"}}} to be
approximated by {"range": {"rx":{"gt": 1000024"}}}. If the approximation
is very selective then this is also a huge performance boost. If the
approximation isn't selective then it should amount to no change in speed.

All of this is enabled with a new parameter on the runtime field
definition:

"rx.kb": {
  "type": "long",
  "script": "emit(doc.rx.value / 1024)",
  "approximate_first": true     <----- this
}

This works by converting the contents of the emit expression into a
new thing called a QueryableExpression which can approximate a term
or range query by "reversing" the expression into queries to the
underlying field. That means there are really five parts to this change:

Converting bits of the Painless AST into QueryableExpressions.
Plumbing the QueryableExpressions out to the runtime fields.
Reversing the queries to the expression into approximation queries.
Implementing the approximation queries on MappedFieldType.
Plugging the approximation queries into the runtime field query.

Co-authored-by: Lukas Wegmann lukas.wegmann@elastic.co
Co-authored-by: Jack Conradson osjdconrad@gmail.com

That forces me to write tests for it more rigorously because I can't depend on the scripting infrastructure.

* Converts `QueryableExpression` and `LongQueryableExpression` into an interface in preparation for there being more subclasses. * Moves the default implementation into package private `AbstractLongQueryableExpression`. * Creates `LongQueryableExpression.field(queryCallbacks)` method to build fields. * Adapts `long` flavored query callbacks into `int` flavored query callbacks. * Plugs those `int` flavored callbacks into the `integer`, `short`, and `byte` field types in ES

Adds some disabled tests that should pass once we have `QueryableExpression` wired into painless's compiler.

Short circuit range quries where `from` is greater than `to` which can never produce results. This is important because the approximation code expects well formed ranges.

This should make it easier to extract the shape of the `emit` expression at compile time and pick up mapping information when we build the query.

`DelayedQueryableExpression` really is a builder. Let's call it what folks expect it to be called.

Now that we've plumbed `QueryableExpression` out of painless into the query we can assert profile results. At this point only constants are linked, but it shows that a query on a constant runtime field that doesn't match never runs the script at all.

factory

This adds a `CollectArgumentAnnotation` to tell painless to collect the arguments to a class binding - that's what `emit` is. This replaces the hard coded name and subclass lookup with an annotation in painless's whitelist for method.

Our `long` favored constants were trying to commute division which ended up converting `10 / doc.f.value` into `doc.f.value / 10`. That's not how division works.

nik9000 · 2021-11-29T19:47:56Z

libs/queryable-expression/build.gradle

+ * Side Public License, v 1.
+ */
+
+apply plugin: 'elasticsearch.build'


QueryableExpression is it's own lib to make it easier to test it. It is also nice to test it in isolation from Elasticsearch.

nik9000 · 2021-11-29T19:52:18Z

modules/lang-painless/src/main/java/org/elasticsearch/painless/Compiler.java

@@ -218,6 +219,9 @@ ScriptScope compile(Loader loader, String name, String source, CompilerSettings
        new DefaultStringConcatenationOptimizationPhase().visitClass(classNode, null);
        new DefaultConstantFoldingOptimizationPhase().visitClass(classNode, null);
        new DefaultStaticConstantExtractionPhase().visitClass(classNode, scriptScope);
+        if (painlessLookup.collectArgumentsTargetMethods().isEmpty() == false) {
+            new CollectArgumentsPhase().visitClass(classNode, scriptScope.getQueryableExpressionScope());
+        }


We run this phase on all long and string flavored runtime fields whether or not you've enabled approximations. I wonder if it's worth disabling it when you haven't enabled them, just for extra paranoia. I don't think that's likely to matter, but I've been wrong in the past.

Well! The docs had a failure in this phase. So maybe that's a sign we should skip it when you haven't enabled the approximations.

nik9000 · 2021-11-29T19:55:34Z

modules/lang-painless/src/test/java/org/elasticsearch/painless/CollectArgumentTests.java

+
+    public void testIntConst() {
+        assertEquals("100", qe("emit(100)").toString());
+    }


I think maybe we should think again about replacing toString here. I like that QueryableExpressions have a nice, readable toString, but I don't know that its the best test. I think the alternative is to implement equals and hashCode on QueryableExpression but that's kind of heavy.

More readable this way

nik9000 · 2021-11-29T19:58:31Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+            }
+            return term.field() + ":*" + term.text() + "*";
+        }
+    }


This query exists to have a nice toString for the profile output. I'm a little worried about the overhead of running the automaton query against high cardinality keyword fields, but in my testing it was much much much faster than running the script a zillion times. Still nothing like as fast as wildcard fields. I was tempted not to even implement QueryableExpression here but I really want to have a test for substring in server.

nik9000 · 2021-11-29T20:03:36Z

server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java

+            @Override
+            public QueryableExpression asQueryableExpression(String field, boolean hasDocValues, SearchExecutionContext context) {
+                return INTEGER.asQueryableExpression(field, hasDocValues, context);
+            }


These fields are uncommon and correctly, but inefficiently approximated as integer fields.

nik9000 · 2021-11-29T20:11:15Z

server/src/main/java/org/elasticsearch/search/runtime/LongScriptFieldRangeQuery.java

@@ -49,7 +63,7 @@ public final String toString(String field) {
            b.append(fieldName()).append(':');
        }
        b.append('[').append(lowerValue).append(" TO ").append(upperValue).append(']');
-        return b.toString();
+        return b.toString() + " approximated by " + approximation();  // TODO move the toString enhancement into the superclass


I think AbstractScriptFieldQuery should handle this bit of the toString but that'd require touching every subclass and felt like it should be saved for a follow up.

nik9000 · 2021-11-29T20:11:44Z

server/src/test/java/org/elasticsearch/index/mapper/AbstractScriptFieldTypeTestCase.java

@@ -178,14 +174,6 @@ public void testFieldCaps() throws Exception {

    protected abstract Query randomTermsQuery(MappedFieldType ft, SearchExecutionContext ctx);

-    protected static SearchExecutionContext mockContext() {
-        return mockContext(true);
-    }


Moved to MapperServiceTestCase.

nik9000 · 2021-11-29T20:14:02Z

...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java

+                        if (string.length() < NGRAM_SIZE) {
+                            continue;
+                        }
+                        // We're not using addClause here because the


Hmmm - looks like I lost my train of thought. I believe this is because only even want the term queries. I think we've filtered out all the cases that'll cause other things, but this feels a little more readable this way.

addClause will expand a search for ab as ab* prefix query on the ngram index because we only index 3grams in the approximation index. Not sure if that's an optimisation you want to keep otherwise 2-character queries won't be accelerated at all? We took the decision not to try accelerate single-character searches just because that could be a lot of ngram matching for little overall acceleration.

I think I was worried about the start and end markers. I'm not against uses ab* in theory, though it was nice that queries targeting runtime fields backed by wildcards would never have to be rewritten. OTOH, we aren't likely to have that many trigrams matching any particular bigram.

nik9000 · 2021-11-29T20:16:03Z

...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java

+                        return new MatchAllDocsQuery();
+                    }
+
+                    return approximation.build();


This is quite similar to termQuery but it doesn't perform the double check. And it doesn't wrap through wildcards. I'm a bit scared of wrapping through wildcards here just because it seems like an opportunity for things to go wrong without providing much value. I didn't perform the double check because the script query will do that anyway. It doesn't feel worth it to do it twice.

nik9000 · 2021-11-29T20:16:21Z

.../elasticsearch/xpack/wildcard/mapper/KeywordScriptFieldApproximatedByWildcardFieldTests.java

+                    "aaq",
+                    "[61 71 0]",
+                    "[71 0 0]"
+                );


These tokens are a trip!

compression...

It's a neat trick!

And fix the phase

markharwood · 2021-12-06T15:31:50Z

server/src/main/java/org/elasticsearch/search/runtime/StringScriptFieldTermQuery.java

 import java.util.List;
 import java.util.Objects;
+import java.util.function.Function;

 public class StringScriptFieldTermQuery extends AbstractStringScriptFieldQuery {


Will there be prefix query support ay some point? (not suggesting adding to this PR - just curious).

It's possible. We'd have to detect a substring with a leading constant of 0. But we can do that. It'd be fairly similar for a grok or dissect. Just more complex.

I understand that prefix query on runtime fields isn't currently supported but was unsure why. Running a script over doc values to prefix query dv.startsWith(qTerm) doesn't seem like it would be slower than a term query qTerm==dv (actually it would probably be faster? Less bytes to check).

Oh! Your want to you why I only did this for the term query and not any
other others. Yeah, I think we would do it eventually. I just didn't do it
in the initial cut. I figured I'd start simple.

markharwood · 2021-12-06T15:49:04Z

.../elasticsearch/xpack/wildcard/mapper/KeywordScriptFieldApproximatedByWildcardFieldTests.java

+import static org.hamcrest.Matchers.instanceOf;
+
+public class KeywordScriptFieldApproximatedByWildcardFieldTests extends MapperServiceTestCase {
+    public void testTermQueryApproximated() throws IOException {


do you also need a test method for the approximateSubstringQuery functionality? e.g. that has different expectations on approximation queries for short queries like a

foo_last_word_approximated hits it.

Not sure I've seen the equivalent test yet in this PR but the thing that revealed most bugs for me in the wildcard field was a randomised regex generator exercising varying levels of nesting and combos of clause types and lengths. I compared unaccelerated keyword field results with accelerated wildcard field results for equivalence and this revealed a lot of accelerator bugs.
Would it make sense to add something similar for randomising scripts and contrasting results of "optimise_first" flag on vs off?

nik9000 · 2021-12-17T14:20:39Z

Mark left a comment a week or so ago asking for more randomized testing with the real classes - mostly around wildcard. I think it's a good idea. We have a fair bit of randomized testing for the expression language itself and that's nice, but no randomized testing for the bit that plug the fields into the expression language. It's worth doing that.

nik9000 and others added 30 commits November 11, 2021 14:23

Init project plaid

d3b8776

Move queryable-expression to lib

e4682ec

That forces me to write tests for it more rigorously because I can't depend on the scripting infrastructure.

first try at QueryableExpressionCollectionPhase and a few tests for it

b2e3f10

also collect int constants

17ab8f5

clean up unit tests

916ab11

Add some tests that for wired expressions

632a840

Adds some disabled tests that should pass once we have `QueryableExpression` wired into painless's compiler.

Don't allow backwards range queries

d447ee4

Short circuit range quries where `from` is greater than `to` which can never produce results. This is important because the approximation code expects well formed ranges.

Fixup magnitude

25eabb8

Add DelayedQueryableExpression to help building

456d7dd

This should make it easier to extract the shape of the `emit` expression at compile time and pick up mapping information when we build the query.

add plumbing to get qe out of painless

edb155a

Rename to builder

e8ce773

`DelayedQueryableExpression` really is a builder. Let's call it what folks expect it to be called.

Fix assertions

4ab20e7

Now that we've plumbed `QueryableExpression` out of painless into the query we can assert profile results. At this point only constants are linked, but it shows that a query on a constant runtime field that doesn't match never runs the script at all.

track emit

d3ae8e7

extract field ref

05a7b3a

support doc.<field> syntax as well and enable e2e tests

248cb94

support params

8f81f60

Merge branch 'master' into plaid

83a2f2f

Update tests

6b10019

Another test we should be able to approximate

118f548

only run QueryableExpressionCollectionPhase when it's found on the

016bd86

factory

Create an annotation to signal to collect args

22bd7d8

This adds a `CollectArgumentAnnotation` to tell painless to collect the arguments to a class binding - that's what `emit` is. This replaces the hard coded name and subclass lookup with an annotation in painless's whitelist for method.

Remove unused

58f4c10

support all scripts with single call to emit

970504e

Fail to run if the factory doesn't have method

d061429

Make sure the arg collection return args

1726b20

Int constants

c009b20

More tests for ints

8a603c9

Division is not commutable

923966f

Our `long` favored constants were trying to commute division which ended up converting `10 / doc.f.value` into `doc.f.value / 10`. That's not how division works.

Update examples

4c56f87

nik9000 added 2 commits November 22, 2021 18:04

Term query on keyword and substring fields

0c62505

Merge branch 'master' into plaid

a9bae04

nik9000 requested review from markharwood, jpountz, Luegg and jdconrad November 29, 2021 19:45

elasticsearchmachine added the v8.1.0 label Nov 29, 2021

nik9000 added 2 commits November 29, 2021 14:54

Merge branch 'master' into plaid

0c7a636

Javadoc

89c265b

nik9000 commented Nov 29, 2021

View reviewed changes

nik9000 changed the title ~~Plaid~~ Option for speeding up runtime field queries Nov 29, 2021

nik9000 added 9 commits November 29, 2021 16:48

Only enable compiler phase is requested

6fee1be

Better

c5127c3

And fix the phase

Spotless

9036605

Spotless

3e8e23c

Another bug

0a54fa7

Merge branch 'master' into plaid

0d399bd

update explain

c715f2b

Merge branch 'master' into plaid

4d49d39

Again?!

bce9795

markharwood reviewed Dec 6, 2021

View reviewed changes

mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022

javanna added :Search/Search Search-related issues that do not fall into other categories >enhancement and removed v8.2.0 labels Mar 1, 2022

javanna mentioned this pull request May 30, 2022

Add optional result caching for runtime mapping field. #86318

Closed

elasticsearchmachine changed the base branch from master to main July 22, 2022 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option for speeding up runtime field queries #81124

Option for speeding up runtime field queries #81124

nik9000 commented Nov 29, 2021 •

edited

nik9000 Nov 29, 2021

nik9000 Nov 29, 2021

nik9000 Nov 29, 2021

nik9000 Nov 29, 2021

markharwood Dec 9, 2021

nik9000 Nov 29, 2021

nik9000 Nov 29, 2021

nik9000 Nov 29, 2021

nik9000 Nov 29, 2021

nik9000 Nov 29, 2021

markharwood Dec 6, 2021

nik9000 Dec 6, 2021

nik9000 Nov 29, 2021

nik9000 Nov 29, 2021

markharwood Nov 30, 2021

nik9000 Nov 30, 2021

markharwood Dec 6, 2021

nik9000 Dec 6, 2021

markharwood Dec 8, 2021 •

edited

nik9000 Dec 8, 2021

markharwood Dec 6, 2021

nik9000 Dec 6, 2021

markharwood Dec 9, 2021

nik9000 commented Dec 17, 2021

Option for speeding up runtime field queries #81124

Are you sure you want to change the base?

Option for speeding up runtime field queries #81124

Conversation

nik9000 commented Nov 29, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markharwood Dec 8, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Dec 17, 2021

nik9000 commented Nov 29, 2021 •

edited

markharwood Dec 8, 2021 •

edited