Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option for speeding up runtime field queries #81124

Draft
wants to merge 49 commits into
base: main
Choose a base branch
from
Draft

Option for speeding up runtime field queries #81124

wants to merge 49 commits into from

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Nov 29, 2021

This adds an option to runtime fields that attempts to speed up queries
against them by running a first pass "approximation" against the search
index. This should usually be faster than running the script against
every potential match which is how we execute queries against runtime
fields right now.

Take, for example, the query for the HTTP method POST against an
apache log line. Apache logs look like:

6.157.0.0 - - [1998-05-03T03:48:46-05:00] "POST /cgi-bin/trivia/Trivia.pl HTTP/1.0" 200 4807

And you can extract the method with a script like:

String m = doc["message"].value;
int start = m.indexOf('"') + 1;
int end = m.indexOf(" ", start);
emit(m.substring(start, end));

A query like {"term": {"method": "POST"}} can be approximated as a
for the substring POST.

Our rally track http_logs has exactly this data with the message field
as a wildcard field and wildcard fields. Without the approximation
added by this change the query above has to run the script 25 million
times. On my desktop it takes about 40 seconds. With the approximation
added by this query the query above has to run the script 400 thousand
times. On my desktop this let's the working set fit into memory and the
query takes 250 milliseconds. That's 160 times faster. Great! That's
pretty close to a best case scenario for this change though. But, not
totally uncommon.

The actual best case scenario for this change is a constant. It's
reasonable to create a runtime field for to add a constant value to data
in an old index with a script like emit(0). This change will
approximate range and term queries against this runtime field as
either match_all or match_none. The latter will skip running the
script entirely which seems like it could be a substantial performance
improvement. match_all is basically how runtime fields work now.

It's also reasonably common to write runtime fields that convert units
like emit(doc.rx / 1024) which converts bytes to kibibytes. This
change allows queries like {"range": {"rx.kb":{"gt": 1000"}}} to be
approximated by {"range": {"rx":{"gt": 1000024"}}}. If the approximation
is very selective then this is also a huge performance boost. If the
approximation isn't selective then it should amount to no change in speed.

All of this is enabled with a new parameter on the runtime field
definition:

"rx.kb": {
  "type": "long",
  "script": "emit(doc.rx.value / 1024)",
  "approximate_first": true     <----- this
}

This works by converting the contents of the emit expression into a
new thing called a QueryableExpression which can approximate a term
or range query by "reversing" the expression into queries to the
underlying field. That means there are really five parts to this change:

  1. Converting bits of the Painless AST into QueryableExpressions.
  2. Plumbing the QueryableExpressions out to the runtime fields.
  3. Reversing the queries to the expression into approximation queries.
  4. Implementing the approximation queries on MappedFieldType.
  5. Plugging the approximation queries into the runtime field query.

Co-authored-by: Lukas Wegmann lukas.wegmann@elastic.co
Co-authored-by: Jack Conradson osjdconrad@gmail.com

nik9000 and others added 30 commits November 11, 2021 14:23
That forces me to write tests for it more rigorously because I can't
depend on the scripting infrastructure.
* Converts `QueryableExpression` and `LongQueryableExpression` into an
  interface in preparation for there being more subclasses.
* Moves the default implementation into package private
  `AbstractLongQueryableExpression`.
* Creates `LongQueryableExpression.field(queryCallbacks)` method to
  build fields.
* Adapts `long` flavored query callbacks into `int` flavored query
  callbacks.
* Plugs those `int` flavored callbacks into the `integer`, `short`, and
  `byte` field types in ES
Adds some disabled tests that should pass once we have
`QueryableExpression` wired into painless's compiler.
Short circuit range quries where `from` is greater than `to` which can
never produce results. This is important because the approximation code
expects well formed ranges.
This should make it easier to extract the shape of the `emit` expression
at compile time and pick up mapping information when we build the query.
`DelayedQueryableExpression` really is a builder. Let's call it what
folks expect it to be called.
Now that we've plumbed `QueryableExpression` out of painless into the
query we can assert profile results. At this point only constants are
linked, but it shows that a query on a constant runtime field that
doesn't match never runs the script at all.
This adds a `CollectArgumentAnnotation` to tell painless to collect the
arguments to a class binding - that's what `emit` is. This replaces the
hard coded name and subclass lookup with an annotation in painless's
whitelist for method.
Our `long` favored constants were trying to commute division which ended
up converting `10 / doc.f.value` into `doc.f.value / 10`. That's not how
division works.
* Side Public License, v 1.
*/

apply plugin: 'elasticsearch.build'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QueryableExpression is it's own lib to make it easier to test it. It is also nice to test it in isolation from Elasticsearch.

@@ -218,6 +219,9 @@ ScriptScope compile(Loader loader, String name, String source, CompilerSettings
new DefaultStringConcatenationOptimizationPhase().visitClass(classNode, null);
new DefaultConstantFoldingOptimizationPhase().visitClass(classNode, null);
new DefaultStaticConstantExtractionPhase().visitClass(classNode, scriptScope);
if (painlessLookup.collectArgumentsTargetMethods().isEmpty() == false) {
new CollectArgumentsPhase().visitClass(classNode, scriptScope.getQueryableExpressionScope());
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We run this phase on all long and string flavored runtime fields whether or not you've enabled approximations. I wonder if it's worth disabling it when you haven't enabled them, just for extra paranoia. I don't think that's likely to matter, but I've been wrong in the past.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well! The docs had a failure in this phase. So maybe that's a sign we should skip it when you haven't enabled the approximations.


public void testIntConst() {
assertEquals("100", qe("emit(100)").toString());
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we should think again about replacing toString here. I like that QueryableExpressions have a nice, readable toString, but I don't know that its the best test. I think the alternative is to implement equals and hashCode on QueryableExpression but that's kind of heavy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More readable this way

}
return term.field() + ":*" + term.text() + "*";
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query exists to have a nice toString for the profile output. I'm a little worried about the overhead of running the automaton query against high cardinality keyword fields, but in my testing it was much much much faster than running the script a zillion times. Still nothing like as fast as wildcard fields. I was tempted not to even implement QueryableExpression here but I really want to have a test for substring in server.

@Override
public QueryableExpression asQueryableExpression(String field, boolean hasDocValues, SearchExecutionContext context) {
return INTEGER.asQueryableExpression(field, hasDocValues, context);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields are uncommon and correctly, but inefficiently approximated as integer fields.

@@ -49,7 +63,7 @@ public final String toString(String field) {
b.append(fieldName()).append(':');
}
b.append('[').append(lowerValue).append(" TO ").append(upperValue).append(']');
return b.toString();
return b.toString() + " approximated by " + approximation(); // TODO move the toString enhancement into the superclass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think AbstractScriptFieldQuery should handle this bit of the toString but that'd require touching every subclass and felt like it should be saved for a follow up.

@@ -178,14 +174,6 @@ public void testFieldCaps() throws Exception {

protected abstract Query randomTermsQuery(MappedFieldType ft, SearchExecutionContext ctx);

protected static SearchExecutionContext mockContext() {
return mockContext(true);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to MapperServiceTestCase.

if (string.length() < NGRAM_SIZE) {
continue;
}
// We're not using addClause here because the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm - looks like I lost my train of thought. I believe this is because only even want the term queries. I think we've filtered out all the cases that'll cause other things, but this feels a little more readable this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addClause will expand a search for ab as ab* prefix query on the ngram index because we only index 3grams in the approximation index. Not sure if that's an optimisation you want to keep otherwise 2-character queries won't be accelerated at all? We took the decision not to try accelerate single-character searches just because that could be a lot of ngram matching for little overall acceleration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was worried about the start and end markers. I'm not against uses ab* in theory, though it was nice that queries targeting runtime fields backed by wildcards would never have to be rewritten. OTOH, we aren't likely to have that many trigrams matching any particular bigram.

return new MatchAllDocsQuery();
}

return approximation.build();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite similar to termQuery but it doesn't perform the double check. And it doesn't wrap through wildcards. I'm a bit scared of wrapping through wildcards here just because it seems like an opportunity for things to go wrong without providing much value. I didn't perform the double check because the script query will do that anyway. It doesn't feel worth it to do it twice.

"aaq",
"[61 71 0]",
"[71 0 0]"
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tokens are a trip!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compression...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a neat trick!

@nik9000 nik9000 changed the title Plaid Option for speeding up runtime field queries Nov 29, 2021
import java.util.List;
import java.util.Objects;
import java.util.function.Function;

public class StringScriptFieldTermQuery extends AbstractStringScriptFieldQuery {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be prefix query support ay some point? (not suggesting adding to this PR - just curious).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible. We'd have to detect a substring with a leading constant of 0. But we can do that. It'd be fairly similar for a grok or dissect. Just more complex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that prefix query on runtime fields isn't currently supported but was unsure why. Running a script over doc values to prefix query dv.startsWith(qTerm) doesn't seem like it would be slower than a term query qTerm==dv (actually it would probably be faster? Less bytes to check).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Your want to you why I only did this for the term query and not any
other others. Yeah, I think we would do it eventually. I just didn't do it
in the initial cut. I figured I'd start simple.

import static org.hamcrest.Matchers.instanceOf;

public class KeywordScriptFieldApproximatedByWildcardFieldTests extends MapperServiceTestCase {
public void testTermQueryApproximated() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you also need a test method for the approximateSubstringQuery functionality? e.g. that has different expectations on approximation queries for short queries like a

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foo_last_word_approximated hits it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I've seen the equivalent test yet in this PR but the thing that revealed most bugs for me in the wildcard field was a randomised regex generator exercising varying levels of nesting and combos of clause types and lengths. I compared unaccelerated keyword field results with accelerated wildcard field results for equivalence and this revealed a lot of accelerator bugs.
Would it make sense to add something similar for randomising scripts and contrasting results of "optimise_first" flag on vs off?

@nik9000
Copy link
Member Author

nik9000 commented Dec 17, 2021

Mark left a comment a week or so ago asking for more randomized testing with the real classes - mostly around wildcard. I think it's a good idea. We have a fair bit of randomized testing for the expression language itself and that's nice, but no randomized testing for the bit that plug the fields into the expression language. It's worth doing that.

@mark-vieira mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022
@javanna javanna added :Search/Search Search-related issues that do not fall into other categories >enhancement and removed v8.2.0 labels Mar 1, 2022
@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants