New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option for speeding up runtime field queries #81124
base: main
Are you sure you want to change the base?
Conversation
That forces me to write tests for it more rigorously because I can't depend on the scripting infrastructure.
* Converts `QueryableExpression` and `LongQueryableExpression` into an interface in preparation for there being more subclasses. * Moves the default implementation into package private `AbstractLongQueryableExpression`. * Creates `LongQueryableExpression.field(queryCallbacks)` method to build fields. * Adapts `long` flavored query callbacks into `int` flavored query callbacks. * Plugs those `int` flavored callbacks into the `integer`, `short`, and `byte` field types in ES
Adds some disabled tests that should pass once we have `QueryableExpression` wired into painless's compiler.
Short circuit range quries where `from` is greater than `to` which can never produce results. This is important because the approximation code expects well formed ranges.
This should make it easier to extract the shape of the `emit` expression at compile time and pick up mapping information when we build the query.
`DelayedQueryableExpression` really is a builder. Let's call it what folks expect it to be called.
Now that we've plumbed `QueryableExpression` out of painless into the query we can assert profile results. At this point only constants are linked, but it shows that a query on a constant runtime field that doesn't match never runs the script at all.
This adds a `CollectArgumentAnnotation` to tell painless to collect the arguments to a class binding - that's what `emit` is. This replaces the hard coded name and subclass lookup with an annotation in painless's whitelist for method.
Our `long` favored constants were trying to commute division which ended up converting `10 / doc.f.value` into `doc.f.value / 10`. That's not how division works.
* Side Public License, v 1. | ||
*/ | ||
|
||
apply plugin: 'elasticsearch.build' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QueryableExpression
is it's own lib to make it easier to test it. It is also nice to test it in isolation from Elasticsearch.
@@ -218,6 +219,9 @@ ScriptScope compile(Loader loader, String name, String source, CompilerSettings | |||
new DefaultStringConcatenationOptimizationPhase().visitClass(classNode, null); | |||
new DefaultConstantFoldingOptimizationPhase().visitClass(classNode, null); | |||
new DefaultStaticConstantExtractionPhase().visitClass(classNode, scriptScope); | |||
if (painlessLookup.collectArgumentsTargetMethods().isEmpty() == false) { | |||
new CollectArgumentsPhase().visitClass(classNode, scriptScope.getQueryableExpressionScope()); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We run this phase on all long
and string
flavored runtime fields whether or not you've enabled approximations. I wonder if it's worth disabling it when you haven't enabled them, just for extra paranoia. I don't think that's likely to matter, but I've been wrong in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well! The docs had a failure in this phase. So maybe that's a sign we should skip it when you haven't enabled the approximations.
|
||
public void testIntConst() { | ||
assertEquals("100", qe("emit(100)").toString()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we should think again about replacing toString
here. I like that QueryableExpression
s have a nice, readable toString
, but I don't know that its the best test. I think the alternative is to implement equals
and hashCode
on QueryableExpression
but that's kind of heavy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More readable this way
} | ||
return term.field() + ":*" + term.text() + "*"; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query exists to have a nice toString
for the profile output. I'm a little worried about the overhead of running the automaton query against high cardinality keyword
fields, but in my testing it was much much much faster than running the script a zillion times. Still nothing like as fast as wildcard
fields. I was tempted not to even implement QueryableExpression
here but I really want to have a test for substring
in server
.
@Override | ||
public QueryableExpression asQueryableExpression(String field, boolean hasDocValues, SearchExecutionContext context) { | ||
return INTEGER.asQueryableExpression(field, hasDocValues, context); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields are uncommon and correctly, but inefficiently approximated as integer
fields.
@@ -49,7 +63,7 @@ public final String toString(String field) { | |||
b.append(fieldName()).append(':'); | |||
} | |||
b.append('[').append(lowerValue).append(" TO ").append(upperValue).append(']'); | |||
return b.toString(); | |||
return b.toString() + " approximated by " + approximation(); // TODO move the toString enhancement into the superclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think AbstractScriptFieldQuery
should handle this bit of the toString
but that'd require touching every subclass and felt like it should be saved for a follow up.
@@ -178,14 +174,6 @@ public void testFieldCaps() throws Exception { | |||
|
|||
protected abstract Query randomTermsQuery(MappedFieldType ft, SearchExecutionContext ctx); | |||
|
|||
protected static SearchExecutionContext mockContext() { | |||
return mockContext(true); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to MapperServiceTestCase
.
if (string.length() < NGRAM_SIZE) { | ||
continue; | ||
} | ||
// We're not using addClause here because the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm - looks like I lost my train of thought. I believe this is because only even want the term queries. I think we've filtered out all the cases that'll cause other things, but this feels a little more readable this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addClause will expand a search for ab
as ab*
prefix query on the ngram index because we only index 3grams in the approximation index. Not sure if that's an optimisation you want to keep otherwise 2-character queries won't be accelerated at all? We took the decision not to try accelerate single-character searches just because that could be a lot of ngram matching for little overall acceleration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was worried about the start and end markers. I'm not against uses ab*
in theory, though it was nice that queries targeting runtime fields backed by wildcards would never have to be rewritten. OTOH, we aren't likely to have that many trigrams matching any particular bigram.
return new MatchAllDocsQuery(); | ||
} | ||
|
||
return approximation.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite similar to termQuery
but it doesn't perform the double check. And it doesn't wrap through wildcards. I'm a bit scared of wrapping through wildcards here just because it seems like an opportunity for things to go wrong without providing much value. I didn't perform the double check because the script query will do that anyway. It doesn't feel worth it to do it twice.
"aaq", | ||
"[61 71 0]", | ||
"[71 0 0]" | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tokens are a trip!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compression...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a neat trick!
import java.util.List; | ||
import java.util.Objects; | ||
import java.util.function.Function; | ||
|
||
public class StringScriptFieldTermQuery extends AbstractStringScriptFieldQuery { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be prefix query support ay some point? (not suggesting adding to this PR - just curious).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible. We'd have to detect a substring
with a leading constant of 0. But we can do that. It'd be fairly similar for a grok or dissect. Just more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that prefix query on runtime fields isn't currently supported but was unsure why. Running a script over doc values to prefix query dv.startsWith(qTerm)
doesn't seem like it would be slower than a term query qTerm==dv
(actually it would probably be faster? Less bytes to check).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Your want to you why I only did this for the term
query and not any
other others. Yeah, I think we would do it eventually. I just didn't do it
in the initial cut. I figured I'd start simple.
import static org.hamcrest.Matchers.instanceOf; | ||
|
||
public class KeywordScriptFieldApproximatedByWildcardFieldTests extends MapperServiceTestCase { | ||
public void testTermQueryApproximated() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you also need a test method for the approximateSubstringQuery
functionality? e.g. that has different expectations on approximation queries for short queries like a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foo_last_word_approximated
hits it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I've seen the equivalent test yet in this PR but the thing that revealed most bugs for me in the wildcard field was a randomised regex generator exercising varying levels of nesting and combos of clause types and lengths. I compared unaccelerated keyword
field results with accelerated wildcard
field results for equivalence and this revealed a lot of accelerator bugs.
Would it make sense to add something similar for randomising scripts and contrasting results of "optimise_first" flag on vs off?
Mark left a comment a week or so ago asking for more randomized testing with the real classes - mostly around wildcard. I think it's a good idea. We have a fair bit of randomized testing for the expression language itself and that's nice, but no randomized testing for the bit that plug the fields into the expression language. It's worth doing that. |
This adds an option to runtime fields that attempts to speed up queries
against them by running a first pass "approximation" against the search
index. This should usually be faster than running the script against
every potential match which is how we execute queries against runtime
fields right now.
Take, for example, the query for the HTTP method
POST
against anapache log line. Apache logs look like:
And you can extract the method with a script like:
A query like
{"term": {"method": "POST"}}
can be approximated as afor the substring
POST
.Our rally track
http_logs
has exactly this data with themessage
fieldas a
wildcard
field andwildcard
fields. Without the approximationadded by this change the query above has to run the script 25 million
times. On my desktop it takes about 40 seconds. With the approximation
added by this query the query above has to run the script 400 thousand
times. On my desktop this let's the working set fit into memory and the
query takes 250 milliseconds. That's 160 times faster. Great! That's
pretty close to a best case scenario for this change though. But, not
totally uncommon.
The actual best case scenario for this change is a constant. It's
reasonable to create a runtime field for to add a constant value to data
in an old index with a script like
emit(0)
. This change willapproximate
range
andterm
queries against this runtime field aseither
match_all
ormatch_none
. The latter will skip running thescript entirely which seems like it could be a substantial performance
improvement.
match_all
is basically how runtime fields work now.It's also reasonably common to write runtime fields that convert units
like
emit(doc.rx / 1024)
which converts bytes to kibibytes. Thischange allows queries like
{"range": {"rx.kb":{"gt": 1000"}}}
to beapproximated by
{"range": {"rx":{"gt": 1000024"}}}
. If the approximationis very selective then this is also a huge performance boost. If the
approximation isn't selective then it should amount to no change in speed.
All of this is enabled with a new parameter on the runtime field
definition:
This works by converting the contents of the
emit
expression into anew thing called a
QueryableExpression
which can approximate aterm
or
range
query by "reversing" the expression into queries to theunderlying field. That means there are really five parts to this change:
QueryableExpression
s.QueryableExpression
s out to the runtime fields.MappedFieldType
.Co-authored-by: Lukas Wegmann lukas.wegmann@elastic.co
Co-authored-by: Jack Conradson osjdconrad@gmail.com