Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new 'exact' DSL query #92351

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Conversation

romseygeek
Copy link
Contributor

For many query types, a term query is the same as an exact match query - matching
a keyword or number will always match a whole value, not a partial one. However, for
text-like fields, term queries match individual tokens and there is no way of doing an
exact match against the whole content of a field without using a runtime field that will
end up doing a table scan.

This commit adds a new 'exact' query to the DSL. For most field types, this just
delegates down to 'term'. However, for text fields and related types, we can now
build a more efficient query that uses all the terms in the input query to build a
conjunction that acts as an approximation, and then confirms a match by looking
at the source of the field.

@romseygeek romseygeek added >feature :Search/Search Search-related issues that do not fall into other categories v8.7.0 labels Dec 14, 2022
@romseygeek romseygeek self-assigned this Dec 14, 2022
@github-actions
Copy link

Documentation preview:

@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Dec 14, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine
Copy link
Collaborator

Hi @romseygeek, I've created a changelog YAML for you.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature makes sense. The lack of exact query has been a source of pain for query languages we've built on top of Elasticsearch like our SQL implementation, so this feature would be very welcome.

I think we'll want to be a bit more specific about what exact means. One thing that might be a bit trappy with your implementation is that if I'm not mistaken, a keyword field with a lowercase normalizer would have its exact query case insensitive while a text field with a lowercase filter would still have its exact query case sensitive. I don't know what is the right behavior. @luigidellaquila I wonder if you have opinions on this one?

@@ -214,6 +214,10 @@ public Query termQueryCaseInsensitive(Object value, @Nullable SearchExecutionCon
);
}

public Query exactQuery(Object value, SearchExecutionContext context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add javadocs?

} catch (IOException e) {
throw new UncheckedIOException(e);
}
this.conjunction = bq.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For text fields, we could even build a phrase query? I don't think it's an important optimization, but maybe we can leave a comment about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an open question I think whether or not using a phrase query instead of a conjunction will help - will depend I guess on the data?

try {
ts.reset();
while (ts.incrementToken()) {
bq.add(new TermQuery(new Term(field, termAtt.toString())), BooleanClause.Occur.MUST);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use FILTER clauses directly since you don't need scores

/**
* Find documents with text fields that exactly match the input
*/
public class TextFieldExactQuery extends Query {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also implement rewrite in this class to rewrite the approximation query.

@luigidellaquila
Copy link
Contributor

@jpountz indeed, this behavior is very similar to what SQL expects (exact match on equality, that is one of the most common use cases).
About case sensitivity, I agree that it could be a bit confusing. The case insensitive behavior on keyword fields with lowercase normalizer could probably be justified also for SQL users, but then they would also expect text to act the same with a lowercase filter.
Anyway, also as it is, IMHO it's a definitely cool new feature.

@romseygeek
Copy link
Contributor Author

Yes, case-sensitivity on keyword fields is going to be a bit confusing. Normalizers on keyword fields cause no end of trouble... I'll add something to the docs.

@romseygeek
Copy link
Contributor Author

Alternatively... we can use a TextFieldExactQuery for keyword fields as well if there's a normalizer implemented. I think this might be a better solution.

@jpountz
Copy link
Contributor

jpountz commented Dec 14, 2022

This is what I was wondering: should we use the same query on keywords to make sure matches are actually exact, or alternatively generalize normalizing to exact queries on text fields by calling Analyzer#normalize on the input strings and values that we retrieve from _source?

@romseygeek
Copy link
Contributor Author

I think 'exact' meaning 'exactly what is in the source' is the easiest to explain to users? I agree that normalizers on keyword fields complicate things, but if you want to find the normalized version you can still use a term query.

@jpountz
Copy link
Contributor

jpountz commented Dec 15, 2022

I agree that this sounds like a better trade-off.

@romseygeek
Copy link
Contributor Author

I've updated so that keyword fields now use the exact text match query if they have a normalizer configured, and added a note to the docs.

@jdconrad this involves adding a new FielddataOperation type of SOURCE which is currently only used by keyword fields - can you take a look and see if this makes sense to you?

Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romseygeek This LGTM for the FieldOperation enum. I added two minor suggestions.

@@ -246,6 +247,11 @@ public Query termQuery(Object value, SearchExecutionContext context) {
return new ConstantScoreQuery(super.termQuery(value, context));
}

@Override
public Query exactQuery(Object value, SearchExecutionContext context) {
return new TextFieldExactQuery(this, context.getForField(this, FielddataOperation.SCRIPT), value.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this should be SOURCE instead of SCRIPT for two reasons:

  1. To me it makes it more obvious where this field is trying to get it's field data from.
  2. Should the scripting behavior ever need to change, this could be easily missed as a side effect of that change.

I realize this would add a bit of additional logic to fielddataBuilder, but I think the clarity may be a good trade off here.

@@ -949,6 +949,11 @@ public boolean isAggregatable() {
return fielddata;
}

@Override
public Query exactQuery(Object value, SearchExecutionContext context) {
return new TextFieldExactQuery(this, context.getForField(this, FielddataOperation.SCRIPT), value.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thought here as my previous comment for MatchOnlyTextFieldMapper.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do a proper review, only had a look at a few things, but the change makes sense to me.

while (ts.incrementToken() && count++ < 1000) { // limit the size of the approximation
bq.add(new TermQuery(new Term(field, termAtt.toString())), BooleanClause.Occur.FILTER);
}
ts.end();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's legal to call end if the token stream may not be fully consumed. We may need to consume the token stream without doing anything with the tokens like LimitTokenCountTokenFilter does.

SearchExecutionContext sec = createSearchExecutionContext(mapper);
Query q = mapper.fieldType("field").exactQuery("value", sec);
assertThat(q, instanceOf(TextFieldExactQuery.class));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also test the case when no normalizer is configured, to make sure we use a TermQuery rather than a query that may read the _source?

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left only a partial review so far, will look at the more interesting parts of the lucene query implementations soon.

@@ -109,6 +109,11 @@ public Query existsQuery(SearchExecutionContext context) {
return new TermQuery(new Term("_feature", name()));
}

@Override
public Query exactQuery(Object value, SearchExecutionContext context) {
throw new IllegalArgumentException("Field [" + name() + "] of type [" + typeName() + "] doesn't support exact queries");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe use UnsupportedOperationException here and all other similar implementations?

@@ -120,6 +120,11 @@ protected Object getSampleValueForDocument() {
return "new york city";
}

@Override
protected boolean supportsExactQuery() {
return false; // TODO: support this? Needs fielddata script access
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before adding this and other TODOs that are difficult to remove without more context in the future, should we collect reasons for potential support of this and other field types in a follow up issue? To me it gives a better overview, has more space for context than a short TODO and is more visible in the backlog.
For this particular case I'm struggling to see the need for support.

@@ -980,9 +985,6 @@ public IndexFieldData.Builder fielddataBuilder(FieldDataContext fieldDataContext
);
}

if (operation != FielddataOperation.SCRIPT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't this needed with the third enum option anymore? as far as I can see this so far triggered for SEARCH, why not (operation == FielddataOperation.SEARCH) then now?

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another round of reviews, left some minor comments and questions.

// Fielddata to be used as part of a script
SCRIPT,
// Fielddata that must be read from source
SOURCE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding: I see this constant set in FieldDataContext three times but I cannot find a spot where the value is actually used to execute - say another code path - than the existing ones. Is it solely here to mark field data usage other than SCRIPT or SEARCH?

mappings:
properties:
text:
type: text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it would also make sense to add a test with a text fields with a non-standard analyzer, e.g. something that changes case here


if (hasDocValues()) {
if (operation != FielddataOperation.SOURCE && hasDocValues()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this correctly, for operation == FielddataOperation.SEARCH this falls through until here now, formerly it would have raised an exception. Is this intended?

ts.reset();
boolean more = ts.incrementToken();
while (more) {
if (count++ >= 1000) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if there is a way to access the indices.query.bool.max_clause_count we are using on this node and use something lower than that here instead of the fixed value, or do you think that's not necessary?


@Override
public float matchCost() {
return 9000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a high value? I looked at other implementations like e.g. BinaryDvConfirmedAutomatonQuery which also uses a TwoPhaseIterator with a more costly verification step, which uses 1000, but I guess in this high range it doesn't matter, just asking out or curiosity.

}

@Override
public void visit(QueryVisitor visitor) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read the Lucene docs but still interested what this does exactly.

@@ -68,6 +68,12 @@ protected Object getSampleValueForDocument() {
return stringEncode(1.3, 1.2);
}

@Override
protected boolean supportsExactQuery() {
// TODO can we support this?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, maybe remove all the TODOs and leave a list of unsupported queries on the issue or a follow up for discussion?

for (int outer = 0; outer < 3; outer++) {
for (int i = 0; i < docCount; i++) {
int value = i;
ParsedDocument doc = mapperService.documentMapper().parse(source(b -> b.field("text", English.intToEnglish(value))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it increase the value of the test if we had more than one token? Also maybe another doc in reverse order that shouldn't match the exact query later, just to check the verification step works as expected?

@@ -75,6 +79,12 @@ protected boolean supportsIgnoreMalformed() {
return false;
}

@Override
protected List<Object> exactQueryValues(MappedFieldType fieldType, Source source, LeafReaderContext ctx, SearchExecutionContext sec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like FlattenedFieldMapper#exactQuery() throws an IllegalArgumentException, however this test doesn't seem to overwrite supportsExactQuery?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet