Add new 'exact' DSL query #92351

romseygeek · 2022-12-14T10:41:37Z

For many query types, a term query is the same as an exact match query - matching
a keyword or number will always match a whole value, not a partial one. However, for
text-like fields, term queries match individual tokens and there is no way of doing an
exact match against the whole content of a field without using a runtime field that will
end up doing a table scan.

This commit adds a new 'exact' query to the DSL. For most field types, this just
delegates down to 'term'. However, for text fields and related types, we can now
build a more efficient query that uses all the terms in the input query to build a
conjunction that acts as an approximation, and then confirms a match by looking
at the source of the field.

github-actions · 2022-12-14T10:41:50Z

Documentation preview:

✨ Changed pages

elasticsearchmachine · 2022-12-14T10:42:02Z

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine · 2022-12-14T10:42:03Z

Hi @romseygeek, I've created a changelog YAML for you.

jpountz

This feature makes sense. The lack of exact query has been a source of pain for query languages we've built on top of Elasticsearch like our SQL implementation, so this feature would be very welcome.

I think we'll want to be a bit more specific about what exact means. One thing that might be a bit trappy with your implementation is that if I'm not mistaken, a keyword field with a lowercase normalizer would have its exact query case insensitive while a text field with a lowercase filter would still have its exact query case sensitive. I don't know what is the right behavior. @luigidellaquila I wonder if you have opinions on this one?

jpountz · 2022-12-14T13:38:24Z

server/src/main/java/org/elasticsearch/index/mapper/MappedFieldType.java

@@ -214,6 +214,10 @@ public Query termQueryCaseInsensitive(Object value, @Nullable SearchExecutionCon
        );
    }

+    public Query exactQuery(Object value, SearchExecutionContext context) {


Add javadocs?

jpountz · 2022-12-14T13:41:39Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldExactQuery.java

+        } catch (IOException e) {
+            throw new UncheckedIOException(e);
+        }
+        this.conjunction = bq.build();


For text fields, we could even build a phrase query? I don't think it's an important optimization, but maybe we can leave a comment about it.

It's an open question I think whether or not using a phrase query instead of a conjunction will help - will depend I guess on the data?

jpountz · 2022-12-14T13:42:15Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldExactQuery.java

+        try {
+            ts.reset();
+            while (ts.incrementToken()) {
+                bq.add(new TermQuery(new Term(field, termAtt.toString())), BooleanClause.Occur.MUST);


you could use FILTER clauses directly since you don't need scores

jpountz · 2022-12-14T13:43:04Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldExactQuery.java

+/**
+ * Find documents with text fields that exactly match the input
+ */
+public class TextFieldExactQuery extends Query {


We should also implement rewrite in this class to rewrite the approximation query.

luigidellaquila · 2022-12-14T14:17:53Z

@jpountz indeed, this behavior is very similar to what SQL expects (exact match on equality, that is one of the most common use cases).
About case sensitivity, I agree that it could be a bit confusing. The case insensitive behavior on keyword fields with lowercase normalizer could probably be justified also for SQL users, but then they would also expect text to act the same with a lowercase filter.
Anyway, also as it is, IMHO it's a definitely cool new feature.

romseygeek · 2022-12-14T15:31:21Z

Yes, case-sensitivity on keyword fields is going to be a bit confusing. Normalizers on keyword fields cause no end of trouble... I'll add something to the docs.

romseygeek · 2022-12-14T16:51:18Z

Alternatively... we can use a TextFieldExactQuery for keyword fields as well if there's a normalizer implemented. I think this might be a better solution.

jpountz · 2022-12-14T16:57:54Z

This is what I was wondering: should we use the same query on keywords to make sure matches are actually exact, or alternatively generalize normalizing to exact queries on text fields by calling Analyzer#normalize on the input strings and values that we retrieve from _source?

romseygeek · 2022-12-15T10:11:12Z

I think 'exact' meaning 'exactly what is in the source' is the easiest to explain to users? I agree that normalizers on keyword fields complicate things, but if you want to find the normalized version you can still use a term query.

jpountz · 2022-12-15T10:15:29Z

I agree that this sounds like a better trade-off.

romseygeek · 2022-12-15T13:19:30Z

I've updated so that keyword fields now use the exact text match query if they have a normalizer configured, and added a note to the docs.

@jdconrad this involves adding a new FielddataOperation type of SOURCE which is currently only used by keyword fields - can you take a look and see if this makes sense to you?

jdconrad

@romseygeek This LGTM for the FieldOperation enum. I added two minor suggestions.

jdconrad · 2022-12-15T18:22:34Z

...per-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java

@@ -246,6 +247,11 @@ public Query termQuery(Object value, SearchExecutionContext context) {
            return new ConstantScoreQuery(super.termQuery(value, context));
        }

+        @Override
+        public Query exactQuery(Object value, SearchExecutionContext context) {
+            return new TextFieldExactQuery(this, context.getForField(this, FielddataOperation.SCRIPT), value.toString());


I'm wondering if this should be SOURCE instead of SCRIPT for two reasons:

To me it makes it more obvious where this field is trying to get it's field data from.

Should the scripting behavior ever need to change, this could be easily missed as a side effect of that change.

I realize this would add a bit of additional logic to fielddataBuilder, but I think the clarity may be a good trade off here.

jdconrad · 2022-12-15T18:26:10Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

@@ -949,6 +949,11 @@ public boolean isAggregatable() {
            return fielddata;
        }

+        @Override
+        public Query exactQuery(Object value, SearchExecutionContext context) {
+            return new TextFieldExactQuery(this, context.getForField(this, FielddataOperation.SCRIPT), value.toString());


Same thought here as my previous comment for MatchOnlyTextFieldMapper.

jpountz

I didn't do a proper review, only had a look at a few things, but the change makes sense to me.

jpountz · 2023-01-05T16:31:09Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldExactQuery.java

+            while (ts.incrementToken() && count++ < 1000) { // limit the size of the approximation
+                bq.add(new TermQuery(new Term(field, termAtt.toString())), BooleanClause.Occur.FILTER);
+            }
+            ts.end();


I don't think it's legal to call end if the token stream may not be fully consumed. We may need to consume the token stream without doing anything with the tokens like LimitTokenCountTokenFilter does.

jpountz · 2023-01-05T16:34:09Z

server/src/test/java/org/elasticsearch/index/mapper/KeywordFieldMapperTests.java

+        SearchExecutionContext sec = createSearchExecutionContext(mapper);
+        Query q = mapper.fieldType("field").exactQuery("value", sec);
+        assertThat(q, instanceOf(TextFieldExactQuery.class));
+    }


Maybe also test the case when no normalizer is configured, to make sure we use a TermQuery rather than a query that may read the _source?

cbuescher

I left only a partial review so far, will look at the more interesting parts of the lucene query implementations soon.

cbuescher · 2023-01-09T17:46:16Z

...apper-extras/src/main/java/org/elasticsearch/index/mapper/extras/RankFeatureFieldMapper.java

@@ -109,6 +109,11 @@ public Query existsQuery(SearchExecutionContext context) {
            return new TermQuery(new Term("_feature", name()));
        }

+        @Override
+        public Query exactQuery(Object value, SearchExecutionContext context) {
+            throw new IllegalArgumentException("Field [" + name() + "] of type [" + typeName() + "] doesn't support exact queries");


nit: maybe use UnsupportedOperationException here and all other similar implementations?

cbuescher · 2023-01-09T17:50:39Z

...ras/src/test/java/org/elasticsearch/index/mapper/extras/SearchAsYouTypeFieldMapperTests.java

@@ -120,6 +120,11 @@ protected Object getSampleValueForDocument() {
        return "new york city";
    }

+    @Override
+    protected boolean supportsExactQuery() {
+        return false;   // TODO: support this? Needs fielddata script access


Before adding this and other TODOs that are difficult to remove without more context in the future, should we collect reasons for potential support of this and other field types in a follow up issue? To me it gives a better overview, has more space for context than a short TODO and is more visible in the backlog.
For this particular case I'm struggling to see the need for support.

cbuescher · 2023-01-09T18:09:07Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

@@ -980,9 +985,6 @@ public IndexFieldData.Builder fielddataBuilder(FieldDataContext fieldDataContext
                );
            }

-            if (operation != FielddataOperation.SCRIPT) {


why isn't this needed with the third enum option anymore? as far as I can see this so far triggered for SEARCH, why not (operation == FielddataOperation.SEARCH) then now?

cbuescher

I did another round of reviews, left some minor comments and questions.

cbuescher · 2023-01-10T09:41:01Z

server/src/main/java/org/elasticsearch/index/mapper/MappedFieldType.java

+        // Fielddata to be used as part of a script
+        SCRIPT,
+        // Fielddata that must be read from source
+        SOURCE


For my understanding: I see this constant set in FieldDataContext three times but I cannot find a spot where the value is actually used to execute - say another code path - than the existing ones. Is it solely here to mark field data usage other than SCRIPT or SEARCH?

cbuescher · 2023-01-11T11:10:40Z

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search/510_exact_query.yml

+          mappings:
+            properties:
+              text:
+                type: text


I'm wondering if it would also make sense to add a test with a text fields with a non-standard analyzer, e.g. something that changes case here

cbuescher · 2023-01-11T11:14:40Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java


-            if (hasDocValues()) {
+            if (operation != FielddataOperation.SOURCE && hasDocValues()) {


If I read this correctly, for operation == FielddataOperation.SEARCH this falls through until here now, formerly it would have raised an exception. Is this intended?

cbuescher · 2023-01-11T11:26:11Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldExactQuery.java

+            ts.reset();
+            boolean more = ts.incrementToken();
+            while (more) {
+                if (count++ >= 1000) {


I'm wondering if there is a way to access the indices.query.bool.max_clause_count we are using on this node and use something lower than that here instead of the fixed value, or do you think that's not necessary?

cbuescher · 2023-01-11T11:41:16Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldExactQuery.java

+
+                    @Override
+                    public float matchCost() {
+                        return 9000;


Is this a high value? I looked at other implementations like e.g. BinaryDvConfirmedAutomatonQuery which also uses a TwoPhaseIterator with a more costly verification step, which uses 1000, but I guess in this high range it doesn't matter, just asking out or curiosity.

cbuescher · 2023-01-11T11:46:13Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldExactQuery.java

+    }
+
+    @Override
+    public void visit(QueryVisitor visitor) {


Read the Lucene docs but still interested what this does exactly.

cbuescher · 2023-01-11T11:47:24Z

server/src/test/java/org/elasticsearch/index/mapper/GeoPointFieldMapperTests.java

@@ -68,6 +68,12 @@ protected Object getSampleValueForDocument() {
        return stringEncode(1.3, 1.2);
    }

+    @Override
+    protected boolean supportsExactQuery() {
+        // TODO can we support this?


Same as above, maybe remove all the TODOs and leave a list of unsupported queries on the issue or a follow up for discussion?

cbuescher · 2023-01-11T11:51:15Z

server/src/test/java/org/elasticsearch/index/mapper/TextFieldExactQueryTests.java

+            for (int outer = 0; outer < 3; outer++) {
+                for (int i = 0; i < docCount; i++) {
+                    int value = i;
+                    ParsedDocument doc = mapperService.documentMapper().parse(source(b -> b.field("text", English.intToEnglish(value))));


Would it increase the value of the test if we had more than one token? Also maybe another doc in reverse order that shouldn't match the exact query later, just to check the verification step works as expected?

cbuescher · 2023-01-11T11:54:35Z

server/src/test/java/org/elasticsearch/index/mapper/flattened/FlattenedFieldMapperTests.java

@@ -75,6 +79,12 @@ protected boolean supportsIgnoreMalformed() {
        return false;
    }

+    @Override
+    protected List<Object> exactQueryValues(MappedFieldType fieldType, Source source, LeafReaderContext ctx, SearchExecutionContext sec)


Looks like FlattenedFieldMapper#exactQuery() throws an IllegalArgumentException, however this test doesn't seem to overwrite supportsExactQuery?

romseygeek added 2 commits December 12, 2022 13:21

wip

9fef06e

Add exact DSL query

4486065

romseygeek added >feature :Search/Search Search-related issues that do not fall into other categories v8.7.0 labels Dec 14, 2022

romseygeek requested review from jpountz and javanna December 14, 2022 10:41

romseygeek self-assigned this Dec 14, 2022

elasticsearchmachine added the Team:Search Meta label for search team label Dec 14, 2022

Update docs/changelog/92351.yaml

f0c0af7

jpountz reviewed Dec 14, 2022

View reviewed changes

romseygeek added 2 commits December 14, 2022 15:26

Proper testing of DSL query

52b2931

Merge remote-tracking branch 'romseygeek/query/exact' into query/exact

0974c3d

romseygeek added 3 commits December 15, 2022 11:07

Fallback to matching against source for keyword fields with normalizers

ae22768

duh

ad7fe8b

Merge remote-tracking branch 'origin/main' into query/exact

4a3bf90

jdconrad reviewed Dec 15, 2022

View reviewed changes

romseygeek added 2 commits December 16, 2022 11:49

deef

6d0c8e0

Merge remote-tracking branch 'origin/main' into query/exact

b09b770

romseygeek requested a review from jpountz January 5, 2023 10:53

jpountz reviewed Jan 5, 2023

View reviewed changes

romseygeek added 2 commits January 9, 2023 09:38

Merge remote-tracking branch 'origin/main' into query/exact

f0b0c8c

deef

c3b6b26

romseygeek requested a review from cbuescher January 9, 2023 10:34

cbuescher reviewed Jan 9, 2023

View reviewed changes

cbuescher reviewed Jan 11, 2023

View reviewed changes

rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023

gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023

pugnascotia added v8.10.0 and removed v8.9.0 labels Jun 22, 2023

quux00 added v8.11.0 and removed v8.10.0 labels Aug 16, 2023

mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023

brianseeders added v8.13.0 and removed v8.12.0 labels Dec 6, 2023

elasticsearchmachine added v8.14.0 and removed v8.13.0 labels Feb 14, 2024

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024


		if (hasDocValues()) {
		if (operation != FielddataOperation.SOURCE && hasDocValues()) {

Add new 'exact' DSL query #92351

Are you sure you want to change the base?

Add new 'exact' DSL query #92351

Conversation

romseygeek commented Dec 14, 2022

github-actions bot commented Dec 14, 2022

elasticsearchmachine commented Dec 14, 2022

elasticsearchmachine commented Dec 14, 2022

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luigidellaquila commented Dec 14, 2022

romseygeek commented Dec 14, 2022

romseygeek commented Dec 14, 2022

jpountz commented Dec 14, 2022

romseygeek commented Dec 15, 2022

jpountz commented Dec 15, 2022

romseygeek commented Dec 15, 2022

jdconrad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbuescher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbuescher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment