branch-4.0: [fix](search) Fix wildcard query on variant subcolumns returning empty results #60793#61018
Closed
airborne12 wants to merge 7 commits intoapache:branch-4.0from
Closed
branch-4.0: [fix](search) Fix wildcard query on variant subcolumns returning empty results #60793#61018airborne12 wants to merge 7 commits intoapache:branch-4.0from
airborne12 wants to merge 7 commits intoapache:branch-4.0from
Conversation
…pache#59747) ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#59394 Problem Summary: The search DSL should only recognize uppercase `AND`, `OR`, `NOT` as boolean operators in search lucene boolean mode. Previously, lowercase `and`, `or`, `not` were also treated as operators, which does not conform to the specification. This PR makes the boolean operators case-sensitive: - Only uppercase `AND`, `OR`, `NOT` are recognized as operators - Lowercase `and`, `or`, `not` are now treated as regular search terms - Using lowercase operators in DSL will result in a parse error ### Release note Make search DSL boolean operators (AND/OR/NOT) case-sensitive in lucene boolean mode.
…sing and fix ES compatibility issues (apache#60654) Problem Summary: The `search()` function's DSL parser had multiple ES compatibility issues and used a two-phase parsing approach (manual pre-parse + ANTLR) that was error-prone. This PR refactors the parser and fixes several bugs: 1. **SearchDslParser refactoring**: Consolidated from two-phase (manual pre-parse + ANTLR) to single-phase ANTLR parsing. The ANTLR grammar now handles all DSL syntax directly, eliminating the fragile manual pre-parse layer. This fixes issues with operator precedence, grouping, and edge cases. 2. **ANTLR grammar improvements**: Updated `SearchLexer.g4` and `SearchParser.g4` to properly handle quoted phrases, field-qualified expressions, prefix/wildcard/regexp patterns, range queries, and boolean operators with correct precedence. 3. **minimum_should_match pipeline**: Added `default_operator` and `minimum_should_match` fields to `TSearchParam` thrift, passing them from FE `SearchPredicate` through to BE `function_search`. When `minimum_should_match > 0`, uses `OccurBooleanQuery` for proper Lucene-style boolean query semantics. 4. **Wildcard/Prefix/Regexp case-sensitivity**: Wildcard and PREFIX patterns are now lowercased when the index has `parser + lower_case=true` (matching ES query_string normalizer behavior). REGEXP patterns are NOT lowercased (matching ES regex behavior where patterns bypass analysis). 5. **MATCH_ALL_DOCS support**: Added `MATCH_ALL_DOCS` clause type for standalone `*` queries and pure NOT query rewrites. Enhanced `AllQuery` with deferred `max_doc` from `context.segment_num_rows` and nullable field support via `NullableScorer`. 6. **BE fixes**: - `regexp_weight._max_expansions`: Changed from 50 to 0 (unlimited) to prevent PREFIX queries from missing documents - `occur_boolean_weight`: Fixed swap→append bug when all SHOULD clauses must match, preserving existing MUST scorers - Variant subcolumn `index_properties` propagation for proper analyzer selection - `lower_case` default handling: inverted index `lower_case` defaults to `"true"` when a parser is configured
…-based indexes (apache#60782) ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#60654 Problem Summary: Follow-up fix for apache#60654 (SearchDslParser refactoring). When FE resolves a variant subcolumn field pattern to a specific analyzer-based index and sends its `index_properties` via `TSearchFieldBinding`, the BE `FieldReaderResolver` was using `EQUAL_QUERY` for TERM clauses. This caused `select_best_reader` to pick the `STRING_TYPE` reader (untokenized index directory) instead of the `FULLTEXT` reader, so tokenized search terms would never match. **Root cause**: For variant subcolumns with analyzer-based indexes, `EQUAL_QUERY` opens the wrong (untokenized) index directory. The query type needs to be upgraded to `MATCH_ANY_QUERY` so `select_best_reader` picks the correct FULLTEXT reader. **Fix**: In `FieldReaderResolver::resolve()`, when the field is a variant subcolumn and the FE-provided `index_properties` indicate an analyzer-based index (`should_analyzer()` returns true), automatically upgrade `EQUAL_QUERY` to `MATCH_ANY_QUERY` before calling `select_best_reader`. Also reuse the `fb_it` iterator to avoid a redundant map lookup.
…mode (apache#60784) ### What problem does this PR solve? Issue Number: close Problem Summary: When using `search('*', ...)` with multi-field options (`fields` parameter), the query fails with: ``` only inverted index queries are supported ``` The root cause is in `SearchDslParser.java`: the multi-field parsing methods (`parseDslMultiFieldMode` and `parseDslMultiFieldLuceneMode`) collect field bindings by calling `collectFieldNames()` on the expanded AST. When the query is `*` (match all), the AST node is `MATCH_ALL_DOCS` which has no field set — by design it matches all documents regardless of field. This caused `collectFieldNames` to return an empty set, resulting in no field bindings. Without field bindings, `RewriteSearchToSlots` couldn't create slot references, so the search expression was never pushed down to the inverted index path, and BE fell back to `execute_impl()` which returns the error. **Fix**: After `collectFieldNames()`, if the result is empty, fall back to using the original `fields` list as field bindings. This ensures the push-down mechanism works for `MATCH_ALL_DOCS` queries. **Reproducing queries** (from bug report): ```sql select count(*) from wikipedia where search('*', '{"fields":["title", "content"], "type": "best_fields", "default_operator":"AND","mode":"lucene", "minimum_should_match": 0}'); select count(*) from wikipedia where search('*', '{"default_field": "title", "default_operator":"AND","mode":"lucene", "minimum_should_match": 0}'); ```
…m2) (apache#60786) ### What problem does this PR solve? Issue Number: close #N/A Problem Summary: The `search()` function did not support ES `query_string` field-grouped syntax where all terms inside parentheses inherit the field prefix: ```sql -- Previously failed with syntax error SELECT * FROM t WHERE search('title:(rock OR jazz)', '{"fields":["title","content"]}'); ``` ES semantics: | Input | Expansion | |-------|-----------| | `title:(rock OR jazz)` | `(title:rock OR title:jazz)` | | `title:(rock jazz)` with `default_operator:AND` | `(+title:rock +title:jazz)` | | `title:(rock OR jazz) AND music` with `fields:[title,content]` | `(title:rock OR title:jazz) AND (title:music OR content:music)` | | `title:("rock and roll" OR jazz)` | `(title:"rock and roll" OR title:jazz)` | ### Root cause The ANTLR grammar `SearchParser.g4` defined `fieldQuery : fieldPath COLON searchValue` where `searchValue` only accepts leaf values (TERM, QUOTED, etc.), not a parenthesized sub-clause. So `title:(` caused a syntax error. ### Solution **Grammar** (`SearchParser.g4`): - Add `fieldGroupQuery : fieldPath COLON LPAREN clause RPAREN` rule - Add it as alternative in `atomClause` before `fieldQuery` **Visitor** (`SearchDslParser.java`): - Add `markExplicitFieldRecursive()` helper — marks all leaf nodes in a group as `explicitField=true` to prevent `MultiFieldExpander` from re-expanding them across unintended fields - Modify `visitBareQuery()` in both `QsAstBuilder` and `QsLuceneModeAstBuilder` to use `currentFieldName` as field group context when set - Add `visitFieldGroupQuery()` to both AST builders: sets field context, visits inner clause, marks all leaves explicit - Update `visitAtomClause()` and `collectTermsFromNotClause()` to handle
…h() function (apache#60790) ### What problem does this PR solve? Problem Summary: This PR adds searcher cache reuse and a DSL result cache for the `search()` function to improve query performance on repeated search queries against the same segments. **Key changes:** 1. **DSL result cache**: Caches the final roaring bitmap per (segment, DSL) pair so repeated identical `search()` queries skip Lucene execution entirely. Uses length-prefix key encoding to avoid hash collisions. 2. **Deep-copy bitmap semantics**: Bitmaps are deep-copied on both cache read and write to prevent `mask_out_null()` from polluting cached entries. 3. **Type-safe cache accessor**: Replaces raw `void*` return with a template `get_value<T>()` that uses `static_assert` to ensure T derives from `LRUCacheValueBase`. 4. **Session-level cache toggle**: Adds `enable_search_function_query_cache` session variable (default: true) to allow disabling the cache per query via `SET_VAR`. 5. **Const-correctness fix**: Removes unsafe `const_cast` in `build_dsl_signature` by copying TSearchParam before Thrift serialization. 6. **Defensive improvements**: Adds null check for `result_bitmap` on cache hit, logging for serialization fallback and cache bypass paths. ### Release note Add DSL result cache for search() function to skip repeated Lucene execution on identical queries.
…y results (apache#60793) ### What problem does this PR solve? Related PR: apache#60782 Problem Summary: When `search()` DSL uses wildcard patterns (e.g. `*ith`, `sm*th`, `sm?th`) on variant subcolumns with analyzer-based indexes (field_pattern), the queries return empty results even though regular TERM search works correctly. **Root cause:** In `FieldReaderResolver::resolve()`, only `EQUAL_QUERY` was upgraded to `MATCH_ANY_QUERY` for variant subcolumns with analyzer-based indexes. `WILDCARD_QUERY` was not upgraded, so `select_best_reader()` picked the `STRING_TYPE` reader instead of `FULLTEXT`. `WildcardWeight` then enumerated terms from the wrong (untokenized) index directory, finding no matches. **Fix:** Extend the query type upgrade condition to also cover `WILDCARD_QUERY`, so wildcard patterns correctly use the FULLTEXT index on variant subcolumns. Also fix a misleading comment in `inverted_index_iterator.cpp` where `is_equal_query()` was described as handling WILDCARD/REGEXP but actually only checks `EQUAL_QUERY`.
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Member
Author
|
Superseded by squashed backport PR #61028 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Merge Order
PR 7/12 in the search() pick chain. Depends on #61013(#60654).
Check List (For Author)
Test
Behavior changed:
Does this need documentation?