branch-4.0: [fix](search) Fix implicit conjunction incorrectly modifying preceding term in lucene mode #60814 by airborne12 · Pull Request #61020 · apache/doris

airborne12 · 2026-03-03T16:44:10Z

Summary

Cherry-pick of [fix](search) Fix implicit conjunction incorrectly modifying preceding term in lucene mode #60814 to branch-4.0
Fix implicit conjunction incorrectly modifying preceding term in lucene mode

Merge Order

PR 9/12 in the search() pick chain. Depends on #61013(#60654).

Check List (For Author)

Test
- Unit Test
- No need to test
Behavior changed:
- No.
Does this need documentation?
- No.

…pache#59747) ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#59394 Problem Summary: The search DSL should only recognize uppercase `AND`, `OR`, `NOT` as boolean operators in search lucene boolean mode. Previously, lowercase `and`, `or`, `not` were also treated as operators, which does not conform to the specification. This PR makes the boolean operators case-sensitive: - Only uppercase `AND`, `OR`, `NOT` are recognized as operators - Lowercase `and`, `or`, `not` are now treated as regular search terms - Using lowercase operators in DSL will result in a parse error ### Release note Make search DSL boolean operators (AND/OR/NOT) case-sensitive in lucene boolean mode.

…sing and fix ES compatibility issues (apache#60654) Problem Summary: The `search()` function's DSL parser had multiple ES compatibility issues and used a two-phase parsing approach (manual pre-parse + ANTLR) that was error-prone. This PR refactors the parser and fixes several bugs: 1. **SearchDslParser refactoring**: Consolidated from two-phase (manual pre-parse + ANTLR) to single-phase ANTLR parsing. The ANTLR grammar now handles all DSL syntax directly, eliminating the fragile manual pre-parse layer. This fixes issues with operator precedence, grouping, and edge cases. 2. **ANTLR grammar improvements**: Updated `SearchLexer.g4` and `SearchParser.g4` to properly handle quoted phrases, field-qualified expressions, prefix/wildcard/regexp patterns, range queries, and boolean operators with correct precedence. 3. **minimum_should_match pipeline**: Added `default_operator` and `minimum_should_match` fields to `TSearchParam` thrift, passing them from FE `SearchPredicate` through to BE `function_search`. When `minimum_should_match > 0`, uses `OccurBooleanQuery` for proper Lucene-style boolean query semantics. 4. **Wildcard/Prefix/Regexp case-sensitivity**: Wildcard and PREFIX patterns are now lowercased when the index has `parser + lower_case=true` (matching ES query_string normalizer behavior). REGEXP patterns are NOT lowercased (matching ES regex behavior where patterns bypass analysis). 5. **MATCH_ALL_DOCS support**: Added `MATCH_ALL_DOCS` clause type for standalone `*` queries and pure NOT query rewrites. Enhanced `AllQuery` with deferred `max_doc` from `context.segment_num_rows` and nullable field support via `NullableScorer`. 6. **BE fixes**: - `regexp_weight._max_expansions`: Changed from 50 to 0 (unlimited) to prevent PREFIX queries from missing documents - `occur_boolean_weight`: Fixed swap→append bug when all SHOULD clauses must match, preserving existing MUST scorers - Variant subcolumn `index_properties` propagation for proper analyzer selection - `lower_case` default handling: inverted index `lower_case` defaults to `"true"` when a parser is configured

…-based indexes (apache#60782) ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#60654 Problem Summary: Follow-up fix for apache#60654 (SearchDslParser refactoring). When FE resolves a variant subcolumn field pattern to a specific analyzer-based index and sends its `index_properties` via `TSearchFieldBinding`, the BE `FieldReaderResolver` was using `EQUAL_QUERY` for TERM clauses. This caused `select_best_reader` to pick the `STRING_TYPE` reader (untokenized index directory) instead of the `FULLTEXT` reader, so tokenized search terms would never match. **Root cause**: For variant subcolumns with analyzer-based indexes, `EQUAL_QUERY` opens the wrong (untokenized) index directory. The query type needs to be upgraded to `MATCH_ANY_QUERY` so `select_best_reader` picks the correct FULLTEXT reader. **Fix**: In `FieldReaderResolver::resolve()`, when the field is a variant subcolumn and the FE-provided `index_properties` indicate an analyzer-based index (`should_analyzer()` returns true), automatically upgrade `EQUAL_QUERY` to `MATCH_ANY_QUERY` before calling `select_best_reader`. Also reuse the `fb_it` iterator to avoid a redundant map lookup.

…mode (apache#60784) ### What problem does this PR solve? Issue Number: close Problem Summary: When using `search('*', ...)` with multi-field options (`fields` parameter), the query fails with: ``` only inverted index queries are supported ``` The root cause is in `SearchDslParser.java`: the multi-field parsing methods (`parseDslMultiFieldMode` and `parseDslMultiFieldLuceneMode`) collect field bindings by calling `collectFieldNames()` on the expanded AST. When the query is `*` (match all), the AST node is `MATCH_ALL_DOCS` which has no field set — by design it matches all documents regardless of field. This caused `collectFieldNames` to return an empty set, resulting in no field bindings. Without field bindings, `RewriteSearchToSlots` couldn't create slot references, so the search expression was never pushed down to the inverted index path, and BE fell back to `execute_impl()` which returns the error. **Fix**: After `collectFieldNames()`, if the result is empty, fall back to using the original `fields` list as field bindings. This ensures the push-down mechanism works for `MATCH_ALL_DOCS` queries. **Reproducing queries** (from bug report): ```sql select count(*) from wikipedia where search('*', '{"fields":["title", "content"], "type": "best_fields", "default_operator":"AND","mode":"lucene", "minimum_should_match": 0}'); select count(*) from wikipedia where search('*', '{"default_field": "title", "default_operator":"AND","mode":"lucene", "minimum_should_match": 0}'); ```

…m2) (apache#60786) ### What problem does this PR solve? Issue Number: close #N/A Problem Summary: The `search()` function did not support ES `query_string` field-grouped syntax where all terms inside parentheses inherit the field prefix: ```sql -- Previously failed with syntax error SELECT * FROM t WHERE search('title:(rock OR jazz)', '{"fields":["title","content"]}'); ``` ES semantics: | Input | Expansion | |-------|-----------| | `title:(rock OR jazz)` | `(title:rock OR title:jazz)` | | `title:(rock jazz)` with `default_operator:AND` | `(+title:rock +title:jazz)` | | `title:(rock OR jazz) AND music` with `fields:[title,content]` | `(title:rock OR title:jazz) AND (title:music OR content:music)` | | `title:("rock and roll" OR jazz)` | `(title:"rock and roll" OR title:jazz)` | ### Root cause The ANTLR grammar `SearchParser.g4` defined `fieldQuery : fieldPath COLON searchValue` where `searchValue` only accepts leaf values (TERM, QUOTED, etc.), not a parenthesized sub-clause. So `title:(` caused a syntax error. ### Solution **Grammar** (`SearchParser.g4`): - Add `fieldGroupQuery : fieldPath COLON LPAREN clause RPAREN` rule - Add it as alternative in `atomClause` before `fieldQuery` **Visitor** (`SearchDslParser.java`): - Add `markExplicitFieldRecursive()` helper — marks all leaf nodes in a group as `explicitField=true` to prevent `MultiFieldExpander` from re-expanding them across unintended fields - Modify `visitBareQuery()` in both `QsAstBuilder` and `QsLuceneModeAstBuilder` to use `currentFieldName` as field group context when set - Add `visitFieldGroupQuery()` to both AST builders: sets field context, visits inner clause, marks all leaves explicit - Update `visitAtomClause()` and `collectTermsFromNotClause()` to handle

…h() function (apache#60790) ### What problem does this PR solve? Problem Summary: This PR adds searcher cache reuse and a DSL result cache for the `search()` function to improve query performance on repeated search queries against the same segments. **Key changes:** 1. **DSL result cache**: Caches the final roaring bitmap per (segment, DSL) pair so repeated identical `search()` queries skip Lucene execution entirely. Uses length-prefix key encoding to avoid hash collisions. 2. **Deep-copy bitmap semantics**: Bitmaps are deep-copied on both cache read and write to prevent `mask_out_null()` from polluting cached entries. 3. **Type-safe cache accessor**: Replaces raw `void*` return with a template `get_value<T>()` that uses `static_assert` to ensure T derives from `LRUCacheValueBase`. 4. **Session-level cache toggle**: Adds `enable_search_function_query_cache` session variable (default: true) to allow disabling the cache per query via `SET_VAR`. 5. **Const-correctness fix**: Removes unsafe `const_cast` in `build_dsl_signature` by copying TSearchParam before Thrift serialization. 6. **Defensive improvements**: Adds null check for `result_bitmap` on cache hit, logging for serialization fallback and cache bypass paths. ### Release note Add DSL result cache for search() function to skip repeated Lucene execution on identical queries.

…y results (apache#60793) ### What problem does this PR solve? Related PR: apache#60782 Problem Summary: When `search()` DSL uses wildcard patterns (e.g. `*ith`, `sm*th`, `sm?th`) on variant subcolumns with analyzer-based indexes (field_pattern), the queries return empty results even though regular TERM search works correctly. **Root cause:** In `FieldReaderResolver::resolve()`, only `EQUAL_QUERY` was upgraded to `MATCH_ANY_QUERY` for variant subcolumns with analyzer-based indexes. `WILDCARD_QUERY` was not upgraded, so `select_best_reader()` picked the `STRING_TYPE` reader instead of `FULLTEXT`. `WildcardWeight` then enumerated terms from the wrong (untokenized) index directory, finding no matches. **Fix:** Extend the query type upgrade condition to also cover `WILDCARD_QUERY`, so wildcard patterns correctly use the FULLTEXT index on variant subcolumns. Also fix a misleading comment in `inverted_index_iterator.cpp` where `is_equal_query()` was described as handling WILDCARD/REGEXP but actually only checks `EQUAL_QUERY`.

… search() (apache#60798) ### What problem does this PR solve? Issue Number: close #DORIS-24542 Problem Summary: When a column has multiple inverted indexes with different analyzers (e.g., one default untokenized index and one with English parser), `search()` in Lucene/scalar mode returns empty results. **Root cause:** In `FieldReaderResolver::resolve()`, `select_best_reader()` was always called with an empty analyzer key `""`, causing it to pick the wrong (untokenized) index for tokenized queries. Additionally, the EQUAL_QUERY → MATCH_ANY_QUERY upgrade was restricted to variant subcolumns only. **Fix:** 1. Extract `analyzer_key` from FE-provided `index_properties` before calling `select_best_reader()` and pass it through 2. Remove the `is_variant_sub` restriction on the query type upgrade so regular columns with multiple indexes also get the correct FULLTEXT reader

…g term in lucene mode (apache#60814) ### What problem does this PR solve? Issue Number: close #DORIS-24545 Problem Summary: In `search()` function's lucene mode, queries with mixed explicit and implicit operators produce different results from Elasticsearch. For example: - Query: `"Sumer" OR Ptolemaic\ dynasty Limonene` with `default_operator=AND` - ES result: 1 row - Doris result: 0 rows (before fix) **Root cause:** In Lucene's `QueryParserBase.addClause()`, only explicit `CONJ_AND`/`CONJ_OR` modify the preceding term's occur. Implicit conjunction (`CONJ_NONE`, i.e., space-separated terms without an explicit operator) only affects the **current** term via `default_operator`, without modifying the preceding term. The FE `SearchDslParser.hasExplicitAndBefore()` incorrectly returned `true` (based on `default_operator`) when no explicit AND token was found. This caused implicit conjunction to be treated identically to explicit AND, making it modify the preceding term's occur — diverging from Lucene/ES semantics. **Example of the bug:** For `a OR b c` with `default_operator=AND`: - Before fix: `SHOULD(a) MUST(b) MUST(c)` — wrong, implicit space before `c` incorrectly upgraded `b` from SHOULD to MUST - After fix: `SHOULD(a) SHOULD(b) MUST(c)` — correct, matches ES behavior. Only `c` gets MUST (from default_operator), `b` retains SHOULD (from the preceding OR) **Fix:** `hasExplicitAndBefore()` now returns `false` when no explicit AND token is found, regardless of `default_operator`. Only explicit AND tokens trigger the "introduced by AND" logic that modifies preceding terms.

Thearas · 2026-03-03T16:44:31Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

airborne12 · 2026-03-04T02:27:58Z

Superseded by squashed backport PR #61028

airborne12 added 9 commits March 3, 2026 23:51

airborne12 requested a review from yiguolei as a code owner March 3, 2026 16:44

airborne12 closed this Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

branch-4.0: [fix](search) Fix implicit conjunction incorrectly modifying preceding term in lucene mode #60814#61020

branch-4.0: [fix](search) Fix implicit conjunction incorrectly modifying preceding term in lucene mode #60814#61020
airborne12 wants to merge 9 commits intoapache:branch-4.0from
airborne12:pick/branch-4.0/60814

airborne12 commented Mar 3, 2026

Uh oh!

Thearas commented Mar 3, 2026

Uh oh!

airborne12 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

airborne12 commented Mar 3, 2026

Summary

Merge Order

Check List (For Author)

Uh oh!

Thearas commented Mar 3, 2026

Uh oh!

airborne12 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants