Skip to content

branch-4.0: [fix](search) Fix implicit conjunction incorrectly modifying preceding term in lucene mode #60814#61020

Closed
airborne12 wants to merge 9 commits intoapache:branch-4.0from
airborne12:pick/branch-4.0/60814
Closed

branch-4.0: [fix](search) Fix implicit conjunction incorrectly modifying preceding term in lucene mode #60814#61020
airborne12 wants to merge 9 commits intoapache:branch-4.0from
airborne12:pick/branch-4.0/60814

Conversation

@airborne12
Copy link
Member

Summary

Merge Order

PR 9/12 in the search() pick chain. Depends on #61013(#60654).

Check List (For Author)

  • Test

    • Unit Test
    • No need to test
  • Behavior changed:

    • No.
  • Does this need documentation?

    • No.

…pache#59747)

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#59394

Problem Summary:
The search DSL should only recognize uppercase `AND`, `OR`, `NOT` as
boolean operators in search lucene boolean mode. Previously, lowercase
`and`, `or`, `not` were also treated as operators, which does not
conform to the specification.

This PR makes the boolean operators case-sensitive:
- Only uppercase `AND`, `OR`, `NOT` are recognized as operators
- Lowercase `and`, `or`, `not` are now treated as regular search terms
- Using lowercase operators in DSL will result in a parse error

### Release note

Make search DSL boolean operators (AND/OR/NOT) case-sensitive in lucene
boolean mode.
…sing and fix ES compatibility issues (apache#60654)

Problem Summary:

The `search()` function's DSL parser had multiple ES compatibility
issues and used a two-phase parsing approach (manual pre-parse + ANTLR)
that was error-prone. This PR refactors the parser and fixes several
bugs:

1. **SearchDslParser refactoring**: Consolidated from two-phase (manual
pre-parse + ANTLR) to single-phase ANTLR parsing. The ANTLR grammar now
handles all DSL syntax directly, eliminating the fragile manual
pre-parse layer. This fixes issues with operator precedence, grouping,
and edge cases.

2. **ANTLR grammar improvements**: Updated `SearchLexer.g4` and
`SearchParser.g4` to properly handle quoted phrases, field-qualified
expressions, prefix/wildcard/regexp patterns, range queries, and boolean
operators with correct precedence.

3. **minimum_should_match pipeline**: Added `default_operator` and
`minimum_should_match` fields to `TSearchParam` thrift, passing them
from FE `SearchPredicate` through to BE `function_search`. When
`minimum_should_match > 0`, uses `OccurBooleanQuery` for proper
Lucene-style boolean query semantics.

4. **Wildcard/Prefix/Regexp case-sensitivity**: Wildcard and PREFIX
patterns are now lowercased when the index has `parser +
lower_case=true` (matching ES query_string normalizer behavior). REGEXP
patterns are NOT lowercased (matching ES regex behavior where patterns
bypass analysis).

5. **MATCH_ALL_DOCS support**: Added `MATCH_ALL_DOCS` clause type for
standalone `*` queries and pure NOT query rewrites. Enhanced `AllQuery`
with deferred `max_doc` from `context.segment_num_rows` and nullable
field support via `NullableScorer`.

6. **BE fixes**:
- `regexp_weight._max_expansions`: Changed from 50 to 0 (unlimited) to
prevent PREFIX queries from missing documents
- `occur_boolean_weight`: Fixed swap→append bug when all SHOULD clauses
must match, preserving existing MUST scorers
- Variant subcolumn `index_properties` propagation for proper analyzer
selection
- `lower_case` default handling: inverted index `lower_case` defaults to
`"true"` when a parser is configured
…-based indexes (apache#60782)

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#60654

Problem Summary:

Follow-up fix for apache#60654 (SearchDslParser refactoring).

When FE resolves a variant subcolumn field pattern to a specific
analyzer-based index and sends its `index_properties` via
`TSearchFieldBinding`, the BE `FieldReaderResolver` was using
`EQUAL_QUERY` for TERM clauses. This caused `select_best_reader` to pick
the `STRING_TYPE` reader (untokenized index directory) instead of the
`FULLTEXT` reader, so tokenized search terms would never match.

**Root cause**: For variant subcolumns with analyzer-based indexes,
`EQUAL_QUERY` opens the wrong (untokenized) index directory. The query
type needs to be upgraded to `MATCH_ANY_QUERY` so `select_best_reader`
picks the correct FULLTEXT reader.

**Fix**: In `FieldReaderResolver::resolve()`, when the field is a
variant subcolumn and the FE-provided `index_properties` indicate an
analyzer-based index (`should_analyzer()` returns true), automatically
upgrade `EQUAL_QUERY` to `MATCH_ANY_QUERY` before calling
`select_best_reader`. Also reuse the `fb_it` iterator to avoid a
redundant map lookup.
…mode (apache#60784)

### What problem does this PR solve?

Issue Number: close 

Problem Summary:

When using `search('*', ...)` with multi-field options (`fields`
parameter), the query fails with:
```
only inverted index queries are supported
```

The root cause is in `SearchDslParser.java`: the multi-field parsing
methods (`parseDslMultiFieldMode` and `parseDslMultiFieldLuceneMode`)
collect field bindings by calling `collectFieldNames()` on the expanded
AST. When the query is `*` (match all), the AST node is `MATCH_ALL_DOCS`
which has no field set — by design it matches all documents regardless
of field. This caused `collectFieldNames` to return an empty set,
resulting in no field bindings. Without field bindings,
`RewriteSearchToSlots` couldn't create slot references, so the search
expression was never pushed down to the inverted index path, and BE fell
back to `execute_impl()` which returns the error.

**Fix**: After `collectFieldNames()`, if the result is empty, fall back
to using the original `fields` list as field bindings. This ensures the
push-down mechanism works for `MATCH_ALL_DOCS` queries.

**Reproducing queries** (from bug report):
```sql
select count(*) from wikipedia where search('*', '{"fields":["title", "content"], "type": "best_fields", "default_operator":"AND","mode":"lucene", "minimum_should_match": 0}');

select count(*) from wikipedia where search('*', '{"default_field": "title", "default_operator":"AND","mode":"lucene", "minimum_should_match": 0}');
```
…m2) (apache#60786)

### What problem does this PR solve?

Issue Number: close #N/A

Problem Summary:

The `search()` function did not support ES `query_string` field-grouped
syntax where all terms inside parentheses inherit the field prefix:

```sql
-- Previously failed with syntax error
SELECT * FROM t WHERE search('title:(rock OR jazz)', '{"fields":["title","content"]}');
```

ES semantics:
| Input | Expansion |
|-------|-----------|
| `title:(rock OR jazz)` | `(title:rock OR title:jazz)` |
| `title:(rock jazz)` with `default_operator:AND` | `(+title:rock
+title:jazz)` |
| `title:(rock OR jazz) AND music` with `fields:[title,content]` |
`(title:rock OR title:jazz) AND (title:music OR content:music)` |
| `title:("rock and roll" OR jazz)` | `(title:"rock and roll" OR
title:jazz)` |

### Root cause

The ANTLR grammar `SearchParser.g4` defined `fieldQuery : fieldPath
COLON searchValue` where `searchValue` only accepts leaf values (TERM,
QUOTED, etc.), not a parenthesized sub-clause. So `title:(` caused a
syntax error.

### Solution

**Grammar** (`SearchParser.g4`):
- Add `fieldGroupQuery : fieldPath COLON LPAREN clause RPAREN` rule
- Add it as alternative in `atomClause` before `fieldQuery`

**Visitor** (`SearchDslParser.java`):
- Add `markExplicitFieldRecursive()` helper — marks all leaf nodes in a
group as `explicitField=true` to prevent `MultiFieldExpander` from
re-expanding them across unintended fields
- Modify `visitBareQuery()` in both `QsAstBuilder` and
`QsLuceneModeAstBuilder` to use `currentFieldName` as field group
context when set
- Add `visitFieldGroupQuery()` to both AST builders: sets field context,
visits inner clause, marks all leaves explicit
- Update `visitAtomClause()` and `collectTermsFromNotClause()` to handle
…h() function (apache#60790)

### What problem does this PR solve?

Problem Summary:

This PR adds searcher cache reuse and a DSL result cache for the
`search()` function to improve query performance on repeated search
queries against the same segments.

**Key changes:**

1. **DSL result cache**: Caches the final roaring bitmap per (segment,
DSL) pair so repeated identical `search()` queries skip Lucene execution
entirely. Uses length-prefix key encoding to avoid hash collisions.

2. **Deep-copy bitmap semantics**: Bitmaps are deep-copied on both cache
read and write to prevent `mask_out_null()` from polluting cached
entries.

3. **Type-safe cache accessor**: Replaces raw `void*` return with a
template `get_value<T>()` that uses `static_assert` to ensure T derives
from `LRUCacheValueBase`.

4. **Session-level cache toggle**: Adds
`enable_search_function_query_cache` session variable (default: true) to
allow disabling the cache per query via `SET_VAR`.

5. **Const-correctness fix**: Removes unsafe `const_cast` in
`build_dsl_signature` by copying TSearchParam before Thrift
serialization.

6. **Defensive improvements**: Adds null check for `result_bitmap` on
cache hit, logging for serialization fallback and cache bypass paths.

### Release note

Add DSL result cache for search() function to skip repeated Lucene
execution on identical queries.
…y results (apache#60793)

### What problem does this PR solve?

Related PR: apache#60782

Problem Summary:
When `search()` DSL uses wildcard patterns (e.g. `*ith`, `sm*th`,
`sm?th`) on variant subcolumns with analyzer-based indexes
(field_pattern), the queries return empty results even though regular
TERM search works correctly.

**Root cause:** In `FieldReaderResolver::resolve()`, only `EQUAL_QUERY`
was upgraded to `MATCH_ANY_QUERY` for variant subcolumns with
analyzer-based indexes. `WILDCARD_QUERY` was not upgraded, so
`select_best_reader()` picked the `STRING_TYPE` reader instead of
`FULLTEXT`. `WildcardWeight` then enumerated terms from the wrong
(untokenized) index directory, finding no matches.

**Fix:** Extend the query type upgrade condition to also cover
`WILDCARD_QUERY`, so wildcard patterns correctly use the FULLTEXT index
on variant subcolumns. Also fix a misleading comment in
`inverted_index_iterator.cpp` where `is_equal_query()` was described as
handling WILDCARD/REGEXP but actually only checks `EQUAL_QUERY`.
… search() (apache#60798)

### What problem does this PR solve?

Issue Number: close #DORIS-24542

Problem Summary:
When a column has multiple inverted indexes with different analyzers
(e.g., one default untokenized index and one with English parser),
`search()` in Lucene/scalar mode returns empty results.

**Root cause:** In `FieldReaderResolver::resolve()`,
`select_best_reader()` was always called with an empty analyzer key
`""`, causing it to pick the wrong (untokenized) index for tokenized
queries. Additionally, the EQUAL_QUERY → MATCH_ANY_QUERY upgrade was
restricted to variant subcolumns only.

**Fix:**
1. Extract `analyzer_key` from FE-provided `index_properties` before
calling `select_best_reader()` and pass it through
2. Remove the `is_variant_sub` restriction on the query type upgrade so
regular columns with multiple indexes also get the correct FULLTEXT
reader
…g term in lucene mode (apache#60814)

### What problem does this PR solve?

Issue Number: close #DORIS-24545

Problem Summary:

In `search()` function's lucene mode, queries with mixed explicit and
implicit operators produce different results from Elasticsearch. For
example:

- Query: `"Sumer" OR Ptolemaic\ dynasty Limonene` with
`default_operator=AND`
- ES result: 1 row
- Doris result: 0 rows (before fix)

**Root cause:** In Lucene's `QueryParserBase.addClause()`, only explicit
`CONJ_AND`/`CONJ_OR` modify the preceding term's occur. Implicit
conjunction (`CONJ_NONE`, i.e., space-separated terms without an
explicit operator) only affects the **current** term via
`default_operator`, without modifying the preceding term.

The FE `SearchDslParser.hasExplicitAndBefore()` incorrectly returned
`true` (based on `default_operator`) when no explicit AND token was
found. This caused implicit conjunction to be treated identically to
explicit AND, making it modify the preceding term's occur — diverging
from Lucene/ES semantics.

**Example of the bug:**

For `a OR b c` with `default_operator=AND`:
- Before fix: `SHOULD(a) MUST(b) MUST(c)` — wrong, implicit space before
`c` incorrectly upgraded `b` from SHOULD to MUST
- After fix: `SHOULD(a) SHOULD(b) MUST(c)` — correct, matches ES
behavior. Only `c` gets MUST (from default_operator), `b` retains SHOULD
(from the preceding OR)

**Fix:** `hasExplicitAndBefore()` now returns `false` when no explicit
AND token is found, regardless of `default_operator`. Only explicit AND
tokens trigger the "introduced by AND" logic that modifies preceding
terms.
@airborne12 airborne12 requested a review from yiguolei as a code owner March 3, 2026 16:44
@Thearas
Copy link
Contributor

Thearas commented Mar 3, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@airborne12
Copy link
Member Author

Superseded by squashed backport PR #61028

@airborne12 airborne12 closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants