Skip to content

branch-4.0: [refactor](search) Refactor SearchDslParser to single-phase ANTLR parsing and fix ES compatibility issues #60654#61013

Closed
airborne12 wants to merge 2 commits intoapache:branch-4.0from
airborne12:pick/branch-4.0/60654
Closed

branch-4.0: [refactor](search) Refactor SearchDslParser to single-phase ANTLR parsing and fix ES compatibility issues #60654#61013
airborne12 wants to merge 2 commits intoapache:branch-4.0from
airborne12:pick/branch-4.0/60654

Conversation

@airborne12
Copy link
Member

Summary

Merge Order

This is PR 2/12 in the search() function pick chain. Depends on #61012 (#59747).

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test
    • No need to test
  • Behavior changed:

    • No.
    • Yes. SearchDslParser refactored to single-phase ANTLR.
  • Does this need documentation?

    • No.

…pache#59747)

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#59394

Problem Summary:
The search DSL should only recognize uppercase `AND`, `OR`, `NOT` as
boolean operators in search lucene boolean mode. Previously, lowercase
`and`, `or`, `not` were also treated as operators, which does not
conform to the specification.

This PR makes the boolean operators case-sensitive:
- Only uppercase `AND`, `OR`, `NOT` are recognized as operators
- Lowercase `and`, `or`, `not` are now treated as regular search terms
- Using lowercase operators in DSL will result in a parse error

### Release note

Make search DSL boolean operators (AND/OR/NOT) case-sensitive in lucene
boolean mode.
…sing and fix ES compatibility issues (apache#60654)

Problem Summary:

The `search()` function's DSL parser had multiple ES compatibility
issues and used a two-phase parsing approach (manual pre-parse + ANTLR)
that was error-prone. This PR refactors the parser and fixes several
bugs:

1. **SearchDslParser refactoring**: Consolidated from two-phase (manual
pre-parse + ANTLR) to single-phase ANTLR parsing. The ANTLR grammar now
handles all DSL syntax directly, eliminating the fragile manual
pre-parse layer. This fixes issues with operator precedence, grouping,
and edge cases.

2. **ANTLR grammar improvements**: Updated `SearchLexer.g4` and
`SearchParser.g4` to properly handle quoted phrases, field-qualified
expressions, prefix/wildcard/regexp patterns, range queries, and boolean
operators with correct precedence.

3. **minimum_should_match pipeline**: Added `default_operator` and
`minimum_should_match` fields to `TSearchParam` thrift, passing them
from FE `SearchPredicate` through to BE `function_search`. When
`minimum_should_match > 0`, uses `OccurBooleanQuery` for proper
Lucene-style boolean query semantics.

4. **Wildcard/Prefix/Regexp case-sensitivity**: Wildcard and PREFIX
patterns are now lowercased when the index has `parser +
lower_case=true` (matching ES query_string normalizer behavior). REGEXP
patterns are NOT lowercased (matching ES regex behavior where patterns
bypass analysis).

5. **MATCH_ALL_DOCS support**: Added `MATCH_ALL_DOCS` clause type for
standalone `*` queries and pure NOT query rewrites. Enhanced `AllQuery`
with deferred `max_doc` from `context.segment_num_rows` and nullable
field support via `NullableScorer`.

6. **BE fixes**:
- `regexp_weight._max_expansions`: Changed from 50 to 0 (unlimited) to
prevent PREFIX queries from missing documents
- `occur_boolean_weight`: Fixed swap→append bug when all SHOULD clauses
must match, preserving existing MUST scorers
- Variant subcolumn `index_properties` propagation for proper analyzer
selection
- `lower_case` default handling: inverted index `lower_case` defaults to
`"true"` when a parser is configured
@airborne12 airborne12 requested a review from yiguolei as a code owner March 3, 2026 16:43
@Thearas
Copy link
Contributor

Thearas commented Mar 3, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@airborne12
Copy link
Member Author

Superseded by squashed backport PR #61028

@airborne12 airborne12 closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants