Skip to content

[fix](search) Fix implicit conjunction incorrectly modifying preceding term in lucene mode#60814

Merged
airborne12 merged 1 commit intoapache:masterfrom
airborne12:fix-DORIS-24545-lucene-implicit-conjunction
Feb 25, 2026
Merged

[fix](search) Fix implicit conjunction incorrectly modifying preceding term in lucene mode#60814
airborne12 merged 1 commit intoapache:masterfrom
airborne12:fix-DORIS-24545-lucene-implicit-conjunction

Conversation

@airborne12
Copy link
Member

What problem does this PR solve?

Issue Number: close #DORIS-24545

Problem Summary:

In search() function's lucene mode, queries with mixed explicit and implicit operators produce different results from Elasticsearch. For example:

  • Query: "Sumer" OR Ptolemaic\ dynasty Limonene with default_operator=AND
  • ES result: 1 row
  • Doris result: 0 rows (before fix)

Root cause: In Lucene's QueryParserBase.addClause(), only explicit CONJ_AND/CONJ_OR modify the preceding term's occur. Implicit conjunction (CONJ_NONE, i.e., space-separated terms without an explicit operator) only affects the current term via default_operator, without modifying the preceding term.

The FE SearchDslParser.hasExplicitAndBefore() incorrectly returned true (based on default_operator) when no explicit AND token was found. This caused implicit conjunction to be treated identically to explicit AND, making it modify the preceding term's occur — diverging from Lucene/ES semantics.

Example of the bug:

For a OR b c with default_operator=AND:

  • Before fix: SHOULD(a) MUST(b) MUST(c) — wrong, implicit space before c incorrectly upgraded b from SHOULD to MUST
  • After fix: SHOULD(a) SHOULD(b) MUST(c) — correct, matches ES behavior. Only c gets MUST (from default_operator), b retains SHOULD (from the preceding OR)

Fix: hasExplicitAndBefore() now returns false when no explicit AND token is found, regardless of default_operator. Only explicit AND tokens trigger the "introduced by AND" logic that modifies preceding terms.

Release note

Fix search() lucene mode producing incorrect results when queries mix explicit operators (OR/AND) with implicit conjunction (space-separated terms).

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • Yes. Implicit conjunction (space between terms) in lucene mode no longer modifies the preceding term's occur. Only explicit AND/OR operators modify preceding terms, matching Lucene/ES semantics.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…g term in lucene mode

In Lucene's QueryParserBase.addClause(), only explicit CONJ_AND/CONJ_OR
modify the preceding term's occur. Implicit conjunction (CONJ_NONE) only
affects the current term via default_operator, without modifying the
preceding term.

The FE SearchDslParser incorrectly treated implicit conjunction the same
as explicit AND when default_operator=AND, causing hasExplicitAndBefore()
to return true. This made queries like "a OR b c" with default_operator=AND
produce SHOULD(a) MUST(b) MUST(c) instead of the correct SHOULD(a)
SHOULD(b) MUST(c), diverging from ES behavior.

Fix: hasExplicitAndBefore() now returns false when no explicit AND token
is found, regardless of default_operator. Only explicit AND tokens
trigger the "introduced by AND" logic that modifies preceding terms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Thearas
Copy link
Contributor

Thearas commented Feb 24, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@airborne12
Copy link
Member Author

buildall

@airborne12
Copy link
Member Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 28906 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit fe7356f52ae214bf5c0d6a89c537947d872528da, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17664	4473	4296	4296
q2	q3	10645	803	519	519
q4	4676	368	252	252
q5	7545	1213	1002	1002
q6	173	176	144	144
q7	780	838	677	677
q8	9817	1476	1347	1347
q9	5301	4829	4727	4727
q10	6872	1887	1646	1646
q11	472	250	244	244
q12	742	571	468	468
q13	17789	4254	3425	3425
q14	230	228	211	211
q15	948	795	787	787
q16	739	715	661	661
q17	711	855	449	449
q18	6027	5441	5210	5210
q19	1505	982	635	635
q20	519	495	402	402
q21	4685	2033	1540	1540
q22	399	345	264	264
Total cold run time: 98239 ms
Total hot run time: 28906 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4773	4717	4561	4561
q2	q3	1822	2278	1767	1767
q4	859	1195	782	782
q5	4047	4427	4414	4414
q6	194	175	138	138
q7	1772	1616	1485	1485
q8	2541	2742	2620	2620
q9	7460	7317	7466	7317
q10	2628	2796	2367	2367
q11	494	438	403	403
q12	496	596	480	480
q13	4033	4778	3574	3574
q14	280	302	278	278
q15	878	821	803	803
q16	690	790	722	722
q17	1209	1601	1326	1326
q18	7037	6950	6624	6624
q19	902	883	898	883
q20	2133	2232	1990	1990
q21	3979	3477	3366	3366
q22	486	477	441	441
Total cold run time: 48713 ms
Total hot run time: 46341 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 184094 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit fe7356f52ae214bf5c0d6a89c537947d872528da, data reload: false

query5	4967	654	536	536
query6	337	240	206	206
query7	4210	467	272	272
query8	355	252	241	241
query9	8796	2761	2750	2750
query10	547	373	330	330
query11	16985	17532	17133	17133
query12	214	134	127	127
query13	1340	480	378	378
query14	7343	3337	3056	3056
query14_1	2894	2967	2954	2954
query15	199	196	183	183
query16	1007	496	468	468
query17	1526	764	612	612
query18	2951	467	370	370
query19	214	217	184	184
query20	148	142	130	130
query21	217	137	119	119
query22	4740	4888	4953	4888
query23	17246	16691	16632	16632
query23_1	16775	16548	16676	16548
query24	7120	1604	1222	1222
query24_1	1218	1215	1217	1215
query25	540	465	399	399
query26	1223	258	145	145
query27	2774	464	280	280
query28	4511	1856	1856	1856
query29	804	556	501	501
query30	313	250	212	212
query31	884	730	662	662
query32	78	71	70	70
query33	521	336	285	285
query34	907	915	579	579
query35	630	664	596	596
query36	1095	1150	993	993
query37	139	101	88	88
query38	2993	2874	2852	2852
query39	882	872	871	871
query39_1	832	833	839	833
query40	230	150	137	137
query41	66	60	57	57
query42	103	101	102	101
query43	381	388	350	350
query44	
query45	200	196	183	183
query46	878	971	630	630
query47	2076	2117	2068	2068
query48	307	323	235	235
query49	625	461	375	375
query50	678	280	213	213
query51	4086	4043	4022	4022
query52	104	107	96	96
query53	289	349	275	275
query54	286	263	250	250
query55	86	86	84	84
query56	305	313	305	305
query57	1348	1341	1279	1279
query58	284	275	270	270
query59	2598	2769	2585	2585
query60	345	330	318	318
query61	140	146	152	146
query62	621	588	560	560
query63	315	285	282	282
query64	4910	1293	1063	1063
query65	
query66	1404	476	373	373
query67	16476	16462	16466	16462
query68	
query69	396	319	288	288
query70	993	927	990	927
query71	340	310	298	298
query72	2956	2812	2484	2484
query73	538	546	320	320
query74	9984	9903	9737	9737
query75	2834	2765	2474	2474
query76	2307	1036	679	679
query77	358	371	304	304
query78	11209	11304	10689	10689
query79	3150	818	611	611
query80	1786	621	556	556
query81	586	291	243	243
query82	987	148	117	117
query83	335	274	249	249
query84	264	113	99	99
query85	888	465	427	427
query86	494	323	308	308
query87	3106	3090	3025	3025
query88	3572	2668	2655	2655
query89	428	371	341	341
query90	2154	178	172	172
query91	170	159	132	132
query92	94	81	77	77
query93	1815	843	509	509
query94	651	311	275	275
query95	588	342	382	342
query96	640	515	225	225
query97	2431	2503	2400	2400
query98	236	217	220	217
query99	1018	982	924	924
Total cold run time: 258377 ms
Total hot run time: 184094 ms

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 66.67% (2/3) 🎉
Increment coverage report
Complete coverage report

Copy link
Member

@eldenmoon eldenmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@zzzxl1993 zzzxl1993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 25, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@airborne12 airborne12 merged commit 1cde086 into apache:master Feb 25, 2026
31 of 33 checks passed
@airborne12 airborne12 deleted the fix-DORIS-24545-lucene-implicit-conjunction branch February 25, 2026 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants