Skip to content

[fix](search) inject MATCH_ALL_DOCS for multi-MUST_NOT queries in lucene mode#60891

Open
airborne12 wants to merge 1 commit intoapache:masterfrom
airborne12:fix-search-multi-must-not-v2
Open

[fix](search) inject MATCH_ALL_DOCS for multi-MUST_NOT queries in lucene mode#60891
airborne12 wants to merge 1 commit intoapache:masterfrom
airborne12:fix-search-multi-must-not-v2

Conversation

@airborne12
Copy link
Member

What problem does this PR solve?

Related PR: #60814

Problem Summary:
In search() lucene mode, when all terms in a boolean query are MUST_NOT
(e.g., NOT a AND NOT b or NOT a NOT b with default_operator=AND),
the query incorrectly returns all documents instead of returning all
documents EXCEPT those matching the negated terms.

Root cause: Lucene's BooleanQuery with only MUST_NOT clauses matches
nothing (by design). ES handles this by injecting a MatchAllDocsQuery
with SHOULD occur. Doris only handled the single-term MUST_NOT case
but not multi-term all-MUST_NOT queries.

Fix: After applyLuceneBooleanLogic(), detect if ALL terms are MUST_NOT
and inject MATCH_ALL_DOCS(SHOULD) with minimum_should_match=1.

Release note

Fix search() lucene mode returning incorrect results for multi-MUST_NOT queries like "NOT a AND NOT b".

Check List (For Author)

  • Test

    • Unit Test
    • Regression test
    • Manual test
    • No need to test
  • Behavior changed:

    • Yes. Pure multi-MUST_NOT queries now correctly exclude matching docs instead of returning all.
  • Does this need documentation?

    • No.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Feb 27, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

…ene mode

When all terms in a boolean query are MUST_NOT (e.g., "NOT a AND NOT b"),
Lucene's BooleanQuery.rewrite() produces a pure-negation query that matches
nothing. ES handles this by injecting a MatchAllDocsQuery with SHOULD occur.

This fix detects the all-MUST_NOT case after applyLuceneBooleanLogic() and
injects MATCH_ALL_DOCS(SHOULD) with minimum_should_match=1, matching ES
query_string semantics for pure negation queries.

Previously only single-term MUST_NOT was handled (the existing single-term
rewrite). Multi-term all-MUST_NOT queries like "NOT a AND NOT b" or
"NOT a NOT b" (with op=and) were not covered.
@airborne12 airborne12 force-pushed the fix-search-multi-must-not-v2 branch from 63212e0 to e61006d Compare February 27, 2026 15:15
@airborne12
Copy link
Member Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 28727 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e61006dbb4c0ad2f40700fdf11ff3953fda711d6, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17658	4463	4277	4277
q2	q3	10662	776	517	517
q4	4678	363	250	250
q5	7563	1213	1015	1015
q6	170	174	145	145
q7	768	836	664	664
q8	9285	1420	1372	1372
q9	4925	4725	4677	4677
q10	6848	1860	1630	1630
q11	465	247	234	234
q12	717	557	479	479
q13	17805	4202	3418	3418
q14	230	223	211	211
q15	955	795	792	792
q16	758	716	670	670
q17	714	869	400	400
q18	5994	5480	5240	5240
q19	1146	973	628	628
q20	517	503	397	397
q21	4560	1985	1449	1449
q22	425	313	262	262
Total cold run time: 96843 ms
Total hot run time: 28727 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4786	4582	4555	4555
q2	q3	1821	2216	1763	1763
q4	863	1192	760	760
q5	4024	4403	4336	4336
q6	185	177	150	150
q7	1751	1705	1532	1532
q8	2428	2697	2559	2559
q9	7521	7619	7323	7323
q10	2636	2899	2442	2442
q11	589	476	425	425
q12	500	576	452	452
q13	3981	4444	3608	3608
q14	279	314	278	278
q15	892	816	827	816
q16	786	760	708	708
q17	1182	1573	1244	1244
q18	7087	6768	6628	6628
q19	1022	915	904	904
q20	2072	2171	2071	2071
q21	4051	3447	3345	3345
q22	485	442	393	393
Total cold run time: 48941 ms
Total hot run time: 46292 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 184085 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e61006dbb4c0ad2f40700fdf11ff3953fda711d6, data reload: false

query5	4352	630	524	524
query6	329	240	218	218
query7	4218	482	274	274
query8	342	247	235	235
query9	8706	2706	2717	2706
query10	515	416	368	368
query11	17151	17542	17631	17542
query12	203	137	142	137
query13	1410	485	433	433
query14	8264	3297	3200	3200
query14_1	2909	2944	2942	2942
query15	230	202	179	179
query16	1005	506	469	469
query17	1260	777	682	682
query18	3058	493	363	363
query19	228	211	204	204
query20	152	132	133	132
query21	222	148	122	122
query22	5607	5158	4792	4792
query23	17287	16755	16544	16544
query23_1	16814	16669	16648	16648
query24	7089	1610	1255	1255
query24_1	1240	1238	1222	1222
query25	563	475	428	428
query26	1233	264	150	150
query27	2757	463	284	284
query28	4459	1846	1883	1846
query29	785	558	472	472
query30	307	251	209	209
query31	886	729	639	639
query32	81	74	72	72
query33	510	320	284	284
query34	917	904	563	563
query35	634	679	580	580
query36	1095	1112	1009	1009
query37	126	104	90	90
query38	3014	2999	2802	2802
query39	892	877	858	858
query39_1	810	829	816	816
query40	224	152	132	132
query41	62	58	60	58
query42	105	103	102	102
query43	368	388	341	341
query44	
query45	202	190	188	188
query46	871	981	597	597
query47	2110	2118	2054	2054
query48	311	322	231	231
query49	642	451	389	389
query50	682	273	221	221
query51	4050	4155	4057	4057
query52	106	108	96	96
query53	290	333	283	283
query54	293	259	261	259
query55	125	82	83	82
query56	328	303	307	303
query57	1349	1355	1268	1268
query58	291	270	275	270
query59	2507	2629	2602	2602
query60	359	343	321	321
query61	151	151	142	142
query62	643	582	535	535
query63	300	276	274	274
query64	4831	1262	985	985
query65	
query66	1385	454	357	357
query67	16348	16335	16234	16234
query68	
query69	401	313	285	285
query70	944	978	870	870
query71	345	311	301	301
query72	2814	2649	2367	2367
query73	545	575	328	328
query74	10067	9958	9837	9837
query75	2882	2738	2463	2463
query76	2304	1016	663	663
query77	367	378	307	307
query78	11237	11465	10698	10698
query79	2853	766	592	592
query80	1798	646	520	520
query81	567	278	246	246
query82	1011	150	115	115
query83	330	262	242	242
query84	245	120	100	100
query85	888	484	435	435
query86	420	297	299	297
query87	3113	3081	2994	2994
query88	3589	2677	2665	2665
query89	425	360	340	340
query90	2025	173	170	170
query91	170	159	135	135
query92	78	86	69	69
query93	1209	853	497	497
query94	637	325	267	267
query95	596	393	315	315
query96	651	516	229	229
query97	2479	2477	2389	2389
query98	234	216	219	216
query99	1013	993	890	890
Total cold run time: 258383 ms
Total hot run time: 184085 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants