Skip to content

[fix](search) Fix slash character in search query_string terms#61599

Merged
airborne12 merged 2 commits intoapache:masterfrom
airborne12:fix-search-slash-in-term
Mar 23, 2026
Merged

[fix](search) Fix slash character in search query_string terms#61599
airborne12 merged 2 commits intoapache:masterfrom
airborne12:fix-search-slash-in-term

Conversation

@airborne12
Copy link
Member

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

The ANTLR lexer in the search() DSL parser excluded / from TERM_CHAR, causing terms like AC/DC to be incorrectly tokenized. The slash was silently skipped by ANTLR's default error recovery, splitting AC/DC into two separate terms AC and DC instead of treating it as a single term.

This caused inconsistent behavior compared to Elasticsearch's query_string parsing, where AC\/DC (escaped slash) is handled as a single analyzed term.

Fix: Add / to the TERM_CHAR fragment in SearchLexer.g4. This allows / to appear within terms (e.g., AC/DC -> single term) while regex patterns like /[a-z]+/ still work correctly since / remains excluded from TERM_START_CHAR.

Release note

Fix search() function incorrectly handling slash (/) character within query terms (e.g., AC/DC). The slash is now treated as a regular character within terms instead of being silently dropped.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Mar 22, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@airborne12
Copy link
Member Author

run buildall

The ANTLR lexer excluded '/' from TERM_CHAR, causing terms like
'AC/DC' to be incorrectly tokenized — the slash was silently skipped,
splitting it into two separate terms. Add '/' to TERM_CHAR so it can
appear within terms while regex patterns (/pattern/) still work since
'/' remains excluded from TERM_START_CHAR.
@airborne12 airborne12 force-pushed the fix-search-slash-in-term branch from 1e7f9c4 to 67a7155 Compare March 22, 2026 14:08
@airborne12
Copy link
Member Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 26636 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 67a7155dcc6c4fa81cbe436d4fff683d45c9d9ae, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17632	4484	4279	4279
q2	q3	10637	773	517	517
q4	4677	351	249	249
q5	7556	1197	1027	1027
q6	180	172	147	147
q7	773	853	664	664
q8	9299	1447	1308	1308
q9	4829	4766	4612	4612
q10	6320	1911	1657	1657
q11	480	252	245	245
q12	745	595	467	467
q13	18028	2910	2196	2196
q14	228	236	209	209
q15	q16	742	759	671	671
q17	737	841	428	428
q18	5834	5419	5270	5270
q19	1130	981	620	620
q20	541	492	370	370
q21	4445	1801	1403	1403
q22	345	297	428	297
Total cold run time: 95158 ms
Total hot run time: 26636 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4812	4589	4550	4550
q2	q3	3983	4403	3831	3831
q4	880	1247	808	808
q5	4112	4523	4420	4420
q6	186	171	143	143
q7	1729	1597	1523	1523
q8	2493	2719	2599	2599
q9	7511	7353	7259	7259
q10	3726	4035	3709	3709
q11	525	426	420	420
q12	490	590	453	453
q13	2919	3233	2340	2340
q14	303	304	281	281
q15	q16	728	786	732	732
q17	1223	1319	1376	1319
q18	7207	6816	6615	6615
q19	872	906	977	906
q20	2055	2181	1993	1993
q21	3897	3480	3287	3287
q22	456	425	366	366
Total cold run time: 50107 ms
Total hot run time: 47554 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 168909 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 67a7155dcc6c4fa81cbe436d4fff683d45c9d9ae, data reload: false

query5	4317	621	507	507
query6	348	229	214	214
query7	4207	460	262	262
query8	348	250	234	234
query9	8714	2693	2686	2686
query10	541	390	326	326
query11	7015	5106	4869	4869
query12	191	130	124	124
query13	1271	464	356	356
query14	5727	3696	3511	3511
query14_1	2878	2853	2834	2834
query15	204	192	175	175
query16	986	466	447	447
query17	1129	750	631	631
query18	2450	453	351	351
query19	216	215	191	191
query20	135	126	132	126
query21	217	135	111	111
query22	13251	14172	14473	14172
query23	16149	15844	15663	15663
query23_1	15872	15804	15798	15798
query24	7532	1618	1227	1227
query24_1	1227	1210	1210	1210
query25	545	460	440	440
query26	1228	261	147	147
query27	2757	480	291	291
query28	4470	1820	1833	1820
query29	810	551	480	480
query30	299	222	192	192
query31	999	958	869	869
query32	78	71	72	71
query33	510	341	304	304
query34	873	870	531	531
query35	659	677	600	600
query36	1123	1179	1028	1028
query37	144	105	86	86
query38	2942	2937	2823	2823
query39	849	826	805	805
query39_1	803	788	784	784
query40	232	151	138	138
query41	61	60	59	59
query42	260	255	263	255
query43	230	246	219	219
query44	
query45	198	184	189	184
query46	890	971	622	622
query47	3217	2125	2090	2090
query48	306	309	225	225
query49	625	459	390	390
query50	675	276	208	208
query51	4039	4014	4018	4014
query52	261	270	251	251
query53	292	336	277	277
query54	294	265	268	265
query55	94	88	86	86
query56	312	328	325	325
query57	1972	1814	1820	1814
query58	288	278	264	264
query59	2833	2952	2727	2727
query60	344	344	322	322
query61	156	157	169	157
query62	632	580	548	548
query63	312	278	273	273
query64	5058	1254	983	983
query65	
query66	1443	461	357	357
query67	24178	24479	24339	24339
query68	
query69	418	318	290	290
query70	962	1011	937	937
query71	331	313	308	308
query72	2777	2673	2423	2423
query73	543	537	321	321
query74	9585	9520	9383	9383
query75	2850	2748	2434	2434
query76	2293	1022	707	707
query77	362	363	301	301
query78	10887	11016	10444	10444
query79	2626	766	583	583
query80	1738	635	539	539
query81	550	256	228	228
query82	1011	154	116	116
query83	331	259	238	238
query84	255	114	98	98
query85	915	577	525	525
query86	424	338	296	296
query87	3126	3102	2984	2984
query88	3523	2654	2649	2649
query89	423	368	347	347
query90	2009	177	174	174
query91	167	160	141	141
query92	72	73	72	72
query93	1174	830	495	495
query94	647	329	292	292
query95	590	340	373	340
query96	657	511	232	232
query97	2419	2491	2368	2368
query98	251	222	223	222
query99	1013	998	920	920
Total cold run time: 252212 ms
Total hot run time: 168909 ms

### What problem does this PR solve?

Problem Summary: The regression test `test_search_slash_in_term` was missing
its expected output file (.out), causing P0 Regression and Cloud P0 pipelines
to fail with `Missing outputFile` error.

### Release note

None

### Check List (For Author)

- Test: Regression test
- Behavior changed: No
- Does this need documentation: No
@airborne12
Copy link
Member Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 26771 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3723ecce7567f39b8cc8a366acba7b6949e56647, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17601	4425	4275	4275
q2	q3	10650	764	522	522
q4	4681	349	249	249
q5	7549	1188	1024	1024
q6	169	170	145	145
q7	774	842	675	675
q8	9291	1484	1305	1305
q9	4954	4702	4689	4689
q10	6303	1897	1625	1625
q11	479	254	237	237
q12	744	578	469	469
q13	18052	2909	2167	2167
q14	226	233	219	219
q15	q16	756	756	670	670
q17	741	866	434	434
q18	6139	5381	5374	5374
q19	1116	1004	607	607
q20	548	496	369	369
q21	4482	1826	1424	1424
q22	377	508	292	292
Total cold run time: 95632 ms
Total hot run time: 26771 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4800	4625	4743	4625
q2	q3	3878	4325	3826	3826
q4	916	1234	809	809
q5	4084	4379	4319	4319
q6	184	176	142	142
q7	1771	1641	1551	1551
q8	2501	2770	2597	2597
q9	7633	7332	7329	7329
q10	3744	4070	3595	3595
q11	514	435	419	419
q12	514	622	483	483
q13	2841	3328	2418	2418
q14	287	323	283	283
q15	q16	713	770	701	701
q17	1193	1384	1404	1384
q18	7420	6707	6539	6539
q19	873	939	921	921
q20	2059	2137	2009	2009
q21	3932	3493	3373	3373
q22	441	438	374	374
Total cold run time: 50298 ms
Total hot run time: 47697 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 168699 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3723ecce7567f39b8cc8a366acba7b6949e56647, data reload: false

query5	4343	665	511	511
query6	331	239	212	212
query7	4217	470	274	274
query8	376	244	234	234
query9	8738	2693	2673	2673
query10	541	363	345	345
query11	6992	5085	4868	4868
query12	181	129	126	126
query13	1297	463	344	344
query14	5725	3706	3497	3497
query14_1	2853	2821	2830	2821
query15	196	194	181	181
query16	986	466	480	466
query17	896	727	645	645
query18	2446	458	363	363
query19	216	215	189	189
query20	139	127	129	127
query21	216	138	115	115
query22	13174	14053	14588	14053
query23	16209	16363	15689	15689
query23_1	15715	15951	15738	15738
query24	7288	1645	1239	1239
query24_1	1230	1225	1242	1225
query25	600	452	402	402
query26	1235	268	155	155
query27	2765	473	291	291
query28	4515	1849	1843	1843
query29	850	572	472	472
query30	298	223	191	191
query31	997	953	867	867
query32	81	69	70	69
query33	522	347	287	287
query34	892	869	539	539
query35	631	686	587	587
query36	1130	1174	975	975
query37	137	92	77	77
query38	2975	2972	2936	2936
query39	847	816	822	816
query39_1	803	792	800	792
query40	231	151	137	137
query41	64	59	58	58
query42	260	258	259	258
query43	245	241	215	215
query44	
query45	201	194	188	188
query46	886	1015	639	639
query47	2152	2156	2070	2070
query48	314	316	238	238
query49	711	471	381	381
query50	707	301	220	220
query51	4128	4104	4038	4038
query52	267	271	261	261
query53	295	337	290	290
query54	308	281	283	281
query55	96	91	85	85
query56	338	336	322	322
query57	1945	1800	1809	1800
query58	292	280	270	270
query59	2855	2990	2800	2800
query60	358	354	326	326
query61	156	154	156	154
query62	620	583	548	548
query63	314	274	268	268
query64	5110	1280	1001	1001
query65	
query66	1492	453	341	341
query67	24224	24293	24123	24123
query68	
query69	407	310	292	292
query70	1016	937	946	937
query71	345	303	301	301
query72	2876	2658	2492	2492
query73	533	553	317	317
query74	9605	9575	9364	9364
query75	2845	2751	2448	2448
query76	2273	1038	656	656
query77	360	379	316	316
query78	10898	11109	10458	10458
query79	2755	780	573	573
query80	1754	635	550	550
query81	550	254	227	227
query82	979	157	116	116
query83	324	266	241	241
query84	297	112	96	96
query85	902	496	438	438
query86	408	304	329	304
query87	3136	3145	2982	2982
query88	3537	2653	2647	2647
query89	425	375	355	355
query90	2002	178	170	170
query91	164	170	139	139
query92	81	75	74	74
query93	1178	818	495	495
query94	635	311	279	279
query95	576	341	380	341
query96	648	505	228	228
query97	2478	2489	2391	2391
query98	250	220	220	220
query99	1000	999	914	914
Total cold run time: 251454 ms
Total hot run time: 168699 ms

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Member

@eldenmoon eldenmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 23, 2026
@airborne12 airborne12 merged commit 574b84f into apache:master Mar 23, 2026
31 of 32 checks passed
@airborne12 airborne12 deleted the fix-search-slash-in-term branch March 23, 2026 06:51
github-actions bot pushed a commit that referenced this pull request Mar 23, 2026
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

The ANTLR lexer in the search() DSL parser excluded `/` from
`TERM_CHAR`, causing terms like `AC/DC` to be incorrectly tokenized. The
slash was silently skipped by ANTLR's default error recovery, splitting
`AC/DC` into two separate terms `AC` and `DC` instead of treating it as
a single term.

This caused inconsistent behavior compared to Elasticsearch's
query_string parsing, where `AC\/DC` (escaped slash) is handled as a
single analyzed term.

**Fix**: Add `/` to the `TERM_CHAR` fragment in `SearchLexer.g4`. This
allows `/` to appear within terms (e.g., `AC/DC` -> single term) while
regex patterns like `/[a-z]+/` still work correctly since `/` remains
excluded from `TERM_START_CHAR`.
github-actions bot pushed a commit that referenced this pull request Mar 23, 2026
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

The ANTLR lexer in the search() DSL parser excluded `/` from
`TERM_CHAR`, causing terms like `AC/DC` to be incorrectly tokenized. The
slash was silently skipped by ANTLR's default error recovery, splitting
`AC/DC` into two separate terms `AC` and `DC` instead of treating it as
a single term.

This caused inconsistent behavior compared to Elasticsearch's
query_string parsing, where `AC\/DC` (escaped slash) is handled as a
single analyzed term.

**Fix**: Add `/` to the `TERM_CHAR` fragment in `SearchLexer.g4`. This
allows `/` to appear within terms (e.g., `AC/DC` -> single term) while
regex patterns like `/[a-z]+/` still work correctly since `/` remains
excluded from `TERM_START_CHAR`.
yiguolei pushed a commit that referenced this pull request Mar 24, 2026
…terms #61599 (#61618)

Cherry-picked from #61599

Co-authored-by: Jack <jiangkai@selectdb.com>
yiguolei pushed a commit that referenced this pull request Mar 24, 2026
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

The ANTLR lexer in the search() DSL parser excluded `/` from
`TERM_CHAR`, causing terms like `AC/DC` to be incorrectly tokenized. The
slash was silently skipped by ANTLR's default error recovery, splitting
`AC/DC` into two separate terms `AC` and `DC` instead of treating it as
a single term.

This caused inconsistent behavior compared to Elasticsearch's
query_string parsing, where `AC\/DC` (escaped slash) is handled as a
single analyzed term.

**Fix**: Add `/` to the `TERM_CHAR` fragment in `SearchLexer.g4`. This
allows `/` to appear within terms (e.g., `AC/DC` -> single term) while
regex patterns like `/[a-z]+/` still work correctly since `/` remains
excluded from `TERM_START_CHAR`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.5-merged dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants