Skip to content

[fix](be) Escape JSONB path member control characters#63517

Open
mrhhsg wants to merge 2 commits into
apache:masterfrom
mrhhsg:fix/jsonb-path-escape
Open

[fix](be) Escape JSONB path member control characters#63517
mrhhsg wants to merge 2 commits into
apache:masterfrom
mrhhsg:fix/jsonb-path-escape

Conversation

@mrhhsg
Copy link
Copy Markdown
Member

@mrhhsg mrhhsg commented May 22, 2026

What problem does this PR solve?

Issue Number: None

Problem Summary: JSONB path serialization emitted raw control characters in member names. This could make paths returned by JSON functions or shown in diagnostics contain raw newlines/tabs and fail path round-trip parsing. Escape JSONB path member names with JSON-compatible sequences and decode those sequences while parsing path members.

Release note

None

Check List (For Author)

  • Test:
    • Unit Test: ./run-be-ut.sh --run --filter=JsonbDocumentTest.jsonb_path_member_to_string_escapes_control_characters -j 16
    • Format check: build-support/clang-format.sh; build-support/check-format.sh; git diff --check
    • Static Analysis: attempted build-support/run-clang-tidy.sh --base d7d516ff7b60e2ff6971e7d18cd805478595b5c2 --build-dir be/ut_build_ASAN, blocked by existing clang-tidy/toolchain diagnostics outside the change
  • Behavior changed: Yes. JSONB path member serialization now escapes control characters instead of emitting raw control bytes.
  • Does this need documentation: No

### What problem does this PR solve?

Issue Number: None

Problem Summary: JSONB path serialization emitted raw control characters in member names. This could make paths returned by JSON functions or shown in diagnostics contain raw newlines/tabs and fail path round-trip parsing. Escape JSONB path member names with JSON-compatible sequences and decode those sequences while parsing path members.

### Release note

None

### Check List (For Author)

- Test:
    - Unit Test: ./run-be-ut.sh --run --filter=JsonbDocumentTest.jsonb_path_member_to_string_escapes_control_characters -j 16
    - Format check: build-support/clang-format.sh; build-support/check-format.sh; git diff --check
    - Static Analysis: attempted build-support/run-clang-tidy.sh --base d7d516f --build-dir be/ut_build_ASAN, blocked by existing clang-tidy/toolchain diagnostics outside the change
- Behavior changed: Yes. JSONB path member serialization now escapes control characters instead of emitting raw control bytes.
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented May 22, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary: I did not find blocking issues in this PR. The change is focused on JSONB path member escaping, and the serializer/parser round-trip for the newly emitted JSON-compatible control-character escapes is covered by the added unit test.

Critical checkpoint conclusions:

  • Goal and tests: The code addresses the stated goal of escaping raw JSONB path member control characters and adds a targeted unit test for newline, tab, carriage return, quote, backslash, and a generic control byte round trip.
  • Scope: The modification is small and localized to JSONB path parse/stringify plus one BE unit test.
  • Concurrency/lifecycle: No new concurrency, shared state, static initialization dependency, or lifecycle ownership change was introduced.
  • Configuration/compatibility: No new config, persistence format, FE/BE protocol, or rolling-upgrade compatibility concern was introduced.
  • Parallel paths: The changed path formatter/parser is the shared JsonbPath path used by JSONB callers; I did not find another formatter needing the same control-character escape update.
  • Error handling/data correctness: Existing boolean parse-failure behavior is preserved; decoded path legs remain length-aware for JSONB lookup and modification paths.
  • Memory/performance: The added escaping work is proportional to member length and does not introduce untracked long-lived allocations beyond existing string construction.
  • Observability: No additional observability appears necessary for this local parse/format behavior.
  • Test result validation: I attempted to run Target system: Linux; Target arch: x86_64
    Python 3.12.3
    Check JAVA_HOME version
    Apache Maven 3.9.15 (98b2cdbfdb5f1ac8781f537ea9acccaed7922349)
    Maven home: /usr/share/apache-maven-3.9.15
    Java version: 17.0.19, vendor: Eclipse Adoptium, runtime: /usr/lib/jvm/temurin-17-jdk-amd64
    Default locale: en_US, platform encoding: UTF-8
    OS name: "linux", version: "6.17.0-1013-azure", arch: "amd64", family: "unix"
    cmake version 3.31.6

CMake suite maintained and supported by Kitware (kitware.com/cmake).
ninja 1.13.2
Get params:
PARALLEL -- 16
CLEAN -- 0
ENABLE_PCH -- ON
WITH_TDE_DIR --

Build Backend UT
Update apache-orc submodule ...
Update clucene submodule ...
-- Make program: /usr/local/bin/ninja
-- Use ccache: and
-- Extra cxx flags:
-- GLIBC_COMPATIBILITY is ON
-- USE_LIBCPP is OFF
-- USE_JEMALLOC is OFF
-- USE_UNWIND is ON
-- ENABLE_PCH is ON
-- USE_AVX2 is ON
-- Build type is ASAN_UT
-- Build target arch is x86_64
-- DORIS_HOME is /home/runner/work/doris/doris
-- THIRDPARTY_DIR is /home/runner/work/doris/doris/thirdparty/installed
-- make test: ON
-- make benchmark: OFF
-- build fs benchmark tool: OFF
-- build task executor simulator: OFF
-- build file cache lru tool: OFF
-- build gensrc if necessary
make: Entering directory '/home/runner/work/doris/doris/gensrc'
make -C script
make[1]: Entering directory '/home/runner/work/doris/doris/gensrc/script'
/home/runner/work/doris/doris/gensrc/script/gen_build_version.sh
make[1]: Leaving directory '/home/runner/work/doris/doris/gensrc/script'
make -C proto
make[1]: Entering directory '/home/runner/work/doris/doris/gensrc/proto'
/home/runner/work/doris/doris/thirdparty/installed/bin/protoc --proto_path=/home/runner/work/doris/doris/gensrc/proto --cpp_out=/home/runner/work/doris/doris/gensrc/proto/../build//gen_cpp /home/runner/work/doris/doris/gensrc/proto/file_cache.proto
make[1]: Leaving directory '/home/runner/work/doris/doris/gensrc/proto'
make: Leaving directory '/home/runner/work/doris/doris/gensrc'
-- Configuring incomplete, errors occurred!, but the runner is missing , so the BE UT configure step failed before compiling tests.

User focus: No additional user-provided review focus was present.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary: I did not find blocking issues in this PR. The change is focused on JSONB path member escaping, and the serializer/parser round-trip for the newly emitted JSON-compatible control-character escapes is covered by the added unit test.

Critical checkpoint conclusions:

  • Goal and tests: The code addresses the stated goal of escaping raw JSONB path member control characters and adds a targeted unit test for newline, tab, carriage return, quote, backslash, and a generic control byte round trip.
  • Scope: The modification is small and localized to JSONB path parse/stringify plus one BE unit test.
  • Concurrency/lifecycle: No new concurrency, shared state, static initialization dependency, or lifecycle ownership change was introduced.
  • Configuration/compatibility: No new config, persistence format, FE/BE protocol, or rolling-upgrade compatibility concern was introduced.
  • Parallel paths: The changed path formatter/parser is the shared JsonbPath path used by JSONB callers; I did not find another formatter needing the same control-character escape update.
  • Error handling/data correctness: Existing boolean parse-failure behavior is preserved; decoded path legs remain length-aware for JSONB lookup and modification paths.
  • Memory/performance: The added escaping work is proportional to member length and does not introduce untracked long-lived allocations beyond existing string construction.
  • Observability: No additional observability appears necessary for this local parse/format behavior.
  • Test result validation: I attempted to run ./run-be-ut.sh --run --filter=JsonbDocumentTest.jsonb_path_member_to_string_escapes_control_characters -j 16, but the runner is missing thirdparty/installed/bin/protoc, so the BE UT configure step failed before compiling tests.

User focus: No additional user-provided review focus was present.

### What problem does this PR solve?

Issue Number: None

Problem Summary: Add regression coverage for JSON_SEARCH paths containing escaped control characters and verify a returned escaped path can be used by JSON_EXTRACT.

### Release note

None

### Check List (For Author)

- Test:
    - Regression test: doris-local-regression --offset 10000 --network 10.26.20.3/24 run -d query_p0/sql_functions/json_functions -s test_json_search -forceGenOut
    - Regression test: doris-local-regression --offset 10000 --network 10.26.20.3/24 run -d query_p0/sql_functions/json_functions -s test_json_search
    - Manual check: git diff --check
- Behavior changed: No
- Does this need documentation: No
@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented May 22, 2026

Added regression coverage for JSON_SEARCH path escaping and refreshed json_search.out.

Validation:

  • doris-local-regression --offset 10000 --network 10.26.20.3/24 run -d query_p0/sql_functions/json_functions -s test_json_search -forceGenOut
  • doris-local-regression --offset 10000 --network 10.26.20.3/24 run -d query_p0/sql_functions/json_functions -s test_json_search
  • git diff --check

@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented May 22, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking issues found.

Critical checkpoint conclusions:

  • Goal and proof: The PR targets JSONB path serialization/parsing for escaped control characters. The implementation covers escaping in path serialization and decoding during parsing, with both BE unit coverage and regression coverage for JSON_SEARCH plus JSON_EXTRACT reuse.
  • Scope and clarity: The change is small and focused on JsonbPath/leg_info escaping behavior and related tests.
  • Concurrency: No shared mutable state, locks, threads, or lifecycle-sensitive global state are introduced.
  • Configuration and compatibility: No configs or storage/protocol formats are added. Behavior changes are limited to user-visible JSON path string escaping; the parser keeps existing backslash escape behavior and adds decoding for generated control escapes.
  • Parallel paths: The affected JsonbPath::to_string path is used by JSON_SEARCH and related diagnostics; the parser side was updated for round-trip behavior.
  • Conditional checks: Escape handling conditions are localized and match the generated forms; no speculative defensive continuation issue found.
  • Test coverage: Added BE unit test for round-trip escaping and regression coverage for JSON_SEARCH outputs and JSON_EXTRACT use of a returned escaped path. Results appear deterministic.
  • Observability: No new observability is needed for this local serialization/parsing fix.
  • Transactions/persistence/data writes: Not applicable.
  • FE/BE variable passing: Not applicable.
  • Performance: The added per-character escaping is linear in key length and only applies during path string serialization/parsing; no concerning hot-path regression identified.

User focus: No additional user-provided review focus was supplied.

@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented May 22, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31174 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3727adef8e63244af0c76a060090befb5fe9ffed, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17715	3895	3892	3892
q2	q3	10764	1409	838	838
q4	4696	475	349	349
q5	7583	2295	2093	2093
q6	249	174	137	137
q7	950	793	650	650
q8	9366	1698	1660	1660
q9	5171	4915	4883	4883
q10	6397	2082	1779	1779
q11	441	278	245	245
q12	641	413	289	289
q13	18121	3494	2757	2757
q14	265	257	233	233
q15	q16	825	780	710	710
q17	953	962	978	962
q18	6915	5664	5553	5553
q19	1194	1258	1082	1082
q20	549	408	254	254
q21	5580	2593	2497	2497
q22	435	362	311	311
Total cold run time: 98810 ms
Total hot run time: 31174 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4181	4115	4114	4114
q2	q3	4519	4948	4349	4349
q4	2088	2208	1398	1398
q5	4408	4287	4284	4284
q6	229	174	130	130
q7	1771	2199	1702	1702
q8	2548	2234	2139	2139
q9	7957	8146	7825	7825
q10	4525	4524	4127	4127
q11	565	412	390	390
q12	718	726	517	517
q13	3324	3552	3036	3036
q14	295	325	274	274
q15	q16	725	726	653	653
q17	1350	1321	1307	1307
q18	8238	7252	7277	7252
q19	1200	1122	1097	1097
q20	2212	2202	1922	1922
q21	5276	4622	4469	4469
q22	534	456	414	414
Total cold run time: 56663 ms
Total hot run time: 51399 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 168964 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3727adef8e63244af0c76a060090befb5fe9ffed, data reload: false

query5	4319	649	546	546
query6	353	222	207	207
query7	4232	581	303	303
query8	327	247	236	236
query9	8850	4005	3970	3970
query10	465	340	303	303
query11	5779	2422	2198	2198
query12	182	132	122	122
query13	1284	598	446	446
query14	5901	5362	5077	5077
query14_1	4344	4372	4387	4372
query15	216	207	185	185
query16	1018	464	435	435
query17	964	739	606	606
query18	2471	486	371	371
query19	221	204	184	184
query20	136	133	134	133
query21	211	140	121	121
query22	13547	13640	13323	13323
query23	17266	16419	16036	16036
query23_1	16187	16183	16216	16183
query24	7374	1739	1290	1290
query24_1	1288	1262	1292	1262
query25	560	467	401	401
query26	1304	343	171	171
query27	2712	548	340	340
query28	4451	1969	1923	1923
query29	1024	624	492	492
query30	304	240	200	200
query31	1098	1061	935	935
query32	87	74	72	72
query33	547	363	319	319
query34	1171	1108	629	629
query35	757	785	672	672
query36	1321	1346	1177	1177
query37	154	103	105	103
query38	3196	3173	3061	3061
query39	931	908	894	894
query39_1	875	877	888	877
query40	235	150	131	131
query41	68	64	63	63
query42	109	110	106	106
query43	315	327	278	278
query44	
query45	211	195	196	195
query46	1077	1148	704	704
query47	2334	2277	2229	2229
query48	428	417	294	294
query49	638	498	396	396
query50	962	344	257	257
query51	4278	4271	4224	4224
query52	107	103	95	95
query53	262	276	203	203
query54	324	275	266	266
query55	92	95	88	88
query56	298	312	295	295
query57	1412	1408	1288	1288
query58	297	271	293	271
query59	1582	1618	1424	1424
query60	323	315	301	301
query61	154	153	152	152
query62	658	621	564	564
query63	247	203	209	203
query64	2404	794	631	631
query65	
query66	1741	488	355	355
query67	30003	29911	29790	29790
query68	
query69	469	354	304	304
query70	1042	992	996	992
query71	309	269	272	269
query72	2977	2702	2401	2401
query73	852	746	432	432
query74	5035	4913	4737	4737
query75	2655	2564	2235	2235
query76	2294	1131	797	797
query77	398	407	340	340
query78	12126	12048	11612	11612
query79	1367	992	739	739
query80	632	517	456	456
query81	450	275	232	232
query82	623	172	119	119
query83	351	272	250	250
query84	311	137	108	108
query85	871	636	449	449
query86	380	332	343	332
query87	3399	3361	3189	3189
query88	3477	2655	2617	2617
query89	427	380	343	343
query90	1920	189	184	184
query91	177	171	144	144
query92	83	82	78	78
query93	1473	1428	839	839
query94	544	350	310	310
query95	680	374	347	347
query96	988	840	342	342
query97	2691	2686	2607	2607
query98	238	230	224	224
query99	1107	1103	997	997
Total cold run time: 251304 ms
Total hot run time: 168964 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 68.89% (62/90) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.97% (27279/37903)
Line Coverage 55.30% (291228/526612)
Region Coverage 52.19% (241865/463424)
Branch Coverage 53.59% (104452/194895)

Copy link
Copy Markdown
Contributor

@HappenLee HappenLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants