[fix](be) Escape JSONB path member control characters#63517
Conversation
### What problem does this PR solve?
Issue Number: None
Problem Summary: JSONB path serialization emitted raw control characters in member names. This could make paths returned by JSON functions or shown in diagnostics contain raw newlines/tabs and fail path round-trip parsing. Escape JSONB path member names with JSON-compatible sequences and decode those sequences while parsing path members.
### Release note
None
### Check List (For Author)
- Test:
- Unit Test: ./run-be-ut.sh --run --filter=JsonbDocumentTest.jsonb_path_member_to_string_escapes_control_characters -j 16
- Format check: build-support/clang-format.sh; build-support/check-format.sh; git diff --check
- Static Analysis: attempted build-support/run-clang-tidy.sh --base d7d516f --build-dir be/ut_build_ASAN, blocked by existing clang-tidy/toolchain diagnostics outside the change
- Behavior changed: Yes. JSONB path member serialization now escapes control characters instead of emitting raw control bytes.
- Does this need documentation: No
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
Review summary: I did not find blocking issues in this PR. The change is focused on JSONB path member escaping, and the serializer/parser round-trip for the newly emitted JSON-compatible control-character escapes is covered by the added unit test.
Critical checkpoint conclusions:
- Goal and tests: The code addresses the stated goal of escaping raw JSONB path member control characters and adds a targeted unit test for newline, tab, carriage return, quote, backslash, and a generic control byte round trip.
- Scope: The modification is small and localized to JSONB path parse/stringify plus one BE unit test.
- Concurrency/lifecycle: No new concurrency, shared state, static initialization dependency, or lifecycle ownership change was introduced.
- Configuration/compatibility: No new config, persistence format, FE/BE protocol, or rolling-upgrade compatibility concern was introduced.
- Parallel paths: The changed path formatter/parser is the shared JsonbPath path used by JSONB callers; I did not find another formatter needing the same control-character escape update.
- Error handling/data correctness: Existing boolean parse-failure behavior is preserved; decoded path legs remain length-aware for JSONB lookup and modification paths.
- Memory/performance: The added escaping work is proportional to member length and does not introduce untracked long-lived allocations beyond existing string construction.
- Observability: No additional observability appears necessary for this local parse/format behavior.
- Test result validation: I attempted to run Target system: Linux; Target arch: x86_64
Python 3.12.3
Check JAVA_HOME version
Apache Maven 3.9.15 (98b2cdbfdb5f1ac8781f537ea9acccaed7922349)
Maven home: /usr/share/apache-maven-3.9.15
Java version: 17.0.19, vendor: Eclipse Adoptium, runtime: /usr/lib/jvm/temurin-17-jdk-amd64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "6.17.0-1013-azure", arch: "amd64", family: "unix"
cmake version 3.31.6
CMake suite maintained and supported by Kitware (kitware.com/cmake).
ninja 1.13.2
Get params:
PARALLEL -- 16
CLEAN -- 0
ENABLE_PCH -- ON
WITH_TDE_DIR --
Build Backend UT
Update apache-orc submodule ...
Update clucene submodule ...
-- Make program: /usr/local/bin/ninja
-- Use ccache: and
-- Extra cxx flags:
-- GLIBC_COMPATIBILITY is ON
-- USE_LIBCPP is OFF
-- USE_JEMALLOC is OFF
-- USE_UNWIND is ON
-- ENABLE_PCH is ON
-- USE_AVX2 is ON
-- Build type is ASAN_UT
-- Build target arch is x86_64
-- DORIS_HOME is /home/runner/work/doris/doris
-- THIRDPARTY_DIR is /home/runner/work/doris/doris/thirdparty/installed
-- make test: ON
-- make benchmark: OFF
-- build fs benchmark tool: OFF
-- build task executor simulator: OFF
-- build file cache lru tool: OFF
-- build gensrc if necessary
make: Entering directory '/home/runner/work/doris/doris/gensrc'
make -C script
make[1]: Entering directory '/home/runner/work/doris/doris/gensrc/script'
/home/runner/work/doris/doris/gensrc/script/gen_build_version.sh
make[1]: Leaving directory '/home/runner/work/doris/doris/gensrc/script'
make -C proto
make[1]: Entering directory '/home/runner/work/doris/doris/gensrc/proto'
/home/runner/work/doris/doris/thirdparty/installed/bin/protoc --proto_path=/home/runner/work/doris/doris/gensrc/proto --cpp_out=/home/runner/work/doris/doris/gensrc/proto/../build//gen_cpp /home/runner/work/doris/doris/gensrc/proto/file_cache.proto
make[1]: Leaving directory '/home/runner/work/doris/doris/gensrc/proto'
make: Leaving directory '/home/runner/work/doris/doris/gensrc'
-- Configuring incomplete, errors occurred!, but the runner is missing , so the BE UT configure step failed before compiling tests.
User focus: No additional user-provided review focus was present.
There was a problem hiding this comment.
Review summary: I did not find blocking issues in this PR. The change is focused on JSONB path member escaping, and the serializer/parser round-trip for the newly emitted JSON-compatible control-character escapes is covered by the added unit test.
Critical checkpoint conclusions:
- Goal and tests: The code addresses the stated goal of escaping raw JSONB path member control characters and adds a targeted unit test for newline, tab, carriage return, quote, backslash, and a generic control byte round trip.
- Scope: The modification is small and localized to JSONB path parse/stringify plus one BE unit test.
- Concurrency/lifecycle: No new concurrency, shared state, static initialization dependency, or lifecycle ownership change was introduced.
- Configuration/compatibility: No new config, persistence format, FE/BE protocol, or rolling-upgrade compatibility concern was introduced.
- Parallel paths: The changed path formatter/parser is the shared JsonbPath path used by JSONB callers; I did not find another formatter needing the same control-character escape update.
- Error handling/data correctness: Existing boolean parse-failure behavior is preserved; decoded path legs remain length-aware for JSONB lookup and modification paths.
- Memory/performance: The added escaping work is proportional to member length and does not introduce untracked long-lived allocations beyond existing string construction.
- Observability: No additional observability appears necessary for this local parse/format behavior.
- Test result validation: I attempted to run ./run-be-ut.sh --run --filter=JsonbDocumentTest.jsonb_path_member_to_string_escapes_control_characters -j 16, but the runner is missing thirdparty/installed/bin/protoc, so the BE UT configure step failed before compiling tests.
User focus: No additional user-provided review focus was present.
### What problem does this PR solve?
Issue Number: None
Problem Summary: Add regression coverage for JSON_SEARCH paths containing escaped control characters and verify a returned escaped path can be used by JSON_EXTRACT.
### Release note
None
### Check List (For Author)
- Test:
- Regression test: doris-local-regression --offset 10000 --network 10.26.20.3/24 run -d query_p0/sql_functions/json_functions -s test_json_search -forceGenOut
- Regression test: doris-local-regression --offset 10000 --network 10.26.20.3/24 run -d query_p0/sql_functions/json_functions -s test_json_search
- Manual check: git diff --check
- Behavior changed: No
- Does this need documentation: No
|
Added regression coverage for JSON_SEARCH path escaping and refreshed json_search.out. Validation:
|
|
/review |
There was a problem hiding this comment.
Review result: no blocking issues found.
Critical checkpoint conclusions:
- Goal and proof: The PR targets JSONB path serialization/parsing for escaped control characters. The implementation covers escaping in path serialization and decoding during parsing, with both BE unit coverage and regression coverage for JSON_SEARCH plus JSON_EXTRACT reuse.
- Scope and clarity: The change is small and focused on JsonbPath/leg_info escaping behavior and related tests.
- Concurrency: No shared mutable state, locks, threads, or lifecycle-sensitive global state are introduced.
- Configuration and compatibility: No configs or storage/protocol formats are added. Behavior changes are limited to user-visible JSON path string escaping; the parser keeps existing backslash escape behavior and adds decoding for generated control escapes.
- Parallel paths: The affected JsonbPath::to_string path is used by JSON_SEARCH and related diagnostics; the parser side was updated for round-trip behavior.
- Conditional checks: Escape handling conditions are localized and match the generated forms; no speculative defensive continuation issue found.
- Test coverage: Added BE unit test for round-trip escaping and regression coverage for JSON_SEARCH outputs and JSON_EXTRACT use of a returned escaped path. Results appear deterministic.
- Observability: No new observability is needed for this local serialization/parsing fix.
- Transactions/persistence/data writes: Not applicable.
- FE/BE variable passing: Not applicable.
- Performance: The added per-character escaping is linear in key length and only applies during path string serialization/parsing; no concerning hot-path regression identified.
User focus: No additional user-provided review focus was supplied.
|
run buildall |
TPC-H: Total hot run time: 31174 ms |
TPC-DS: Total hot run time: 168964 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
What problem does this PR solve?
Issue Number: None
Problem Summary: JSONB path serialization emitted raw control characters in member names. This could make paths returned by JSON functions or shown in diagnostics contain raw newlines/tabs and fail path round-trip parsing. Escape JSONB path member names with JSON-compatible sequences and decode those sequences while parsing path members.
Release note
None
Check List (For Author)
./run-be-ut.sh --run --filter=JsonbDocumentTest.jsonb_path_member_to_string_escapes_control_characters -j 16build-support/clang-format.sh;build-support/check-format.sh;git diff --checkbuild-support/run-clang-tidy.sh --base d7d516ff7b60e2ff6971e7d18cd805478595b5c2 --build-dir be/ut_build_ASAN, blocked by existing clang-tidy/toolchain diagnostics outside the change