Skip to content

branch-4.0: [fix](fe) Reject lone UTF-16 surrogates in JSONB literals (RFC 8259 §8.2) #63255#63346

Open
github-actions[bot] wants to merge 1 commit into
branch-4.0from
auto-pick-63255-branch-4.0
Open

branch-4.0: [fix](fe) Reject lone UTF-16 surrogates in JSONB literals (RFC 8259 §8.2) #63255#63346
github-actions[bot] wants to merge 1 commit into
branch-4.0from
auto-pick-63255-branch-4.0

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

Cherry-picked from #63255

…8.2) (#63255)

## Summary

**Problem fixed:** `JsonLiteral` (Nereids/Jackson path) and
`analysis.JsonLiteral` (legacy/Gson path) silently accepted lone UTF-16
surrogates (e.g. `'"\uD800"'::JSONB`) as valid JSONB literals. RFC 8259
§8.2 explicitly forbids unpaired surrogates in JSON strings because they
cannot be represented as valid UTF-8.

**How it was fixed:** Added a recursive `validateNoLoneSurrogate`
post-parse check in both `JsonLiteral` constructors. After Jackson/Gson
parses the JSON tree, the method walks all string nodes and immediately
throws `AnalysisException` on any lone high or low surrogate.

## What problem does this PR solve?

**Before this fix:** Passing a lone surrogate like `'"\uD800"'::JSONB`
was silently accepted at the FE layer. The invalid value would be stored
in the BE JSONB column. The error would only surface later — during
`EXPORT`, `SELECT INTO OUTFILE`, or cross-system transfer — making it
hard to diagnose. This is a data-correctness (SEV-2) issue.

**After this fix:** Constructing a `JsonLiteral` with a lone surrogate
immediately throws `AnalysisException: Invalid jsonb literal: JSON
string contains lone high surrogate` (or `lone low surrogate`), giving
the user a clear error at write time.

## Behavior change

| Scenario | Before | After |
|---|---|---|
| `'"\uD800"'::JSONB` | Accepted silently | AnalysisException at parse
time |
| `INSERT INTO t VALUES (1, '"\uD800"')` | Stored in BE, may fail on
export | AnalysisException at FE |
| `'"\uD83D\uDE00"'::JSONB` (valid pair 😀) | Accepted | Still accepted
(no change) |
| `'"hello"'::JSONB` (plain ASCII) | Accepted | Still accepted (no
change) |

## Why both paths?

Doris has two `JsonLiteral` implementations:
- **Nereids** (`fe-core`): uses Jackson `ObjectMapper.readTree` —
Jackson accepts lone surrogates by default
- **Legacy** (`fe-catalog`, `analysis`): uses Gson `JsonParser.parse` —
Gson also accepts lone surrogates by default

Both needed the same fix to ensure consistent behavior regardless of
which query path is used.

## Release note

JSONB literal expressions now reject strings containing lone UTF-16
surrogates (e.g. `'"\uD800"'::JSONB`) with an `AnalysisException` at
parse time, conforming to RFC 8259 §8.2. Previously such literals were
silently accepted, which could cause errors during export or
cross-system data transfer.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hello-stephen
Copy link
Copy Markdown
Contributor

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 48.89% (22/45) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants