fix(parser): sanitize lone UTF-16 surrogates before JSONL parsing (fixes #85) by delexw · Pull Request #88 · delexw/claude-code-trace

delexw · 2026-05-16T23:48:50Z

Problem

JSONL files written by Claude Code before v2.1.132 may contain lone UTF-16 surrogate code units (e.g. \uD83D without a matching \uDCxx low surrogate) inside tool_result content block strings. This happens when Claude Code's tool-error truncation logic split a multi-byte emoji at an offset boundary.

serde_json rejects lone surrogates per RFC 8259, causing parse_entry to silently discard any JSONL line that contains one.

Fix

Add a sanitize_lone_surrogates() pre-pass that runs on the raw JSONL text before handing it to serde_json::from_str. It scans for \uXXXX escape sequences in the surrogate range (U+D800–U+DFFF) and:

Lone high surrogate (\uD8xx not followed by \uDCxx–\uDFxx): replaced with �
Lone low surrogate (\uDCxx–\uDFxx not preceded by a valid high surrogate): replaced with �
Valid surrogate pair (\uD8xx\uDCxx): preserved as-is

Allocation is deferred: strings containing no surrogates return Cow::Borrowed with zero copies.

parse_entry is updated to convert the raw bytes to &str first (failing fast on non-UTF-8, which serde_json::from_slice would also reject) and then apply the sanitizer before deserialization.

Files Changed

src-tauri/src/parser/entry.rs — added hex4_to_u16, sanitize_lone_surrogates, updated parse_entry

Tests Added

10 new tests in parser::entry::tests:

Test	Verifies
`sanitize_lone_surrogates_no_surrogates_returns_borrowed`	zero-copy fast path
`sanitize_lone_high_surrogate_replaced_with_fffd`	lone `\uD83D` → `�`
`sanitize_lone_low_surrogate_replaced_with_fffd`	lone `\uDC36` → `�`
`sanitize_valid_surrogate_pair_unchanged`	valid pair preserved
`sanitize_multiple_lone_surrogates_all_replaced`	two lone surrogates, both replaced
`sanitize_high_surrogate_at_end_of_string_replaced`	lone surrogate at string boundary
`parse_entry_with_lone_high_surrogate_succeeds`	end-to-end: line parses instead of returning `None`
`parse_entry_with_lone_low_surrogate_succeeds`	end-to-end: lone low surrogate
`parse_entry_with_valid_surrogate_pair_succeeds`	end-to-end: valid pair still parses

All 390 Rust tests and 352 frontend tests pass.

Fixes #85

JSONL files written by Claude Code before v2.1.132 may contain lone UTF-16 surrogate code units (e.g. `\uD83D` without a matching low surrogate) when the tool-error truncation logic split a multi-byte emoji at an offset boundary. serde_json rejects lone surrogates per RFC 8259, causing parse_entry to silently discard those lines. Add sanitize_lone_surrogates() which scans the raw JSONL string for `\uXXXX` escape sequences in the surrogate range (U+D800-U+DFFF) and replaces lone surrogates with `�` before the JSON deserializer sees them. Valid surrogate pairs (\uD8xx followed immediately by \uDCxx) are preserved unchanged. Allocation is deferred: strings with no surrogates return Cow::Borrowed with zero copies. Update parse_entry to convert the input bytes to str (failing fast on non-UTF-8) and apply the sanitizer before serde_json::from_str. Closes #85

delexw force-pushed the fix-issue-85 branch from bd30fdb to 94a6d52 Compare May 17, 2026 01:16

delexw merged commit 833e266 into main May 17, 2026
1 check failed

delexw deleted the fix-issue-85 branch May 17, 2026 01:21

delexw mentioned this pull request May 17, 2026

fix(parser): add explicit lifetime to Cow return type (fixes clippy lint) #89

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parser): sanitize lone UTF-16 surrogates before JSONL parsing (fixes #85)#88

fix(parser): sanitize lone UTF-16 surrogates before JSONL parsing (fixes #85)#88
delexw merged 1 commit into
mainfrom
fix-issue-85

delexw commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

delexw commented May 16, 2026

Problem

Fix

Files Changed

Tests Added

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant