Skip to content

Conversation

@Maaarcocr
Copy link
Contributor

@Maaarcocr Maaarcocr commented Nov 12, 2025

Implements RFC 8259 § 7 compliant handling of UTF-16 surrogate pairs. Non-BMP characters (U+10000 to U+10FFFF) encoded as surrogate pairs like \uD834\uDD1E are now correctly decoded to their Unicode code points.

The parser now:

  • Detects high surrogates (0xD800-0xDBFF) and looks ahead for low surrogates
  • Combines pairs using the formula: ((high - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000
  • Validates all edge cases with descriptive error messages
  • Maintains backward compatibility for normal \uXXXX escapes

Added comprehensive tests covering valid pairs, multiple pairs, mixed content, and all error cases for unpaired surrogates.

fixes #31

Implements RFC 8259 § 7 compliant handling of UTF-16 surrogate pairs.
Non-BMP characters (U+10000 to U+10FFFF) encoded as surrogate pairs
like \uD834\uDD1E are now correctly decoded to their Unicode code points.

The parser now:
- Detects high surrogates (0xD800-0xDBFF) and looks ahead for low surrogates
- Combines pairs using the formula: ((high - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000
- Validates all edge cases with descriptive error messages
- Maintains backward compatibility for normal \uXXXX escapes

Added comprehensive tests covering valid pairs, multiple pairs, mixed content,
and all error cases for unpaired surrogates.
@dsherret dsherret requested a review from Copilot November 12, 2025 15:13
Copilot finished reviewing on behalf of dsherret November 12, 2025 15:15
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements RFC 8259 § 7 compliant UTF-16 surrogate pair handling, enabling the parser to correctly decode non-BMP Unicode characters (U+10000 to U+10FFFF) that are encoded as surrogate pairs in JSON escape sequences.

  • Adds detection and combination logic for UTF-16 surrogate pairs (high: 0xD800-0xDBFF, low: 0xDC00-0xDFFF)
  • Implements validation for all edge cases with descriptive error messages for unpaired surrogates
  • Maintains backward compatibility for standard \uXXXX Unicode escapes

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/string.rs Core implementation of surrogate pair detection, validation, and decoding using the RFC 8259 formula
src/scanner.rs Updated error message for unpaired low surrogate to reflect new validation
src/parse_to_value.rs Comprehensive test suite covering valid pairs, multiple pairs, mixed content, and error cases

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@dsherret dsherret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@dsherret dsherret merged commit d9e9430 into dprint:main Nov 12, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JSONC parser fails to correctly parse non-BMP escape sequences

2 participants