Skip to content

Fix UTF-8 decoder rejecting valid U+0800 three-byte sequence#664

Merged
swebb2066 merged 1 commit into
apache:masterfrom
metsw24-max:utf8-u0800-boundary-check
May 14, 2026
Merged

Fix UTF-8 decoder rejecting valid U+0800 three-byte sequence#664
swebb2066 merged 1 commit into
apache:masterfrom
metsw24-max:utf8-u0800-boundary-check

Conversation

@metsw24-max
Copy link
Copy Markdown
Contributor

This PR fixes an off-by-one error in the UTF-8 three-byte decoding validation logic.

Problem

Transcoder::decode incorrectly rejected the canonical UTF-8 encoding of U+0800.

The three-byte overlong validation check used:

if (rv <= 0x800)

which treated 0x0800 itself as invalid, even though it is the smallest valid code point that legitimately requires a three-byte UTF-8 sequence.

As a result, valid UTF-8 input containing the bytes:

E0 A0 80

(the canonical encoding of U+0800) was decoded as 0xFFFF, causing the caller to substitute Transcoder::LOSSCHAR and silently corrupt the decoded output.

Fix

Change the validation condition to:

if (rv < 0x800)

This preserves rejection of true overlong encodings while correctly accepting U+0800.

Tests

Added testDecodeUTF8_U0800 regression coverage which:

  • decodes the literal UTF-8 byte sequence E0 A0 80
  • verifies the decoded output matches Transcoder::encode(0x0800, …)
  • asserts that no LOSSCHAR is introduced

@swebb2066 swebb2066 merged commit bb2563c into apache:master May 14, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants