Skip to content

Bug: Name objects with non-UTF-8 byte sequences (Shift-JIS, GBK, etc.) cause garbled text when reading PDF #364

@Tocchann

Description

@Tocchann

Expected Behavior

PDF Name objects (tokens starting with /) that contain byte sequences encoded in a legacy code page (Shift-JIS / CP932, GBK / CP936, Big5 / CP950, EUC-KR / CP949, etc.) should be decoded correctly and produce the original string.

Actual Behavior

ScanName() in Lexer.cs and CLexer.cs unconditionally decodes all non-ASCII byte sequences as UTF-8.
This causes two problems:

Problem 1: Non-ASCII detection is too narrow

The condition below only catches bytes in the range 0xC0–0xFF:

// Lexer.cs / CLexer.cs  (current upstream code)
if ((name[idx] & 0xC0) == 0xC0)

Shift-JIS continuation bytes (0x80–0xBF) are NOT detected.
For example, the Shift-JIS lead byte 0x83 satisfies (0x83 & 0xC0) == 0x80, not 0xC0, so the re-decode branch is never entered and the raw bytes are returned as-is (garbled).

Problem 2: Unconditional UTF-8 decode silently corrupts non-UTF-8 sequences

var decodedName = Encoding.UTF8.GetString(bytes);  // silent replacement with U+FFFD

When the byte sequence is Shift-JIS or GBK, Encoding.UTF8.GetString() replaces invalid bytes with U+FFFD (replacement character) instead of throwing, causing silent data loss.

Steps to Reproduce the Behavior

The following xUnit test (using the same test infrastructure as the existing LexerTests.cs) fails on the current upstream code:

// Helper (already exists in LexerTests.cs)
static Lexer CreateLexerFromBytes(byte[] bytes)
{
    var stream = new MemoryStream(bytes);
    return new Lexer(stream, null);
}

[Fact]
public void ScanName_ShiftJIS_lead_byte_0x83_should_be_re_decoded()
{
    // Shift-JIS "テ" is encoded as the two bytes 0x83 0x65.
    // The high byte 0x83 is in the range 0x80-0xBF.
    // Upstream check: (0x83 & 0xC0) == 0xC0 → FALSE → no re-decode → garbled output.
    byte[] bytes = [(byte)'/', 0x83, 0x65, (byte)' '];

    var lexer = CreateLexerFromBytes(bytes);
    var symbol = lexer.ScanName();

    symbol.Should().Be(Symbol.Name);
    // On upstream: Token is "/\x83\x65" (raw bytes, not decoded) — the assertion below fails.
    lexer.Token.Should().NotContain("\x83");
}

[Fact]
public void ScanName_ShiftJIS_name_roundtrips_correctly_on_CP932_system()
{
    // "日本語" in Shift-JIS: 0x93 0xFA 0x96 0xD1 0x8C 0xEA
    // Even though 0x93 triggers (0x93 & 0xC0)==0xC0 on upstream, the subsequent
    // Encoding.UTF8.GetString() corrupts the bytes because they are not valid UTF-8.
    byte[] sjisBytes = [0x93, 0xFA, 0x96, 0xD1, 0x8C, 0xEA];
    byte[] bytes = new byte[1 + sjisBytes.Length + 1];
    bytes[0] = (byte)'/';
    sjisBytes.CopyTo(bytes, 1);
    bytes[^1] = (byte)' ';

    var lexer = CreateLexerFromBytes(bytes);
    var symbol = lexer.ScanName();

    symbol.Should().Be(Symbol.Name);
    // On upstream: Token contains U+FFFD replacement characters (data loss).
    lexer.Token.Should().NotContain("\uFFFD");
}

Root Cause

In Lexer.cs (PdfSharp.Pdf.IO) and CLexer.cs (PdfSharp.Pdf.Content), the ScanName() method:

  1. Uses (name[idx] & 0xC0) == 0xC0 to detect non-ASCII bytes — this misses the 0x80–0xBF range used by Shift-JIS continuation bytes.
  2. Passes the raw bytes unconditionally to Encoding.UTF8.GetString(), which silently replaces invalid bytes with U+FFFD.

Proposed Fix

  1. Widen the non-ASCII check to (name[idx] & 0x80) != 0 so that any byte ≥ 0x80 triggers re-decoding (covers both UTF-8 lead bytes and legacy-encoding bytes).

  2. Replace the unconditional UTF-8 decode with a strict UTF-8 decoder (configured with DecoderExceptionFallback) inside a try/catch, and fall back to the ANSI code page of the current culture when decoding fails:

// Cache once; DecoderExceptionFallback causes GetString() to throw on invalid sequences.
static readonly Encoding StrictUtf8 = new UTF8Encoding(
    encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);

// In ScanName():
string decodedName;
try
{
    decodedName = StrictUtf8.GetString(bytes);
}
catch (DecoderFallbackException)
{
    // Fall back to the ANSI code page of the current culture
    // (CP932 for ja-JP, CP936 for zh-CN, CP950 for zh-TW, etc.)
    decodedName = PdfEncoders.AnsiCodepageEncoding.GetString(bytes);
}
  1. Register CodePagesEncodingProvider once at startup on non-.NET-Framework targets so that Encoding.GetEncoding(codePage) resolves legacy code pages (requires the System.Text.Encoding.CodePages NuGet package for netstandard2.0).

A working implementation with unit tests is available in the fork:
https://github.com/Tocchann/PDFsharp (branch: master)

Key changed files:

  • src/foundation/src/PDFsharp/src/PdfSharp/Pdf.IO/Lexer.cs
  • src/foundation/src/PDFsharp/src/PdfSharp/Pdf.Content/CLexer.cs
  • src/foundation/src/PDFsharp/src/PdfSharp/Pdf.Internal/PdfEncoders.cs
  • src/foundation/src/PDFsharp/tests/PdfSharp.Tests/IO/LexerTests.cs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions