Expected Behavior
PDF Name objects (tokens starting with /) that contain byte sequences encoded in a legacy code page (Shift-JIS / CP932, GBK / CP936, Big5 / CP950, EUC-KR / CP949, etc.) should be decoded correctly and produce the original string.
Actual Behavior
ScanName() in Lexer.cs and CLexer.cs unconditionally decodes all non-ASCII byte sequences as UTF-8.
This causes two problems:
Problem 1: Non-ASCII detection is too narrow
The condition below only catches bytes in the range 0xC0–0xFF:
// Lexer.cs / CLexer.cs (current upstream code)
if ((name[idx] & 0xC0) == 0xC0)
Shift-JIS continuation bytes (0x80–0xBF) are NOT detected.
For example, the Shift-JIS lead byte 0x83 satisfies (0x83 & 0xC0) == 0x80, not 0xC0, so the re-decode branch is never entered and the raw bytes are returned as-is (garbled).
Problem 2: Unconditional UTF-8 decode silently corrupts non-UTF-8 sequences
var decodedName = Encoding.UTF8.GetString(bytes); // silent replacement with U+FFFD
When the byte sequence is Shift-JIS or GBK, Encoding.UTF8.GetString() replaces invalid bytes with U+FFFD (replacement character) instead of throwing, causing silent data loss.
Steps to Reproduce the Behavior
The following xUnit test (using the same test infrastructure as the existing LexerTests.cs) fails on the current upstream code:
// Helper (already exists in LexerTests.cs)
static Lexer CreateLexerFromBytes(byte[] bytes)
{
var stream = new MemoryStream(bytes);
return new Lexer(stream, null);
}
[Fact]
public void ScanName_ShiftJIS_lead_byte_0x83_should_be_re_decoded()
{
// Shift-JIS "テ" is encoded as the two bytes 0x83 0x65.
// The high byte 0x83 is in the range 0x80-0xBF.
// Upstream check: (0x83 & 0xC0) == 0xC0 → FALSE → no re-decode → garbled output.
byte[] bytes = [(byte)'/', 0x83, 0x65, (byte)' '];
var lexer = CreateLexerFromBytes(bytes);
var symbol = lexer.ScanName();
symbol.Should().Be(Symbol.Name);
// On upstream: Token is "/\x83\x65" (raw bytes, not decoded) — the assertion below fails.
lexer.Token.Should().NotContain("\x83");
}
[Fact]
public void ScanName_ShiftJIS_name_roundtrips_correctly_on_CP932_system()
{
// "日本語" in Shift-JIS: 0x93 0xFA 0x96 0xD1 0x8C 0xEA
// Even though 0x93 triggers (0x93 & 0xC0)==0xC0 on upstream, the subsequent
// Encoding.UTF8.GetString() corrupts the bytes because they are not valid UTF-8.
byte[] sjisBytes = [0x93, 0xFA, 0x96, 0xD1, 0x8C, 0xEA];
byte[] bytes = new byte[1 + sjisBytes.Length + 1];
bytes[0] = (byte)'/';
sjisBytes.CopyTo(bytes, 1);
bytes[^1] = (byte)' ';
var lexer = CreateLexerFromBytes(bytes);
var symbol = lexer.ScanName();
symbol.Should().Be(Symbol.Name);
// On upstream: Token contains U+FFFD replacement characters (data loss).
lexer.Token.Should().NotContain("\uFFFD");
}
Root Cause
In Lexer.cs (PdfSharp.Pdf.IO) and CLexer.cs (PdfSharp.Pdf.Content), the ScanName() method:
- Uses
(name[idx] & 0xC0) == 0xC0 to detect non-ASCII bytes — this misses the 0x80–0xBF range used by Shift-JIS continuation bytes.
- Passes the raw bytes unconditionally to
Encoding.UTF8.GetString(), which silently replaces invalid bytes with U+FFFD.
Proposed Fix
-
Widen the non-ASCII check to (name[idx] & 0x80) != 0 so that any byte ≥ 0x80 triggers re-decoding (covers both UTF-8 lead bytes and legacy-encoding bytes).
-
Replace the unconditional UTF-8 decode with a strict UTF-8 decoder (configured with DecoderExceptionFallback) inside a try/catch, and fall back to the ANSI code page of the current culture when decoding fails:
// Cache once; DecoderExceptionFallback causes GetString() to throw on invalid sequences.
static readonly Encoding StrictUtf8 = new UTF8Encoding(
encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);
// In ScanName():
string decodedName;
try
{
decodedName = StrictUtf8.GetString(bytes);
}
catch (DecoderFallbackException)
{
// Fall back to the ANSI code page of the current culture
// (CP932 for ja-JP, CP936 for zh-CN, CP950 for zh-TW, etc.)
decodedName = PdfEncoders.AnsiCodepageEncoding.GetString(bytes);
}
- Register
CodePagesEncodingProvider once at startup on non-.NET-Framework targets so that Encoding.GetEncoding(codePage) resolves legacy code pages (requires the System.Text.Encoding.CodePages NuGet package for netstandard2.0).
A working implementation with unit tests is available in the fork:
https://github.com/Tocchann/PDFsharp (branch: master)
Key changed files:
src/foundation/src/PDFsharp/src/PdfSharp/Pdf.IO/Lexer.cs
src/foundation/src/PDFsharp/src/PdfSharp/Pdf.Content/CLexer.cs
src/foundation/src/PDFsharp/src/PdfSharp/Pdf.Internal/PdfEncoders.cs
src/foundation/src/PDFsharp/tests/PdfSharp.Tests/IO/LexerTests.cs
Expected Behavior
PDF Name objects (tokens starting with
/) that contain byte sequences encoded in a legacy code page (Shift-JIS / CP932, GBK / CP936, Big5 / CP950, EUC-KR / CP949, etc.) should be decoded correctly and produce the original string.Actual Behavior
ScanName()inLexer.csandCLexer.csunconditionally decodes all non-ASCII byte sequences as UTF-8.This causes two problems:
Problem 1: Non-ASCII detection is too narrow
The condition below only catches bytes in the range 0xC0–0xFF:
Shift-JIS continuation bytes (0x80–0xBF) are NOT detected.
For example, the Shift-JIS lead byte
0x83satisfies(0x83 & 0xC0) == 0x80, not0xC0, so the re-decode branch is never entered and the raw bytes are returned as-is (garbled).Problem 2: Unconditional UTF-8 decode silently corrupts non-UTF-8 sequences
When the byte sequence is Shift-JIS or GBK,
Encoding.UTF8.GetString()replaces invalid bytes withU+FFFD(replacement character) instead of throwing, causing silent data loss.Steps to Reproduce the Behavior
The following xUnit test (using the same test infrastructure as the existing
LexerTests.cs) fails on the current upstream code:Root Cause
In
Lexer.cs(PdfSharp.Pdf.IO) andCLexer.cs(PdfSharp.Pdf.Content), theScanName()method:(name[idx] & 0xC0) == 0xC0to detect non-ASCII bytes — this misses the 0x80–0xBF range used by Shift-JIS continuation bytes.Encoding.UTF8.GetString(), which silently replaces invalid bytes withU+FFFD.Proposed Fix
Widen the non-ASCII check to
(name[idx] & 0x80) != 0so that any byte ≥ 0x80 triggers re-decoding (covers both UTF-8 lead bytes and legacy-encoding bytes).Replace the unconditional UTF-8 decode with a strict UTF-8 decoder (configured with
DecoderExceptionFallback) inside a try/catch, and fall back to the ANSI code page of the current culture when decoding fails:CodePagesEncodingProvideronce at startup on non-.NET-Framework targets so thatEncoding.GetEncoding(codePage)resolves legacy code pages (requires theSystem.Text.Encoding.CodePagesNuGet package fornetstandard2.0).A working implementation with unit tests is available in the fork:
https://github.com/Tocchann/PDFsharp (branch:
master)Key changed files:
src/foundation/src/PDFsharp/src/PdfSharp/Pdf.IO/Lexer.cssrc/foundation/src/PDFsharp/src/PdfSharp/Pdf.Content/CLexer.cssrc/foundation/src/PDFsharp/src/PdfSharp/Pdf.Internal/PdfEncoders.cssrc/foundation/src/PDFsharp/tests/PdfSharp.Tests/IO/LexerTests.cs