Fix ScanName() to correctly decode non-UTF-8 name objects (Shift-JIS, GBK, etc.) by Tocchann · Pull Request #365 · empira/PDFsharp

Tocchann · 2026-04-29T04:26:15Z

Summary

This PR fixes a bug in ScanName() (Lexer.cs and CLexer.cs) where PDF Name objects containing byte sequences encoded in a legacy code page (Shift-JIS/CP932, GBK/CP936, Big5/CP950, EUC-KR/CP949, etc.) produced garbled text when read.

Closes #364

Root Cause

Two bugs existed in the ScanName() method:

Bug 1 – Non-ASCII detection was too narrow

The condition (name[idx] & 0xC0) == 0xC0 only matches bytes in the range 0xC0–0xFF.
Shift-JIS bytes such as 0x83 satisfy (0x83 & 0xC0) == 0x80, so the re-decode branch was never entered and raw bytes were returned garbled.

Bug 2 – Unconditional UTF-8 decode caused silent data loss

Encoding.UTF8.GetString(bytes) silently replaces invalid bytes with U+FFFD (replacement character), destroying the original content without any error.

Fix

Change	Detail
Widen non-ASCII check	`(byte & 0x80) != 0` — catches every byte ≥ 0x80
Strict UTF-8 + fallback	`new UTF8Encoding(false, throwOnInvalidBytes: true)` in try/catch; on failure fall back to `PdfEncoders.AnsiCodepageEncoding`
`AnsiCodepageEncoding`	New property on `PdfEncoders` that resolves the ANSI code page for the current culture (CP932, CP936, CP950, CP949, …)
`CodePagesEncodingProvider`	Registered once at startup on non-.NET-Framework targets; `System.Text.Encoding.CodePages` NuGet added for `netstandard2.0`
Unit tests	Added to `LexerTests.cs`: UTF-8 normal case, non-UTF-8 no-throw, Shift-JIS fallback

Changed Files

src/Directory.Packages.props
src/foundation/src/PDFsharp/src/PdfSharp/Pdf.IO/Lexer.cs
src/foundation/src/PDFsharp/src/PdfSharp/Pdf.Content/CLexer.cs
src/foundation/src/PDFsharp/src/PdfSharp/Pdf.Internal/PdfEncoders.cs
src/foundation/src/PDFsharp/src/PdfSharp/PdfSharp.csproj
src/foundation/src/PDFsharp/tests/PdfSharp.Tests/IO/LexerTests.cs

… GBK, etc.) Problem ------- In Lexer.cs (Pdf.IO) and CLexer.cs (Pdf.Content), ScanName() had two bugs: 1. Non-ASCII byte detection used (byte & 0xC0) == 0xC0, which misses bytes in the range 0x80-0xBF. Shift-JIS lead/continuation bytes such as 0x83 satisfy (0x83 & 0xC0) == 0x80, so the re-decode branch was never entered and the raw bytes were returned garbled. 2. The unconditional Encoding.UTF8.GetString() call silently replaced invalid bytes with U+FFFD instead of throwing, causing silent data loss for Shift-JIS, GBK, Big5, EUC-KR, and similar legacy encodings. Fix --- - Widen non-ASCII detection from (& 0xC0)==0xC0 to (& 0x80)!=0 so every byte >= 0x80 triggers the re-decode path (covers UTF-8 lead bytes and all legacy encodings). - Replace unconditional UTF-8 decode with a strict UTF-8 decoder (throwOnInvalidBytes: true) wrapped in try/catch; on DecoderFallbackException fall back to the ANSI code page of the current culture via PdfEncoders.AnsiCodepageEncoding. - Add AnsiCodepageEncoding property to PdfEncoders that resolves the ANSI code page for the current culture (CP932 for ja-JP, CP936 for zh-CN, CP950 for zh-TW, etc.). - Register CodePagesEncodingProvider once at startup on non-.NET-Framework targets so that Encoding.GetEncoding(codePage) resolves legacy code pages. - Add System.Text.Encoding.CodePages NuGet reference for netstandard2.0 targets (net462 already includes all code pages natively). - Add unit tests covering UTF-8 names, non-UTF-8 no-throw, and Shift-JIS fallback. Fixes empira#364

ThomasHoevel · 2026-05-04T06:24:59Z

ANSI-encoded names should only be expected for PDF files targeting PDF 1.3 or older.
PDF files targeting PDF 1.4 or newer should use UTF-8 encoding for names.

Would you provide some PDF files "from the wild" for testing?

Tocchann · 2026-05-04T07:22:41Z

This file is the reproduction file.
Objects 279 and 280 are the ones in question.

Sample32.pdf

ThomasHoevel added the Cannot Reproduce https://xkcd.com/583/ label May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ScanName() to correctly decode non-UTF-8 name objects (Shift-JIS, GBK, etc.)#365

Fix ScanName() to correctly decode non-UTF-8 name objects (Shift-JIS, GBK, etc.)#365
Tocchann wants to merge 1 commit intoempira:masterfrom
Tocchann:fix/scanname-non-utf8-encoding

Tocchann commented Apr 29, 2026

Uh oh!

ThomasHoevel commented May 4, 2026

Uh oh!

Tocchann commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Tocchann commented Apr 29, 2026

Summary

Root Cause

Fix

Changed Files

Uh oh!

ThomasHoevel commented May 4, 2026

Uh oh!

Tocchann commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants