Skip to content

Fix ScanName() to correctly decode non-UTF-8 name objects (Shift-JIS, GBK, etc.)#365

Open
Tocchann wants to merge 1 commit intoempira:masterfrom
Tocchann:fix/scanname-non-utf8-encoding
Open

Fix ScanName() to correctly decode non-UTF-8 name objects (Shift-JIS, GBK, etc.)#365
Tocchann wants to merge 1 commit intoempira:masterfrom
Tocchann:fix/scanname-non-utf8-encoding

Conversation

@Tocchann
Copy link
Copy Markdown

Summary

This PR fixes a bug in ScanName() (Lexer.cs and CLexer.cs) where PDF Name objects containing byte sequences encoded in a legacy code page (Shift-JIS/CP932, GBK/CP936, Big5/CP950, EUC-KR/CP949, etc.) produced garbled text when read.

Closes #364

Root Cause

Two bugs existed in the ScanName() method:

Bug 1 – Non-ASCII detection was too narrow

The condition (name[idx] & 0xC0) == 0xC0 only matches bytes in the range 0xC0–0xFF.
Shift-JIS bytes such as 0x83 satisfy (0x83 & 0xC0) == 0x80, so the re-decode branch was never entered and raw bytes were returned garbled.

Bug 2 – Unconditional UTF-8 decode caused silent data loss

Encoding.UTF8.GetString(bytes) silently replaces invalid bytes with U+FFFD (replacement character), destroying the original content without any error.

Fix

Change Detail
Widen non-ASCII check (byte & 0x80) != 0 — catches every byte ≥ 0x80
Strict UTF-8 + fallback new UTF8Encoding(false, throwOnInvalidBytes: true) in try/catch; on failure fall back to PdfEncoders.AnsiCodepageEncoding
AnsiCodepageEncoding New property on PdfEncoders that resolves the ANSI code page for the current culture (CP932, CP936, CP950, CP949, …)
CodePagesEncodingProvider Registered once at startup on non-.NET-Framework targets; System.Text.Encoding.CodePages NuGet added for netstandard2.0
Unit tests Added to LexerTests.cs: UTF-8 normal case, non-UTF-8 no-throw, Shift-JIS fallback

Changed Files

  • src/Directory.Packages.props
  • src/foundation/src/PDFsharp/src/PdfSharp/Pdf.IO/Lexer.cs
  • src/foundation/src/PDFsharp/src/PdfSharp/Pdf.Content/CLexer.cs
  • src/foundation/src/PDFsharp/src/PdfSharp/Pdf.Internal/PdfEncoders.cs
  • src/foundation/src/PDFsharp/src/PdfSharp/PdfSharp.csproj
  • src/foundation/src/PDFsharp/tests/PdfSharp.Tests/IO/LexerTests.cs

… GBK, etc.)

Problem
-------
In Lexer.cs (Pdf.IO) and CLexer.cs (Pdf.Content), ScanName() had two bugs:

1. Non-ASCII byte detection used (byte & 0xC0) == 0xC0, which misses bytes in the
   range 0x80-0xBF.  Shift-JIS lead/continuation bytes such as 0x83 satisfy
   (0x83 & 0xC0) == 0x80, so the re-decode branch was never entered and the raw
   bytes were returned garbled.

2. The unconditional Encoding.UTF8.GetString() call silently replaced invalid bytes
   with U+FFFD instead of throwing, causing silent data loss for Shift-JIS, GBK,
   Big5, EUC-KR, and similar legacy encodings.

Fix
---
- Widen non-ASCII detection from (& 0xC0)==0xC0 to (& 0x80)!=0 so every byte >=
  0x80 triggers the re-decode path (covers UTF-8 lead bytes and all legacy encodings).
- Replace unconditional UTF-8 decode with a strict UTF-8 decoder (throwOnInvalidBytes:
  true) wrapped in try/catch; on DecoderFallbackException fall back to the ANSI code
  page of the current culture via PdfEncoders.AnsiCodepageEncoding.
- Add AnsiCodepageEncoding property to PdfEncoders that resolves the ANSI code page
  for the current culture (CP932 for ja-JP, CP936 for zh-CN, CP950 for zh-TW, etc.).
- Register CodePagesEncodingProvider once at startup on non-.NET-Framework targets
  so that Encoding.GetEncoding(codePage) resolves legacy code pages.
- Add System.Text.Encoding.CodePages NuGet reference for netstandard2.0 targets
  (net462 already includes all code pages natively).
- Add unit tests covering UTF-8 names, non-UTF-8 no-throw, and Shift-JIS fallback.

Fixes empira#364
@ThomasHoevel ThomasHoevel added the Cannot Reproduce https://xkcd.com/583/ label May 4, 2026
@ThomasHoevel
Copy link
Copy Markdown
Member

ANSI-encoded names should only be expected for PDF files targeting PDF 1.3 or older.
PDF files targeting PDF 1.4 or newer should use UTF-8 encoding for names.

Would you provide some PDF files "from the wild" for testing?

@Tocchann
Copy link
Copy Markdown
Author

Tocchann commented May 4, 2026

This file is the reproduction file.
Objects 279 and 280 are the ones in question.

Sample32.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cannot Reproduce https://xkcd.com/583/

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Name objects with non-UTF-8 byte sequences (Shift-JIS, GBK, etc.) cause garbled text when reading PDF

2 participants