Fix ScanName() to correctly decode non-UTF-8 name objects (Shift-JIS, GBK, etc.)#365
Open
Tocchann wants to merge 1 commit intoempira:masterfrom
Open
Fix ScanName() to correctly decode non-UTF-8 name objects (Shift-JIS, GBK, etc.)#365Tocchann wants to merge 1 commit intoempira:masterfrom
Tocchann wants to merge 1 commit intoempira:masterfrom
Conversation
… GBK, etc.) Problem ------- In Lexer.cs (Pdf.IO) and CLexer.cs (Pdf.Content), ScanName() had two bugs: 1. Non-ASCII byte detection used (byte & 0xC0) == 0xC0, which misses bytes in the range 0x80-0xBF. Shift-JIS lead/continuation bytes such as 0x83 satisfy (0x83 & 0xC0) == 0x80, so the re-decode branch was never entered and the raw bytes were returned garbled. 2. The unconditional Encoding.UTF8.GetString() call silently replaced invalid bytes with U+FFFD instead of throwing, causing silent data loss for Shift-JIS, GBK, Big5, EUC-KR, and similar legacy encodings. Fix --- - Widen non-ASCII detection from (& 0xC0)==0xC0 to (& 0x80)!=0 so every byte >= 0x80 triggers the re-decode path (covers UTF-8 lead bytes and all legacy encodings). - Replace unconditional UTF-8 decode with a strict UTF-8 decoder (throwOnInvalidBytes: true) wrapped in try/catch; on DecoderFallbackException fall back to the ANSI code page of the current culture via PdfEncoders.AnsiCodepageEncoding. - Add AnsiCodepageEncoding property to PdfEncoders that resolves the ANSI code page for the current culture (CP932 for ja-JP, CP936 for zh-CN, CP950 for zh-TW, etc.). - Register CodePagesEncodingProvider once at startup on non-.NET-Framework targets so that Encoding.GetEncoding(codePage) resolves legacy code pages. - Add System.Text.Encoding.CodePages NuGet reference for netstandard2.0 targets (net462 already includes all code pages natively). - Add unit tests covering UTF-8 names, non-UTF-8 no-throw, and Shift-JIS fallback. Fixes empira#364
Member
|
ANSI-encoded names should only be expected for PDF files targeting PDF 1.3 or older. Would you provide some PDF files "from the wild" for testing? |
Author
|
This file is the reproduction file. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a bug in
ScanName()(Lexer.csandCLexer.cs) where PDF Name objects containing byte sequences encoded in a legacy code page (Shift-JIS/CP932, GBK/CP936, Big5/CP950, EUC-KR/CP949, etc.) produced garbled text when read.Closes #364
Root Cause
Two bugs existed in the
ScanName()method:Bug 1 – Non-ASCII detection was too narrow
The condition
(name[idx] & 0xC0) == 0xC0only matches bytes in the range0xC0–0xFF.Shift-JIS bytes such as
0x83satisfy(0x83 & 0xC0) == 0x80, so the re-decode branch was never entered and raw bytes were returned garbled.Bug 2 – Unconditional UTF-8 decode caused silent data loss
Encoding.UTF8.GetString(bytes)silently replaces invalid bytes withU+FFFD(replacement character), destroying the original content without any error.Fix
(byte & 0x80) != 0— catches every byte ≥ 0x80new UTF8Encoding(false, throwOnInvalidBytes: true)in try/catch; on failure fall back toPdfEncoders.AnsiCodepageEncodingAnsiCodepageEncodingPdfEncodersthat resolves the ANSI code page for the current culture (CP932, CP936, CP950, CP949, …)CodePagesEncodingProviderSystem.Text.Encoding.CodePagesNuGet added fornetstandard2.0LexerTests.cs: UTF-8 normal case, non-UTF-8 no-throw, Shift-JIS fallbackChanged Files
src/Directory.Packages.propssrc/foundation/src/PDFsharp/src/PdfSharp/Pdf.IO/Lexer.cssrc/foundation/src/PDFsharp/src/PdfSharp/Pdf.Content/CLexer.cssrc/foundation/src/PDFsharp/src/PdfSharp/Pdf.Internal/PdfEncoders.cssrc/foundation/src/PDFsharp/src/PdfSharp/PdfSharp.csprojsrc/foundation/src/PDFsharp/tests/PdfSharp.Tests/IO/LexerTests.cs