Bug: Name objects with non-UTF-8 byte sequences (Shift-JIS, GBK, etc.) cause garbled text when reading PDF

## Expected Behavior

PDF Name objects (tokens starting with `/`) that contain byte sequences encoded in a legacy code page (Shift-JIS / CP932, GBK / CP936, Big5 / CP950, EUC-KR / CP949, etc.) should be decoded correctly and produce the original string.

## Actual Behavior

`ScanName()` in `Lexer.cs` and `CLexer.cs` unconditionally decodes all non-ASCII byte sequences as UTF-8.
This causes two problems:

### Problem 1: Non-ASCII detection is too narrow

The condition below only catches bytes in the range **0xC0–0xFF**:

```csharp
// Lexer.cs / CLexer.cs  (current upstream code)
if ((name[idx] & 0xC0) == 0xC0)
```

Shift-JIS continuation bytes (**0x80–0xBF**) are NOT detected.
For example, the Shift-JIS lead byte `0x83` satisfies `(0x83 & 0xC0) == 0x80`, not `0xC0`, so the re-decode branch is never entered and the raw bytes are returned as-is (garbled).

### Problem 2: Unconditional UTF-8 decode silently corrupts non-UTF-8 sequences

```csharp
var decodedName = Encoding.UTF8.GetString(bytes);  // silent replacement with U+FFFD
```

When the byte sequence is Shift-JIS or GBK, `Encoding.UTF8.GetString()` replaces invalid bytes with `U+FFFD` (replacement character) instead of throwing, causing silent data loss.

## Steps to Reproduce the Behavior

The following xUnit test (using the same test infrastructure as the existing `LexerTests.cs`) fails on the current upstream code:

```csharp
// Helper (already exists in LexerTests.cs)
static Lexer CreateLexerFromBytes(byte[] bytes)
{
    var stream = new MemoryStream(bytes);
    return new Lexer(stream, null);
}

[Fact]
public void ScanName_ShiftJIS_lead_byte_0x83_should_be_re_decoded()
{
    // Shift-JIS "テ" is encoded as the two bytes 0x83 0x65.
    // The high byte 0x83 is in the range 0x80-0xBF.
    // Upstream check: (0x83 & 0xC0) == 0xC0 → FALSE → no re-decode → garbled output.
    byte[] bytes = [(byte)'/', 0x83, 0x65, (byte)' '];

    var lexer = CreateLexerFromBytes(bytes);
    var symbol = lexer.ScanName();

    symbol.Should().Be(Symbol.Name);
    // On upstream: Token is "/\x83\x65" (raw bytes, not decoded) — the assertion below fails.
    lexer.Token.Should().NotContain("\x83");
}

[Fact]
public void ScanName_ShiftJIS_name_roundtrips_correctly_on_CP932_system()
{
    // "日本語" in Shift-JIS: 0x93 0xFA 0x96 0xD1 0x8C 0xEA
    // Even though 0x93 triggers (0x93 & 0xC0)==0xC0 on upstream, the subsequent
    // Encoding.UTF8.GetString() corrupts the bytes because they are not valid UTF-8.
    byte[] sjisBytes = [0x93, 0xFA, 0x96, 0xD1, 0x8C, 0xEA];
    byte[] bytes = new byte[1 + sjisBytes.Length + 1];
    bytes[0] = (byte)'/';
    sjisBytes.CopyTo(bytes, 1);
    bytes[^1] = (byte)' ';

    var lexer = CreateLexerFromBytes(bytes);
    var symbol = lexer.ScanName();

    symbol.Should().Be(Symbol.Name);
    // On upstream: Token contains U+FFFD replacement characters (data loss).
    lexer.Token.Should().NotContain("\uFFFD");
}
```

## Root Cause

In `Lexer.cs` (`PdfSharp.Pdf.IO`) and `CLexer.cs` (`PdfSharp.Pdf.Content`), the `ScanName()` method:

1. Uses `(name[idx] & 0xC0) == 0xC0` to detect non-ASCII bytes — this misses the 0x80–0xBF range used by Shift-JIS continuation bytes.
2. Passes the raw bytes unconditionally to `Encoding.UTF8.GetString()`, which silently replaces invalid bytes with `U+FFFD`.

## Proposed Fix

1. **Widen the non-ASCII check** to `(name[idx] & 0x80) != 0` so that any byte ≥ 0x80 triggers re-decoding (covers both UTF-8 lead bytes and legacy-encoding bytes).

2. **Replace the unconditional UTF-8 decode** with a strict UTF-8 decoder (configured with `DecoderExceptionFallback`) inside a try/catch, and fall back to the ANSI code page of the current culture when decoding fails:

```csharp
// Cache once; DecoderExceptionFallback causes GetString() to throw on invalid sequences.
static readonly Encoding StrictUtf8 = new UTF8Encoding(
    encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);

// In ScanName():
string decodedName;
try
{
    decodedName = StrictUtf8.GetString(bytes);
}
catch (DecoderFallbackException)
{
    // Fall back to the ANSI code page of the current culture
    // (CP932 for ja-JP, CP936 for zh-CN, CP950 for zh-TW, etc.)
    decodedName = PdfEncoders.AnsiCodepageEncoding.GetString(bytes);
}
```

3. **Register `CodePagesEncodingProvider`** once at startup on non-.NET-Framework targets so that `Encoding.GetEncoding(codePage)` resolves legacy code pages (requires the `System.Text.Encoding.CodePages` NuGet package for `netstandard2.0`).

A working implementation with unit tests is available in the fork:  
https://github.com/Tocchann/PDFsharp (branch: `master`)

Key changed files:
- `src/foundation/src/PDFsharp/src/PdfSharp/Pdf.IO/Lexer.cs`
- `src/foundation/src/PDFsharp/src/PdfSharp/Pdf.Content/CLexer.cs`
- `src/foundation/src/PDFsharp/src/PdfSharp/Pdf.Internal/PdfEncoders.cs`
- `src/foundation/src/PDFsharp/tests/PdfSharp.Tests/IO/LexerTests.cs`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Name objects with non-UTF-8 byte sequences (Shift-JIS, GBK, etc.) cause garbled text when reading PDF #364

Expected Behavior

Actual Behavior

Problem 1: Non-ASCII detection is too narrow

Problem 2: Unconditional UTF-8 decode silently corrupts non-UTF-8 sequences

Steps to Reproduce the Behavior

Root Cause

Proposed Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Name objects with non-UTF-8 byte sequences (Shift-JIS, GBK, etc.) cause garbled text when reading PDF #364

Description

Expected Behavior

Actual Behavior

Problem 1: Non-ASCII detection is too narrow

Problem 2: Unconditional UTF-8 decode silently corrupts non-UTF-8 sequences

Steps to Reproduce the Behavior

Root Cause

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions