Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.NET Core 3.0 follows Unicode best practices when replacing ill-formed UTF-8 byte sequences #13547

Closed
GrabYourPitchforks opened this issue Jul 24, 2019 · 2 comments
Labels
breaking-change Indicates a .NET Core breaking change

Comments

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Jul 24, 2019

.NET Core 3.0 follows Unicode best practices when replacing ill-formed UTF-8 byte sequences

See .NET Core 3.0 follows Unicode best practices when replacing ill-formed UTF-8 byte sequences for updated documentation for this change.

When the UTF8Encoding class encounters an ill-formed UTF-8 byte sequence during a bytes-to-chars transcoding operation, it will replace that sequence with a '�' (U+FFFD REPLACEMENT CHARACTER) character in the output string. .NET Core 3.0 differs from previous versions of .NET Core and the .NET Framework in that .NET Core 3.0 follows the Unicode best practice for performing this replacement during the transcoding operation.

Version introduced

3.0

Change description

When transcoding bytes to chars, the UTF8Encoding class now performs character substitution based on Unicode best practices. The substitution mechanism used is described by The Unicode Standard, Version 12.0, Sec. 3.9 (PDF) in the heading titled U+FFFD Substitution of Maximal Subparts.

This behavior only applies when the input byte sequence contains ill-formed UTF-8 data. Additionally, if the UTF8Encoding instance has been constructed with throwOnInvalidBytes: true (see the ctor documentation), the UTF8Encoding instance will continue to throw on invalid input rather than perform U+FFFD replacement.

Old behavior

Input: The 3-byte input: [ ED A0 90 ] (ill-formed input)
Output: The 2-char output: [ FFFD FFFD ]

New behavior

Input: The 3-byte input: [ ED A0 90 ] (ill-formed input)
Output: The 3-char output: [ FFFD FFFD FFFD ]

(This 3-char output is the preferred output per Table 3-9 of the previously linked Unicode Standard PDF.)

Reason for change

This is part of a larger effort to improve UTF-8 handling throughout .NET, including by the new System.Text.Unicode.Utf8 and System.Text.Rune types. The UTF8Encoding type was given improved error handling mechanics so that it produces output consistent with the newly introduced types.

Recommended action

No action is required on the part of the developer.

Category

Core

Affected APIs


Issue metadata

  • Issue type: breaking-change
@dotnet-bot dotnet-bot added breaking-change Indicates a .NET Core breaking change ⌚ Not Triaged Not triaged labels Jul 24, 2019
@rpetrusha rpetrusha removed the ⌚ Not Triaged Not triaged label Aug 6, 2019
@rpetrusha
Copy link
Contributor

I've made some minor changes to the text of the issue, @GrabYourPitchforks. Please review them.

@rpetrusha
Copy link
Contributor

@GrabYourPitchforks, do we need to note the preview version in which this change was introduced?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Indicates a .NET Core breaking change
Projects
None yet
Development

No branches or pull requests

3 participants