[Breaking change request] Change UTF-8 encoder and decoder to match the WHATWG encoding standard #41100

askeksa-google · 2020-03-18T18:26:30Z

Summary

Change encoding and decoding of UTF-8 to conform to the WHATWG encoding standard. This means that it will never emit invalid UTF-8, only accept valid UTF-8 and be compatible with the TextEncoder and TextDecoder classes in JavaScript.

Related issues: #7046, #22330, #28832, #31370, #31954

What is changing:

When decoding UTF-8 data with the Utf8Codec or Utf8Decoder class, the input is considered malformed if it contains an encoded surrogate character (code point in the range U+D800-U+DFFF, encoded in UTF-8 as a 3-byte character encoding where the first byte is 0xED and the second byte is in the range 0xA0-0xBF).
When encoding a string as UTF-8 with the Utf8Codec or Utf8Encoder class, and the string contains an unpaired surrogate, that surrogate is emitted as a replacement character (U+FFFD, encoded in UTF-8 as 0xEF, 0xBF, 0xBD) instead of an encoded surrogate (which is invalid UTF-8). For chunked conversion, if a chunk ends with a high surrogate and the next chunk starts with a low surrogate, these surrogates are considered properly paired and are combined, like before.
When decoding malformed UTF-8 data with allowMalformed set to true, the number of replacement characters emitted will sometimes differ from the number currently emitted. Specifically, the decoder will emit one replacement character for each maximal sequence of input bytes that is either
1. a prefix of a valid encoding of a character, or
2. a single byte that is not a prefix of a valid encoding of a character.
When decoding malformed UTF-8 data with allowMalformed set to false, the offset in the resulting FormatException will point to the first byte from which the decoder can conclude that the sequence is malformed, rather than the first byte that was not decoded successfully. Also, the message of the FormatException will sometimes be different from what it is currently. If the input contains more than one error, the FormatException may point to a different error than before.

Why is this changing?

Dart strings (like JS and Java strings) may contain unpaired surrogates. The current strategy of allowing surrogates when encoding and decoding UTF-8 ensures that any Dart string can be encoded as UTF-8 (actually, WTF-8) and decoded back into the original string.

This strategy has a number of drawbacks:

The output of the UTF-8 encoder is sometimes not valid UTF-8, which can be problematic when this data needs to be consumed by other programs.
When UTF-8 data is read, and that data contains encoded unpaired surrogates, this may cause problems much later, when the string is processed, rather that catching the invalid encoding up front.
The Dart behavior deviates from JS, which means that when Dart code is translated to JS, UTF-8 encoding and decoding can't directly use the JS TextEncoder and TextDecoder classes. It must do some or all of the conversion in Dart code, which has a significant performance cost.
Retaining exact compatibility with the current error behavior complicates some planned optimizations to UTF-8 decoding in the Dart VM.

The purpose of the change is thus to:

ensure that Dart programs don't inadvertently produce or accept invalid UTF-8.
enable faster UTF-8 encoding and decoding for both JS and VM targets.

Expected impact

Programs manipulating strings through usual string operations are unlikely to be affected.

A program may be affected by this change if it does any of the following:

Manipulates strings in a way that may introduce unpaired surrogates, encodes these strings as UTF-8, decodes them again and expects the string contents to be preserved.
Encodes arbitrary substrings as UTF-8 without regard to surrogate pairs, decodes them again, concatenates them (before or after decoding) and expects the string contents to be preserved.
Relies on the exact offsets and/or error messages from decoding invalid UTF-8 data or which error is reported in case of multiple errors.
Relies on the number of replacement characters produced by decoding invalid UTF-8 data.

Mitigation

For the scenarios listed above:

UTF-8, being an interchange format, is unsuited for representing such broken strings. Consider a different representation.
For encoding in multiple chunks, use the chunked conversion API.
Adjust the program to detect specific errors in a different way, or adapt it to the new errors.
Same as 3.

Variations

An optional allowSurrogates parameter could be added to the encoder and decoder to support the round-trip use case. To obtain the performance benefits, it should default to false. This could introduce further breakage for programs implementing the Utf8Codec interface (unless we only put the flag on the constructors).

If the surrogate change is considered too risky, the error and replacement character changes on their own can still ease the VM optimizations and possibly improve the performance of JS when allowMalformed is set to true.

The text was updated successfully, but these errors were encountered:

franklinyow · 2020-03-26T00:43:45Z

Please send the announce
https://github.com/dart-lang/sdk/blob/master/docs/process/breaking-changes.md

The change will need to wait for the beta branch snapped before it can be submitted.

mkustermann · 2020-04-20T12:11:24Z

The breaking change has been announced on dart-announce, see announcement.

mkustermann · 2020-04-20T12:14:03Z

The change will need to wait for the beta branch snapped before it can be submitted.

The beta branch has been cut, so this is not blocked anymore.

We would like to move forward with this.

@franklinyow Can you help us getting the LGTM from all veto powers?

franklinyow · 2020-04-22T19:10:31Z

cc @Hixie @matanlurey @dgrove @vsmenon for review and approval.

vsmenon · 2020-04-22T19:15:26Z

Is there an impact on prod code size for web or native? ( @rakudrama )

askeksa-google · 2020-04-22T19:45:16Z

@rakudrama and I just discussed the web situation. There is currently a 4k size increase and a significant speed regression for small inputs, but we have ideas for fixing both issues.

For native there will probably be a small (a few k) constant size increase.

askeksa-google · 2020-04-30T10:10:48Z

One detail that we need to decide on: for JS, when allowMalformed is set to true, we can choose either to fall back to the Dart implementation like we do currently, or we can use the TextDecoder (when available).

Using TextDecoder is significantly faster, but it might result in different output between browsers, since not all browsers follow the WHATWG standard precisely. Experiments indicate that:

Firefox follows the standard.
Chrome (and Chromium-based browsers): When an unfinished sequence is interrupted by end-of-input, or when an unfinished 4-byte sequence is interrupted after exactly 2 bytes, then one replacement character is emitted per byte instead of just one.
Safari: When a valid first byte of a multi-byte sequence is followed by fewer bytes than necessary to complete the sequence and then end-of-input, a replacement character is emitted and the rest of the input is discarded, even if it contains valid sequences.

@lrhn @rakudrama @vsmenon WDYT?

lrhn · 2020-04-30T10:16:02Z

Use TextDecoder and file bugs against Chrome and Safari.
(And mark the corresponding tests as failing on those browsers).

I don't think the exact behavior on malformed inputs is that important. We don't otherwise try to work around browser bugs.

rakudrama · 2020-04-30T17:56:35Z

Using TextDecoder for short inputs is slower than the Dart implementation, so we will likely use a hybrid approach, using the backup Dart implementation for short inputs, where the threshold depends on some arguments.
@lrhn How do you feel about the bug existing for some inputs but not others?

@askeksa-google, is the following statement true: All cases of non-standard behaviour have an output containing the replacement character.
If so, we have a relatively simple remedy: in the allowMalformed case, check the output for the replacement character and if present, redo the conversion with the backup dart code.
It might be less work to do this than split the tests to keep good test coverage.

We do work around some other browser bugs - for example, fixing the unusual exponential form in IE11's Number.toString(radix), and various differences in the DOM APIs.

askeksa-google · 2020-04-30T18:29:07Z

That statement is true. A stricter statement which is also true is: All cases of non-standard behaviour have an output containing the replacement character as the last character or two replacement characters in a row.

The "when an unfinished 4-byte sequence is interrupted after exactly 2 bytes" deviation in Chromium is unfortunate, since without that (assuming only the other observed deviations) we would only have to check the last character of the output.

For mostly-ASCII inputs, it might very well be faster to always fall back than to scan the result. At least the cross-over point will be quite high.

lrhn · 2020-05-01T07:57:47Z

@lrhn How do you feel about the bug existing for some inputs but not others?

I feel fine. File bug reports against the browsers and hope they get around to fixing it.

It's still only error handling, it's not like they do something different for a well-formed input.

Hixie · 2020-05-03T06:25:36Z

Having standards-compliant error handling can be as important as handling valid input. For one, variance in error handling is often a vector in vulnerabilities. For another, most people don't care if input is valid or not. They just care that all their software interoperates.

This adjusts all UTF-8 tests to the new semantics in the breaking change described here: #41100 This has three parts: - Unpaired surrogates are encoded as replacement characters, and encoded surrogates are considered malformed input when decoding. - Decoding errors are generally reported on the position of the byte that conclusively makes the input malformed. - The number of replacement characters emitted by the decoder is generally one per unfinished sequence or undecodable byte. The code changes to implement the new semantics are placed in subsequent commits. Change-Id: I4cc8ce660e39287e734070764ab8e1f0ebb8b9e0 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/143815 Reviewed-by: Lasse R.H. Nielsen <lrn@google.com>

This implements the encoding part of the breaking change described at #41100 Change-Id: I22f2ffc24efc783a2199f640690a85c70a85e7d2 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/143818 Reviewed-by: Lasse R.H. Nielsen <lrn@google.com>

Two-pass decoder: the first pass scans through the input to compute the length of the resulting string and which decoder to use, and the second pass does the actual decoding. The same decoder is used for both one-shot and chunked decoding, and both with and without allowMalformed. If there is an error in the input and allowMalformed is true, it starts over with a general decoder that supports malformed input and allocates space as it goes along. JS targets go directly to the general decoder, as the two-pass approach is not beneficial here. Three pieces of the decoder are designed to be pluggable by patches to optimize the performance further: - scan, running the first pass of the conversion. - decode8, decoding Latin1 data into a OneByteString. - decode16, decoding arbitrary data into a TwoByteString. Improves decoding speed, especially for complex input (many multi-byte characters). Observed speed increases are approximately: - dart2js: up to 40% - VM JIT: up to 260% - VM AOT: up to 130% The constant overhead of calling the UTF-8 decoder is also significantly reduced for dart2js. Code size for dart2js is slightly reduced compared to the old decoder. ASCII inputs currently see a slight speed decrease for VM targets, which will be fixed in https://dart-review.googlesource.com/c/sdk/+/145460 This is part of the implementation of the breaking change described at #41100 Closes #28832 Closes #31954 Ideas for further improvements to the decoder are collected in #41734 Change-Id: I3c5bb84e8d6783231680a9d34d6c38e8a28ab112 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/142025 Reviewed-by: Stephen Adams <sra@google.com> Reviewed-by: Martin Kustermann <kustermann@google.com>

This brings JSON encoding and decoding in line with the UTF-8 changes described at #41100 The fused UTF-8 / JSON decoder for the VM now uses the new UTF-8 decoder instead of its own, separate UTF-8 decoder. The JSON encoder now escapes lone surrogates, so it can encode JSON string values containing lone surrogates while keeping its output valid UTF-8. Change-Id: Ie4d4601cf84012068849e64d4670f2dcd49ea088 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/144286 Reviewed-by: Lasse R.H. Nielsen <lrn@google.com>

lrhn · 2020-05-04T10:57:53Z

Having consistent behavior is a good thing, but there is a limit to how much work should be spent on mitigating browser bugs. I don't think the impact here is big enough to warrant adding detection and workarounds for something which is, arguably, already an error situation.

Update package:ffi to a version which does not depend on unpaired surrogates. Breaking change in Dart: #41100. Change-Id: I2a5ba0abee7c6cccb166c234f8f620dbe0063d47 Cq-Include-Trybots: luci.dart.try:vm-ffi-android-debug-arm-try,vm-ffi-android-debug-arm64-try Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/146340 Reviewed-by: Aske Simon Christensen <askesc@google.com> Commit-Queue: Daco Harkes <dacoharkes@google.com>

askeksa-google · 2020-05-07T05:18:20Z

I benchmarked a post-scan for replacement characters. Here it adds about 6ns per byte on top of the TextDecoder time of 2-6ns per byte. In other words, it is a 4x slowdown for ASCII strings and 2x for the most complex strings. Still faster than the Dart version, though (for long strings).

With the changes in #41100 the handling of UTF-8 encoded surrogates in Dart now matches that of JS. Thus, the pre-pass that scans for the presence of surrogates before handing the data to TextDecoder is no longer needed. Removing this gives a significant speedup. On my laptop, in Chrome, on the Utf8Decode benchmark, it gives around 1ns per input byte out of previously roughly 2.5ns (ASCII) to 5ns (Russian). In principle, this also enables TextDecoder for allowMalformed: true, since the number of replacement characters produced by Dart now matches the WHATWG standard. This does result in failures in some browsers, where these no not adhere to the standard. For instance, Chrome outputs one replacement character per undecoded input byte when an unfinished sequence is interrupted by end-of-input, where the standard specifies only one replacement character. To work around the browser deviations, the output from TextDecoder is scanned for replacement characters, and if any are found, the decoding falls back to the Dart implementation. This workaround can be removed if the bugs are fixed in the browsers. Since TextDecoder has a large startup overhead, we also fall back to the Dart implementation for short strings. Change-Id: I9e95a95ce726ce0d9e9a3b46df8ee2512ab05f0a Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/144294 Commit-Queue: Aske Simon Christensen <askesc@google.com> Reviewed-by: Stephen Adams <sra@google.com>

The breaking change #41100 changed the UTF-8 encoder to encode unpaired surrogates as replacement characters. However, the VM contains its own, internal UTF-8 encoder, which is used for printing and for the Dart_StringToUTF8 function in the Dart API. Here, this encoder is changed to also encode unpaired surrogates as replacement characters. Fixes #42094 Change-Id: I9d55168f67d124dbc7987fb759696a98e7526c29 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/149292 Commit-Queue: Aske Simon Christensen <askesc@google.com> Reviewed-by: Martin Kustermann <kustermann@google.com> Reviewed-by: Daco Harkes <dacoharkes@google.com>

askeksa-google added area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. breaking-change-request This tracks requests for feedback on breaking changes labels Mar 18, 2020

askeksa-google self-assigned this Mar 18, 2020

lrhn assigned franklinyow Mar 19, 2020

askeksa-google mentioned this issue Mar 26, 2020

Skip surrogates in decode test dart-lang/co19#592

Merged

askeksa-google changed the title ~~[Breaking change request] Change UTF-8 decoder to match the WHATWG decoding standard~~ [Breaking change request] Change UTF-8 encoder and decoder to match the WHATWG encoding standard Mar 31, 2020

askeksa-google added the library-convert label Apr 30, 2020

askeksa-google mentioned this issue May 4, 2020

Further improvements to UTF-8 decoding #41734

Open

9 tasks

dart-bot closed this as completed in 3efa014 May 4, 2020

This was referenced May 4, 2020

Fix utf8_test to match breaking change to UTF-8 encoding and decoding dart-archive/ffi#42

Merged

Adjust documentation to match new UTF-8 encoder/decoder behavior dart-archive/ffi#43

Merged

askeksa-google mentioned this issue May 28, 2020

Print lone surrogates as replacement characters #42094

Closed

askeksa-google mentioned this issue Sep 30, 2020

Utf8Decoder should be compatible with TextDecoder.decode #31370

Closed

askeksa-google mentioned this issue Oct 9, 2020

Remove workaround for TextDecoder browser bugs when the bugs are fixed #43737

Open

dart-lang deleted a comment from zerotobtc Sep 29, 2021

rakudrama mentioned this issue Sep 14, 2022

UTF-8 decoder produces incorrect number of replacement characters #49967

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Breaking change request] Change UTF-8 encoder and decoder to match the WHATWG encoding standard #41100

[Breaking change request] Change UTF-8 encoder and decoder to match the WHATWG encoding standard #41100

askeksa-google commented Mar 18, 2020 •

edited

Loading

franklinyow commented Mar 26, 2020

mkustermann commented Apr 20, 2020

mkustermann commented Apr 20, 2020

franklinyow commented Apr 22, 2020

vsmenon commented Apr 22, 2020

askeksa-google commented Apr 22, 2020

askeksa-google commented Apr 30, 2020

lrhn commented Apr 30, 2020

rakudrama commented Apr 30, 2020

askeksa-google commented Apr 30, 2020

lrhn commented May 1, 2020

Hixie commented May 3, 2020

lrhn commented May 4, 2020

askeksa-google commented May 7, 2020

[Breaking change request] Change UTF-8 encoder and decoder to match the WHATWG encoding standard #41100

[Breaking change request] Change UTF-8 encoder and decoder to match the WHATWG encoding standard #41100

Comments

askeksa-google commented Mar 18, 2020 • edited Loading

Summary

What is changing:

Why is this changing?

Expected impact

Mitigation

Variations

franklinyow commented Mar 26, 2020

mkustermann commented Apr 20, 2020

mkustermann commented Apr 20, 2020

franklinyow commented Apr 22, 2020

vsmenon commented Apr 22, 2020

askeksa-google commented Apr 22, 2020

askeksa-google commented Apr 30, 2020

lrhn commented Apr 30, 2020

rakudrama commented Apr 30, 2020

askeksa-google commented Apr 30, 2020

lrhn commented May 1, 2020

Hixie commented May 3, 2020

lrhn commented May 4, 2020

askeksa-google commented May 7, 2020

askeksa-google commented Mar 18, 2020 •

edited

Loading