Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Breaking change request] Change UTF-8 encoder and decoder to match the WHATWG encoding standard #41100

Closed
askeksa-google opened this issue Mar 18, 2020 · 14 comments
Assignees
Labels
area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. breaking-change-request This tracks requests for feedback on breaking changes library-convert

Comments

@askeksa-google
Copy link

askeksa-google commented Mar 18, 2020

Summary

Change encoding and decoding of UTF-8 to conform to the WHATWG encoding standard. This means that it will never emit invalid UTF-8, only accept valid UTF-8 and be compatible with the TextEncoder and TextDecoder classes in JavaScript.

Related issues: #7046, #22330, #28832, #31370, #31954

What is changing:

  • When decoding UTF-8 data with the Utf8Codec or Utf8Decoder class, the input is considered malformed if it contains an encoded surrogate character (code point in the range U+D800-U+DFFF, encoded in UTF-8 as a 3-byte character encoding where the first byte is 0xED and the second byte is in the range 0xA0-0xBF).
  • When encoding a string as UTF-8 with the Utf8Codec or Utf8Encoder class, and the string contains an unpaired surrogate, that surrogate is emitted as a replacement character (U+FFFD, encoded in UTF-8 as 0xEF, 0xBF, 0xBD) instead of an encoded surrogate (which is invalid UTF-8). For chunked conversion, if a chunk ends with a high surrogate and the next chunk starts with a low surrogate, these surrogates are considered properly paired and are combined, like before.
  • When decoding malformed UTF-8 data with allowMalformed set to true, the number of replacement characters emitted will sometimes differ from the number currently emitted. Specifically, the decoder will emit one replacement character for each maximal sequence of input bytes that is either
    1. a prefix of a valid encoding of a character, or
    2. a single byte that is not a prefix of a valid encoding of a character.
  • When decoding malformed UTF-8 data with allowMalformed set to false, the offset in the resulting FormatException will point to the first byte from which the decoder can conclude that the sequence is malformed, rather than the first byte that was not decoded successfully. Also, the message of the FormatException will sometimes be different from what it is currently. If the input contains more than one error, the FormatException may point to a different error than before.

Why is this changing?

Dart strings (like JS and Java strings) may contain unpaired surrogates. The current strategy of allowing surrogates when encoding and decoding UTF-8 ensures that any Dart string can be encoded as UTF-8 (actually, WTF-8) and decoded back into the original string.

This strategy has a number of drawbacks:

  • The output of the UTF-8 encoder is sometimes not valid UTF-8, which can be problematic when this data needs to be consumed by other programs.
  • When UTF-8 data is read, and that data contains encoded unpaired surrogates, this may cause problems much later, when the string is processed, rather that catching the invalid encoding up front.
  • The Dart behavior deviates from JS, which means that when Dart code is translated to JS, UTF-8 encoding and decoding can't directly use the JS TextEncoder and TextDecoder classes. It must do some or all of the conversion in Dart code, which has a significant performance cost.
  • Retaining exact compatibility with the current error behavior complicates some planned optimizations to UTF-8 decoding in the Dart VM.

The purpose of the change is thus to:

  • ensure that Dart programs don't inadvertently produce or accept invalid UTF-8.
  • enable faster UTF-8 encoding and decoding for both JS and VM targets.

Expected impact

Programs manipulating strings through usual string operations are unlikely to be affected.

A program may be affected by this change if it does any of the following:

  1. Manipulates strings in a way that may introduce unpaired surrogates, encodes these strings as UTF-8, decodes them again and expects the string contents to be preserved.
  2. Encodes arbitrary substrings as UTF-8 without regard to surrogate pairs, decodes them again, concatenates them (before or after decoding) and expects the string contents to be preserved.
  3. Relies on the exact offsets and/or error messages from decoding invalid UTF-8 data or which error is reported in case of multiple errors.
  4. Relies on the number of replacement characters produced by decoding invalid UTF-8 data.

Mitigation

For the scenarios listed above:

  1. UTF-8, being an interchange format, is unsuited for representing such broken strings. Consider a different representation.
  2. For encoding in multiple chunks, use the chunked conversion API.
  3. Adjust the program to detect specific errors in a different way, or adapt it to the new errors.
  4. Same as 3.

Variations

An optional allowSurrogates parameter could be added to the encoder and decoder to support the round-trip use case. To obtain the performance benefits, it should default to false. This could introduce further breakage for programs implementing the Utf8Codec interface (unless we only put the flag on the constructors).

If the surrogate change is considered too risky, the error and replacement character changes on their own can still ease the VM optimizations and possibly improve the performance of JS when allowMalformed is set to true.

@askeksa-google askeksa-google added area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. breaking-change-request This tracks requests for feedback on breaking changes labels Mar 18, 2020
@askeksa-google askeksa-google self-assigned this Mar 18, 2020
@franklinyow
Copy link
Contributor

Please send the announce
https://github.com/dart-lang/sdk/blob/master/docs/process/breaking-changes.md

The change will need to wait for the beta branch snapped before it can be submitted.

@askeksa-google askeksa-google changed the title [Breaking change request] Disallow surrogates in encoding and decoding of UTF-8 and change number of replacement characters generated from invalid UTF-8 to one per un-decoded input byte. [Breaking change request] Change UTF-8 decoder to match the WHATWG decoding standard Mar 30, 2020
@askeksa-google askeksa-google changed the title [Breaking change request] Change UTF-8 decoder to match the WHATWG decoding standard [Breaking change request] Change UTF-8 encoder and decoder to match the WHATWG encoding standard Mar 31, 2020
@mkustermann
Copy link
Member

The breaking change has been announced on dart-announce, see announcement.

@mkustermann
Copy link
Member

The change will need to wait for the beta branch snapped before it can be submitted.

The beta branch has been cut, so this is not blocked anymore.

We would like to move forward with this.

@franklinyow Can you help us getting the LGTM from all veto powers?

@franklinyow
Copy link
Contributor

cc @Hixie @matanlurey @dgrove @vsmenon for review and approval.

@vsmenon
Copy link
Member

vsmenon commented Apr 22, 2020

Is there an impact on prod code size for web or native? ( @rakudrama )

@askeksa-google
Copy link
Author

@rakudrama and I just discussed the web situation. There is currently a 4k size increase and a significant speed regression for small inputs, but we have ideas for fixing both issues.

For native there will probably be a small (a few k) constant size increase.

@askeksa-google
Copy link
Author

One detail that we need to decide on: for JS, when allowMalformed is set to true, we can choose either to fall back to the Dart implementation like we do currently, or we can use the TextDecoder (when available).

Using TextDecoder is significantly faster, but it might result in different output between browsers, since not all browsers follow the WHATWG standard precisely. Experiments indicate that:

  • Firefox follows the standard.
  • Chrome (and Chromium-based browsers): When an unfinished sequence is interrupted by end-of-input, or when an unfinished 4-byte sequence is interrupted after exactly 2 bytes, then one replacement character is emitted per byte instead of just one.
  • Safari: When a valid first byte of a multi-byte sequence is followed by fewer bytes than necessary to complete the sequence and then end-of-input, a replacement character is emitted and the rest of the input is discarded, even if it contains valid sequences.

@lrhn @rakudrama @vsmenon WDYT?

@lrhn
Copy link
Member

lrhn commented Apr 30, 2020

Use TextDecoder and file bugs against Chrome and Safari.
(And mark the corresponding tests as failing on those browsers).

I don't think the exact behavior on malformed inputs is that important. We don't otherwise try to work around browser bugs.

@rakudrama
Copy link
Member

Using TextDecoder for short inputs is slower than the Dart implementation, so we will likely use a hybrid approach, using the backup Dart implementation for short inputs, where the threshold depends on some arguments.
@lrhn How do you feel about the bug existing for some inputs but not others?

@askeksa-google, is the following statement true: All cases of non-standard behaviour have an output containing the replacement character.
If so, we have a relatively simple remedy: in the allowMalformed case, check the output for the replacement character and if present, redo the conversion with the backup dart code.
It might be less work to do this than split the tests to keep good test coverage.

We do work around some other browser bugs - for example, fixing the unusual exponential form in IE11's Number.toString(radix), and various differences in the DOM APIs.

@askeksa-google
Copy link
Author

That statement is true. A stricter statement which is also true is: All cases of non-standard behaviour have an output containing the replacement character as the last character or two replacement characters in a row.

The "when an unfinished 4-byte sequence is interrupted after exactly 2 bytes" deviation in Chromium is unfortunate, since without that (assuming only the other observed deviations) we would only have to check the last character of the output.

For mostly-ASCII inputs, it might very well be faster to always fall back than to scan the result. At least the cross-over point will be quite high.

@lrhn
Copy link
Member

lrhn commented May 1, 2020

@lrhn How do you feel about the bug existing for some inputs but not others?

I feel fine. File bug reports against the browsers and hope they get around to fixing it.

It's still only error handling, it's not like they do something different for a well-formed input.

@Hixie
Copy link
Contributor

Hixie commented May 3, 2020

Having standards-compliant error handling can be as important as handling valid input. For one, variance in error handling is often a vector in vulnerabilities. For another, most people don't care if input is valid or not. They just care that all their software interoperates.

dart-bot pushed a commit that referenced this issue May 4, 2020
This adjusts all UTF-8 tests to the new semantics in the breaking
change described here: #41100

This has three parts:
- Unpaired surrogates are encoded as replacement characters, and
  encoded surrogates are considered malformed input when decoding.
- Decoding errors are generally reported on the position of the byte
  that conclusively makes the input malformed.
- The number of replacement characters emitted by the decoder is
  generally one per unfinished sequence or undecodable byte.

The code changes to implement the new semantics are placed in subsequent
commits.

Change-Id: I4cc8ce660e39287e734070764ab8e1f0ebb8b9e0
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/143815
Reviewed-by: Lasse R.H. Nielsen <lrn@google.com>
dart-bot pushed a commit that referenced this issue May 4, 2020
This implements the encoding part of the breaking change described at
#41100

Change-Id: I22f2ffc24efc783a2199f640690a85c70a85e7d2
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/143818
Reviewed-by: Lasse R.H. Nielsen <lrn@google.com>
dart-bot pushed a commit that referenced this issue May 4, 2020
Two-pass decoder: the first pass scans through the input to compute
the length of the resulting string and which decoder to use, and the
second pass does the actual decoding.

The same decoder is used for both one-shot and chunked decoding, and
both with and without allowMalformed. If there is an error in the input
and allowMalformed is true, it starts over with a general decoder that
supports malformed input and allocates space as it goes along.

JS targets go directly to the general decoder, as the two-pass approach
is not beneficial here.

Three pieces of the decoder are designed to be pluggable by patches to
optimize the performance further:
- scan, running the first pass of the conversion.
- decode8, decoding Latin1 data into a OneByteString.
- decode16, decoding arbitrary data into a TwoByteString.

Improves decoding speed, especially for complex input (many multi-byte
characters). Observed speed increases are approximately:
 - dart2js: up to 40%
 - VM JIT:  up to 260%
 - VM AOT:  up to 130%

The constant overhead of calling the UTF-8 decoder is also significantly
reduced for dart2js.

Code size for dart2js is slightly reduced compared to the old decoder.

ASCII inputs currently see a slight speed decrease for VM targets, which
will be fixed in https://dart-review.googlesource.com/c/sdk/+/145460

This is part of the implementation of the breaking change described at
#41100

Closes #28832
Closes #31954

Ideas for further improvements to the decoder are collected in
#41734

Change-Id: I3c5bb84e8d6783231680a9d34d6c38e8a28ab112
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/142025
Reviewed-by: Stephen Adams <sra@google.com>
Reviewed-by: Martin Kustermann <kustermann@google.com>
dart-bot pushed a commit that referenced this issue May 4, 2020
This brings JSON encoding and decoding in line with the UTF-8 changes
described at #41100

The fused UTF-8 / JSON decoder for the VM now uses the new UTF-8 decoder
instead of its own, separate UTF-8 decoder.

The JSON encoder now escapes lone surrogates, so it can encode JSON
string values containing lone surrogates while keeping its output valid
UTF-8.

Change-Id: Ie4d4601cf84012068849e64d4670f2dcd49ea088
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/144286
Reviewed-by: Lasse R.H. Nielsen <lrn@google.com>
@lrhn
Copy link
Member

lrhn commented May 4, 2020

Having consistent behavior is a good thing, but there is a limit to how much work should be spent on mitigating browser bugs. I don't think the impact here is big enough to warrant adding detection and workarounds for something which is, arguably, already an error situation.

dart-bot pushed a commit that referenced this issue May 4, 2020
Update package:ffi to a version which does not depend on unpaired
surrogates.

Breaking change in Dart: #41100.

Change-Id: I2a5ba0abee7c6cccb166c234f8f620dbe0063d47
Cq-Include-Trybots: luci.dart.try:vm-ffi-android-debug-arm-try,vm-ffi-android-debug-arm64-try
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/146340
Reviewed-by: Aske Simon Christensen <askesc@google.com>
Commit-Queue: Daco Harkes <dacoharkes@google.com>
@askeksa-google
Copy link
Author

I benchmarked a post-scan for replacement characters. Here it adds about 6ns per byte on top of the TextDecoder time of 2-6ns per byte. In other words, it is a 4x slowdown for ASCII strings and 2x for the most complex strings. Still faster than the Dart version, though (for long strings).

dart-bot pushed a commit that referenced this issue May 13, 2020
With the changes in #41100 the
handling of UTF-8 encoded surrogates in Dart now matches that of JS.

Thus, the pre-pass that scans for the presence of surrogates before
handing the data to TextDecoder is no longer needed. Removing this
gives a significant speedup. On my laptop, in Chrome, on the Utf8Decode
benchmark, it gives around 1ns per input byte out of previously roughly
2.5ns (ASCII) to 5ns (Russian).

In principle, this also enables TextDecoder for allowMalformed: true,
since the number of replacement characters produced by Dart now matches
the WHATWG standard. This does result in failures in some browsers,
where these no not adhere to the standard. For instance, Chrome outputs
one replacement character per undecoded input byte when an unfinished
sequence is interrupted by end-of-input, where the standard specifies
only one replacement character.

To work around the browser deviations, the output from TextDecoder is
scanned for replacement characters, and if any are found, the decoding
falls back to the Dart implementation. This workaround can be removed
if the bugs are fixed in the browsers.

Since TextDecoder has a large startup overhead, we also fall back to the
Dart implementation for short strings.

Change-Id: I9e95a95ce726ce0d9e9a3b46df8ee2512ab05f0a
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/144294
Commit-Queue: Aske Simon Christensen <askesc@google.com>
Reviewed-by: Stephen Adams <sra@google.com>
dart-bot pushed a commit that referenced this issue Jun 29, 2020
The breaking change #41100
changed the UTF-8 encoder to encode unpaired surrogates as replacement
characters.

However, the VM contains its own, internal UTF-8 encoder, which is used
for printing and for the Dart_StringToUTF8 function in the Dart API.
Here, this encoder is changed to also encode unpaired surrogates as
replacement characters.

Fixes #42094

Change-Id: I9d55168f67d124dbc7987fb759696a98e7526c29
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/149292
Commit-Queue: Aske Simon Christensen <askesc@google.com>
Reviewed-by: Martin Kustermann <kustermann@google.com>
Reviewed-by: Daco Harkes <dacoharkes@google.com>
@dart-lang dart-lang deleted a comment from zerotobtc Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. breaking-change-request This tracks requests for feedback on breaking changes library-convert
Projects
None yet
Development

No branches or pull requests

7 participants