Remove workaround for TextDecoder browser bugs when the bugs are fixed #43737

askeksa-google · 2020-10-09T14:24:48Z

Our UTF-8 decoder can use the TextDecoder JS class where available.

Some browsers have bugs in their TextDecoder implementation around the exact behavior of malformed input when fatal=false (corresponding to allowMalformed=true in Dart). In particular:

Chrome (and Chromium-based browsers): When an unfinished sequence is interrupted by end-of-input, or when an unfinished 4-byte sequence is interrupted after exactly 2 bytes, then one replacement character is emitted per byte instead of just one.
Safari: When a valid first byte of a multi-byte sequence is followed by fewer bytes than necessary to complete the sequence and then end-of-input, a replacement character is emitted and the rest of the input is discarded, even if it contains valid sequences.

As discussed in #41100 and #31370 we currently have a workaround for these browser bugs in place to ensure compliant behavior in all cases. This workaround takes the form of a scan for replacement characters in the converted output and constitutes a significant performance overhead, also for well-formed input.

To get rid of the workaround, we could file bugs for these browsers, wait for the bugs to be fixed and then remove the workaround.

The one in Chromium about a 4-byte sequence being interrupted after exactly 2 bytes is particularly nasty, since this is what necessitates the scan through the whole output. A workaround for just the other two bugs could just check whether the last character in the output is a replacement character.

@lrhn @rakudrama

The text was updated successfully, but these errors were encountered:

lrhn · 2020-10-12T09:16:16Z

Definitely do file bugs against the browsers.

I'm also willing to not do the workaround and forward any complaints to the browser bug. It's already an error condition It's annoying that you won't get exactly the error recovery that you expect, but I don't expect it to be a significant issue in practice.

sigmundch · 2021-11-08T18:34:53Z

I'm curious about whether there are any updates. Have such bugs been filed? Did the browser behavior change in the past year?

askeksa-google · 2021-11-11T12:04:01Z

I have not filed any browser bugs. This issue was meant as an encouragement to do so. Admittedly, the title does not reflect that clearly.

askeksa-google · 2021-11-11T12:09:16Z

Chrome 95 still has both bugs, as far as I can see.

askeksa-google · 2023-01-03T16:58:19Z

Both bugs appear to be fixed in Chrome 108. 👍

This means we can remove the scanning workaround and only check if the last emitted character is a replacement character.

We should check Safari as well. If the Safari bug is also gone, we can completely remove the workaround.

askeksa-google · 2023-01-04T10:51:18Z

Safari is still buggy. Removed the Chrome workaround in https://dart-review.googlesource.com/c/sdk/+/278181

rakudrama · 2023-01-04T19:06:45Z

We can't remove the work-around yet.

rakudrama · 2023-11-12T02:27:08Z

Update: One of our 1P customers supports really old browsers like Safari 12.1 and Chrome 80. We can't remove the workaround until their supported browser set is compliant.

lrhn · 2023-11-12T09:23:41Z

It's also an option to remove the workarounds and document that UTF-8 encoding uses the platform's implementation of available, bugs and all. We can call it a breaking change of that's necessary.

Don't like that, don't allow malformed inputs!
Allowing invalid inputs is a bad idea anyway, if you're going to interpret the string in any way, and if you're only storing a user string, they might want to be notified that their input is invalid, so they don't choke when you send it back to them with replacement characters inside.

Deciding with replacement characters should really only be used for cases where they immediately become visible to the user, like an implicit error report.
Fx deciding Dart source files with allow-invalid is mostly fine, because the parser will report the error anyway. Except inside string literals, where we probably should just require that any replacement character, BOM or unpaired surrogate must be escaped. And comments, where it also might matter to DartDoc, which can itself warn about seeing replacement characters.

Accepting WTF-8 might be something we'd want to do independently of other invalid code unit sequences, so you don't need to allow all invalid encoding just to get those. Or overlong zero terminators, if people are into that, as long as it's a single, well-defined exception with a clear meaning, that you can ask for.

askeksa · 2023-11-14T07:57:44Z

For perspective, the above fix (which confers a significant performance boost) only makes a difference in cases where a 4-byte sequence is interrupted by more input (not by end-of-input) after exactly 2 bytes. In that specific case, old Chromium browsers will emit two replacement characters instead of one.

rakudrama · 2023-11-14T20:57:38Z

For perspective, the above fix (which confers a significant performance boost) only makes a difference in cases where a 4-byte sequence is interrupted by more input (not by end-of-input) after exactly 2 bytes. In that specific case, old Chromium browsers will emit two replacement characters instead of one.

Interesting that the variation with respect to the spec is so narrow. I don't follow the 'not by end of input' part - the intercepted path is used only on a non-chunked version.

Another approach might be to always use the fatal TextDecoder and use the Error as a signal of needing to run our code to insert a consistent number of replacement characters. I don't recall if there is a reason we don't do that.

The code is being refactored here. It still contains the old indexOf workaround.

lrhn · 2023-11-15T10:41:47Z

As I understand the explanation, if the input is [0xF2, 0x80, 0x7f], the initial 0xF2, 0x80 is the first two code units of a four-byte UTF-8 encoding. Since it's followed by a non-continuation byte (not in range 0x80-0xBF), the four-byte sequence is uncomplete.

That should produce a single U+FFFD in the output, it's a single incomplete sequence, but in those versions of Chrome it emitted two U+FFFD characters, one per invalid code unit.

If the input had been just [0xF2, 0x80], ending at the incomplete sequence, it is emitted as a single U+FFFD character. It's only if the incomplete sequence is interrupted by another code unit that it was treated incorrectly.

If it had been the other way around, the discrepancy would only have happened at the end of input, and would be easy to check for.

As it is, it can happen in the middle, and that's why we'd have to scan the entire input to recognize it. (I guess we can scan the input and output in tandem, only stopping if seeing two U+FFFDs in a row, and then checking if it comes from a single incomplete sequence - which means checking that the number of consecutive U+FFFDs match what the input should have generated, which is very close to parsing the input and converting it again.)

I'd be perfectly happy to allow discrepancies in how incorrect inputs are decoded, if they re introduced by the underlying system. Update your browers if it matters (which it shouldn't, because you should only ever be getting valid input, right?).

askeksa · 2023-11-15T14:27:24Z

It actually also happens similarly at end of input. It's just that the optimized workaround in https://dart-review.googlesource.com/c/sdk/+/278181 (which works around the Safari bug that still exists) catches the end-of-input case, so with that workaround, only middle-of-input situations will result in non-compliant output.

askeksa-google added the area-web Use area-web for Dart web related issues, including the DDC and dart2js compilers and JS interop. label Oct 9, 2020

sigmundch added the web-dart2js label Nov 8, 2021

sigmundch added P3 A lower priority bug or feature request and removed area-web Use area-web for Dart web related issues, including the DDC and dart2js compilers and JS interop. labels Nov 8, 2021

mraleph added the area-web Use area-web for Dart web related issues, including the DDC and dart2js compilers and JS interop. label Nov 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove workaround for TextDecoder browser bugs when the bugs are fixed #43737

Remove workaround for TextDecoder browser bugs when the bugs are fixed #43737

askeksa-google commented Oct 9, 2020

lrhn commented Oct 12, 2020

sigmundch commented Nov 8, 2021

askeksa-google commented Nov 11, 2021

askeksa-google commented Nov 11, 2021

askeksa-google commented Jan 3, 2023 •

edited

Loading

askeksa-google commented Jan 4, 2023

rakudrama commented Jan 4, 2023

rakudrama commented Nov 12, 2023

lrhn commented Nov 12, 2023

askeksa commented Nov 14, 2023 •

edited

Loading

rakudrama commented Nov 14, 2023 •

edited

Loading

lrhn commented Nov 15, 2023

askeksa commented Nov 15, 2023

Remove workaround for TextDecoder browser bugs when the bugs are fixed #43737

Remove workaround for TextDecoder browser bugs when the bugs are fixed #43737

Comments

askeksa-google commented Oct 9, 2020

lrhn commented Oct 12, 2020

sigmundch commented Nov 8, 2021

askeksa-google commented Nov 11, 2021

askeksa-google commented Nov 11, 2021

askeksa-google commented Jan 3, 2023 • edited Loading

askeksa-google commented Jan 4, 2023

rakudrama commented Jan 4, 2023

rakudrama commented Nov 12, 2023

lrhn commented Nov 12, 2023

askeksa commented Nov 14, 2023 • edited Loading

rakudrama commented Nov 14, 2023 • edited Loading

lrhn commented Nov 15, 2023

askeksa commented Nov 15, 2023

askeksa-google commented Jan 3, 2023 •

edited

Loading

askeksa commented Nov 14, 2023 •

edited

Loading

rakudrama commented Nov 14, 2023 •

edited

Loading