Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove workaround for TextDecoder browser bugs when the bugs are fixed #43737

Open
askeksa-google opened this issue Oct 9, 2020 · 13 comments
Open
Labels
area-web Use area-web for Dart web related issues, including the DDC and dart2js compilers and JS interop. P3 A lower priority bug or feature request web-dart2js

Comments

@askeksa-google
Copy link

Our UTF-8 decoder can use the TextDecoder JS class where available.

Some browsers have bugs in their TextDecoder implementation around the exact behavior of malformed input when fatal=false (corresponding to allowMalformed=true in Dart). In particular:

  • Chrome (and Chromium-based browsers): When an unfinished sequence is interrupted by end-of-input, or when an unfinished 4-byte sequence is interrupted after exactly 2 bytes, then one replacement character is emitted per byte instead of just one.
  • Safari: When a valid first byte of a multi-byte sequence is followed by fewer bytes than necessary to complete the sequence and then end-of-input, a replacement character is emitted and the rest of the input is discarded, even if it contains valid sequences.

As discussed in #41100 and #31370 we currently have a workaround for these browser bugs in place to ensure compliant behavior in all cases. This workaround takes the form of a scan for replacement characters in the converted output and constitutes a significant performance overhead, also for well-formed input.

To get rid of the workaround, we could file bugs for these browsers, wait for the bugs to be fixed and then remove the workaround.

The one in Chromium about a 4-byte sequence being interrupted after exactly 2 bytes is particularly nasty, since this is what necessitates the scan through the whole output. A workaround for just the other two bugs could just check whether the last character in the output is a replacement character.

@lrhn @rakudrama

@askeksa-google askeksa-google added the area-web Use area-web for Dart web related issues, including the DDC and dart2js compilers and JS interop. label Oct 9, 2020
@lrhn
Copy link
Member

lrhn commented Oct 12, 2020

Definitely do file bugs against the browsers.

I'm also willing to not do the workaround and forward any complaints to the browser bug. It's already an error condition It's annoying that you won't get exactly the error recovery that you expect, but I don't expect it to be a significant issue in practice.

@sigmundch
Copy link
Member

I'm curious about whether there are any updates. Have such bugs been filed? Did the browser behavior change in the past year?

@sigmundch sigmundch added P3 A lower priority bug or feature request and removed area-web Use area-web for Dart web related issues, including the DDC and dart2js compilers and JS interop. labels Nov 8, 2021
@mraleph mraleph added the area-web Use area-web for Dart web related issues, including the DDC and dart2js compilers and JS interop. label Nov 9, 2021
@askeksa-google
Copy link
Author

I have not filed any browser bugs. This issue was meant as an encouragement to do so. Admittedly, the title does not reflect that clearly.

@askeksa-google
Copy link
Author

Chrome 95 still has both bugs, as far as I can see.

@askeksa-google
Copy link
Author

askeksa-google commented Jan 3, 2023

Both bugs appear to be fixed in Chrome 108. 👍

This means we can remove the scanning workaround and only check if the last emitted character is a replacement character.

We should check Safari as well. If the Safari bug is also gone, we can completely remove the workaround.

@askeksa-google
Copy link
Author

Safari is still buggy. Removed the Chrome workaround in https://dart-review.googlesource.com/c/sdk/+/278181

@rakudrama
Copy link
Member

We can't remove the work-around yet.

@rakudrama
Copy link
Member

Update: One of our 1P customers supports really old browsers like Safari 12.1 and Chrome 80. We can't remove the workaround until their supported browser set is compliant.

@lrhn
Copy link
Member

lrhn commented Nov 12, 2023

It's also an option to remove the workarounds and document that UTF-8 encoding uses the platform's implementation of available, bugs and all. We can call it a breaking change of that's necessary.

Don't like that, don't allow malformed inputs!
Allowing invalid inputs is a bad idea anyway, if you're going to interpret the string in any way, and if you're only storing a user string, they might want to be notified that their input is invalid, so they don't choke when you send it back to them with replacement characters inside.

Deciding with replacement characters should really only be used for cases where they immediately become visible to the user, like an implicit error report.
Fx deciding Dart source files with allow-invalid is mostly fine, because the parser will report the error anyway. Except inside string literals, where we probably should just require that any replacement character, BOM or unpaired surrogate must be escaped. And comments, where it also might matter to DartDoc, which can itself warn about seeing replacement characters.

Accepting WTF-8 might be something we'd want to do independently of other invalid code unit sequences, so you don't need to allow all invalid encoding just to get those. Or overlong zero terminators, if people are into that, as long as it's a single, well-defined exception with a clear meaning, that you can ask for.

@askeksa
Copy link
Contributor

askeksa commented Nov 14, 2023

For perspective, the above fix (which confers a significant performance boost) only makes a difference in cases where a 4-byte sequence is interrupted by more input (not by end-of-input) after exactly 2 bytes. In that specific case, old Chromium browsers will emit two replacement characters instead of one.

@rakudrama
Copy link
Member

rakudrama commented Nov 14, 2023

For perspective, the above fix (which confers a significant performance boost) only makes a difference in cases where a 4-byte sequence is interrupted by more input (not by end-of-input) after exactly 2 bytes. In that specific case, old Chromium browsers will emit two replacement characters instead of one.

Interesting that the variation with respect to the spec is so narrow. I don't follow the 'not by end of input' part - the intercepted path is used only on a non-chunked version.

Another approach might be to always use the fatal TextDecoder and use the Error as a signal of needing to run our code to insert a consistent number of replacement characters. I don't recall if there is a reason we don't do that.

The code is being refactored here. It still contains the old indexOf workaround.

@lrhn
Copy link
Member

lrhn commented Nov 15, 2023

As I understand the explanation, if the input is [0xF2, 0x80, 0x7f], the initial 0xF2, 0x80 is the first two code units of a four-byte UTF-8 encoding. Since it's followed by a non-continuation byte (not in range 0x80-0xBF), the four-byte sequence is uncomplete.

That should produce a single U+FFFD in the output, it's a single incomplete sequence, but in those versions of Chrome it emitted two U+FFFD characters, one per invalid code unit.

If the input had been just [0xF2, 0x80], ending at the incomplete sequence, it is emitted as a single U+FFFD character. It's only if the incomplete sequence is interrupted by another code unit that it was treated incorrectly.

If it had been the other way around, the discrepancy would only have happened at the end of input, and would be easy to check for.

As it is, it can happen in the middle, and that's why we'd have to scan the entire input to recognize it. (I guess we can scan the input and output in tandem, only stopping if seeing two U+FFFDs in a row, and then checking if it comes from a single incomplete sequence - which means checking that the number of consecutive U+FFFDs match what the input should have generated, which is very close to parsing the input and converting it again.)

I'd be perfectly happy to allow discrepancies in how incorrect inputs are decoded, if they re introduced by the underlying system. Update your browers if it matters (which it shouldn't, because you should only ever be getting valid input, right?).

@askeksa
Copy link
Contributor

askeksa commented Nov 15, 2023

It actually also happens similarly at end of input. It's just that the optimized workaround in https://dart-review.googlesource.com/c/sdk/+/278181 (which works around the Safari bug that still exists) catches the end-of-input case, so with that workaround, only middle-of-input situations will result in non-compliant output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-web Use area-web for Dart web related issues, including the DDC and dart2js compilers and JS interop. P3 A lower priority bug or feature request web-dart2js
Projects
None yet
Development

No branches or pull requests

6 participants