-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove workaround for TextDecoder browser bugs when the bugs are fixed #43737
Comments
Definitely do file bugs against the browsers. I'm also willing to not do the workaround and forward any complaints to the browser bug. It's already an error condition It's annoying that you won't get exactly the error recovery that you expect, but I don't expect it to be a significant issue in practice. |
I'm curious about whether there are any updates. Have such bugs been filed? Did the browser behavior change in the past year? |
I have not filed any browser bugs. This issue was meant as an encouragement to do so. Admittedly, the title does not reflect that clearly. |
Chrome 95 still has both bugs, as far as I can see. |
Both bugs appear to be fixed in Chrome 108. 👍 This means we can remove the scanning workaround and only check if the last emitted character is a replacement character. We should check Safari as well. If the Safari bug is also gone, we can completely remove the workaround. |
Safari is still buggy. Removed the Chrome workaround in https://dart-review.googlesource.com/c/sdk/+/278181 |
We can't remove the work-around yet. |
Update: One of our 1P customers supports really old browsers like Safari 12.1 and Chrome 80. We can't remove the workaround until their supported browser set is compliant. |
It's also an option to remove the workarounds and document that UTF-8 encoding uses the platform's implementation of available, bugs and all. We can call it a breaking change of that's necessary. Don't like that, don't allow malformed inputs! Deciding with replacement characters should really only be used for cases where they immediately become visible to the user, like an implicit error report. Accepting WTF-8 might be something we'd want to do independently of other invalid code unit sequences, so you don't need to allow all invalid encoding just to get those. Or overlong zero terminators, if people are into that, as long as it's a single, well-defined exception with a clear meaning, that you can ask for. |
For perspective, the above fix (which confers a significant performance boost) only makes a difference in cases where a 4-byte sequence is interrupted by more input (not by end-of-input) after exactly 2 bytes. In that specific case, old Chromium browsers will emit two replacement characters instead of one. |
Interesting that the variation with respect to the spec is so narrow. I don't follow the 'not by end of input' part - the intercepted path is used only on a non-chunked version. Another approach might be to always use the fatal TextDecoder and use the Error as a signal of needing to run our code to insert a consistent number of replacement characters. I don't recall if there is a reason we don't do that. The code is being refactored here. It still contains the old |
As I understand the explanation, if the input is That should produce a single U+FFFD in the output, it's a single incomplete sequence, but in those versions of Chrome it emitted two U+FFFD characters, one per invalid code unit. If the input had been just If it had been the other way around, the discrepancy would only have happened at the end of input, and would be easy to check for. As it is, it can happen in the middle, and that's why we'd have to scan the entire input to recognize it. (I guess we can scan the input and output in tandem, only stopping if seeing two U+FFFDs in a row, and then checking if it comes from a single incomplete sequence - which means checking that the number of consecutive U+FFFDs match what the input should have generated, which is very close to parsing the input and converting it again.) I'd be perfectly happy to allow discrepancies in how incorrect inputs are decoded, if they re introduced by the underlying system. Update your browers if it matters (which it shouldn't, because you should only ever be getting valid input, right?). |
It actually also happens similarly at end of input. It's just that the optimized workaround in https://dart-review.googlesource.com/c/sdk/+/278181 (which works around the Safari bug that still exists) catches the end-of-input case, so with that workaround, only middle-of-input situations will result in non-compliant output. |
Our UTF-8 decoder can use the
TextDecoder
JS class where available.Some browsers have bugs in their
TextDecoder
implementation around the exact behavior of malformed input whenfatal=false
(corresponding toallowMalformed=true
in Dart). In particular:As discussed in #41100 and #31370 we currently have a workaround for these browser bugs in place to ensure compliant behavior in all cases. This workaround takes the form of a scan for replacement characters in the converted output and constitutes a significant performance overhead, also for well-formed input.
To get rid of the workaround, we could file bugs for these browsers, wait for the bugs to be fixed and then remove the workaround.
The one in Chromium about a 4-byte sequence being interrupted after exactly 2 bytes is particularly nasty, since this is what necessitates the scan through the whole output. A workaround for just the other two bugs could just check whether the last character in the output is a replacement character.
@lrhn @rakudrama
The text was updated successfully, but these errors were encountered: