Utf8Decoder should be compatible with TextDecoder.decode #31370

rakudrama · 2017-11-14T20:18:22Z

The browser's TextDecoder.prototype.decode treats surrogates (U+D800 through U+DFFF) differently to Utf8Decoder.
This makes it difficult to use TextDecoder to accelerate conversion.
Acceleration is highly desirable - it improves one binary protobuf benchmark by 8x.

The main difference is that Utf8Decoder converts surrogates into a code point, but TextDecoder considers a surrogate to be an error and, depending on the fatal option, either throws an error, or decodes the surrogate to U+FFFD REPLACEMENT CHARACTER.

It is not possible to get acceptable performance for allowMalformed: true by trying with {fatal: true} and catching the Error and re-decoding with the slow code. Throwing the error is ~1000x more expensive.

Everything would be simpler if Utf8Decoder was completely aligned with TextDecoder.decode.

I have also verified that for other malformed inputs, TextDecoder and Utf8Decoder disagree on the number of U+FFFD replacements generated.

The text was updated successfully, but these errors were encountered:

lrhn · 2020-09-30T20:27:00Z

@askeksa-google did this.

askeksa-google · 2020-09-30T20:35:46Z

In principle, yes. We currently have a workaround for some browser bugs.

We could have an issue for reporting the bugs, waiting for them to be fixed, and then removing the workaround.

rakudrama added the library-convert label Nov 14, 2017

vsmenon added the area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. label Nov 16, 2017

rakudrama mentioned this issue Jan 10, 2018

Utf8Codec doesn't throw FormatException for single and paired UTF-16 surrogates #28832

Closed

rakudrama mentioned this issue Jan 21, 2018

UTF8.decode is slow (performance) #31954

Closed

askeksa-google mentioned this issue Mar 18, 2020

[Breaking change request] Change UTF-8 encoder and decoder to match the WHATWG encoding standard #41100

Closed

lrhn closed this as completed Sep 30, 2020

askeksa-google mentioned this issue Oct 9, 2020

Remove workaround for TextDecoder browser bugs when the bugs are fixed #43737

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utf8Decoder should be compatible with TextDecoder.decode #31370

Utf8Decoder should be compatible with TextDecoder.decode #31370

rakudrama commented Nov 14, 2017 •

edited

Loading

lrhn commented Sep 30, 2020

askeksa-google commented Sep 30, 2020

Utf8Decoder should be compatible with TextDecoder.decode #31370

Utf8Decoder should be compatible with TextDecoder.decode #31370

Comments

rakudrama commented Nov 14, 2017 • edited Loading

lrhn commented Sep 30, 2020

askeksa-google commented Sep 30, 2020

rakudrama commented Nov 14, 2017 •

edited

Loading