Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TextDecoder benchmark #3180

Merged
merged 6 commits into from Oct 22, 2019

Conversation

@ry
Copy link
Collaborator

ry commented Oct 22, 2019

ref #3163
cc @qwerasd205

merge on approval

ry added 2 commits Oct 22, 2019
@ry ry requested a review from piscisaureus Oct 22, 2019
@qwerasd205

This comment has been minimized.

Copy link
Contributor

qwerasd205 commented Oct 22, 2019

Since the benchmark utilized random values it should probably be run with a seed to make it deterministic (and thereby more consistent)

@ry

This comment has been minimized.

Copy link
Collaborator Author

ry commented Oct 22, 2019

@qwerasd205 I think seeing the distribution is useful

@qwerasd205

This comment has been minimized.

Copy link
Contributor

qwerasd205 commented Oct 22, 2019

Oh, does it do multiple runs?

ry added 2 commits Oct 22, 2019
@ry

This comment has been minimized.

Copy link
Collaborator Author

ry commented Oct 22, 2019

return Math.floor(Math.random() * 0xffffffff);
}

function generateRandomUTF8(size) {

This comment has been minimized.

Copy link
@piscisaureus

piscisaureus Oct 22, 2019

Collaborator

This function is wrong in countless ways.
Care to look again after a good night of sleep?

Copy link
Collaborator

piscisaureus left a comment

I can't even

@qwerasd205

This comment has been minimized.

Copy link
Contributor

qwerasd205 commented Oct 22, 2019

Blame me for the terrible code, as I'm the one who wrote it, it's hastily written while quite tired. I did recognize at the time that it was poor code and suggested replacing the awful function for generating random UTF8 data with a file read.

@qwerasd205

This comment has been minimized.

Copy link
Contributor

qwerasd205 commented Oct 22, 2019

Here's some slightly better code

generateRandUtf8 (bytes: number): Uint8Array {
    let randString = '';
    const buffer = new Uint16Array(1024);
    let bufferIndex = 0;
    for (var i = 0; i <= Math.round(bytes / 3); i++) {
        buffer[bufferIndex++] = Math.floor(Math.random() * 0xFFFF) & 0xFC00;
        if (bufferIndex === 1024) {
            randString += String.fromCharCode.apply(null, buffer);
            bufferIndex = 0;
        }
    }
    randString += String.fromCharCode.apply(null, buffer.subarray(0,bufferIndex));
    return new TextEncoder().encode(randString).subarray(0, bytes);
}

const rData = generateRandUtf8(10 * 1024 * 1024);
new TextDecoder().decode(rData);

Generates and then decodes 10MB of random well-formed (except the chance of a bad byte or two at the end due to truncation) utf-8 data which consists of entirely 3 or 4 byte sequences. A similar benchmark should probably be made for all ASCII text, and one for just generating the data, to give a reference line for the fixed overhead. Also, somewhat unrelated, but important, there should be unit tests for the TextDecoder and TextEncoder.

@ry

This comment has been minimized.

Copy link
Collaborator Author

ry commented Oct 22, 2019

^-- @piscisaureus does that work for you? Note this code is just for benchmarking TextDecoder - it's not part of the release.

@piscisaureus

This comment has been minimized.

Copy link
Collaborator

piscisaureus commented Oct 22, 2019

  • With the above code, I would bet money that generating the random text takes an order of magnitude longer than decoding it. So this becomes a benchmark of the random text generator, not TextDecoder.
  • It does not produce well formed UTF-8, since decoding it would yield code points that are not valid unicode, in particular in the range 0xD800-0xDFFF. You can simply test for well-formedness by verifying that encode(decode(buf)) equals buf.
  • As a baseline for comparison and general sanity, we should indeed add a benchmark that decodes an all-ASCII buffer, e.g. (new Uint8Array(size).fill(97))
@qwerasd205

This comment has been minimized.

Copy link
Contributor

qwerasd205 commented Oct 22, 2019

With the above code, I would bet money that generating the random text takes an order of magnitude longer than decoding it. So this becomes a benchmark of the random text generator, not TextDecoder.

Yes, this is why I said

and one for just generating the data, to give a reference line for the fixed overhead

But, it takes just about as long with the current (very slow) TextDecoder to decode it as it does to generate it, currently.
Though, I still think it would be better just to use a few static test files, rather than trying to generate the data to decode.

It does not produce well formed UTF-8, since decoding it would yield code points that are not valid unicode, in particular in the range 0xD800-0xDFFF. You can simply test for well-formedness by verifying that encode(decode(buf)) equals buf.

Does TextEncoder ever produce non-well-formed utf-8?

@piscisaureus

This comment has been minimized.

Copy link
Collaborator

piscisaureus commented Oct 22, 2019

Does TextEncoder ever produce non-well-formed utf-8?

Probably not, so instead you'd end up with a string that contains a lot of replacement characters instead.

@qwerasd205

This comment has been minimized.

Copy link
Contributor

qwerasd205 commented Oct 22, 2019

Yeah I just looked and maybe ~10% of the characters in the text that function outputs are replacement characters. Would you object to benchmarking the decoder on just a string with all sizes of utf-8 character repeated to fill the required size?

@qwerasd205

This comment has been minimized.

Copy link
Contributor

qwerasd205 commented Oct 22, 2019

@piscisaureus
Would these benchmarks be acceptable?
All 4 byte characters

const fourBytes = new TextEncoder().encode('😀');

function generateFourBytes (bytes: number): Uint8Array {
    const result = new Uint8Array(bytes);
    for (var i = 0; i < bytes; i++) {
        result[i] = fourBytes[i % 4];
    }
    return result;
}

new TextDecoder().decode(generateFourBytes(10 * 1024 * 1024));

Mixed

const mixed = new TextEncoder().encode('@Ā๐😀');

function generateMixed (bytes: number): Uint8Array {
    const result = new Uint8Array(bytes);
    for (var i = 0; i < bytes; i++) {
        result[i] = mixed[i % 10];
    }
    return result;
}

new TextDecoder().decode(generateMixed(10 * 1024 * 1024));

All ASCII

function generateASCII (bytes: number): Uint8Array {
    return new Uint8Array(bytes).fill(64);
}

new TextDecoder().decode(generateASCII(10 * 1024 * 1024));
@piscisaureus

This comment has been minimized.

Copy link
Collaborator

piscisaureus commented Oct 22, 2019

But, it takes just about as long with the current (very slow) TextDecoder to decode it as it does to generate it, currently.

If this is true then TextDecoder is indeed comically bad.
But any improvements won't (or would just barely) move the needle because the benchmark overhead would become dominant.

@piscisaureus
Would these benchmarks be acceptable?

Yes, much bettter.

Other suggestion: shorter random buffer and decode that 100 times.

@qwerasd205

This comment has been minimized.

Copy link
Contributor

qwerasd205 commented Oct 22, 2019

Other suggestion: shorter random buffer and decode that 100 times.

Yes, I had that idea, too, because there is some overhead to any decoder, and decoding lots of small stuff is a common use, so representing it would be good. Any suggestion on the length? Maybe 1k?

@qwerasd205

This comment has been minimized.

Copy link
Contributor

qwerasd205 commented Oct 22, 2019

@piscisaureus
Here's a benchmark that generates 1024 bytes of utf-8 which is a random mix of 4 characters (1 byte long, 2, 3, and 4), then decodes it ten thousand times.

const mixed = new TextEncoder().encode('@Ā๐😀');

function generateRandom (bytes: number): Uint8Array {
    const result = new Uint8Array(bytes);
    let i = 0;
    while (i < bytes) {
        const toAdd = Math.floor(Math.random()*Math.min(4,bytes-i));
        switch (toAdd) {
            case 0:
                result[i] = mixed[0];
                i++;
                break;
            case 1:
                result[i] = mixed[1];
                result[i+1] = mixed[2];
                i += 2;
                break;
            case 2:
                result[i] = mixed[3];
                result[i+1] = mixed[4];
                result[i+2] = mixed[5];
                i += 3;
                break;
            case 3:
                result[i] = mixed[6];
                result[i+1] = mixed[7];
                result[i+2] = mixed[8];
                result[i+3] = mixed[9];
                i += 4;
                break;
        }
    }
    return result;
}

const randomData = generateRandom(1024);
const decoder = new TextDecoder();
for (var i = 0; i < 10_000; i++) decoder.decode(randomData);
@piscisaureus

This comment has been minimized.

Copy link
Collaborator

piscisaureus commented Oct 22, 2019

@ry ^-- that works for me

@qwerasd205

This comment has been minimized.

Copy link
Contributor

qwerasd205 commented Oct 22, 2019

It may also be wise to benchmark TextDecoder memory usage, as that's also an issue with the current TextDecoder implementation.

x
@ry

This comment has been minimized.

Copy link
Collaborator Author

ry commented Oct 22, 2019

@piscisaureus updated

@qwerasd205 memory will be checked by the benchmarks. watch for it here after it land https://deno.land/benchmarks.html#max-memory

Copy link
Collaborator

piscisaureus left a comment

👍

fix
@qwerasd205

This comment has been minimized.

Copy link
Contributor

qwerasd205 commented Oct 22, 2019

Lint failed, because of for (var... Can just be modified to use let. (My use of for (var.. is a habit I've built as a generic performance consideration, as it used to be the case that for (let.. loops were significantly slower than for (var... I just checked and it seems like v8 at least has started properly optimizing it so it's equivalent, or sometimes faster. I'll have to start phasing out that habit from my code.)

@ry ry merged commit dc80dd2 into denoland:master Oct 22, 2019
10 checks passed
10 checks passed
test macOS-10.14
Details
test_std macOS-10.14
Details
test windows-2016
Details
test_std windows-2016
Details
test ubuntu-16.04
Details
test_debug ubuntu-16.04
Details
test_std ubuntu-16.04
Details
bench ubuntu-16.04
Details
lint ubuntu-16.04
Details
license/cla Contributor License Agreement is signed.
Details
@ry ry deleted the ry:text_decoder_perf branch Oct 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.