New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of WebUtility.Decode for non-escaped strings #7671

Merged
merged 4 commits into from Apr 15, 2016

Conversation

Projects
None yet
8 participants
@hughbe
Collaborator

hughbe commented Apr 12, 2016

  • Cuts allocations down to 1/3 of the original
  • Doubles performance (in time)
  • We still allocate a little bit, as removing all allocations would harm the performance of strings that actually need escaping

Benchmark - escaping needed

  • No performance regressions

benchmark escaping

Benchmark - no escaping needed

benchmark no escaping

Benchmark Code

Click here

static void Main(string[] args)
{
    // Escaping
    TimeAction("Old: ", () => Old.UrlDecode("%ABabc"));
    TimeAction("New: ", () => New.UrlDecode("%ABabc"));

    // No escaping
    TimeAction("Old: ", () => Old.UrlDecode("abc"));
    TimeAction("New: ", () => New.UrlDecode("abc"));

    Console.ReadLine();
}

public static void TimeAction(string prefix, Action action)
{
    var sw = new Stopwatch();
    for (int iter = 0; iter < 5; iter++)
    {
        int gen0 = GC.CollectionCount(0);
        sw.Restart();
        for (int i = 0; i < 10000000; i++)
        {
            action();
        }
        sw.Stop();
        Console.WriteLine($"{prefix}Time: {sw.Elapsed.TotalSeconds}\tGC0: {GC.CollectionCount(0) - gen0}");
    }
}

/cc @stephentoub @jamesqo @davidsh
Fixes #6542 together with #7546

Improve performance of WebUtility.Decode for non-escaped strings
- Cuts allocations down to 1/3 of the original
- Doubles performance (in time)

- We still allocate a little bit, as removing all allocations would harm
the performance of strings that actually need escaping

Fixes #6542

@hughbe hughbe changed the title from Improve performance of WebUtility.Decode for non-escaped string to Improve performance of WebUtility.Decode for non-escaped strings Apr 12, 2016

_bufferSize = bufferSize;
_encoding = encoding;
_charBuffer = new char[bufferSize];
_charBuffer = null; // char buffer created on demand

This comment has been minimized.

@stephentoub

stephentoub Apr 12, 2016

Member

Is this really useful? Empty string is already special cased to avoid decoding. So isn't this only valuable in the case where all of the inputs are bytes rather than chars, and to enable that you're then doing an extra null check branch on every AddChar? And even then all of the bytes except for those at the end will end up getting dumped into the char[], forcing it to be allocated. Doesn't seem like a good tradeoff.

@stephentoub

stephentoub Apr 12, 2016

Member

Is this really useful? Empty string is already special cased to avoid decoding. So isn't this only valuable in the case where all of the inputs are bytes rather than chars, and to enable that you're then doing an extra null check branch on every AddChar? And even then all of the bytes except for those at the end will end up getting dumped into the char[], forcing it to be allocated. Doesn't seem like a good tradeoff.

This comment has been minimized.

@hughbe

hughbe Apr 12, 2016

Collaborator

Well I found that this halved allocations for the UrlDecode("noescaping") case, as it seems that ASCII chars are added to the decoder as AddByte not as AddChar. I found this behaviour weird, but lazilly instantiating _charBuffer does help in the no escaping case.

@hughbe

hughbe Apr 12, 2016

Collaborator

Well I found that this halved allocations for the UrlDecode("noescaping") case, as it seems that ASCII chars are added to the decoder as AddByte not as AddChar. I found this behaviour weird, but lazilly instantiating _charBuffer does help in the no escaping case.

This comment has been minimized.

@hughbe

hughbe Apr 12, 2016

Collaborator

Just a follow up: its due to this line of code.

I don't really understand why "// 7 bit have to go as bytes because of Unicode" but hey

@hughbe

hughbe Apr 12, 2016

Collaborator

Just a follow up: its due to this line of code.

I don't really understand why "// 7 bit have to go as bytes because of Unicode" but hey

This comment has been minimized.

@stephentoub

stephentoub Apr 12, 2016

Member

I see, ok. (Seems strange then that we're lazily allocating the byte[] array; doing so suggests we don't expect any ASCII chars in common input, but let's not change it.)

@stephentoub

stephentoub Apr 12, 2016

Member

I see, ok. (Seems strange then that we're lazily allocating the byte[] array; doing so suggests we don't expect any ASCII chars in common input, but let's not change it.)

This comment has been minimized.

@hughbe

hughbe Apr 12, 2016

Collaborator

understood - obviously I'm not planning to change any of that behaviour in this PR, but do you have a theory as to why we add ascii chars as bytes not chars?

@hughbe

hughbe Apr 12, 2016

Collaborator

understood - obviously I'm not planning to change any of that behaviour in this PR, but do you have a theory as to why we add ascii chars as bytes not chars?

This comment has been minimized.

@stephentoub

stephentoub Apr 12, 2016

Member

right but they allocate the array only when needed

No. In the code before this PR, the char[] is always allocated (unless the string is empty).

@stephentoub

stephentoub Apr 12, 2016

Member

right but they allocate the array only when needed

No. In the code before this PR, the char[] is always allocated (unless the string is empty).

This comment has been minimized.

@tarekgh

tarekgh Apr 12, 2016

Member

right. looking more, it looks when having ascii characters, it needs to be encoded by the encoder. bu tif the character is not ascii will be stored as it is without encoding. so I believe what they are doing is collecting ascii charcaters in the array, and everytime need to flush it, will run the encoder just once on the array before the flush. so it is just optimizing to not calling the encoder with every ascii character.

@tarekgh

tarekgh Apr 12, 2016

Member

right. looking more, it looks when having ascii characters, it needs to be encoded by the encoder. bu tif the character is not ascii will be stored as it is without encoding. so I believe what they are doing is collecting ascii charcaters in the array, and everytime need to flush it, will run the encoder just once on the array before the flush. so it is just optimizing to not calling the encoder with every ascii character.

This comment has been minimized.

@hughbe

hughbe Apr 12, 2016

Collaborator

so it is just optimizing to not calling the encoder with every ascii character.

but doesn't the implementation check if there are any bytes in the buffer, and if there are none in the buffer it doesn't flush any bytes
So that means that each ascii character wouldn't flush any bytes, only the first in a string of ascii chars

@hughbe

hughbe Apr 12, 2016

Collaborator

so it is just optimizing to not calling the encoder with every ascii character.

but doesn't the implementation check if there are any bytes in the buffer, and if there are none in the buffer it doesn't flush any bytes
So that means that each ascii character wouldn't flush any bytes, only the first in a string of ascii chars

This comment has been minimized.

@tarekgh

tarekgh Apr 12, 2016

Member

UrlDecoder is collecting the ascii bytes till either AddChar or GetString get called. at that time it will flush the bytes after encoding it. and then will start collecting the new encountered bytes again and so on.
imagine you have string like aaaaaNaaaaNaaaaNaaa
where a is ascii character and N is not ascii, this will have the byte array get filled 4 times and encoded 4 times. if you didn't collect the ascii in byte array as the code is doing, this mean you'll need to call the encoder the number of 'a' characters in the string.

@tarekgh

tarekgh Apr 12, 2016

Member

UrlDecoder is collecting the ascii bytes till either AddChar or GetString get called. at that time it will flush the bytes after encoding it. and then will start collecting the new encountered bytes again and so on.
imagine you have string like aaaaaNaaaaNaaaaNaaa
where a is ascii character and N is not ascii, this will have the byte array get filled 4 times and encoded 4 times. if you didn't collect the ascii in byte array as the code is doing, this mean you'll need to call the encoder the number of 'a' characters in the string.

This comment has been minimized.

@hughbe

hughbe Apr 12, 2016

Collaborator

understood, thanks

@hughbe

hughbe Apr 12, 2016

Collaborator

understood, thanks

@hughbe

This comment has been minimized.

Show comment
Hide comment
@hughbe

hughbe Apr 12, 2016

Collaborator

Test Innerloop CentOS7.1 Release Build and Test
Test Innerloop Windows_NT Debug Build and Test

Collaborator

hughbe commented Apr 12, 2016

Test Innerloop CentOS7.1 Release Build and Test
Test Innerloop Windows_NT Debug Build and Test

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Apr 12, 2016

Member

@dotnet-bot test code coverage please

Member

stephentoub commented Apr 12, 2016

@dotnet-bot test code coverage please

@hughbe

This comment has been minimized.

Show comment
Hide comment
@hughbe

hughbe Apr 12, 2016

Collaborator

@stephentoub is this new/can non-Microsoft people use that code coverage feature

Collaborator

hughbe commented Apr 12, 2016

@stephentoub is this new/can non-Microsoft people use that code coverage feature

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Apr 12, 2016

Member

is this new

nope

can non-Microsoft people use that code coverage feature

yup

Member

stephentoub commented Apr 12, 2016

is this new

nope

can non-Microsoft people use that code coverage feature

yup

@hughbe

This comment has been minimized.

Show comment
Hide comment
@hughbe

hughbe Apr 12, 2016

Collaborator

nope

That's cool, I've never noticed that before. Is the idea that it fails if there is an overall decrease in coverage or does it just publish a report? (I guess we'll find out!)

Collaborator

hughbe commented Apr 12, 2016

nope

That's cool, I've never noticed that before. Is the idea that it fails if there is an overall decrease in coverage or does it just publish a report? (I guess we'll find out!)

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Apr 12, 2016

Member

Is the idea that it fails if there is an overall decrease in coverage or does it just publish a report? (I guess we'll find out!)

It doesn't perform any comparison, but it archives the report so you can click through to view it.

Member

stephentoub commented Apr 12, 2016

Is the idea that it fails if there is an overall decrease in coverage or does it just publish a report? (I guess we'll find out!)

It doesn't perform any comparison, but it archives the report so you can click through to view it.

@hughbe

This comment has been minimized.

Show comment
Hide comment
@hughbe

hughbe Apr 13, 2016

Collaborator

Thanks, I agree its a good idea to add some more tests to cover all the branches I've changed. I also added some more tests for invalid percent encoding, lowercase hex, space escaping and non-ASCII chars.
This brings coverage up to 100% for UrlDecodeInternal and almost 100% for UrlDecoder - I think the missing coverage may be dead code (line 683)

Collaborator

hughbe commented Apr 13, 2016

Thanks, I agree its a good idea to add some more tests to cover all the branches I've changed. I also added some more tests for invalid percent encoding, lowercase hex, space escaping and non-ASCII chars.
This brings coverage up to 100% for UrlDecodeInternal and almost 100% for UrlDecoder - I think the missing coverage may be dead code (line 683)

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub
Member

stephentoub commented Apr 15, 2016

LGTM

@davidsh?

@davidsh

This comment has been minimized.

Show comment
Hide comment
@davidsh

davidsh Apr 15, 2016

Member

LGTM.

Member

davidsh commented Apr 15, 2016

LGTM.

@davidsh

This comment has been minimized.

Show comment
Hide comment
@davidsh

This comment has been minimized.

Show comment
Hide comment
@davidsh

davidsh Apr 15, 2016

Member

@dotnet-bot Test Innerloop Windows_NT Debug Build and Test

Member

davidsh commented Apr 15, 2016

@dotnet-bot Test Innerloop Windows_NT Debug Build and Test

@davidsh

This comment has been minimized.

Show comment
Hide comment
@davidsh

davidsh Apr 15, 2016

Member

@dotnet-bot Test Innerloop CentOS7.1 Release Build and Test

Member

davidsh commented Apr 15, 2016

@dotnet-bot Test Innerloop CentOS7.1 Release Build and Test

@davidsh davidsh merged commit be2360b into dotnet:master Apr 15, 2016

8 checks passed

Innerloop CentOS7.1 Debug Build and Test Build finished. No test results found.
Details
Innerloop CentOS7.1 Release Build and Test Build finished. No test results found.
Details
Innerloop OSX Debug Build and Test Build finished. No test results found.
Details
Innerloop OSX Release Build and Test Build finished. No test results found.
Details
Innerloop Ubuntu Debug Build and Test Build finished. No test results found.
Details
Innerloop Ubuntu Release Build and Test Build finished. No test results found.
Details
Innerloop Windows_NT Debug Build and Test Build finished. 145191 tests run, 46 skipped, 0 failed.
Details
Innerloop Windows_NT Release Build and Test Build finished. 144362 tests run, 46 skipped, 0 failed.
Details
@hughbe

This comment has been minimized.

Show comment
Hide comment
@hughbe

hughbe Apr 16, 2016

Collaborator

Glad to see this merged. Should I send over a similar PR to coreclr, or will this be automatically handled by the netfx-port-consider label

Collaborator

hughbe commented Apr 16, 2016

Glad to see this merged. Should I send over a similar PR to coreclr, or will this be automatically handled by the netfx-port-consider label

@hughbe hughbe deleted the hughbe:webutility-decode branch Apr 16, 2016

@karelz karelz modified the milestone: 1.0.0-rtm Dec 3, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment