Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GZipStream.ReadByte 10x slower than on Desktop #39233

Closed
KirillOsenkov opened this issue Jul 13, 2020 · 38 comments
Closed

GZipStream.ReadByte 10x slower than on Desktop #39233

KirillOsenkov opened this issue Jul 13, 2020 · 38 comments

Comments

@KirillOsenkov
Copy link
Member

Description

Reading a binlog file takes 4 seconds on .NET Desktop, but 40 seconds on Core 3.1.

Originally reported by @vatsan-madhavan in KirillOsenkov/MSBuildStructuredLog#376

Repro:

  1. git clone https://github.com/vatsan-madhavan/MSBuildStructuredLogSummary
  2. msbuild /r
  3. Download the binlog from https://1drv.ms/u/s!Ah63gfGkKca4lsl2sy71gpkpM9TCog?e=ifgsMj and save it anywhere, e.g. C:\temp\vatsan\vatsan.binlog
  4. MSBuildBinLogSummary\bin\Debug\net472\MSBuildBinLogSummary.exe --log-file C:\temp\Vatsan\vatsan.binlog - takes 4 seconds
  5. MSBuildBinLogSummary\bin\Debug\netcoreapp3.1\MSBuildBinLogSummary.exe --log-file C:\temp\Vatsan\vatsan.binlog - takes 40 seconds

The stack that I saw was taking all the time is ReadByte from a GzipStream, it's called millions of times (to read the binlog):

  | ZLibNative.ZLibStreamHandle.Inflate Line 300
-- | --
  | Inflater.Inflate Line 281
  | Inflater.ReadInflateOutput Line 266
  | Inflater.ReadOutput Line 122
  | Inflater.InflateVerified Line 92
  | Inflater.Inflate Line 55
  | DeflateStream.ReadByte Line 229
  | GZipStream.ReadByte Line 72
  | BinaryReader.ReadBoolean Line 236
  | BuildEventArgsReader.ReadTargetFinishedEventArgs Line 359
  | BuildEventArgsReader.Read Line 71
  | BinLogReader.Replay Line 78
  | BinLogReader.Replay Line 37
  | StructuredLoggerUtil..ctor Line 30
  | Program.ProcessBinaryLog Line 29
  | Program.Main Line 24

Configuration

  • 3.1.301
  • Windows 10
  • x64

Regression?

.NET Framework 4.7.2 is 4 seconds, .NET Core 3.1 is 40 seconds

@KirillOsenkov
Copy link
Member Author

@carlossanlop

@danmoseley
Copy link
Member

Would be interesting to know 5.0 numbers

@vatsan-madhavan
Copy link
Member

About the same as 3.1.

> Measure-Command { .\bin\Release\netcoreapp3.1\win-x64\publish\MSBuildBinLogSummary.exe --log-file D:\src\repos\pinvoke\msbuild.binlog }

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 51
Milliseconds      : 868
Ticks             : 518685936
TotalDays         : 0.000600330944444444
TotalHours        : 0.0144079426666667
TotalMinutes      : 0.86447656
TotalSeconds      : 51.8685936
TotalMilliseconds : 51868.5936


> Measure-Command { .\bin\Release\net5.0\win-x64\publish\MSBuildBinLogSummary.exe --log-file D:\src\repos\pinvoke\msbuild.binlog }

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 49
Milliseconds      : 313
Ticks             : 493133924
TotalDays         : 0.000570756856481482
TotalHours        : 0.0136981645555556
TotalMinutes      : 0.821889873333333
TotalSeconds      : 49.3133924
TotalMilliseconds : 49313.3924


> Measure-Command { .\bin\Release\net472\MSBuildBinLogSummary.exe --log-file D:\src\repos\pinvoke\msbuild.binlog }

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 4
Milliseconds      : 746
Ticks             : 47464341
TotalDays         : 5.49355798611111E-05
TotalHours        : 0.00131845391666667
TotalMinutes      : 0.079107235
TotalSeconds      : 4.7464341
TotalMilliseconds : 4746.4341


> type ..\global.json
{
  "sdk": {
    "version": "5.0.100-preview.8.20363.2",
    "rollForward": "latestFeature"
  }
}

@Wraith2
Copy link
Contributor

Wraith2 commented Jul 15, 2020

I had a look into this to see if it was anything I could diagnose and/or fix. The answer seems to be no but what I found may be of use to someone more knowledgeable. Using managed profiling tools it looks like it's just spending a lot of time reading and spinning, not helpful. Using VTune I think it's clear that it's spending a lot of time in window_output_flush which is memcopying. Here's what I see:

Capture

@danmoseley
Copy link
Member

It would be interesting to know whether we are calling zlib in the same pattern. Maybe we are asking it to do more work. Also I wonder whether it would be difficult to swap in the older zlib library to remove that variable.

@Wraith2
Copy link
Contributor

Wraith2 commented Jul 15, 2020

netcore 3.1:
Capture netcore

netfx 472:
Capture netfx

Focussed on the part that stands out as taking more time between the two. It looks like the netcore version is doing byte reads from the byte stream by reading a byte and then I'd guess that the native compression is pushing the rest of the contents back 1 byte and reading another byte to keep it full, that would explain the memcpy to me. Netfx is doing array based reads. I'd suggest that the msbuild library might want to add in a buffered reader around the binary reader.

@KirillOsenkov
Copy link
Member Author

I've tried wrapping the GZipStream in a BufferedStream, but against expectations it didn't change the picture much at all. There must be some real regression in the native ZLib library.

For easier investigations we should take the binlogs and the BinLogReader out of the picture and just write a simple console app that uses ReadByte() to scan a large gzip stream (~30MB compressed). I'm sure we'll see a similar ~10x perf regression here as well.

@Wraith2
Copy link
Contributor

Wraith2 commented Jul 19, 2020

A repro with direct code rather than using a library would make things easier to track down. The use of the channel for events also complicates the traces a lot because you see a lot of waiting. I suspect you'll see this behaviour anytime you read individual bytes from the netcore stream.

@KirillOsenkov
Copy link
Member Author

@stephentoub FYI reading binlogs on Core is 10x slower than on Desktop

@ericstj
Copy link
Member

ericstj commented Dec 9, 2020

I've tried wrapping the GZipStream in a BufferedStream, but against expectations it didn't change the picture much at all.

Are you sure about that? I made a smaller repro. Download this and add your msbuild.binlog to it. Run it on .NETFramework and Core and observe the difference. Then remove the buffered stream and observe.

Not meant to be a proper benchmark, but illustrates the point.

var bytes = File.ReadAllBytes("msbuild.binlog");
using (var ms = new MemoryStream(bytes))
using (var gzs = new GZipStream(ms, CompressionMode.Decompress))
using (var bs = new BufferedStream(gzs))
{
    long positon = 0, last = 0;
    const long mb = 1024 * 1024;

    var sw = Stopwatch.StartNew();
    while (bs.ReadByte() != -1)
    {
        if (++positon - last > 100 * mb)
            Console.WriteLine($"{(last = positon) / mb} MB");
    }

    Console.WriteLine($"Time: {sw.Elapsed}");
}

testGzipStream.zip

Here's what I'm seeing:

Framework Duration
net5.0 2.1970965
netcoreapp3.1 2.0621879
net48 4.0997598

Without the BufferedStream I do see significant differences, but both are much slower than if the stream is buffered. Still seems to be some difference between .NETCore and .NETFramework, but less mainline than "zlib is 10x slower"

@KirillOsenkov
Copy link
Member Author

Thanks for looking into this Eric!

First of all, adding BufferedStream indeed does completely fix it in Vatsan's standalone command-line repro. It brings the time down from 31 to two seconds for .NET 5:
image

and from 4 down to 2 seconds for net472:
image

So for practical purposes we can consider this solved - this teaches me to always wrap a GZipStream in a BufferedStream when reading in small increments, for huge savings. The reason I said it didn't work for me when I tried it is I tried in the binlog viewer and it didn't visibly reduce the binlog loading time (because on net472 it just shaves off a couple seconds (4 -> 2 seconds).

However now I'm just curious about the discrepancy without the BufferedStream. I tried your repro without the buffered stream on net472 and got 38 seconds (vs. 4 seconds).

But the most interesting bit is for net5.0 the time without BufferedStream is 13:18 (thirteen minutes), which is absolutely disproportional. This is inline with the ratio we saw on binlog vs. your simple repro: net5 is 10x-20x slower than net472:
image

I'm not sure it's worth digging in further to understand that (you decide), but the easy workaround is to use BufferedStream which makes net5 twice as fast as net472, which indeed lives up to my high expectations.

@KirillOsenkov
Copy link
Member Author

Wild thought: looking at the traces here, could it be an interop issue? #39233 (comment)

@jkoritzinsky

@KirillOsenkov
Copy link
Member Author

Hmm now that I'm re-reading this paragraph from @Wraith2 earlier in the thread it starts making sense to me:

It looks like the netcore version is doing byte reads from the byte stream by reading a byte and then I'd guess that the native compression is pushing the rest of the contents back 1 byte and reading another byte to keep it full, that would explain the memcpy to me. Netfx is doing array based reads.

@Wraith2
Copy link
Contributor

Wraith2 commented Dec 15, 2020

You could test to see if peak throughput is when the buffered reader buffer size is equal to the native buffer used, in that situation no memcpy to refill the buffer will be used only a full read in the native side.

@KirillOsenkov
Copy link
Member Author

Good idea! I tried various buffer sizes (all .NET 5) and it bottoms out at 1.3 seconds after 16K buffer size:
image

KirillOsenkov added a commit to KirillOsenkov/MSBuildStructuredLog that referenced this issue Dec 15, 2020
@Wraith2
Copy link
Contributor

Wraith2 commented Dec 15, 2020

My assumption when I investigated originally was that the native implementation had been optimized for throughput and that the small reads for headers followed by mostly large full buffer reads was the intended use case. As always optimizing for all use cases would be problematic.

On the other hand since the small reads degraded so badly and it was accepted it doesn't seem that there's any particular perf being measured for native. Perhaps @adamsitnik the perf guru may have some insight on that and whether it might be useful.

@ericstj
Copy link
Member

ericstj commented Dec 15, 2020

This looks like it might be the culprit. This is called with every ReadByte. It does appear to be shifting the state buffer in zlib.

memmove(state->window + woff, state->window + roff, copysz);

@stephentoub
Copy link
Member

@jtkukunas, any thoughts on this issue? It seems like the memmove may have been introduced as part of
intel/zlib@62b7616, and presumably it's a tradeoff with it optimizing for larger / "normal" size buffers at the expense of small ones.

@stephentoub
Copy link
Member

@jtkukunas, any thoughts?

@akoeplinger
Copy link
Member

should we add this to the 7.0 milestone? it seems important enough to me so it doesn't get lost in the Future milestone.

@Wraith2
Copy link
Contributor

Wraith2 commented Sep 3, 2021

It would be good to identify the desired behaviour and then find a way to make the native lib support that. At the moment the optimization is for peak large read throughput but that causes small reads to suffer high buffer consolidation overhead in the native side.

@danmoseley
Copy link
Member

Probably we should open an issue in zlib directly?

@Wraith2
Copy link
Contributor

Wraith2 commented Sep 3, 2021

Possibly but I don't know if or by how much a standard version of zlib would regress performance over using @jtkukunas 's fork. It looks like the specific change for this is only in his fork.

@danmoseley
Copy link
Member

I wasn't clear, I meant https://github.com/jtkukunas/zlib. Yes, we only use https://github.com/madler/zlib for Arm/Arm64.

@Wraith2
Copy link
Contributor

Wraith2 commented Sep 3, 2021

In which case yes, i think opening an issue with a simple repro would probably be the next action.

@Austin-Lamb
Copy link

I have a minimal repro of this issue in case it helps. With code like this:

var localPath = Path.GetTempFileName();
new WebClient().DownloadFile(@"https://github.com/dotnet/runtime/archive/refs/tags/v6.0.5.zip", localPath);

var numBytes = 0;

var watch = Stopwatch.StartNew();

using (var archive = new ZipArchive(File.OpenRead(localPath), ZipArchiveMode.Read))
{
    foreach (var entry in archive.Entries)
    {
        using (var stream = entry.Open())
        {
            int b = 0;
            while ((b = stream.ReadByte()) != -1 && numBytes < 100_000_000)
                numBytes++;
        }
    }
}

watch.Stop();

Console.WriteLine($"Read {numBytes} bytes");
Console.WriteLine($"Time elapsed: {watch.Elapsed}");

When running this in .NET Framework 4.6.2, I get numbers like this:

Running in .NET Framework 4.6.2
Read 100000000 bytes
Time elapsed: 00:00:05.6521488

When running in .NET 6, I get numbers like this:

Running in .NET 6.0.5
Read 100000000 bytes
Time elapsed: 00:00:55.3112673

So it's ~10-11x slower here. In some cases in a larger codebase I've seen it be 13-14x slower. @stephentoub any thoughts about fixing this in .NET 7?

@stephentoub
Copy link
Member

any thoughts about fixing this in .NET 7?

What would the fix be?

(We still haven't heard back from @jtkukunas.)

@Austin-Lamb
Copy link

Could we insert a buffer into the stream? When I tested putting a BufferedStream atop the DeflateStream, it substantially improved perf (by 50x) - and it looks like .NET Framework essentially had a buffering mechanism in this codepath.

@stephentoub
Copy link
Member

Could we insert a buffer into the stream?

If someone wants a buffer, they can add a BufferedStream themselves; BufferedStream exists because most streams don't implicitly buffer themselves, and it provides the ability to opting in to doing so when it's desirable. Implicitly adding a buffer impacts all use cases (e.g. additional allocation), whereas this issue is localized to ReadByte.

and it looks like .NET Framework essentially had a buffering mechanism in this codepath.

Can you elaborate? What buffer are you referring to?

@Austin-Lamb
Copy link

I didn't read the code closely, but in Inflater there is a thing called OutputWindow in .NET Framework - and a cursory read suggests this is some kind of buffer.

I'm planning to use a BufferedStream as my workaround, but it feels like DeflateStream and GZipStream performance should be equivalent on upgrading to .NET Core from .NET Framework so codebases don't have to go manually buffer these streams to upgrade without performance regression, right?

@stephentoub
Copy link
Member

but in Inflater there is a thing called OutputWindow in .NET Framework - and a cursory read suggests this is some kind of buffer.

That's part of the old managed implementation of decompression, which is only used if you explicitly opt out of using zlib via config, and is way slower, existing only for legacy compatibility.

but it feels like DeflateStream and GZipStream performance should be equivalent on upgrading to .NET Core from .NET Framework so codebases don't have to go manually buffer these streams to upgrade without performance regression, right?

DeflateStream and GZipStream in .NET Core are generally much faster than they are in .NET Framework, when you read/write in reasonably-sized chunks (though many of the improvements made in .NET Core around compression were in fact ported back to .NET Framework, a rarity but done due to its significant impact). This issue is about reading a single byte at a time; it is not the general case, nor if you care about performance is it a recommended way to consume data even without this issue, as it's way faster to read many bytes at a time even if this was performing well. And this issue remains open because we recognize there's something to be improved here. That doesn't mean forcing everyone to incur the cost of an extra buffer in order to improve a case that's not the high-performance way of extracting the data is the right tradeoff.

@Austin-Lamb
Copy link

The reason I found this is we have a BinaryReader which ends up reading strings, and strings first read a 7-bit encoded length with Read7BitEncodedInt which is a single byte so this got super expensive quickly. ReadByte can happen even without a caller directly trying to read bytes one at a time, through mechanisms like that.

How do you opt out of zlib via config, so I can try that out?

@stephentoub
Copy link
Member

How do you opt out of zlib via config, so I can try that out?

On .NET Framework, you use this switch:
https://docs.microsoft.com/en-us/dotnet/framework/migration-guide/retargeting/4.5.2-4.8#suggestion
zlib is the only mechanism for .NET Core; it's the primary mechanism for .NET Framework, which just kept the old managed implementations around for compatibility.

The reason I found this is we have a BinaryReader which ends up reading strings, and strings first read a 7-bit encoded length with Read7BitEncodedInt which is a single byte so this got super expensive quickly.

Makes sense. Using a BufferedStream for now is a good way to go. Even if/when this issue is addressed, BufferedStream would still be likely to help your use case.

@jtkukunas
Copy link

Not sure how I missed this. Will take a look.

@stephentoub
Copy link
Member

Thanks, @jtkukunas.

@jtkukunas
Copy link

@stephentoub Yes, this behavior is a tradeoff for normal-sized operations. One byte at a time is ... suboptimal. I'd recommend buffering the data.

@danmoseley
Copy link
Member

Thanks @jtkukunas

@KirillOsenkov I think this issue can be closed now?

@KirillOsenkov
Copy link
Member Author

Sure, thanks

System.IO - Compression automation moved this from To do to Done Jun 10, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests