Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeflateStream output differs between .NET Framework and .NET Core #40170

Open
tmat opened this issue Aug 8, 2019 · 15 comments

Comments

@tmat
Copy link
Member

commented Aug 8, 2019

The compression result differs between .NET Framework and .NET Core. Is it expected? Are there any knobs that can be used to remove the differences?

The following program prints different output:

var text = "\r\nusing System;\r\n\r\nclass C\r\n{\r\n    public static void Main()\r\n    {\r\n        Console.WriteLine();\r\n    }\r\n}\r\n";
var compressedStream = new MemoryStream();

using (var deflater = new DeflateStream(compressedStream, CompressionLevel.Optimal, leaveOpen: true))
using (var writer = new StreamWriter(deflater, Encoding.UTF8, bufferSize: 1024, leaveOpen: true))
{
    writer.Write(text);
}

compressedStream.Position = 0;
Console.WriteLine(BitConverter.ToString(compressedStream.ToArray()));

Full Framework:

7B-BF-7B-FF-FB-DD-FB-79-B9-4A-8B-33-F3-D2-15-82-2B-8B-4B-52-73-AD-79-B9-78-B9-92-73-12-
8B-8B-15-9C-79-B9-AA-79-B9-14-80-A0-A0-34-29-27-33-59-A1-B8-24-B1-04-48-95-E5-67-A6-28-
F8-26-66-E6-69-68-42-A4-A1-AA-40-C0-39-3F-AF-38-3F-27-55-2F-BC-28-B3-24-D5-27-33-2F-55-
43-D3-1A-22-5B-CB-CB-05-44-00

Core:

7A-BF-7B-FF-FB-DD-FB-79-B9-4A-8B-33-F3-D2-15-82-2B-8B-4B-52-73-AD-79-B9-78-B9-92-73-12-
8B-8B-15-9C-79-B9-AA-79-B9-14-80-A0-A0-34-29-27-33-59-A1-B8-24-B1-04-48-95-E5-67-A6-28-
F8-26-66-E6-69-68-42-A4-A1-AA-40-C0-39-3F-AF-38-3F-27-55-2F-BC-28-B3-24-D5-27-33-2F-55-
43-13-68-22-48-A6-96-97-0B-88-00-00-00-00-FF-FF-03-00

Both blobs decompress back to the same value. But their compressed lengths are different.

@stephentoub

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

Is it expected?

Yes. The implementations are almost entirely different. .NET Framework had a managed implementation, .NET Core uses zlib, and any changes/improvements in the underlying algorithms/settings can influence the compression ratios/outputs/etc.

Are there any knobs that can be used to remove the differences?

No.

Can you elaborate on the concern?

@tmat

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

Also it appears there is a difference between Windows Core CLR and Linux Core CLR :(.

The concern is that the compiler output is different for the same inputs.

@stephentoub

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

Also it appears there is a difference between Windows Core CLR and Linux Core CLR :(.

Quite possibly. On Linux and macOS we use whatever zlib is installed on the box. If it changes its compression from version to version, that will show up in use of DeflateStream.

@tmat

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

Now this is more of a concern. You're basically saying that the behavior of the compression algorithm is non-deterministic/unspecified? I'd expect it to produce the same output based on the same input parameters.

@tmat

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

(regardless of the implementation - different implementations might be faster or slower but they should produce the same output).

@stephentoub

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

You're basically saying that the behavior of the compression algorithm is non-deterministic/unspecified?

No, it should be deterministic, on the same OS on the same version. If it's not we'd want to understand why.

but they should produce the same output

That's generally not how compression libraries work. There are no guarantee of exact same output across bug fixes, improvements to throughput, improvements to compression ratios, etc. What you're suggesting would be that a library would be prohibited from doing what it does better from version to version.

@stephentoub stephentoub added the question label Aug 9, 2019

@stephentoub stephentoub added this to the Future milestone Aug 9, 2019

@tmat

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

I'd expect it to be controlled by a parameter.

No, it should be deterministic, on the same OS on the same version.

Now that seems problematic. Consider a cloud build that runs on hundred machines. If some of the machines have been updated to a newer version and others have not then the result of the build will depend on where a particular dll is built. That is not good.

@stephentoub

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

I'd expect it to be controlled by a parameter.

So every time Roslyn improves the IL it generates for the same input, it exposes a knob so that it's opt-in for every individual IL change? I don't think so.

@tmat

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

That's different since I would expect the whole build to use the same compiler. I guess now we need to consider the version of zlib as well. Do we redistribute zlib with Core CLR or we use the one that's installed on the OS?

@stephentoub

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

Do we redistribute zlib with Core CLR or we use the one that's installed on the OS?

On Linux and macOS where zlib is available everywhere, we use the one installed on the OS. On Windows where it's not, we have to distribute it with .NET Core, in the clrcompression.dll, and since we're distributing it anyway, we then use Intel's optimized version of zlib.

@tmat

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

Hmm, can we always redist the zlib with Core CLR, so that at least with the same version of runtime we get the same outputs?

@stephentoub

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

We try to redist as little as possible, a) so that we're not on the hook for servicing (especially security issues, zero-day vulnerabilities, etc.), b) to keep distribution sizes as small as possible, c) to allow admins to control the versions of such libraries the way they do elsewhere (lots of apps and frameworks use zlib), etc.

@tmat

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

What that means though is that deterministic cloud build must run the same version of OS on all machines (e.g. in a container), using the same version of Core CLR is not enough.

FYI @jaredpar

@danmosemsft

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

If the build has non managed compression steps such as tar it could have the same issue, it is not specific to.NET.

@jaredpar

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

Sounds like the build is still deterministic here. That is given the same inputs the compilation will produce the same outputs (bit for bit). The interesting bit here is that there are more inputs than we expected.

The set of inputs is most important when we're calculating a key for compilation in a caching environment. The idea being that we could re-use the output of a compilation on one machine on a separate machine assuming all of the inputs are the same. Sounds like OS version, or maybe zlib version, needs to be one of the keys.

That's unfortunate but not a big deviation from today as the operating system itself is already a part of the key. Different operating systems have different line endings and depending on how your gitattributes are setup this can change the line endings of source files. Line endings affect the end output of the build and hence if they're potentially different they need to generate different keys.

Another way to think about this: a git SHA is insufficient for uniquely identifying the source code of a product. The operating system needs to be factored in as well.

Given that the operating system already needs to be a part of any key we generate adding the version doesn't seem like a deal breaker. It's unfortunate but not too far removed from where we are today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.