GH-32240: [C#] Support decompression of IPC format buffers #33603

adamreeve · 2023-01-11T02:32:26Z

Which issue does this PR close?

Closes #32240 / https://issues.apache.org/jira/browse/ARROW-16921

What changes are included in this PR?

This PR implements decompression support for Arrow IPC format files and streams in the dotnet/C# library.

This PR originally provided a reflection based implementation to avoid pulling in new dependencies but now only adds a new ICompressionCodecFactory interface that users must implement for decompression support, although I intend to add a new package that implements this in #33893. The original description is below:

The main concern raised in the above Jira issue was that we don't want to add new NuGet package dependencies to support decompression formats that won't be needed by most users, so a default CompressionProvider implementation has been added that uses reflection to use the ZstdNet package for ZSTD decompression and K4os.Compression.LZ4.Streams and CommunityToolkit.HighPerformance for LZ4 Frame support if they are available. The netstandard1.3 target has decompression support disabled due to some reflection functionality being missing, and neither ZstdNet or K4os.Compression.LZ4.Streams support netstandard1.3.

The ArrowFileReader and ArrowStreamReader constructors accept an ICompressionProvider parameter to allow users to provide their own compression provider if they want to use different dependencies.

Alternatives to consider

An alternative approach that could be considered instead of reflection is to use these extra dependencies as build time dependencies but not make them dependencies of the NuGet package. I tested this out in adamreeve@4544afd and it seems to work reasonably well too but required bumping the version of System.Runtime.CompilerServices.Unsafe under the netstandard2.0 and netcoreapp3.1 targets. This reduces all the reflection boilerplate but seems pretty hacky as now Apache.Arrow.dll depends on these extra dlls and we rely on the dotnet runtime behaviour of not trying to load them until they're used. So I think the reflection approach is better.

Another alternative would be to move decompression support into a separate NuGet package (eg. Apache.Arrow.Compression) that depends on Apache.Arrow and has an implementation of ICompressionProvider that users can pass in to the ArrowFileReader constructor, or maybe has a way to register itself with the Apache.Arrow package so it only needs to be configured once. That would seem cleaner to me but I'm not sure how much work it would be to set up a whole new package.

Are these changes tested?

Yes, new unit tests have been added. Test files have been created with a Python script that is included in the PR due to only decompression support being added and not compression support.

Are there any user-facing changes?

Yes, this implements a new feature but in a backwards compatible way.

Closes: [C#] Add decompression support for Record Batches #32240

lidavidm · 2023-01-11T13:23:15Z

FWIW, the Java implementation does take the approach of separating out the compression dependencies into a new package, with only the interface (and a no-op implementation) in the core package. You do have to know to pass the compression codec factory explicitly in that case, unfortunately.

This reverts commit ed4de72.

This reverts commit d3b7ba9.

…bzstd

github-actions · 2023-01-11T21:57:29Z

Closes: [C#] Add decompression support for Record Batches #32240

github-actions · 2023-01-11T21:57:32Z

⚠️ GitHub issue #32240 has been automatically assigned in GitHub to PR creator.

adamreeve · 2023-01-11T21:58:49Z

Thanks @lidavidm, yes I think having a separate package would make sense for dotnet too, and maybe that can be added in a separate PR so for now I've stripped this back to remove the reflection based implementation and just have an implementation in the test project.

I've also renamed things to be a bit more consistent with the Java API, although I've kept the ICompressionCodec API more narrowly focused so it only deals with decompression of dotnet memory buffers and Arrow buffer creation is handled by the internal IBufferCreator implementation, whereas in the Java API this is part of the CompressionCodec interface.

One question I have is whether it makes sense for the public compression API (ICompressionCodec and ICompressionCodecFactory) to live in the Apache.Arrow.Ipc namespace, or whether this should be moved down to Apache.Arrow if there's a possibility it might be used outside of the IPC format?

lidavidm · 2023-01-12T13:06:50Z

Java also puts the compression interface in its IPC package. C++ treats it as a more general utility (since it's also used for Parquet, etc.). Strictly in terms of Arrow I think it's fine to be in Ipc.

wjones127

This all looks good! I just had a minor question.

wjones127 · 2023-01-25T22:56:33Z

csharp/src/Apache.Arrow/Ipc/DecompressingBufferCreator.cs

+            // First 8 bytes give the uncompressed data length
+            var uncompressedLength = BinaryPrimitives.ReadInt64LittleEndian(source.Span.Slice(0, 8));


This seems sensible, but I couldn't find where in the spec this come from?

The buffer compression strategy is described at https://github.com/apache/arrow/blob/apache-arrow-11.0.0/format/Message.fbs#L59, I'll add a reference to this in a comment

Ah it was right there. Thanks!

westonpace · 2023-01-26T19:35:21Z

I see a script to generate some .arrow files and then also those .arrow files checked in. Did you mean to commit the generated arrow files?

wjones127 · 2023-01-26T19:45:25Z

I see a script to generate some .arrow files and then also those .arrow files checked in. Did you mean to commit the generated arrow files?

I'd imagine this is only temporary until we allow writing compressed IPC files in C#. Then we can remove those files and simply round trip within the implementation. Does that seem fair?

adamreeve · 2023-01-26T19:46:23Z

I see a script to generate some .arrow files and then also those .arrow files checked in. Did you mean to commit the generated arrow files?

Yes, I didn't think we'd want to depend on being able to run Python with PyArrow for the dotnet tests to run, but wanted to keep the script there partly as documentation on what should be in the files and so that people could regenerate them if needed. Do you think there's a better way to handle this?

ursabot · 2023-01-27T03:21:48Z

Benchmark runs are scheduled for baseline = 8f537ca and contender = b7fd793. b7fd793 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.25% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.09% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] b7fd7939 ec2-t3-xlarge-us-east-2
[Failed] b7fd7939 test-mac-arm
[Failed] b7fd7939 ursa-i9-9960x
[Finished] b7fd7939 ursa-thinkcentre-m75q
[Finished] 8f537ca9 ec2-t3-xlarge-us-east-2
[Failed] 8f537ca9 test-mac-arm
[Failed] 8f537ca9 ursa-i9-9960x
[Finished] 8f537ca9 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

westonpace · 2023-01-27T05:43:20Z

Yes, I didn't think we'd want to depend on being able to run Python with PyArrow for the dotnet tests to run

I agree it is nice not to need this.

Do you think there's a better way to handle this?

For a few small files I don't think it's something we need to worry about too much. There is a separate repository however, https://github.com/apache/arrow-testing which stores various test binaries and is already included as a submodule in apache/arrow. If you find you need to generate more files in the future you could consider storing them there.

Tommo56700 · 2023-02-09T16:10:02Z

csharp/src/Apache.Arrow/Ipc/ArrowStreamReader.cs

        {
            if (stream == null)
                throw new ArgumentNullException(nameof(stream));

-            _implementation = new ArrowStreamReaderImplementation(stream, allocator, leaveOpen);
+            _implementation = new ArrowStreamReaderImplementation(stream, allocator, compressionCodecFactory, leaveOpen);
        }

        public ArrowStreamReader(ReadOnlyMemory<byte> buffer)


Why not add compression support for ArrowMemoryReaderImplementation?

That wasn't intentional, I'd missed that there was a separate implementation for reading from a memory buffer. I'll make a PR to fix this.

PR opened at #34108

…che#33603) # Which issue does this PR close? Closes apache#32240 / https://issues.apache.org/jira/browse/ARROW-16921 # What changes are included in this PR? This PR implements decompression support for Arrow IPC format files and streams in the dotnet/C# library. The main concern raised in the above Jira issue was that we don't want to add new NuGet package dependencies to support decompression formats that won't be needed by most users, so a default `CompressionProvider` implementation has been added that uses reflection to use the `ZstdNet` package for ZSTD decompression and `K4os.Compression.LZ4.Streams` and `CommunityToolkit.HighPerformance` for LZ4 Frame support if they are available. The `netstandard1.3` target has decompression support disabled due to some reflection functionality being missing, and neither `ZstdNet` or `K4os.Compression.LZ4.Streams` support `netstandard1.3`. The `ArrowFileReader` and `ArrowStreamReader` constructors accept an `ICompressionProvider` parameter to allow users to provide their own compression provider if they want to use different dependencies. ### Alternatives to consider An alternative approach that could be considered instead of reflection is to use these extra dependencies as build time dependencies but not make them dependencies of the NuGet package. I tested this out in adamreeve@4544afd and it seems to work reasonably well too but required bumping the version of `System.Runtime.CompilerServices.Unsafe` under the `netstandard2.0` and `netcoreapp3.1` targets. This reduces all the reflection boilerplate but seems pretty hacky as now Apache.Arrow.dll depends on these extra dlls and we rely on the dotnet runtime behaviour of not trying to load them until they're used. So I think the reflection approach is better. Another alternative would be to move decompression support into a separate NuGet package (eg. `Apache.Arrow.Compression`) that depends on `Apache.Arrow` and has an implementation of `ICompressionProvider` that users can pass in to the `ArrowFileReader` constructor, or maybe has a way to register itself with the `Apache.Arrow` package so it only needs to be configured once. That would seem cleaner to me but I'm not sure how much work it would be to set up a whole new package. # Are these changes tested? Yes, new unit tests have been added. Test files have been created with a Python script that is included in the PR due to only decompression support being added and not compression support. # Are there any user-facing changes? Yes, this implements a new feature but in a backwards compatible way. * Closes: apache#32240 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>

…ReadOnlyMemory (#34108) This is a small follow-up to #33603 to support reading a compressed IPC stream from a `ReadOnlyMemory<byte>`, as I missed that this has a separate reader implementation. * Closes: #32240 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

…che#33603) # Which issue does this PR close? Closes apache#32240 / https://issues.apache.org/jira/browse/ARROW-16921 # What changes are included in this PR? This PR implements decompression support for Arrow IPC format files and streams in the dotnet/C# library. The main concern raised in the above Jira issue was that we don't want to add new NuGet package dependencies to support decompression formats that won't be needed by most users, so a default `CompressionProvider` implementation has been added that uses reflection to use the `ZstdNet` package for ZSTD decompression and `K4os.Compression.LZ4.Streams` and `CommunityToolkit.HighPerformance` for LZ4 Frame support if they are available. The `netstandard1.3` target has decompression support disabled due to some reflection functionality being missing, and neither `ZstdNet` or `K4os.Compression.LZ4.Streams` support `netstandard1.3`. The `ArrowFileReader` and `ArrowStreamReader` constructors accept an `ICompressionProvider` parameter to allow users to provide their own compression provider if they want to use different dependencies. ### Alternatives to consider An alternative approach that could be considered instead of reflection is to use these extra dependencies as build time dependencies but not make them dependencies of the NuGet package. I tested this out in adamreeve@4544afd and it seems to work reasonably well too but required bumping the version of `System.Runtime.CompilerServices.Unsafe` under the `netstandard2.0` and `netcoreapp3.1` targets. This reduces all the reflection boilerplate but seems pretty hacky as now Apache.Arrow.dll depends on these extra dlls and we rely on the dotnet runtime behaviour of not trying to load them until they're used. So I think the reflection approach is better. Another alternative would be to move decompression support into a separate NuGet package (eg. `Apache.Arrow.Compression`) that depends on `Apache.Arrow` and has an implementation of `ICompressionProvider` that users can pass in to the `ArrowFileReader` constructor, or maybe has a way to register itself with the `Apache.Arrow` package so it only needs to be configured once. That would seem cleaner to me but I'm not sure how much work it would be to set up a whole new package. # Are these changes tested? Yes, new unit tests have been added. Test files have been created with a Python script that is included in the PR due to only decompression support being added and not compression support. # Are there any user-facing changes? Yes, this implements a new feature but in a backwards compatible way. * Closes: apache#32240 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>

… from ReadOnlyMemory (apache#34108) This is a small follow-up to apache#33603 to support reading a compressed IPC stream from a `ReadOnlyMemory<byte>`, as I missed that this has a separate reader implementation. * Closes: apache#32240 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

…IPC decompression (#33893) ### Rationale for this change This further addresses #32240 and is a follow up to PR #33603 to provide an implementation of the `ICompressionCodecFactory` interface in a new `Apache.Arrow.Compression` package. Making this a separate package means users who don't need IPC decompression support don't need to pull in extra dependencies. ### What changes are included in this PR? Adds a new `Apache.Arrow.Compression` package and moves the existing compression codec implementations used for testing into this package. ### Are these changes tested? There are unit tests verifying the decompression support, but this also affects the release scripts and I'm not sure how to fully test these. ### Are there any user-facing changes? Yes, this adds a new package users can install for IPC decompression support, so documentation has been updated. * Closes: #32240 Lead-authored-by: Adam Reeve <adreeve@gmail.com> Co-authored-by: Eric Erhardt <eric.erhardt@microsoft.com> Signed-off-by: Eric Erhardt <eric.erhardt@microsoft.com>

adamreeve added 3 commits January 11, 2023 14:42

Support reading compressed IPC buffers in dotnet

713e447

Add reflection based default compression provider

d3b7ba9

Documentation update

ed4de72

This comment was marked as outdated.

Sign in to view

github-actions bot added Component: C# Component: Documentation labels Jan 11, 2023

adamreeve mentioned this pull request Jan 11, 2023

[C#] Add decompression support for Record Batches #32240

Closed

adamreeve added 5 commits January 12, 2023 09:12

Revert "Documentation update"

883a0bb

This reverts commit ed4de72.

Revert "Add reflection based default compression provider"

a898350

This reverts commit d3b7ba9.

Rename for more consistency with Java API

55417af

Fix Assert.Equal parameter ordering

3b15296

Switch to ZstdSharp.Port as it is strong named and doesn't require li…

4971190

…bzstd

github-actions bot removed the Component: Documentation label Jan 11, 2023

assignUser changed the title ~~ARROW-16921: [C#] Support decompression of IPC format buffers~~ GH-32240: [C#] Support decompression of IPC format buffers Jan 11, 2023

wjones127 self-requested a review January 25, 2023 22:09

wjones127 approved these changes Jan 25, 2023

View reviewed changes

Add comment referencing the BodyCompressionMethod documentation

ceb8a46

adamreeve requested a review from westonpace as a code owner January 25, 2023 23:34

wjones127 approved these changes Jan 26, 2023

View reviewed changes

wjones127 merged commit b7fd793 into apache:master Jan 26, 2023

adamreeve mentioned this pull request Jan 26, 2023

GH-32240: [C#] Add new Apache.Arrow.Compression package to implement IPC decompression #33893

Merged

adamreeve deleted the dotnet_decompression_2 branch January 30, 2023 00:43

Tommo56700 reviewed Feb 9, 2023

View reviewed changes

adamreeve mentioned this pull request Feb 9, 2023

GH-32240: [C#] Support decompression when reading an IPC stream from ReadOnlyMemory #34108

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-32240: [C#] Support decompression of IPC format buffers #33603

GH-32240: [C#] Support decompression of IPC format buffers #33603

adamreeve commented Jan 11, 2023 •

edited

This comment was marked as outdated.

This comment was marked as outdated.

lidavidm commented Jan 11, 2023

github-actions bot commented Jan 11, 2023

github-actions bot commented Jan 11, 2023

adamreeve commented Jan 11, 2023

lidavidm commented Jan 12, 2023

wjones127 left a comment

wjones127 Jan 25, 2023

adamreeve Jan 25, 2023

wjones127 Jan 26, 2023

westonpace commented Jan 26, 2023

wjones127 commented Jan 26, 2023

adamreeve commented Jan 26, 2023

ursabot commented Jan 27, 2023

westonpace commented Jan 27, 2023

Tommo56700 Feb 9, 2023 •

edited

adamreeve Feb 9, 2023

adamreeve Feb 9, 2023

		// First 8 bytes give the uncompressed data length
		var uncompressedLength = BinaryPrimitives.ReadInt64LittleEndian(source.Span.Slice(0, 8));

GH-32240: [C#] Support decompression of IPC format buffers #33603

GH-32240: [C#] Support decompression of IPC format buffers #33603

Conversation

adamreeve commented Jan 11, 2023 • edited

Which issue does this PR close?

What changes are included in this PR?

Alternatives to consider

Are these changes tested?

Are there any user-facing changes?

This comment was marked as outdated.

This comment was marked as outdated.

lidavidm commented Jan 11, 2023

github-actions bot commented Jan 11, 2023

github-actions bot commented Jan 11, 2023

adamreeve commented Jan 11, 2023

lidavidm commented Jan 12, 2023

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 Jan 25, 2023

Choose a reason for hiding this comment

adamreeve Jan 25, 2023

Choose a reason for hiding this comment

wjones127 Jan 26, 2023

Choose a reason for hiding this comment

westonpace commented Jan 26, 2023

wjones127 commented Jan 26, 2023

adamreeve commented Jan 26, 2023

ursabot commented Jan 27, 2023

westonpace commented Jan 27, 2023

Tommo56700 Feb 9, 2023 • edited

Choose a reason for hiding this comment

adamreeve Feb 9, 2023

Choose a reason for hiding this comment

adamreeve Feb 9, 2023

Choose a reason for hiding this comment

adamreeve commented Jan 11, 2023 •

edited

Tommo56700 Feb 9, 2023 •

edited