ARROW-6795: [C#] Fix for reading large (2GB+) files #5412

abbotware · 2019-09-17T18:18:12Z

It seems that trying to read larger than 2GB+ files will blow up.

As long as the record batches are less than 2GB (the max size of span) there should be no problem reading a large file

bkietz · 2019-09-17T18:45:31Z

Could you open a JIRA for this?

wesm · 2019-09-18T15:40:09Z

Probably needs a unit test. @eerhardt @chutchinson thoughts on how to test such issues with large files?

eerhardt · 2019-09-18T16:23:41Z

Do we have similar tests in any other language implementation?

The only approach I can imagine is the test writing out a massive amount of data to a file, and then try reading it back in. And then ensuring the file is deleted at the end of the test.

The C# unit tests currently take under a second to run. If anyone wants to skip this in their local development, they can disable the test locally.

eerhardt · 2019-09-18T16:26:01Z

csharp/src/Apache.Arrow/Ipc/ArrowFileReaderImplementation.cs

@@ -36,7 +36,7 @@ internal sealed class ArrowFileReaderImplementation : ArrowStreamReaderImplement
        /// <summary>
        /// Notes what byte position where the footer data is in the stream
        /// </summary>
-        private int _footerStartPostion;
+        private long _footerStartPostion;


We actually don't use the field for anything. We can remove the instance field and just make it a local variable in the functions that use it today.

eerhardt · 2019-09-18T16:26:45Z

csharp/src/Apache.Arrow/Ipc/ArrowFileReaderImplementation.cs

@@ -110,7 +110,7 @@ protected override void ReadSchema()

            ArrayPool<byte>.Shared.RentReturn(footerLength, (buffer) =>
            {
-                _footerStartPostion = (int)GetFooterLengthPosition() - footerLength;
+                _footerStartPostion = GetFooterLengthPosition() - footerLength;


We also need to fix the same code in ReadSchemaAsync(). It has a similar cast.

wesm · 2019-09-19T19:13:31Z

I see. Not having a unit test is not a big deal -- we do have tests in Python that generate fairly large payloads but I'm not certain they exercise this exact case

abbotware · 2019-09-22T13:53:59Z

@eerhardt - Wouldnt it make sense to just have some large files already created that are used for verifying / integration tests? I noticed many of the unit tests just write/read from memory streams

eerhardt · 2019-09-23T20:16:42Z

Wouldnt it make sense to just have some large files already created that are used for verifying / integration tests? I noticed many of the unit tests just write/read from memory streams

Where do you propose putting a 2GB file? I wouldn't want to have to download the file.

abbotware · 2019-09-24T13:49:17Z

It doesnt have be part of the source tree. Also keep in mind the files do compress down very well since arrow is not a compressed format... my 10 gig data files compressed down to 600 megs.. so im sure 2 gigs will compress down to ~120... part of the integration test would have to decompress it.

In either case, it does make sense to have existing files that can be used for backwards / inter-library compatibility since I had problems with C# <-> R . They created binary different files even when the structure is the 'same'

eerhardt · 2019-09-24T14:47:40Z

I had problems with C# <-> R . They created binary different files even when the structure is the 'same'

Can you open one or more JIRA issues for this? https://issues.apache.org/jira/projects/ARROW/issues

abbotware · 2019-09-25T03:48:59Z

i did: https://issues.apache.org/jira/browse/ARROW-6681

wesm · 2019-10-03T19:26:59Z

Rather than checking in large files, I would recommend generating them on the fly at test-time using e.g. pyarrow. This could be dockerized also, something like docker-compose run csharp-gen-test-data

wesm · 2019-10-03T19:27:29Z

Seems like this patch is incomplete. How would you all like to proceed?

eerhardt · 2019-10-04T21:58:25Z

@wesm - this would be my preferred way forward with this patch. Let me know any thoughts/feedback you have.

The associated JIRA with this PR is not correct. 6681 is about the order of RecordBatchs are written to the file, not about reading large files. So I've logged https://issues.apache.org/jira/browse/ARROW-6795 for this specific issue. Can you update this PR's title to associate it with the correct issue?
To resolve my above PR comments, I have created a commit that can be cherry-picked into this patch. eerhardt@00fa066. If this commit (or some other resolution to the above comments) are brought into this change, I can sign off.
As for testing this scenario: While it goes against my beliefs that we should have a good set of tests, in this case I think that the amount of effort is going to outweigh the value. If given the choice between merging this change without an explicit CI test for the scenario, and not taking this change at all, I would choose to take the change without an explicit CI test. But that is just my opinion.

Remove unused field and fix async API as well.

emkornfield · 2019-10-17T06:37:16Z

@eerhardt I just cherry-picked your change can you verify?

wesm · 2019-10-17T13:37:11Z

Should we try to include this in 0.15.1?

eerhardt

LGTM. Thanks @abbotware and @emkornfield.

eerhardt · 2019-10-17T21:51:50Z

Should we try to include this in 0.15.1?

I think it would be great if we could get this fix and #5413 in 0.15.1. They are both small, targeted changes that it sounds like @abbotware wants to take advantage of.

It seems that trying to read larger than 2GB+ files will blow up. As long as the record batches are less than 2GB (the max size of span) there should be no problem reading a large file Closes apache#5412 from abbotware/Fix-For-Large-Files and squashes the following commits: 898d556 <Eric Erhardt> Respond to PR feedback. a556ac2 <Anthony Abate> fix for reading large (2GB+) files Lead-authored-by: Anthony Abate <anthony.abate@gmail.com> Co-authored-by: Eric Erhardt <eric.erhardt@microsoft.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

eerhardt reviewed Sep 18, 2019

View reviewed changes

pitrou changed the title ~~fix for reading large (2GB+) files~~ ARROW-6681: [C#] Fix for reading large (2GB+) files Sep 25, 2019

fix for reading large (2GB+) files

a556ac2

kszucs force-pushed the Fix-For-Large-Files branch from 393be8c to a556ac2 Compare October 5, 2019 10:14

emkornfield changed the title ~~ARROW-6681: [C#] Fix for reading large (2GB+) files~~ ARROW-6795: [C#] Fix for reading large (2GB+) files Oct 17, 2019

Respond to PR feedback.

898d556

Remove unused field and fix async API as well.

eerhardt approved these changes Oct 17, 2019

View reviewed changes

emkornfield closed this in 3675073 Oct 18, 2019

abbotware deleted the Fix-For-Large-Files branch November 4, 2019 14:24

asfimport mentioned this pull request Oct 18, 2019

[C#] Reading large Arrow files in C# results in an exception #23129

Closed

asfimport mentioned this pull request Oct 4, 2019

[C#] Record Batches in reverse order? #23029

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-6795: [C#] Fix for reading large (2GB+) files #5412

ARROW-6795: [C#] Fix for reading large (2GB+) files #5412

abbotware commented Sep 17, 2019

bkietz commented Sep 17, 2019

wesm commented Sep 18, 2019

eerhardt commented Sep 18, 2019

eerhardt Sep 18, 2019

eerhardt Sep 18, 2019

wesm commented Sep 19, 2019

abbotware commented Sep 22, 2019

eerhardt commented Sep 23, 2019

abbotware commented Sep 24, 2019

eerhardt commented Sep 24, 2019

abbotware commented Sep 25, 2019

wesm commented Oct 3, 2019

wesm commented Oct 3, 2019

eerhardt commented Oct 4, 2019

emkornfield commented Oct 17, 2019

wesm commented Oct 17, 2019

eerhardt left a comment

eerhardt commented Oct 17, 2019 •

edited

ARROW-6795: [C#] Fix for reading large (2GB+) files #5412

ARROW-6795: [C#] Fix for reading large (2GB+) files #5412

Conversation

abbotware commented Sep 17, 2019

bkietz commented Sep 17, 2019

wesm commented Sep 18, 2019

eerhardt commented Sep 18, 2019

eerhardt Sep 18, 2019

Choose a reason for hiding this comment

eerhardt Sep 18, 2019

Choose a reason for hiding this comment

wesm commented Sep 19, 2019

abbotware commented Sep 22, 2019

eerhardt commented Sep 23, 2019

abbotware commented Sep 24, 2019

eerhardt commented Sep 24, 2019

abbotware commented Sep 25, 2019

wesm commented Oct 3, 2019

wesm commented Oct 3, 2019

eerhardt commented Oct 4, 2019

emkornfield commented Oct 17, 2019

wesm commented Oct 17, 2019

eerhardt left a comment

Choose a reason for hiding this comment

eerhardt commented Oct 17, 2019 • edited

eerhardt commented Oct 17, 2019 •

edited