Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6795: [C#] Fix for reading large (2GB+) files #5412

Closed
wants to merge 2 commits into from

Conversation

abbotware
Copy link
Contributor

It seems that trying to read larger than 2GB+ files will blow up.

As long as the record batches are less than 2GB (the max size of span) there should be no problem reading a large file

@bkietz
Copy link
Member

bkietz commented Sep 17, 2019

Could you open a JIRA for this?

@wesm
Copy link
Member

wesm commented Sep 18, 2019

Probably needs a unit test. @eerhardt @chutchinson thoughts on how to test such issues with large files?

@eerhardt
Copy link
Contributor

Do we have similar tests in any other language implementation?

The only approach I can imagine is the test writing out a massive amount of data to a file, and then try reading it back in. And then ensuring the file is deleted at the end of the test.

The C# unit tests currently take under a second to run. If anyone wants to skip this in their local development, they can disable the test locally.

@@ -36,7 +36,7 @@ internal sealed class ArrowFileReaderImplementation : ArrowStreamReaderImplement
/// <summary>
/// Notes what byte position where the footer data is in the stream
/// </summary>
private int _footerStartPostion;
private long _footerStartPostion;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't use the field for anything. We can remove the instance field and just make it a local variable in the functions that use it today.

@@ -110,7 +110,7 @@ protected override void ReadSchema()

ArrayPool<byte>.Shared.RentReturn(footerLength, (buffer) =>
{
_footerStartPostion = (int)GetFooterLengthPosition() - footerLength;
_footerStartPostion = GetFooterLengthPosition() - footerLength;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to fix the same code in ReadSchemaAsync(). It has a similar cast.

@wesm
Copy link
Member

wesm commented Sep 19, 2019

I see. Not having a unit test is not a big deal -- we do have tests in Python that generate fairly large payloads but I'm not certain they exercise this exact case

@abbotware
Copy link
Contributor Author

@eerhardt - Wouldnt it make sense to just have some large files already created that are used for verifying / integration tests? I noticed many of the unit tests just write/read from memory streams

@eerhardt
Copy link
Contributor

Wouldnt it make sense to just have some large files already created that are used for verifying / integration tests? I noticed many of the unit tests just write/read from memory streams

Where do you propose putting a 2GB file? I wouldn't want to have to download the file.

@abbotware
Copy link
Contributor Author

It doesnt have be part of the source tree. Also keep in mind the files do compress down very well since arrow is not a compressed format... my 10 gig data files compressed down to 600 megs.. so im sure 2 gigs will compress down to ~120... part of the integration test would have to decompress it.

In either case, it does make sense to have existing files that can be used for backwards / inter-library compatibility since I had problems with C# <-> R . They created binary different files even when the structure is the 'same'

@eerhardt
Copy link
Contributor

I had problems with C# <-> R . They created binary different files even when the structure is the 'same'

Can you open one or more JIRA issues for this? https://issues.apache.org/jira/projects/ARROW/issues

@abbotware
Copy link
Contributor Author

@pitrou pitrou changed the title fix for reading large (2GB+) files ARROW-6681: [C#] Fix for reading large (2GB+) files Sep 25, 2019
@wesm
Copy link
Member

wesm commented Oct 3, 2019

Rather than checking in large files, I would recommend generating them on the fly at test-time using e.g. pyarrow. This could be dockerized also, something like docker-compose run csharp-gen-test-data

@wesm
Copy link
Member

wesm commented Oct 3, 2019

Seems like this patch is incomplete. How would you all like to proceed?

@eerhardt
Copy link
Contributor

eerhardt commented Oct 4, 2019

@wesm - this would be my preferred way forward with this patch. Let me know any thoughts/feedback you have.

  1. The associated JIRA with this PR is not correct. 6681 is about the order of RecordBatchs are written to the file, not about reading large files. So I've logged https://issues.apache.org/jira/browse/ARROW-6795 for this specific issue. Can you update this PR's title to associate it with the correct issue?
  2. To resolve my above PR comments, I have created a commit that can be cherry-picked into this patch. eerhardt@00fa066. If this commit (or some other resolution to the above comments) are brought into this change, I can sign off.
  3. As for testing this scenario: While it goes against my beliefs that we should have a good set of tests, in this case I think that the amount of effort is going to outweigh the value. If given the choice between merging this change without an explicit CI test for the scenario, and not taking this change at all, I would choose to take the change without an explicit CI test. But that is just my opinion.

@emkornfield emkornfield changed the title ARROW-6681: [C#] Fix for reading large (2GB+) files ARROW-6795: [C#] Fix for reading large (2GB+) files Oct 17, 2019
Remove unused field and fix async API as well.
@emkornfield
Copy link
Contributor

@eerhardt I just cherry-picked your change can you verify?

@wesm
Copy link
Member

wesm commented Oct 17, 2019

Should we try to include this in 0.15.1?

Copy link
Contributor

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @abbotware and @emkornfield.

@eerhardt
Copy link
Contributor

eerhardt commented Oct 17, 2019

Should we try to include this in 0.15.1?

I think it would be great if we could get this fix and #5413 in 0.15.1. They are both small, targeted changes that it sounds like @abbotware wants to take advantage of.

kszucs pushed a commit to kszucs/arrow that referenced this pull request Oct 21, 2019
It seems that trying to read larger than 2GB+ files will blow up.

As long as the record batches are less than 2GB (the max size of span) there should be no problem reading a large file

Closes apache#5412 from abbotware/Fix-For-Large-Files and squashes the following commits:

898d556 <Eric Erhardt> Respond to PR feedback.
a556ac2 <Anthony Abate> fix for reading large (2GB+) files

Lead-authored-by: Anthony Abate <anthony.abate@gmail.com>
Co-authored-by: Eric Erhardt <eric.erhardt@microsoft.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
kszucs pushed a commit to kszucs/arrow that referenced this pull request Oct 21, 2019
It seems that trying to read larger than 2GB+ files will blow up.

As long as the record batches are less than 2GB (the max size of span) there should be no problem reading a large file

Closes apache#5412 from abbotware/Fix-For-Large-Files and squashes the following commits:

898d556 <Eric Erhardt> Respond to PR feedback.
a556ac2 <Anthony Abate> fix for reading large (2GB+) files

Lead-authored-by: Anthony Abate <anthony.abate@gmail.com>
Co-authored-by: Eric Erhardt <eric.erhardt@microsoft.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
kszucs pushed a commit to kszucs/arrow that referenced this pull request Oct 21, 2019
It seems that trying to read larger than 2GB+ files will blow up.

As long as the record batches are less than 2GB (the max size of span) there should be no problem reading a large file

Closes apache#5412 from abbotware/Fix-For-Large-Files and squashes the following commits:

898d556 <Eric Erhardt> Respond to PR feedback.
a556ac2 <Anthony Abate> fix for reading large (2GB+) files

Lead-authored-by: Anthony Abate <anthony.abate@gmail.com>
Co-authored-by: Eric Erhardt <eric.erhardt@microsoft.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
@abbotware abbotware deleted the Fix-For-Large-Files branch November 4, 2019 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants