New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6795: [C#] Fix for reading large (2GB+) files #5412
Conversation
Could you open a JIRA for this? |
Probably needs a unit test. @eerhardt @chutchinson thoughts on how to test such issues with large files? |
Do we have similar tests in any other language implementation? The only approach I can imagine is the test writing out a massive amount of data to a file, and then try reading it back in. And then ensuring the file is deleted at the end of the test. The C# unit tests currently take under a second to run. If anyone wants to skip this in their local development, they can disable the test locally. |
@@ -36,7 +36,7 @@ internal sealed class ArrowFileReaderImplementation : ArrowStreamReaderImplement | |||
/// <summary> | |||
/// Notes what byte position where the footer data is in the stream | |||
/// </summary> | |||
private int _footerStartPostion; | |||
private long _footerStartPostion; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't use the field for anything. We can remove the instance field and just make it a local variable in the functions that use it today.
@@ -110,7 +110,7 @@ protected override void ReadSchema() | |||
|
|||
ArrayPool<byte>.Shared.RentReturn(footerLength, (buffer) => | |||
{ | |||
_footerStartPostion = (int)GetFooterLengthPosition() - footerLength; | |||
_footerStartPostion = GetFooterLengthPosition() - footerLength; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to fix the same code in ReadSchemaAsync()
. It has a similar cast.
I see. Not having a unit test is not a big deal -- we do have tests in Python that generate fairly large payloads but I'm not certain they exercise this exact case |
@eerhardt - Wouldnt it make sense to just have some large files already created that are used for verifying / integration tests? I noticed many of the unit tests just write/read from memory streams |
Where do you propose putting a 2GB file? I wouldn't want to have to download the file. |
It doesnt have be part of the source tree. Also keep in mind the files do compress down very well since arrow is not a compressed format... my 10 gig data files compressed down to 600 megs.. so im sure 2 gigs will compress down to ~120... part of the integration test would have to decompress it. In either case, it does make sense to have existing files that can be used for backwards / inter-library compatibility since I had problems with C# <-> R . They created binary different files even when the structure is the 'same' |
Can you open one or more JIRA issues for this? https://issues.apache.org/jira/projects/ARROW/issues |
Rather than checking in large files, I would recommend generating them on the fly at test-time using e.g. pyarrow. This could be dockerized also, something like |
Seems like this patch is incomplete. How would you all like to proceed? |
@wesm - this would be my preferred way forward with this patch. Let me know any thoughts/feedback you have.
|
393be8c
to
a556ac2
Compare
Remove unused field and fix async API as well.
@eerhardt I just cherry-picked your change can you verify? |
Should we try to include this in 0.15.1? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @abbotware and @emkornfield.
I think it would be great if we could get this fix and #5413 in |
It seems that trying to read larger than 2GB+ files will blow up. As long as the record batches are less than 2GB (the max size of span) there should be no problem reading a large file Closes apache#5412 from abbotware/Fix-For-Large-Files and squashes the following commits: 898d556 <Eric Erhardt> Respond to PR feedback. a556ac2 <Anthony Abate> fix for reading large (2GB+) files Lead-authored-by: Anthony Abate <anthony.abate@gmail.com> Co-authored-by: Eric Erhardt <eric.erhardt@microsoft.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
It seems that trying to read larger than 2GB+ files will blow up. As long as the record batches are less than 2GB (the max size of span) there should be no problem reading a large file Closes apache#5412 from abbotware/Fix-For-Large-Files and squashes the following commits: 898d556 <Eric Erhardt> Respond to PR feedback. a556ac2 <Anthony Abate> fix for reading large (2GB+) files Lead-authored-by: Anthony Abate <anthony.abate@gmail.com> Co-authored-by: Eric Erhardt <eric.erhardt@microsoft.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
It seems that trying to read larger than 2GB+ files will blow up. As long as the record batches are less than 2GB (the max size of span) there should be no problem reading a large file Closes apache#5412 from abbotware/Fix-For-Large-Files and squashes the following commits: 898d556 <Eric Erhardt> Respond to PR feedback. a556ac2 <Anthony Abate> fix for reading large (2GB+) files Lead-authored-by: Anthony Abate <anthony.abate@gmail.com> Co-authored-by: Eric Erhardt <eric.erhardt@microsoft.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
It seems that trying to read larger than 2GB+ files will blow up.
As long as the record batches are less than 2GB (the max size of span) there should be no problem reading a large file