feat: support block gzip for streams#9175
Conversation
alamb
left a comment
There was a problem hiding this comment.
Thanks @tshauck -- this code looks great to me.
I do think this PR needs a test otherwise we could potentially end up breaking this feature by accident
here some the existing coverage
https://github.com/search?q=repo%3Aapache%2Farrow-datafusion%20aggregate_test_100.csv.gz&type=code
Which I think is referencing this file https://github.com/apache/arrow-testing/blob/master/data/csv/aggregate_test_100.csv.gz
Perhaps you can check in a small .bgzip'd file in https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/data and then add a small test to https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/csv_files.slt ?
|
Thanks @alamb -- certainly down to add a test. Before I go the route you mentioned, I wanted to ask if you knew that would follow the stream decoding path? Looking at https://github.com/apache/arrow-datafusion/blob/ae882356171513c9d6c22b3bd966898fb4e8cac0/datafusion/core/src/datasource/physical_plan/csv.rs#L393-L441 in the CsvOpener, I think local files would result in If not I'll code that up, otherwise perhaps adding a specific unittest that calls |
I think a specific unit test for convert stream would be great and easiest (and you can probably just generate some data directly) |
|
Thanks! |
Which issue does this PR close?
Closes #9156
Rationale for this change
Updates the stream gzip decoding to check for multiple members within a file.
What changes are included in this PR?
Doesn't update the
Readdecoder because it already supports multiple members in a file.Are these changes tested?
The gzip decoding still seems to work, and I checked a manually bgzipped file, but I'm not sure if this code path is tested already.
Are there any user-facing changes?
No