New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected #7534
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Is the interaction with batch size tested?
The python dataset tests are crashing on Mac: https://github.com/apache/arrow/runs/806974457 |
@jorisvandenbossche that build comes from the first of two |
Hmm, it failed on the last commit as well, but I restarted that one. And so now appears to be green indeed .. |
That crash is ARROW-8999. I'm fairly confident it's a real bug given that it happens about 1-5% of the time |
f45a051
to
78a48dd
Compare
78a48dd
to
0c705bf
Compare
0c705bf
to
0171cea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a segfault on OSX and Windows python tests.
0171cea
to
e3f17ce
Compare
I think #7704 needs to be merged first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a rebase issue (pulling in a Rust change it looks like) but otherwise this looks fine with the caveat that we need to have a plan to constrain memory use when using the batch readers
… columns are projected
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
e3f17ce
to
75e3b0a
Compare
cpp/src/parquet/arrow/reader.cc
Outdated
if (chunk == nullptr) { | ||
return ::arrow::IterationTraits<RecordBatchIterator>::End(); | ||
} | ||
} while (chunk->length() == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this check should be internal to the ColumnReaders. I tried embedding it in TransferColumnData but discovered that other location in the codebase rely on empty chunks. Is this expected or should it be fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
This bug is inherited from
parquet::arrow::RowGroupRecordBatchReader
, which yielded empty record batches when no columns were projected because no field readers were available from which to derive batch size. I've added logic to get usable batch sizes from file metadata in the empty columns case