-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Given most columns, I can run a loop like:
std::unique_ptr<parquet::arrow::ColumnReader> columnReader(/*...*/);
while (nRowsRemaining > 0) {
int n = std::min(100, nRowsRemaining);
std::shared_ptr<arrow::ChunkedArray> chunkedArray;
auto status = columnReader->NextBatch(n, &chunkedArray);
// ... and then use `chunkedArray`
nRowsRemaining -= n;
}(The context is: "convert Parquet to CSV/JSON, with small memory footprint." Used in https://github.com/CJWorkbench/parquet-to-arrow)
Normally, the first NextBatch() return value looks like val0...val99; the second return value looks like val100...val199; and so on.
... but with a ByteArrayDictionaryRecordReader, that isn't the case. The first NextBatch() return value looks like val0...val100; the second return value looks like val0...val99, val100...val199 (ChunkedArray with two arrays); the third return value looks like val0...val99, val100...val199, val200...val299 (ChunkedArray with three arrays); and so on. The returned arrays are never cleared.
In sum: NextBatch() on a dictionary column reader returns the wrong values.
I've attached a minimal Parquet file that presents this problem with the above code; and I've written a patch that fixes this one case, to illustrate where things are wrong. I don't think I understand enough edge cases to decree that my patch is a correct fix.
Environment: Linux 5.2.17-200.fc30.x86_64 (Docker)
Reporter: Adam Hooper / @adamhooper
Assignee: Adam Hooper / @adamhooper
Related issues:
- [C++] [Dataset] Scanning dataset with dictionary type hangs (is duplicated by)
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-6895. Please see the migration documentation for further details.