Skip to content

[C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling NextBatch() #23222

@asfimport

Description

@asfimport

Given most columns, I can run a loop like:

std::unique_ptr<parquet::arrow::ColumnReader> columnReader(/*...*/);
while (nRowsRemaining > 0) {
    int n = std::min(100, nRowsRemaining);
    std::shared_ptr<arrow::ChunkedArray> chunkedArray;
    auto status = columnReader->NextBatch(n, &chunkedArray);
    // ... and then use `chunkedArray`
    nRowsRemaining -= n;
}

(The context is: "convert Parquet to CSV/JSON, with small memory footprint." Used in https://github.com/CJWorkbench/parquet-to-arrow)

Normally, the first NextBatch() return value looks like val0...val99; the second return value looks like val100...val199; and so on.

... but with a ByteArrayDictionaryRecordReader, that isn't the case. The first NextBatch() return value looks like val0...val100; the second return value looks like val0...val99, val100...val199 (ChunkedArray with two arrays); the third return value looks like val0...val99, val100...val199, val200...val299 (ChunkedArray with three arrays); and so on. The returned arrays are never cleared.

In sum: NextBatch() on a dictionary column reader returns the wrong values.

I've attached a minimal Parquet file that presents this problem with the above code; and I've written a patch that fixes this one case, to illustrate where things are wrong. I don't think I understand enough edge cases to decree that my patch is a correct fix.

Environment: Linux 5.2.17-200.fc30.x86_64 (Docker)
Reporter: Adam Hooper / @adamhooper
Assignee: Adam Hooper / @adamhooper

Related issues:

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-6895. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions