Skip to content

[C++] Improve performance of parquet readahead #31683

@asfimport

Description

@asfimport

The 7.0.0 readahead for parquet would read up to 256 row groups at once which meant that, if the consumer were too slow, we would almost certainly run out of memory.

ARROW-15410 improved readahead as a whole and, in the process, changed parquet so it's always reading 1 row group in advance.

This is not always ideal in S3 scenarios. We may want to read many row groups in advance if the row groups are small. To fix this we should continue reading in parallel until there are at least batch_size * batch_readahead rows being fetched.

Reporter: Weston Pace / @westonpace
Assignee: Weston Pace / @westonpace

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-16294. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions