[C++][Datasets] Improve memory usage of datasets API when scanning parquet

This is a more targeted fix to improve memory usage when scanning parquet files.  It is related to broader issues like ARROW-14648 but those will likely take longer to fix.  The goal here is to make it possible to scan large parquet datasets with many files where each file has reasonably sized row groups (e.g. 1 million rows).  Currently we run out of memory scanning a configuration as simple as:

21 parquet files
Each parquet file has 10 million rows split into row groups of size 1 million

**Reporter**: [Weston Pace](https://issues.apache.org/jira/browse/ARROW-15410) / @westonpace
**Assignee**: [Weston Pace](https://issues.apache.org/jira/browse/ARROW-15410) / @westonpace
#### Related issues:
- [[C++][Datasets] Improve memory usage of datasets](https://github.com/apache/arrow/issues/30893) (is depended upon by)
#### PRs and other links:
- [GitHub Pull Request #12228](https://github.com/apache/arrow/pull/12228)

<sub>**Note**: *This issue was originally created as [ARROW-15410](https://issues.apache.org/jira/browse/ARROW-15410). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Datasets] Improve memory usage of datasets API when scanning parquet #30892

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Datasets] Improve memory usage of datasets API when scanning parquet #30892

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions