Skip to content

No obvious mechanism for partitioning groups of record batches #46432

@apjoseph

Description

@apjoseph

Describe the enhancement requested

I've got a group of large fixed width text files that need to be treated as a single dataset in pyarrow without the onerous requirement to read them all into memory at once

Due to the lack of support for fixed width files in arrow or any kind of mechanism to provide a function to transform lines in open_csv, I created a RecordBatchReader for each file and attempted to pass the sequence of readers to pyarrow.dataset.dataset as the docs state should be possible

However, due to #38012 this is currently broken.

As such there doesn't appear to be any supported mechanism to lazily load partitions of record batches.

As far as I can tell from the docs, the only possible workaround to deal with large data sources that don't fit the narrow set of supported file types would be to write an entire filesystem in fsspec and customize the returned TextIOBase so that as each file is read, the lines are converted to a standard CSV format that 'open_csv` can deal with.

Needless to say, such a solution would be extremely convoluted, error prone, and eliminate much of the utility of arrow/pyarrow.

There really should be some straightforward mechanism for lazily loading record batches, otherwise you need yet another intermediary dataframe library just to be able to quickly operate on data stored in text files without overwhelming system memory.

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions