[Python] Add adapter to write pandas.DataFrame in user-selected chunk size to streaming format #16145

asfimport · 2017-01-21T19:11:09Z

While we can convert a pandas.DataFrame to a single (arbitrarily large) arrow::RecordBatch, it is not easy to create multiple small record batches – we could do so in a streaming fashion and immediately write them into an arrow::io::OutputStream.

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Related issues:

C++/Parquet: Support writing chunked arrays as part of a table (relates to)

_{Note: This issue was originally created as ARROW-504. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2017-01-21T19:13:11Z

Wes McKinney / @wesm:
cc @mrocklin – you may have some use cases for converting DataFrames to streams in dask.dataframe

asfimport · 2017-01-23T14:52:57Z

Matthew Rocklin / @mrocklin:
At the moment I don't have any active use cases for this. We tend to handle pandas dataframes as atomic blocks of data.

However generally I agree that streaming chunks in a more granular way is probably a better way to go. Non-blocking IO quickly becomes blocking IO if data starts overflowing local buffers. This is the sort of technology that might influence future design decisions.

From a pure Dask perspective my ideal serialization interface is Python object -> iterator of memoryview objects.

asfimport · 2017-01-23T16:28:46Z

Wes McKinney / @wesm:

Non-blocking IO quickly becomes blocking IO if data starts overflowing local buffers. This is the sort of technology that might influence future design decisions.

Indeed – it would be fairly easy to implement spill-to-disk tools using either the streaming or random access format

asfimport · 2017-11-27T04:39:42Z

Wes McKinney / @wesm:
Part of #1364

asfimport · 2017-11-27T09:28:42Z

Uwe Korn / @xhochy:
Issue resolved by pull request 1364
#1364

asfimport closed this as completed Nov 27, 2017

asfimport assigned wesm Jan 10, 2023

asfimport mentioned this issue Jan 11, 2023

C++/Parquet: Support writing chunked arrays as part of a table #15577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Add adapter to write pandas.DataFrame in user-selected chunk size to streaming format #16145

[Python] Add adapter to write pandas.DataFrame in user-selected chunk size to streaming format #16145

asfimport commented Jan 21, 2017 •

edited

asfimport commented Jan 21, 2017

asfimport commented Jan 23, 2017

asfimport commented Jan 23, 2017

asfimport commented Nov 27, 2017

asfimport commented Nov 27, 2017

[Python] Add adapter to write pandas.DataFrame in user-selected chunk size to streaming format #16145

[Python] Add adapter to write pandas.DataFrame in user-selected chunk size to streaming format #16145

Comments

asfimport commented Jan 21, 2017 • edited

Related issues:

asfimport commented Jan 21, 2017

asfimport commented Jan 23, 2017

asfimport commented Jan 23, 2017

asfimport commented Nov 27, 2017

asfimport commented Nov 27, 2017

asfimport commented Jan 21, 2017 •

edited