Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Add adapter to write pandas.DataFrame in user-selected chunk size to streaming format #16145

Closed
asfimport opened this issue Jan 21, 2017 · 5 comments

Comments

@asfimport
Copy link

asfimport commented Jan 21, 2017

While we can convert a pandas.DataFrame to a single (arbitrarily large) arrow::RecordBatch, it is not easy to create multiple small record batches – we could do so in a streaming fashion and immediately write them into an arrow::io::OutputStream.

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-504. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
cc @mrocklin – you may have some use cases for converting DataFrames to streams in dask.dataframe

@asfimport
Copy link
Author

Matthew Rocklin / @mrocklin:
At the moment I don't have any active use cases for this. We tend to handle pandas dataframes as atomic blocks of data.

However generally I agree that streaming chunks in a more granular way is probably a better way to go. Non-blocking IO quickly becomes blocking IO if data starts overflowing local buffers. This is the sort of technology that might influence future design decisions.

From a pure Dask perspective my ideal serialization interface is Python object -> iterator of memoryview objects.

@asfimport
Copy link
Author

Wes McKinney / @wesm:

Non-blocking IO quickly becomes blocking IO if data starts overflowing local buffers. This is the sort of technology that might influence future design decisions.

Indeed – it would be fairly easy to implement spill-to-disk tools using either the streaming or random access format

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Part of #1364

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
Issue resolved by pull request 1364
#1364

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants