You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While we can convert a pandas.DataFrame to a single (arbitrarily large) arrow::RecordBatch, it is not easy to create multiple small record batches – we could do so in a streaming fashion and immediately write them into an arrow::io::OutputStream.
Matthew Rocklin / @mrocklin:
At the moment I don't have any active use cases for this. We tend to handle pandas dataframes as atomic blocks of data.
However generally I agree that streaming chunks in a more granular way is probably a better way to go. Non-blocking IO quickly becomes blocking IO if data starts overflowing local buffers. This is the sort of technology that might influence future design decisions.
From a pure Dask perspective my ideal serialization interface is Python object -> iterator of memoryview objects.
Non-blocking IO quickly becomes blocking IO if data starts overflowing local buffers. This is the sort of technology that might influence future design decisions.
Indeed – it would be fairly easy to implement spill-to-disk tools using either the streaming or random access format
While we can convert a
pandas.DataFrame
to a single (arbitrarily large)arrow::RecordBatch
, it is not easy to create multiple small record batches – we could do so in a streaming fashion and immediately write them into anarrow::io::OutputStream
.Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm
Related issues:
Note: This issue was originally created as ARROW-504. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: