Memory usage RecordBatchStreamWriter

Hi.

I have a monte-carlo calcuator that yields a couple of hundred Nx1 numpy arrays. I need to develop further functionality on it, and since it can`t be solved easily without having access to the full set I`m pursuing the route of exporting them. Found PyArrow and got exited. First wall I hit, was that the writer could not write "columns" (IPC). A stackoverflow post, and two weeks later, I`m writing my arrays to single file-single column with a stream writer ,using write_table and chunksize (write_batch has no such parameter) .I`m then combining all files to a single file by using a reader for every file and reading the corresponding "part"-batches. I then combine them to a single recordbatch and write. The whole idea is that I can later pull in parts of the complete set/all columns (which would fit in memory) and  process further. Now, everything works, but following along on my task manager, I see that memory simply skyrockets when I write. I would expect memory consumption to stay around the size of my group batches and then some. The whole point of this exercise is having stuff fit in memory, and I can not see how I can achieve this. It makes me wonder if I`m a complete idiot when I read [efficiently-writing-and-reading-arrow-data](https://arrow.apache.org/docs/python/ipc.html#efficiently-writing-and-reading-arrow-data), have I done something wrong or am I looking at it wrong? I have attached a python file with a simple attempt. I have tried the filewriters, doing Tables instead of batches and refactoring in all thinkable ways.

 

A snip:
 
```java

readers = [pa.ipc.open_stream(file) for file in self.tempfiles]
combined_schema = pa.unify_schemas([r.schema for r in readers])

with pa.ipc.new_stream(os.path.join(self.path, self.outfile_name ),    schema=combined_schema,) as writer:
    for group in zip(*readers):
        combined_batch = pa.RecordBatch.from_arrays(
            [g.column(0) for g in group], names=combined_schema.names)
        writer.write_batch(combined_batch)
```
 

**Environment**: Windows 11 , Python 3.9.2
**Reporter**: [Stig Korsnes](https://issues.apache.org/jira/browse/ARROW-15920)
#### Original Issue Attachments:
- [demo.py](https://issues.apache.org/jira/secure/attachment/13041008/demo.py)
- [mem.png](https://issues.apache.org/jira/secure/attachment/13041009/mem.png)

<sub>**Note**: *This issue was originally created as [ARROW-15920](https://issues.apache.org/jira/browse/ARROW-15920). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage RecordBatchStreamWriter #31349

Original Issue Attachments:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Memory usage RecordBatchStreamWriter #31349

Description

Original Issue Attachments:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions