Hi.
I have a monte-carlo calcuator that yields a couple of hundred Nx1 numpy arrays. I need to develop further functionality on it, and since it cant be solved easily without having access to the full set Im pursuing the route of exporting them. Found PyArrow and got exited. First wall I hit, was that the writer could not write "columns" (IPC). A stackoverflow post, and two weeks later, Im writing my arrays to single file-single column with a stream writer ,using write_table and chunksize (write_batch has no such parameter) .Im then combining all files to a single file by using a reader for every file and reading the corresponding "part"-batches. I then combine them to a single recordbatch and write. The whole idea is that I can later pull in parts of the complete set/all columns (which would fit in memory) and process further. Now, everything works, but following along on my task manager, I see that memory simply skyrockets when I write. I would expect memory consumption to stay around the size of my group batches and then some. The whole point of this exercise is having stuff fit in memory, and I can not see how I can achieve this. It makes me wonder if I`m a complete idiot when I read efficiently-writing-and-reading-arrow-data, have I done something wrong or am I looking at it wrong? I have attached a python file with a simple attempt. I have tried the filewriters, doing Tables instead of batches and refactoring in all thinkable ways.
A snip:
readers = [pa.ipc.open_stream(file) for file in self.tempfiles]
combined_schema = pa.unify_schemas([r.schema for r in readers])
with pa.ipc.new_stream(os.path.join(self.path, self.outfile_name ), schema=combined_schema,) as writer:
for group in zip(*readers):
combined_batch = pa.RecordBatch.from_arrays(
[g.column(0) for g in group], names=combined_schema.names)
writer.write_batch(combined_batch)
Environment: Windows 11 , Python 3.9.2
Reporter: Stig Korsnes
Original Issue Attachments:
Note: This issue was originally created as ARROW-15920. Please see the migration documentation for further details.
Hi.
I have a monte-carlo calcuator that yields a couple of hundred Nx1 numpy arrays. I need to develop further functionality on it, and since it can
t be solved easily without having access to the full set Im pursuing the route of exporting them. Found PyArrow and got exited. First wall I hit, was that the writer could not write "columns" (IPC). A stackoverflow post, and two weeks later, Im writing my arrays to single file-single column with a stream writer ,using write_table and chunksize (write_batch has no such parameter) .Im then combining all files to a single file by using a reader for every file and reading the corresponding "part"-batches. I then combine them to a single recordbatch and write. The whole idea is that I can later pull in parts of the complete set/all columns (which would fit in memory) and process further. Now, everything works, but following along on my task manager, I see that memory simply skyrockets when I write. I would expect memory consumption to stay around the size of my group batches and then some. The whole point of this exercise is having stuff fit in memory, and I can not see how I can achieve this. It makes me wonder if I`m a complete idiot when I read efficiently-writing-and-reading-arrow-data, have I done something wrong or am I looking at it wrong? I have attached a python file with a simple attempt. I have tried the filewriters, doing Tables instead of batches and refactoring in all thinkable ways.A snip:
Environment: Windows 11 , Python 3.9.2
Reporter: Stig Korsnes
Original Issue Attachments:
Note: This issue was originally created as ARROW-15920. Please see the migration documentation for further details.