Pre-allocate buffer #422

JoaoAparicio · 2023-04-11T01:32:03Z

If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc.

Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode().

Inspired by #399

JoaoAparicio · 2023-04-11T01:35:09Z

Waiting on release of
JuliaIO/TranscodingStreams.jl#136

Moelf · 2023-04-11T02:39:58Z

any thoughts on this?

Feather file with compression and larger than RAM #340

I know you were aware of this issue, and thanks a lot for looking into quite a few issues here and in the TransdocingStream

If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc. Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode(). Inspired by apache#399

JoaoAparicio · 2023-04-11T03:17:38Z

I have some thoughts. One solution to "my dataset is larger than memory" is partitioning. If your dataset is partitioned in such a way that each partition fits in memory, you can iterate it with

stream = Arrow.Stream(path)
for tbl in stream
    ...
end

You can do this right now without requiring any additional features from this package.

In contrast what is discussed in #340 (which is: don't decompress if you don't have to) is a different approach, but doesn't yet exist.

Currently I have some commits that add the feature to multi-thread decompression at the buffer level. I will be trying to upstream what I have so far. The difficulty is that these commits touch a lot of code, so this won't happen overnight. I imagine couple of weeks? On top of that it should be straightforward to implement what is discussed in #340.

Moelf · 2023-04-11T03:27:23Z

stream = Arrow.Stream(path)
for tbl in stream

the problem of this approach is it's decompressing every column. Consider examples such as:

[Discussion] Need for early-returning friendly iteration interface #417

Decompressing every column would be super slower if I'm only using a small % of columns

baumgold

LGTM

baumgold · 2023-04-11T14:48:50Z

@quinnj / @ericphanson - any comments before we merge?

ericphanson · 2023-04-11T14:56:15Z

nope, LGTM!

If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc. Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode(). Inspired by apache#399

JoaoAparicio marked this pull request as ready for review April 11, 2023 01:32

JoaoAparicio mentioned this pull request Apr 11, 2023

#132 allow users to optionally provide an output buffer when calling transcode JuliaIO/TranscodingStreams.jl#136

Merged

JoaoAparicio force-pushed the preallocatebuffer branch from 697df28 to 383d0fb Compare April 11, 2023 03:14

Bugfix compat

5a69758

baumgold approved these changes Apr 11, 2023

View reviewed changes

baumgold merged commit f8f8d8e into apache:main Apr 11, 2023

ericphanson mentioned this pull request Oct 15, 2023

bump #488

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-allocate buffer #422

Pre-allocate buffer #422

JoaoAparicio commented Apr 11, 2023

JoaoAparicio commented Apr 11, 2023

Moelf commented Apr 11, 2023 •

edited

Loading

JoaoAparicio commented Apr 11, 2023

Moelf commented Apr 11, 2023 •

edited

Loading

baumgold left a comment

baumgold commented Apr 11, 2023

ericphanson commented Apr 11, 2023

Pre-allocate buffer #422

Pre-allocate buffer #422

Conversation

JoaoAparicio commented Apr 11, 2023

JoaoAparicio commented Apr 11, 2023

Moelf commented Apr 11, 2023 • edited Loading

JoaoAparicio commented Apr 11, 2023

Moelf commented Apr 11, 2023 • edited Loading

baumgold left a comment

Choose a reason for hiding this comment

baumgold commented Apr 11, 2023

ericphanson commented Apr 11, 2023

Moelf commented Apr 11, 2023 •

edited

Loading

Moelf commented Apr 11, 2023 •

edited

Loading