-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_buffer_memory_size() returns an incorrect larger size #5352
Comments
I think this PR might help clear up confusion, as it enhances the docstring of /// Note that this does not always correspond to the exact memory usage of an array,
/// since multiple arrays can share the same buffers or slices thereof. Note that arrow-rs/arrow-array/src/array/byte_array.rs Lines 464 to 471 in 31cf5ce
This should help explain the discrepancy. |
To further expand on the above, the IPC reader avoids copying by slicing buffers, and this is a common source of arrays with such shared buffers |
Thank you, this is super helpful! If I'm only interested in the in the cumulative size of each element without considering the underlying implementation (whether it's copying, slicing), should I use The context is that we want to serialize the arrow data to json, and before serialization we want to have a rough estimate on the size of serialized data based on arrow's data size. In this case, we are interested in the size that use |
There is https://docs.rs/arrow-data/latest/arrow_data/struct.ArrayData.html#method.get_slice_memory_size but if you're only interested in strings you could do something like
I should highlight that this will ignore the overheads from encoding field names in JSON though, and will be a pretty crude approximation. |
We are interested in types in addition to string as well. |
You can call |
But seems it will do a |
Clone is cheap on buffers, as the underlying storage is reference counted. |
Which part is this question about
Question about
get_buffer_memory_size()
methodDescribe your question
We have an arrow chunk which consists of two columns, all generic string type.
If we use
value()
method to pull out each string value and add up, the total size is 108684 bytes.If we use
value_data().len()
, gives us 108684 bytesIf we use
value_offsets().len()
gives us 5430 bytesBut use
get_buffer_memory_size()
gives us 521664 bytesNullBuffer is 0.
Noticed that the implementation of
get_buffer_memory_size()
is basically to add upvalue_offsets
,value_data
and null buffer. But why there's a significant difference? Here 108684 is what we expect.Additional context
The text was updated successfully, but these errors were encountered: