Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_buffer_memory_size() returns an incorrect larger size #5352

Closed
Yifeng-Sigma opened this issue Jan 31, 2024 · 8 comments
Closed

get_buffer_memory_size() returns an incorrect larger size #5352

Yifeng-Sigma opened this issue Jan 31, 2024 · 8 comments
Labels
question Further information is requested

Comments

@Yifeng-Sigma
Copy link

Yifeng-Sigma commented Jan 31, 2024

Which part is this question about

Question about get_buffer_memory_size() method

Describe your question

We have an arrow chunk which consists of two columns, all generic string type.
If we use value() method to pull out each string value and add up, the total size is 108684 bytes.
If we use value_data().len(), gives us 108684 bytes
If we use value_offsets().len() gives us 5430 bytes
But use get_buffer_memory_size() gives us 521664 bytes
NullBuffer is 0.

Noticed that the implementation of get_buffer_memory_size() is basically to add up value_offsets, value_data and null buffer. But why there's a significant difference? Here 108684 is what we expect.

Additional context

@Yifeng-Sigma Yifeng-Sigma added the question Further information is requested label Jan 31, 2024
@Jefffrey
Copy link
Contributor

I think this PR might help clear up confusion, as it enhances the docstring of get_buffer_memory_size(): #5347

    /// Note that this does not always correspond to the exact memory usage of an array,
    /// since multiple arrays can share the same buffers or slices thereof.

Note that value_data().len() gives the len, i.e. how many bytes are used in the buffer, whilst get_buffer_memory_size() calculates using the capacity of the buffer (how much memory is allocated to it), which might be larger than the len:

fn get_buffer_memory_size(&self) -> usize {
let mut sum = self.value_offsets.inner().inner().capacity();
sum += self.value_data.capacity();
if let Some(x) = &self.nulls {
sum += x.buffer().capacity()
}
sum
}

This should help explain the discrepancy.

@tustvold
Copy link
Contributor

tustvold commented Jan 31, 2024

To further expand on the above, the IPC reader avoids copying by slicing buffers, and this is a common source of arrays with such shared buffers

@Yifeng-Sigma
Copy link
Author

Yifeng-Sigma commented Jan 31, 2024

Thank you, this is super helpful!

If I'm only interested in the in the cumulative size of each element without considering the underlying implementation (whether it's copying, slicing), should I use value_data().len()?

The context is that we want to serialize the arrow data to json, and before serialization we want to have a rough estimate on the size of serialized data based on arrow's data size.

In this case, we are interested in the size that use value() method to pull out each string value and add up, which equals to value_data().len()

@tustvold
Copy link
Contributor

tustvold commented Jan 31, 2024

There is https://docs.rs/arrow-data/latest/arrow_data/struct.ArrayData.html#method.get_slice_memory_size but if you're only interested in strings you could do something like

let offsets = array.offsets();
let total_length = offsets.last().unwrap().as_usize() - offsets.first().unwrap().as_usize();

I should highlight that this will ignore the overheads from encoding field names in JSON though, and will be a pretty crude approximation.

@Yifeng-Sigma
Copy link
Author

Yifeng-Sigma commented Feb 1, 2024

There is https://docs.rs/arrow-data/latest/arrow_data/struct.ArrayData.html#method.get_slice_memory_size but if you're only interested in strings you could do something like

let offsets = array.offsets();
let total_length = offsets.last().unwrap().as_usize() - offsets.first().unwrap().as_usize();

I should highlight that this will ignore the overheads from encoding field names in JSON though, and will be a pretty crude approximation.

We are interested in types in addition to string as well. get_slice_memory_size() is exactly what I'm looking for!
Seems only get_buffer_memory_size() and get_array_memory_size are implemented for array, but get_slice_memory_size() is implemented for ArrayData only, wondering what's the recommendation for array here?

@tustvold
Copy link
Contributor

tustvold commented Feb 1, 2024

You can call Array::data to convert to the ArrayData representation

@Yifeng-Sigma
Copy link
Author

You can call Array::data to convert to the ArrayData representation

But seems it will do a clone() underlying, which is heavy if I only want to get the size? Is it possible to implement get_slice_memory_size() for array as well?

@tustvold
Copy link
Contributor

tustvold commented Feb 1, 2024

Clone is cheap on buffers, as the underlying storage is reference counted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants