get_buffer_memory_size() returns an incorrect larger size #5352

Yifeng-Sigma · 2024-01-31T02:11:27Z

Which part is this question about

Question about get_buffer_memory_size() method

Describe your question

We have an arrow chunk which consists of two columns, all generic string type.
If we use value() method to pull out each string value and add up, the total size is 108684 bytes.
If we use value_data().len(), gives us 108684 bytes
If we use value_offsets().len() gives us 5430 bytes
But use get_buffer_memory_size() gives us 521664 bytes
NullBuffer is 0.

Noticed that the implementation of get_buffer_memory_size() is basically to add up value_offsets, value_data and null buffer. But why there's a significant difference? Here 108684 is what we expect.

Additional context

The text was updated successfully, but these errors were encountered:

Jefffrey · 2024-01-31T10:17:14Z

I think this PR might help clear up confusion, as it enhances the docstring of get_buffer_memory_size(): #5347

    /// Note that this does not always correspond to the exact memory usage of an array,
    /// since multiple arrays can share the same buffers or slices thereof.

Note that value_data().len() gives the len, i.e. how many bytes are used in the buffer, whilst get_buffer_memory_size() calculates using the capacity of the buffer (how much memory is allocated to it), which might be larger than the len:

arrow-rs/arrow-array/src/array/byte_array.rs

Lines 464 to 471 in 31cf5ce

    
           fn get_buffer_memory_size(&self) -> usize { 
        
               let mut sum = self.value_offsets.inner().inner().capacity(); 
        
               sum += self.value_data.capacity(); 
        
               if let Some(x) = &self.nulls { 
        
                   sum += x.buffer().capacity() 
        
               } 
        
               sum 
        
           }

This should help explain the discrepancy.

tustvold · 2024-01-31T10:22:32Z

To further expand on the above, the IPC reader avoids copying by slicing buffers, and this is a common source of arrays with such shared buffers

Yifeng-Sigma · 2024-01-31T18:23:47Z

Thank you, this is super helpful!

If I'm only interested in the in the cumulative size of each element without considering the underlying implementation (whether it's copying, slicing), should I use value_data().len()?

The context is that we want to serialize the arrow data to json, and before serialization we want to have a rough estimate on the size of serialized data based on arrow's data size.

In this case, we are interested in the size that use value() method to pull out each string value and add up, which equals to value_data().len()

tustvold · 2024-01-31T21:16:58Z

There is https://docs.rs/arrow-data/latest/arrow_data/struct.ArrayData.html#method.get_slice_memory_size but if you're only interested in strings you could do something like

let offsets = array.offsets();
let total_length = offsets.last().unwrap().as_usize() - offsets.first().unwrap().as_usize();

I should highlight that this will ignore the overheads from encoding field names in JSON though, and will be a pretty crude approximation.

Yifeng-Sigma · 2024-02-01T01:36:24Z

There is https://docs.rs/arrow-data/latest/arrow_data/struct.ArrayData.html#method.get_slice_memory_size but if you're only interested in strings you could do something like
let offsets = array.offsets();
let total_length = offsets.last().unwrap().as_usize() - offsets.first().unwrap().as_usize();
I should highlight that this will ignore the overheads from encoding field names in JSON though, and will be a pretty crude approximation.

We are interested in types in addition to string as well. get_slice_memory_size() is exactly what I'm looking for!
Seems only get_buffer_memory_size() and get_array_memory_size are implemented for array, but get_slice_memory_size() is implemented for ArrayData only, wondering what's the recommendation for array here?

tustvold · 2024-02-01T07:51:21Z

You can call Array::data to convert to the ArrayData representation

Yifeng-Sigma · 2024-02-01T08:09:33Z

You can call Array::data to convert to the ArrayData representation

But seems it will do a clone() underlying, which is heavy if I only want to get the size? Is it possible to implement get_slice_memory_size() for array as well?

tustvold · 2024-02-01T08:35:36Z

Clone is cheap on buffers, as the underlying storage is reference counted.

Yifeng-Sigma added the question Further information is requested label Jan 31, 2024

Yifeng-Sigma closed this as completed Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_buffer_memory_size() returns an incorrect larger size #5352

get_buffer_memory_size() returns an incorrect larger size #5352

Yifeng-Sigma commented Jan 31, 2024 •

edited

Loading

Jefffrey commented Jan 31, 2024

tustvold commented Jan 31, 2024 •

edited

Loading

Yifeng-Sigma commented Jan 31, 2024 •

edited

Loading

tustvold commented Jan 31, 2024 •

edited

Loading

Yifeng-Sigma commented Feb 1, 2024 •

edited

Loading

tustvold commented Feb 1, 2024

Yifeng-Sigma commented Feb 1, 2024

tustvold commented Feb 1, 2024

get_buffer_memory_size() returns an incorrect larger size #5352

get_buffer_memory_size() returns an incorrect larger size #5352

Comments

Yifeng-Sigma commented Jan 31, 2024 • edited Loading

Jefffrey commented Jan 31, 2024

tustvold commented Jan 31, 2024 • edited Loading

Yifeng-Sigma commented Jan 31, 2024 • edited Loading

tustvold commented Jan 31, 2024 • edited Loading

Yifeng-Sigma commented Feb 1, 2024 • edited Loading

tustvold commented Feb 1, 2024

Yifeng-Sigma commented Feb 1, 2024

tustvold commented Feb 1, 2024

Yifeng-Sigma commented Jan 31, 2024 •

edited

Loading

tustvold commented Jan 31, 2024 •

edited

Loading

Yifeng-Sigma commented Jan 31, 2024 •

edited

Loading

tustvold commented Jan 31, 2024 •

edited

Loading

Yifeng-Sigma commented Feb 1, 2024 •

edited

Loading