Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct array memory usage calculation for dictionary arrays #505

Merged
merged 7 commits into from Jun 29, 2021

Conversation

jhorstmann
Copy link
Contributor

@jhorstmann jhorstmann commented Jun 28, 2021

Which issue does this PR close?

Closes #503.

Rationale for this change

Some Array implementations have fields for easier access to nested arrays. These point to the same data already contained in ArrayData and so should not be considered to increase the memory usage. Since all the buffers are contained in ArrayData we can instead delegate the whole calculation and remove the specialized implementations.

What changes are included in this PR?

Fixes overreporting of memory usage in DictionaryArray, FixedSizeListArray and UnionArray and simplified the code.

Are there any user-facing changes?

get_array_memory_size returns slightly different but more correct results

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 28, 2021
@alamb
Copy link
Contributor

alamb commented Jun 28, 2021

There appears to be some failures in the test suite

https://github.com/apache/arrow-rs/pull/505/checks?check_run_id=2934625537

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks like a great change @jhorstmann -- and the tests are 👨‍🍳 ❤️
Fixing bugs by deleting code is the mark of a great engineer in my opinion.

While reviewing the code, the only thing that stands out to me is the calculation of array memory size may still be inaccurate (we can fix as a follow on issue / PR too)

https://github.com/jhorstmann/arrow-rs/blob/dictionary-array-memory-size/arrow/src/array/data.rs#L360

        // Calculate size of the fields that don't have [get_array_memory_size] method internally.
        size += mem::size_of_val(self)
            - mem::size_of_val(&self.buffers) // <--- this is a Vec so it doesn't have an `array_memory_size` so it probably should not be included
            - mem::size_of_val(&self.null_bitmap)
            - mem::size_of_val(&self.child_data);

arrow/src/array/array.rs Outdated Show resolved Hide resolved
arrow/src/array/array.rs Show resolved Hide resolved

// substract empty array to avoid magic numbers for the size of additional fields
assert_eq!(
arr.get_array_memory_size() - empty.get_array_memory_size(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a cool calculation 👍

@jhorstmann
Copy link
Contributor Author

There appears to be some failures in the test suite

That is what I get when not running the full test suite after a seemingly small change :)

While looking into it I also noticed the problem you mentioned with fields that don't have [get_array_memory_size] themselves, the overhead of these fields (ptr, len, capacity) does not seem to be accounted for.

jhorstmann and others added 4 commits June 28, 2021 23:51
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
let mut size = 0;
// Calculate size of the fields that don't have [get_array_memory_size] method internally.
size += mem::size_of_val(self)
- mem::size_of_val(&self.buffers)
Copy link
Contributor

@alamb alamb Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since child_data and null_bitmap include the size of self in the results of bitmap.get_array_memory_size() and child.get_array_memory_size() I think we still need to subtract them off.

Perhaps a pattern such as edited

        if let Some(bitmap) = &self.null_bitmap {
            size += bitmap.get_array_memory_size()
            size -= mem::size_of_val(&bitmap);
        }

would make the intent clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think this applies only to the null_bitmap since it is embedded directly without a heap allocation in between, but have to think about this a bit more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fixed now, I added one more assertion for this case:

// expected size is the size of the PrimitiveArray struct,
// which includes the optional validity buffer
// plus one buffer on the heap
assert_eq!(
    std::mem::size_of::<PrimitiveArray<Int64Type>>() + std::mem::size_of::<Buffer>(),
    empty_with_bitmap.get_array_memory_size()
);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embedded directly without a heap allocation in between.

I double checked too and I agree it now looks good. This whole notion is confusing -- I think the tests you have added will really help in the future too

@codecov-commenter
Copy link

codecov-commenter commented Jun 28, 2021

Codecov Report

Merging #505 (848dd40) into master (de62168) will increase coverage by 0.09%.
The diff coverage is 100.00%.

❗ Current head 848dd40 differs from pull request most recent head 57bbe92. Consider uploading reports for the commit 57bbe92 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #505      +/-   ##
==========================================
+ Coverage   82.64%   82.74%   +0.09%     
==========================================
  Files         165      165              
  Lines       45703    45686      -17     
==========================================
+ Hits        37773    37804      +31     
+ Misses       7930     7882      -48     
Impacted Files Coverage Δ
arrow/src/array/array_binary.rs 92.23% <ø> (+2.10%) ⬆️
arrow/src/array/array_boolean.rs 94.01% <ø> (+3.10%) ⬆️
arrow/src/array/array_dictionary.rs 88.38% <ø> (+3.81%) ⬆️
arrow/src/array/array_list.rs 94.88% <ø> (+2.06%) ⬆️
arrow/src/array/array_primitive.rs 94.60% <ø> (-0.10%) ⬇️
arrow/src/array/array_string.rs 97.76% <ø> (+1.71%) ⬆️
arrow/src/array/array_struct.rs 89.24% <ø> (+1.39%) ⬆️
arrow/src/array/array_union.rs 89.26% <ø> (+2.33%) ⬆️
arrow/src/array/null.rs 83.78% <ø> (-2.89%) ⬇️
arrow/src/array/array.rs 80.90% <100.00%> (+4.04%) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de62168...57bbe92. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thank you @jhorstmann

@jorgecarleitao / @nevi-me / @Dandandan any thoughts?

let mut size = 0;
// Calculate size of the fields that don't have [get_array_memory_size] method internally.
size += mem::size_of_val(self)
- mem::size_of_val(&self.buffers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embedded directly without a heap allocation in between.

I double checked too and I agree it now looks good. This whole notion is confusing -- I think the tests you have added will really help in the future too

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Great simplification.

@alamb alamb merged commit f1a831f into apache:master Jun 29, 2021
alamb added a commit that referenced this pull request Jun 30, 2021
* Correct array memory usage calculation for dictionary arrays

* update comments

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* update comments

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Adjust memory usage calculation and move related tests to the same file

* Comment about bitmap size

* Clippy fixes

* Adjust calculation for validity bitmap

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb alamb added the cherry-picked PR that was backported to active release (will be included in maintenance release) label Jun 30, 2021
alamb added a commit that referenced this pull request Jul 1, 2021
…515)

* Correct array memory usage calculation for dictionary arrays

* update comments

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* update comments

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Adjust memory usage calculation and move related tests to the same file

* Comment about bitmap size

* Clippy fixes

* Adjust calculation for validity bitmap

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
@alamb alamb added enhancement Any new improvement worthy of a entry in the changelog bug and removed enhancement Any new improvement worthy of a entry in the changelog labels Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug cherry-picked PR that was backported to active release (will be included in maintenance release)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect memory usage calculation for dictionary arrays
4 participants