Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust] [Experiment] [WIP]: Use SmallVec in ArrayData to optimize the common usecase of single-buffer arrays #9330

Closed
wants to merge 4 commits into from

Conversation

jhorstmann
Copy link
Contributor

Another experiement for speeding up ArrayData. Most array types only contain a single buffer (except for the validity buffer) and also at most one child data object. I think the only types this does not apply to are struct and union arrays. Using a SmallVec trades one comparison against an allocation and the resulting indirection in those cases.

The ArrayData::new method is left in place unchanged for compatibility.

TODO:

  • Use the optimized new_smallvec method in kernels whenever applicable
  • Run all benchmarks on a non-throttling machine

@jorgecarleitao @Dandandan this is also related to the discussion on #9271. I tried this initially some month ago and the benchmarks were not conclusive, but might be worth trying again or in combination with removing the Arc indirection.

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@@ -225,11 +226,11 @@ pub struct ArrayData {
/// The buffers for this array data. Note that depending on the array types, this
/// could hold different kinds of buffers (e.g., value buffer, value offset buffer)
/// at different positions.
buffers: Vec<Buffer>,
buffers: SmallVec<[Buffer; 1]>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using something like a specific enum might also work (and be more memory efficient) as well:

pub enum  ArrayBuffer {
  One(ArrayDataRef),
  Two(ArrayDataRef, ArrayDataRef),
  Many(Vec<ArrayDataRef>)
}

I honestly don't know what the maximum number of buffers any specific ArrayData can have,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, max is 3 including the null buffer, so max is 2 in this context. See MutableDataArray, where we use two buffers on the stack to avoid the overhead. When unused, a buffer takes something like a usize + a pointer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I haven't checked the memory layout of both, but in the common case of 1 buffer this could inline better (in the absence of LTO). Since we already have the validity bitmap in a separate Option, the variants are Zero, One and Many.

@alamb
Copy link
Contributor

alamb commented Mar 3, 2021

@jhorstmann I am closing this PR for the time being to clean up the Rust/Arrow PR backlog. Please let me know if this is a mistake

@alamb alamb closed this Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants