Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Box ScalarValue:Lists, reduce size by half size #788

Merged
merged 2 commits into from
Jul 28, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 27, 2021

Which issue does this PR close?

Re #786

Changes:

  1. Reduce size of ScalarValue from 64 bytes to 32 bytes by boxing ScalarValue:Lists internal parts

Rationale for this change

  1. A smaller ScalarValue means switching to use it in hash aggregations and hash joins will not be as expensive memory wise (where one is instantiated for each distinct grouping value)

What changes are included in this PR?

  1. Reduce size of ScalarValue from 64 bytes to 32 bytes by boxing ScalarValue:Lists internal parts

Are there any user-facing changes?

No

@alamb alamb added the api change Changes the API exposed to users of the crate label Jul 27, 2021
@github-actions github-actions bot added ballista datafusion Changes in the datafusion crate labels Jul 27, 2021
// Since ScalarValues are used in a non trivial number of places,
// making it larger means significant more memory consumption
// per distinct value.
assert_eq!(std::mem::size_of::<ScalarValue>(), 32);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the test showing the size decrease

List(Option<Vec<ScalarValue>>, DataType),
/// list of nested ScalarValue (boxed to reduce size_of(ScalarValue))
#[allow(clippy::box_vec)]
List(Option<Box<Vec<ScalarValue>>>, Box<DataType>),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the change -- the rest of the PR is just follow on work from this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting optimization 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Box<[ScalarValue]> via Vec::into_boxed_slice would also be an option and would remove one pointer indirection, with the downside that data would need to be copied if the vec has excess capacity. @alamb do you think this would be worth exploring? I could prepare a PR since I already started looking to the usages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhorstmann I think using Vec::into_boxed_slice would be just fine. I don't think ScalarValues are often (ever?) updated after creation so using a boxed slice seems like a good idea

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@alamb
Copy link
Contributor Author

alamb commented Jul 28, 2021

I think keeping ScalarValue smaller will help in various places even if we choose to go with something other than the implementation in #786

@alamb alamb merged commit 4929590 into apache:master Jul 28, 2021
@alamb alamb deleted the alamb/reduce_size_of_scalar branch July 28, 2021 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants