Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sorting dictionary encoded primitive integer arrays #2680

Merged
merged 6 commits into from Sep 10, 2022

Conversation

viirya
Copy link
Member

@viirya viirya commented Sep 8, 2022

Which issue does this PR close?

Closes #2679.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 8, 2022
DataType::Dictionary(_, _) => {
downcast_dictionary_array!(
values => match values.values().data_type() {
DataType::Int8 => sort_primitive_dictionary::<_, Int8Type, _>(values, v, n, &options, limit, cmp),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other primitive types, we just need to add other patterns and call sort_primitive_dictionary with different parameters.

I will add them in separate PRs.

)));
}
}
DataType::Dictionary(_, _) => {
Copy link
Contributor

@tustvold tustvold Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NGL I worry a bit that this has the combinatorial fanout that absolutely tanks compile times... Perhaps we could compute the sort order of the dictionary values and then use this to compare the keys? This might even be faster

Copy link
Member Author

@viirya viirya Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I agree that sorting the dictionary values and using it to compare the keys might be faster. But does it help on combinatorial fanout from dictionary? For example, in order to sort on the dictionary values, we still need get the value array from dictionary array so a downcast_dictionary_array! is also necessary.

Copy link
Contributor

@tustvold tustvold Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, you could call sort_to_indices on the values array, which is only typed on the values type, and then to compare dictionary values you compare the indices you just computed, which is only typed on the dictionary key type. You therefore avoid the fanout?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So downcast_dictionary_array is not the one you concern but downcast_dictionary_array + sort_primitive_dictionary which is typed on both key and value types?

Then it makes sense to me. I can refactor this and split sorting dictionary to sorting dictionary values and sorting keys based on the computed indices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is the combinatorial fanout that is especially painful, and given it is avoidable I think it makes sense to do so. It's a case of every little helps 😅

values => match values.values().data_type() {
DataType::Int8 => {
let dict_values = values.values();
let sorted_value_indices = sort_to_indices(dict_values, Some(SortOptions::default()), None)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let sorted_value_indices = sort_to_indices(dict_values, Some(SortOptions::default()), None)?;
let sorted_value_indices = sort_to_indices(dict_values, options, None)?;

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have fixed sort options (default) when sorting the values of dictionary. As we will compare keys based on the sorted indices. The sorted indices represent the order of values which is fixed.

For example, a dictionary (keys = [1, 2, 0], values = [0, 1, 2]). No matter we sort in ascending order or in descending order, when we compare key = 1 and key = 2, their relation is always (key = 1) < (key = 2) according to the values behind the keys.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current test can catch an error if we use the same option to sort the values of the dictionary.

Copy link
Contributor

@tustvold tustvold Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, you raise a good point. I think you may still need to propagate the null ordering though?

Edit: I think you need to pass the full options here, and then drop the descending for the second sort. But I could be wrong, multi-level nullability is just 🤯

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, we need to keep the given null ordering when sorting the values of the dictionary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagating null ordering now and added one test.

let sorted_value_indices = sort_to_indices(dict_values, Some(SortOptions::default()), None)?;
sort_primitive_dictionary::<_, _>(values, &sorted_value_indices, v, n, &options, limit, cmp)
},
DataType::Utf8 => sort_string_dictionary::<_>(values, v, n, &options, limit),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sort_string_dictionary also can be optimized by this trick to sort on values first. I will leave it to a follow-up PR.

sorted_value_indices: &UInt32Array,
value_indices: Vec<u32>,
null_indices: Vec<u32>,
options: &SortOptions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW &SortOptions is actually bigger than SortOptions

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor nits

@viirya
Copy link
Member Author

viirya commented Sep 10, 2022

The C++ building error is known issue. I will merge this once other CI tasks pass.

@viirya viirya merged commit 04bd395 into apache:master Sep 10, 2022
@viirya
Copy link
Member Author

viirya commented Sep 10, 2022

Thanks.

@ursabot
Copy link

ursabot commented Sep 10, 2022

Benchmark runs are scheduled for baseline = 5ccf73e and contender = 04bd395. 04bd395 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support sorting dictionary encoded primitive integer arrays
3 participants