New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support sorting dictionary encoded primitive integer arrays #2680
Conversation
arrow/src/compute/kernels/sort.rs
Outdated
DataType::Dictionary(_, _) => { | ||
downcast_dictionary_array!( | ||
values => match values.values().data_type() { | ||
DataType::Int8 => sort_primitive_dictionary::<_, Int8Type, _>(values, v, n, &options, limit, cmp), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other primitive types, we just need to add other patterns and call sort_primitive_dictionary
with different parameters.
I will add them in separate PRs.
))); | ||
} | ||
} | ||
DataType::Dictionary(_, _) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NGL I worry a bit that this has the combinatorial fanout that absolutely tanks compile times... Perhaps we could compute the sort order of the dictionary values and then use this to compare the keys? This might even be faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I agree that sorting the dictionary values and using it to compare the keys might be faster. But does it help on combinatorial fanout from dictionary? For example, in order to sort on the dictionary values, we still need get the value array from dictionary array so a downcast_dictionary_array!
is also necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, you could call sort_to_indices on the values array, which is only typed on the values type, and then to compare dictionary values you compare the indices you just computed, which is only typed on the dictionary key type. You therefore avoid the fanout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So downcast_dictionary_array
is not the one you concern but downcast_dictionary_array
+ sort_primitive_dictionary
which is typed on both key and value types?
Then it makes sense to me. I can refactor this and split sorting dictionary to sorting dictionary values and sorting keys based on the computed indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is the combinatorial fanout that is especially painful, and given it is avoidable I think it makes sense to do so. It's a case of every little helps 😅
35cb87d
to
3697abc
Compare
arrow/src/compute/kernels/sort.rs
Outdated
values => match values.values().data_type() { | ||
DataType::Int8 => { | ||
let dict_values = values.values(); | ||
let sorted_value_indices = sort_to_indices(dict_values, Some(SortOptions::default()), None)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let sorted_value_indices = sort_to_indices(dict_values, Some(SortOptions::default()), None)?; | |
let sorted_value_indices = sort_to_indices(dict_values, options, None)?; |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have fixed sort options (default) when sorting the values of dictionary. As we will compare keys based on the sorted indices. The sorted indices represent the order of values which is fixed.
For example, a dictionary (keys = [1, 2, 0], values = [0, 1, 2]). No matter we sort in ascending order or in descending order, when we compare key = 1 and key = 2, their relation is always (key = 1) < (key = 2) according to the values behind the keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current test can catch an error if we use the same option to sort the values of the dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah, you raise a good point. I think you may still need to propagate the null ordering though?
Edit: I think you need to pass the full options here, and then drop the descending for the second sort. But I could be wrong, multi-level nullability is just 🤯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, we need to keep the given null ordering when sorting the values of the dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Propagating null ordering now and added one test.
let sorted_value_indices = sort_to_indices(dict_values, Some(SortOptions::default()), None)?; | ||
sort_primitive_dictionary::<_, _>(values, &sorted_value_indices, v, n, &options, limit, cmp) | ||
}, | ||
DataType::Utf8 => sort_string_dictionary::<_>(values, v, n, &options, limit), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sort_string_dictionary
also can be optimized by this trick to sort on values first. I will leave it to a follow-up PR.
arrow/src/compute/kernels/sort.rs
Outdated
sorted_value_indices: &UInt32Array, | ||
value_indices: Vec<u32>, | ||
null_indices: Vec<u32>, | ||
options: &SortOptions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW &SortOptions
is actually bigger than SortOptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor nits
The C++ building error is known issue. I will merge this once other CI tasks pass. |
Thanks. |
Benchmark runs are scheduled for baseline = 5ccf73e and contender = 04bd395. 04bd395 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #2679.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?