-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DictionaryArray::keys_iter
, and take_iter
for other array types
#1296
Conversation
arrow/src/array/iterator.rs
Outdated
pub struct DictionaryIter<'a, K: ArrowPrimitiveType, T: ArrowPrimitiveType> { | ||
array: &'a DictionaryArray<K>, | ||
values: &'a PrimitiveArray<T>, | ||
current: usize, | ||
current_end: usize, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may add iterators for others, e.g. Utf8, binary..., if this direction is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 as you note, the way this is implemented, it will only work for DictionaryArrays whose values are PrimitiveArray<>
such as Int64Array
If we take this approach, we'll probably end up with different iterator types for different value types like
pub struct DictionaryIterPrimitive<'a, K: ArrowPrimitiveType, T: ArrowPrimitiveType> {
...
}
impl<'a, K: ArrowPrimitiveType, T: ArrowPrimitiveType> std::iter::Iterator
for DictionaryIter<'a, K, T>
{
type Item = Option<T::Native>;
...
}
pub struct DictionaryIterBool<'a, K: ArrowPrimitiveType> {
...
}
impl <'a, K: ArrowPrimitiveType> std::iter::Iterator for DictionaryIterBool<'a, K>
{
type Item = Option<bool>;
...
}
...
And then functions like the following
impl DictionaryArray {
pub fn iter_primitive<T>::(&self) -> DictionaryIterPrimitive
...
}
pub fn iter_bool<T>::(&self) -> DictionaryIterBool
...
}
}
Which would effectively require the user to know what type of values their dictionary held
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking some more we could potentially avoid parameterizing on the key type by always upcasting the indexes to usize
(which has to happen anyways) before creating the iterator
cc @alamb |
Codecov Report
@@ Coverage Diff @@
## master #1296 +/- ##
=======================================
Coverage 83.05% 83.06%
=======================================
Files 180 180
Lines 52824 52870 +46
=======================================
+ Hits 43874 43916 +42
- Misses 8950 8954 +4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @viirya -- I think this could get messy quickly due to the number of types, but I don't have any other great ideas to be honest
maybe @tustvold or @jorgecarleitao have ideas 🤔
arrow/src/array/iterator.rs
Outdated
pub struct DictionaryIter<'a, K: ArrowPrimitiveType, T: ArrowPrimitiveType> { | ||
array: &'a DictionaryArray<K>, | ||
values: &'a PrimitiveArray<T>, | ||
current: usize, | ||
current_end: usize, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 as you note, the way this is implemented, it will only work for DictionaryArrays whose values are PrimitiveArray<>
such as Int64Array
If we take this approach, we'll probably end up with different iterator types for different value types like
pub struct DictionaryIterPrimitive<'a, K: ArrowPrimitiveType, T: ArrowPrimitiveType> {
...
}
impl<'a, K: ArrowPrimitiveType, T: ArrowPrimitiveType> std::iter::Iterator
for DictionaryIter<'a, K, T>
{
type Item = Option<T::Native>;
...
}
pub struct DictionaryIterBool<'a, K: ArrowPrimitiveType> {
...
}
impl <'a, K: ArrowPrimitiveType> std::iter::Iterator for DictionaryIterBool<'a, K>
{
type Item = Option<bool>;
...
}
...
And then functions like the following
impl DictionaryArray {
pub fn iter_primitive<T>::(&self) -> DictionaryIterPrimitive
...
}
pub fn iter_bool<T>::(&self) -> DictionaryIterBool
...
}
}
Which would effectively require the user to know what type of values their dictionary held
🤔
arrow/src/array/iterator.rs
Outdated
pub struct DictionaryIter<'a, K: ArrowPrimitiveType, T: ArrowPrimitiveType> { | ||
array: &'a DictionaryArray<K>, | ||
values: &'a PrimitiveArray<T>, | ||
current: usize, | ||
current_end: usize, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking some more we could potentially avoid parameterizing on the key type by always upcasting the indexes to usize
(which has to happen anyways) before creating the iterator
Thanks @alamb. Yea, that is also I have concern with it. But I don't have better idea for |
I thought a bit more about this. What about adding something slightly more general such as: impl DictionaryArray {
/// return an iterator over the keys (indexes into the dictionary)
fn keys_iter() -> impl Iterator<Item = usize>
} Then add an iterator to the various other types of array: impl PrimitiveArray<T> {
...
/// return an iterator that returns the values of `array.value(i)` for an iterator with each element `i`
fn take_iter(&self, indexes: impl Iterator<item = usize>) -> impl Iterator<Item = T::NativeType> {
..
}
} So then to iterate over a dictionary by value the user would do something like: let keys = dict_array.keys_iter();
let values = dict_array.values().downcast_ref::<UInt64Array>().unwrap().take_iter(keys);
// values is an interator over u64! |
@alamb Looks like a good idea. Thanks. Let me revise this. |
b4ecfbf
to
7f59d2e
Compare
@alamb Revised this based on your suggestion. |
arrow/src/array/array_primitive.rs
Outdated
indexes: impl Iterator<Item = Option<usize>> + 'a, | ||
) -> impl Iterator<Item = Option<T::Native>> + 'a { | ||
indexes.map(|opt_index| { | ||
opt_index.map(|index| unsafe { self.value_unchecked(index) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍
Reading this, however, if someone passed in an interator that was out of bounds it will lead to undefined behavior; So I guess we would probably need to either validate every index or else mark the take_iter
method as unsafe
itself 🤔 which seems less than ideal.
We can only avoid the bounds check if we knew the iterator came from a dictionary that had been previously validated.
685273b
to
0bec1a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks good. Any thoughts @Dandandan or @liukun4515 or @tustvold ?
Thank you @alamb ! |
Thanks again @viirya |
Thank you @alamb ! |
DictionaryArray::keys_iter
, and take_iter
for other array types
Which issue does this PR close?
Closes #1295.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?