-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10403: [C++] Implement unique kernel for non-uniform chunked dictionary arrays #9683
Conversation
@nealrichardson what do you think about this approach? It introduces overhead to because it transposes dictionary indices but it gives us |
I'm not familiar with this C++ code so I'll let others comment (cc @pitrou @bkietz @michalursa). It looks like the issue is only with ChunkedArrays where the chunks have different dictionaries? My instinct is that, rather than unifying first and then determining unique values/counting/hashing, what if we could do the aggregation on each chunk first and then unify the results? That would be a smaller amount of data to manipulate. |
Indeed unifying over all chunks first and then transposing individual chunk indices would be a better idea! I'm still a bit unfamiliar with kernel mechanics but I'm thinking implementing a new kernel for chunked DictionaryArrays with different dictionaries will be the best way to go for this. |
There are indeed two possible approaches:
The second approach could be faster in the (unusual?) cases where only a small subset of dictionary values actually appear in the data. If most dictionary values are used, both cases should have similar performance, though. Since we don't have a generic hash-aggregate yet, the first approach sounds good enough. cc @bkietz for opinions |
Shall I then fix CI issues and we proceed with the first approach? |
You can fix CI issus IMHO. |
This is ready for review - the java issue appears to be a flaky upload. |
ping :) |
293a6fc
to
4d00295
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Sorry for the delay, will merge if CI is green.
Thanks @pitrou! |
Yes, please do! |
…ctionary arrays See [ARROW-10403](https://issues.apache.org/jira/browse/ARROW-10403) and [ARROW-9132](https://issues.apache.org/jira/browse/ARROW-9132). Closes apache#9683 from rok/ARROW-10403 Lead-authored-by: Rok <rok@mihevc.org> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ctionary arrays See [ARROW-10403](https://issues.apache.org/jira/browse/ARROW-10403) and [ARROW-9132](https://issues.apache.org/jira/browse/ARROW-9132). Closes apache#9683 from rok/ARROW-10403 Lead-authored-by: Rok <rok@mihevc.org> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ctionary arrays See [ARROW-10403](https://issues.apache.org/jira/browse/ARROW-10403) and [ARROW-9132](https://issues.apache.org/jira/browse/ARROW-9132). Closes apache#9683 from rok/ARROW-10403 Lead-authored-by: Rok <rok@mihevc.org> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
See ARROW-10403 and ARROW-9132.