Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10403: [C++] Implement unique kernel for non-uniform chunked dictionary arrays #9683

Closed
wants to merge 2 commits into from

Conversation

rok
Copy link
Member

@rok rok commented Mar 12, 2021

@github-actions
Copy link

@rok
Copy link
Member Author

rok commented Mar 12, 2021

@nealrichardson what do you think about this approach? It introduces overhead to because it transposes dictionary indices but it gives us value_counts.

@nealrichardson
Copy link
Member

I'm not familiar with this C++ code so I'll let others comment (cc @pitrou @bkietz @michalursa). It looks like the issue is only with ChunkedArrays where the chunks have different dictionaries? My instinct is that, rather than unifying first and then determining unique values/counting/hashing, what if we could do the aggregation on each chunk first and then unify the results? That would be a smaller amount of data to manipulate.

@rok rok closed this Mar 12, 2021
@rok rok reopened this Mar 12, 2021
@rok
Copy link
Member Author

rok commented Mar 12, 2021

My instinct is that, rather than unifying first and then determining unique values/counting/hashing, what if we could do the aggregation on each chunk first and then unify the results? That would be a smaller amount of data to manipulate.

Indeed unifying over all chunks first and then transposing individual chunk indices would be a better idea!

I'm still a bit unfamiliar with kernel mechanics but I'm thinking implementing a new kernel for chunked DictionaryArrays with different dictionaries will be the best way to go for this.

@pitrou
Copy link
Member

pitrou commented Mar 16, 2021

There are indeed two possible approaches:

  • unify all chunks first, and then run the unique kernel over the transposed incides (as proposed by @rok)
  • run the unique kernel over the original chunks, and then hash-aggregate the unique results of the different chunks (in effect SELECT sum(counts) GROUP BY values)

The second approach could be faster in the (unusual?) cases where only a small subset of dictionary values actually appear in the data. If most dictionary values are used, both cases should have similar performance, though.

Since we don't have a generic hash-aggregate yet, the first approach sounds good enough.
(also note that unique is in itself a special case of hash-aggregation)

cc @bkietz for opinions

@rok
Copy link
Member Author

rok commented Mar 22, 2021

Since we don't have a generic hash-aggregate yet, the first approach sounds good enough.
(also note that unique is in itself a special case of hash-aggregation)

Shall I then fix CI issues and we proceed with the first approach?
Or do we rather put effort into generic hash-aggregate?

@pitrou
Copy link
Member

pitrou commented Mar 22, 2021

You can fix CI issus IMHO.

@rok rok marked this pull request as ready for review March 25, 2021 16:19
@rok
Copy link
Member Author

rok commented Mar 25, 2021

This is ready for review - the java issue appears to be a flaky upload.

@rok
Copy link
Member Author

rok commented Mar 29, 2021

ping :)

@rok rok force-pushed the ARROW-10403 branch 3 times, most recently from 293a6fc to 4d00295 Compare April 7, 2021 22:29
@pitrou pitrou changed the title ARROW-10403: [C++] Implement unique kernel for dictionary type ARROW-10403: [C++] Implement unique kernel for non-uniform chunked dictionary arrays Apr 8, 2021
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Sorry for the delay, will merge if CI is green.

@pitrou pitrou closed this in b24cff9 Apr 8, 2021
@rok
Copy link
Member Author

rok commented Apr 8, 2021

Thanks @pitrou!
Should we make a jira for the generic hash aggregate?

@pitrou
Copy link
Member

pitrou commented Apr 8, 2021

Yes, please do!

@rok
Copy link
Member Author

rok commented Apr 8, 2021

GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…ctionary arrays

See [ARROW-10403](https://issues.apache.org/jira/browse/ARROW-10403) and [ARROW-9132](https://issues.apache.org/jira/browse/ARROW-9132).

Closes apache#9683 from rok/ARROW-10403

Lead-authored-by: Rok <rok@mihevc.org>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 10, 2021
…ctionary arrays

See [ARROW-10403](https://issues.apache.org/jira/browse/ARROW-10403) and [ARROW-9132](https://issues.apache.org/jira/browse/ARROW-9132).

Closes apache#9683 from rok/ARROW-10403

Lead-authored-by: Rok <rok@mihevc.org>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…ctionary arrays

See [ARROW-10403](https://issues.apache.org/jira/browse/ARROW-10403) and [ARROW-9132](https://issues.apache.org/jira/browse/ARROW-9132).

Closes apache#9683 from rok/ARROW-10403

Lead-authored-by: Rok <rok@mihevc.org>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants