-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27099][SQL] Add 'xxhash64' for hashing arbitrary columns to Long #24019
Conversation
Hi @cloud-fan and @rxin, based on the blame, it seems like you've looked at the hashing here relatively recently (or, at least, more recently than anyone else); could you take a look at this patch? Thanks! |
ok to test |
* @since 2.4.1 | ||
*/ | ||
@scala.annotation.varargs | ||
def xxhash64(cols: Column*): Column = withExpr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need seed
in arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hash
function doesn't currently have a seed
argument either.
In any case, I asked about this on dev@spark.apache.org ("[SQL] hash: 64-bits and seeding"), but didn't get any response to that part of my proposal (just the xxhash bit). I think if there was one it would have to come first, because the var args have to come last, something like the following?
def hash(seed: Int, cols: Column*): Column
// or, maybe, don't perpetuate the "bad"/non-specific name:
def murmur3(seed: Int, cols: Columns*): Column
def xxhash64(seed: Long, cols: Column*): Column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Its ok as it it.
Test build #103436 has finished for PR 24019 at commit
|
This is designed to exactly mimic the 32-bit `hash`, which uses MurmurHash3. The name is designed to be more future-proof than the 'hash', by indicating the exact algorithm used, similar to md5 and the sha hashes.
Test build #103480 has finished for PR 24019 at commit
|
retest this please |
Test build #103487 has finished for PR 24019 at commit
|
Test build #103483 has finished for PR 24019 at commit
|
The `hash` function isn't the only thing required to mimic.
retest this please |
Thanks for the review/testing help. (I apologise for repeatedly failing tests online, however, I'm struggling to find the best way to run tests locally, since it seems to take so long/consume my machine.) I think this now more closely matches the existing |
Test build #103500 has finished for PR 24019 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a test for R in test_sparkSQL
Test build #103526 has finished for PR 24019 at commit
|
Test build #103539 has finished for PR 24019 at commit
|
R looks good, but perhaps appveyer R test is not triggering? |
Hm, I'm not sure I understand; is there something I should do? (Also, I see |
I'd love to see this progress; is there anything I should do? |
R test passes, so that part is good. someone else should review? |
thanks, merging to master! |
What changes were proposed in this pull request?
This introduces a new SQL function 'xxhash64' for getting a 64-bit hash of an arbitrary number of columns.
This is designed to exactly mimic the 32-bit
hash
, which usesMurmurHash3. The name is designed to be more future-proof than the
'hash', by indicating the exact algorithm used, similar to md5 and the
sha hashes.
How was this patch tested?
The tests for the existing
hash
function were duplicated to run withxxhash64
.