Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27099][SQL] Add 'xxhash64' for hashing arbitrary columns to Long #24019

Closed
wants to merge 6 commits into from

Conversation

huonw
Copy link
Contributor

@huonw huonw commented Mar 8, 2019

What changes were proposed in this pull request?

This introduces a new SQL function 'xxhash64' for getting a 64-bit hash of an arbitrary number of columns.

This is designed to exactly mimic the 32-bit hash, which uses
MurmurHash3. The name is designed to be more future-proof than the
'hash', by indicating the exact algorithm used, similar to md5 and the
sha hashes.

How was this patch tested?

The tests for the existing hash function were duplicated to run with xxhash64.

@huonw
Copy link
Contributor Author

huonw commented Mar 11, 2019

Hi @cloud-fan and @rxin, based on the blame, it seems like you've looked at the hashing here relatively recently (or, at least, more recently than anyone else); could you take a look at this patch? Thanks!

@cloud-fan
Copy link
Contributor

ok to test

* @since 2.4.1
*/
@scala.annotation.varargs
def xxhash64(cols: Column*): Column = withExpr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need seed in arguments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash function doesn't currently have a seed argument either.

In any case, I asked about this on dev@spark.apache.org ("[SQL] hash: 64-bits and seeding"), but didn't get any response to that part of my proposal (just the xxhash bit). I think if there was one it would have to come first, because the var args have to come last, something like the following?

def hash(seed: Int, cols: Column*): Column
// or, maybe, don't perpetuate the "bad"/non-specific name:
def murmur3(seed: Int, cols: Columns*): Column
def xxhash64(seed: Long, cols: Column*): Column

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Its ok as it it.

@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103436 has finished for PR 24019 at commit c73b303.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

This is designed to exactly mimic the 32-bit `hash`, which uses
MurmurHash3. The name is designed to be more future-proof than the
'hash', by indicating the exact algorithm used, similar to md5 and the
sha hashes.
@SparkQA
Copy link

SparkQA commented Mar 14, 2019

Test build #103480 has finished for PR 24019 at commit 6fb8c39.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 14, 2019

Test build #103487 has finished for PR 24019 at commit 6fb8c39.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2019

Test build #103483 has finished for PR 24019 at commit 6fb8c39.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

The `hash` function isn't the only thing required to mimic.
@cloud-fan
Copy link
Contributor

retest this please

@huonw
Copy link
Contributor Author

huonw commented Mar 14, 2019

Thanks for the review/testing help. (I apologise for repeatedly failing tests online, however, I'm struggling to find the best way to run tests locally, since it seems to take so long/consume my machine.)

I think this now more closely matches the existing hash/Murmur3Hash in terms of uses/tests. For instance, my new commits expose this to Python and R, and duplicated some tests that call to the classes not the SQL functions. (Please let me know if you'd like the commits squashed to keep the history cleaner! I'm happy to oblige.)

@SparkQA
Copy link

SparkQA commented Mar 14, 2019

Test build #103500 has finished for PR 24019 at commit 7d95431.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a test for R in test_sparkSQL

R/pkg/NAMESPACE Outdated Show resolved Hide resolved
R/pkg/R/generics.R Outdated Show resolved Hide resolved
@SparkQA
Copy link

SparkQA commented Mar 15, 2019

Test build #103526 has finished for PR 24019 at commit 2af3224.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 15, 2019

Test build #103539 has finished for PR 24019 at commit bde21bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

R looks good, but perhaps appveyer R test is not triggering?

@huonw
Copy link
Contributor Author

huonw commented Mar 17, 2019

Hm, I'm not sure I understand; is there something I should do? (Also, I see SparkSQL functions: ...... in the appveyor log?)

@huonw
Copy link
Contributor Author

huonw commented Mar 19, 2019

I'd love to see this progress; is there anything I should do?

@felixcheung
Copy link
Member

R test passes, so that part is good. someone else should review?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in b67d369 Mar 20, 2019
@huonw huonw deleted the hash64 branch March 20, 2019 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants