[SPARK-27099][SQL] Add 'xxhash64' for hashing arbitrary columns to Long #24019

huonw · 2019-03-08T03:11:09Z

What changes were proposed in this pull request?

This introduces a new SQL function 'xxhash64' for getting a 64-bit hash of an arbitrary number of columns.

This is designed to exactly mimic the 32-bit hash, which uses
MurmurHash3. The name is designed to be more future-proof than the
'hash', by indicating the exact algorithm used, similar to md5 and the
sha hashes.

How was this patch tested?

The tests for the existing hash function were duplicated to run with xxhash64.

huonw · 2019-03-11T22:18:51Z

Hi @cloud-fan and @rxin, based on the blame, it seems like you've looked at the hashing here relatively recently (or, at least, more recently than anyone else); could you take a look at this patch? Thanks!

cloud-fan · 2019-03-13T12:46:42Z

ok to test

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

maropu · 2019-03-13T13:50:07Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @since 2.4.1
+   */
+  @scala.annotation.varargs
+  def xxhash64(cols: Column*): Column = withExpr {


We don't need seed in arguments?

The hash function doesn't currently have a seed argument either.

In any case, I asked about this on dev@spark.apache.org ("[SQL] hash: 64-bits and seeding"), but didn't get any response to that part of my proposal (just the xxhash bit). I think if there was one it would have to come first, because the var args have to come last, something like the following?

def hash(seed: Int, cols: Column*): Column // or, maybe, don't perpetuate the "bad"/non-specific name: def murmur3(seed: Int, cols: Columns*): Column

def xxhash64(seed: Long, cols: Column*): Column

Ah, I see. Its ok as it it.

SparkQA · 2019-03-13T16:18:19Z

Test build #103436 has finished for PR 24019 at commit c73b303.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

This is designed to exactly mimic the 32-bit `hash`, which uses MurmurHash3. The name is designed to be more future-proof than the 'hash', by indicating the exact algorithm used, similar to md5 and the sha hashes.

SparkQA · 2019-03-14T07:05:02Z

Test build #103480 has finished for PR 24019 at commit 6fb8c39.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-03-14T07:09:07Z

retest this please

SparkQA · 2019-03-14T09:20:38Z

Test build #103487 has finished for PR 24019 at commit 6fb8c39.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-14T09:27:41Z

Test build #103483 has finished for PR 24019 at commit 6fb8c39.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

The `hash` function isn't the only thing required to mimic.

cloud-fan · 2019-03-14T11:58:15Z

retest this please

huonw · 2019-03-14T12:01:08Z

Thanks for the review/testing help. (I apologise for repeatedly failing tests online, however, I'm struggling to find the best way to run tests locally, since it seems to take so long/consume my machine.)

I think this now more closely matches the existing hash/Murmur3Hash in terms of uses/tests. For instance, my new commits expose this to Python and R, and duplicated some tests that call to the classes not the SQL functions. (Please let me know if you'd like the commits squashed to keep the history cleaner! I'm happy to oblige.)

SparkQA · 2019-03-14T19:07:12Z

Test build #103500 has finished for PR 24019 at commit 7d95431.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

pls add a test for R in test_sparkSQL

R/pkg/NAMESPACE

R/pkg/R/generics.R

SparkQA · 2019-03-15T05:59:01Z

Test build #103526 has finished for PR 24019 at commit 2af3224.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-15T21:25:16Z

Test build #103539 has finished for PR 24019 at commit bde21bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2019-03-16T04:11:52Z

R looks good, but perhaps appveyer R test is not triggering?

huonw · 2019-03-17T02:16:38Z

Hm, I'm not sure I understand; is there something I should do? (Also, I see SparkSQL functions: ...... in the appveyor log?)

huonw · 2019-03-19T21:17:11Z

I'd love to see this progress; is there anything I should do?

felixcheung · 2019-03-20T03:26:07Z

R test passes, so that part is good. someone else should review?

cloud-fan · 2019-03-20T08:34:56Z

thanks, merging to master!

cloud-fan reviewed Mar 13, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/functions.scala Outdated Show resolved Hide resolved

maropu reviewed Mar 13, 2019

View reviewed changes

huonw added 2 commits March 14, 2019 10:52

[SPARK-27099][SQL] Add 'xxhash64' for hashing arbitrary columns to Long

b015395

This is designed to exactly mimic the 32-bit `hash`, which uses MurmurHash3. The name is designed to be more future-proof than the 'hash', by indicating the exact algorithm used, similar to md5 and the sha hashes.

Update @SInCE from 2.4.1 to 3.0.0

ff2d704

huonw force-pushed the hash64 branch from c73b303 to 6fb8c39 Compare March 14, 2019 06:45

huonw added 2 commits March 14, 2019 22:55

Mirror Murmur3Hash treatment for XxHash64

49ff80e

The `hash` function isn't the only thing required to mimic.

Expose xxhash64 to Python and R

7d95431

huonw force-pushed the hash64 branch from 6fb8c39 to 7d95431 Compare March 14, 2019 11:57

Use the correct @Aliases for the R function

2af3224

felixcheung reviewed Mar 15, 2019

View reviewed changes

R/pkg/NAMESPACE Outdated Show resolved Hide resolved

R/pkg/R/generics.R Outdated Show resolved Hide resolved

Sort R files, and add test to test_sparkSQL.R

bde21bb

cloud-fan closed this in b67d369 Mar 20, 2019

huonw deleted the hash64 branch March 20, 2019 10:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27099][SQL] Add 'xxhash64' for hashing arbitrary columns to Long #24019

[SPARK-27099][SQL] Add 'xxhash64' for hashing arbitrary columns to Long #24019

huonw commented Mar 8, 2019

huonw commented Mar 11, 2019

cloud-fan commented Mar 13, 2019

maropu Mar 13, 2019

huonw Mar 14, 2019

maropu Mar 15, 2019

SparkQA commented Mar 13, 2019

SparkQA commented Mar 14, 2019

dilipbiswal commented Mar 14, 2019

SparkQA commented Mar 14, 2019

SparkQA commented Mar 14, 2019

cloud-fan commented Mar 14, 2019

huonw commented Mar 14, 2019

SparkQA commented Mar 14, 2019

felixcheung left a comment

SparkQA commented Mar 15, 2019

SparkQA commented Mar 15, 2019

felixcheung commented Mar 16, 2019

huonw commented Mar 17, 2019

huonw commented Mar 19, 2019 •

edited

Loading

felixcheung commented Mar 20, 2019

cloud-fan commented Mar 20, 2019

[SPARK-27099][SQL] Add 'xxhash64' for hashing arbitrary columns to Long #24019

[SPARK-27099][SQL] Add 'xxhash64' for hashing arbitrary columns to Long #24019

Conversation

huonw commented Mar 8, 2019

What changes were proposed in this pull request?

How was this patch tested?

huonw commented Mar 11, 2019

cloud-fan commented Mar 13, 2019

maropu Mar 13, 2019

Choose a reason for hiding this comment

huonw Mar 14, 2019

Choose a reason for hiding this comment

maropu Mar 15, 2019

Choose a reason for hiding this comment

SparkQA commented Mar 13, 2019

SparkQA commented Mar 14, 2019

dilipbiswal commented Mar 14, 2019

SparkQA commented Mar 14, 2019

SparkQA commented Mar 14, 2019

cloud-fan commented Mar 14, 2019

huonw commented Mar 14, 2019

SparkQA commented Mar 14, 2019

felixcheung left a comment

Choose a reason for hiding this comment

SparkQA commented Mar 15, 2019

SparkQA commented Mar 15, 2019

felixcheung commented Mar 16, 2019

huonw commented Mar 17, 2019

huonw commented Mar 19, 2019 • edited Loading

felixcheung commented Mar 20, 2019

cloud-fan commented Mar 20, 2019

huonw commented Mar 19, 2019 •

edited

Loading