Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29124][CORE] Use MurmurHash3 bytesHash(data, seed) instead of bytesHash(data) #25821

Closed
wants to merge 1 commit into from
Closed

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Sep 17, 2019

What changes were proposed in this pull request?

This PR changes bytesHash(data) API invocation with the underlaying byteHash(data, arraySeed) invocation.

def bytesHash(data: Array[Byte]): Int = bytesHash(data, arraySeed)

Why are the changes needed?

The original API is changed between Scala versions by the following commit. From Scala 2.12.9, the semantic of the function is changed. If we use the underlying form, we are safe during Scala version migration.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This is a kind of refactoring.

Pass the Jenkins with the existing tests.

@dongjoon-hyun dongjoon-hyun changed the title Use the underlying function [SPARK-29124][CORE] Use bytesHash(data, seed) instead of bytesHash(data) Sep 17, 2019
@dongjoon-hyun
Copy link
Member Author

cc @kiszk , @wangyum , @srowen

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29124][CORE] Use bytesHash(data, seed) instead of bytesHash(data) [SPARK-29124][CORE] Use MurmurHash3 bytesHash(data, seed) instead of bytesHash(data) Sep 17, 2019
@srowen
Copy link
Member

srowen commented Sep 17, 2019

See #25404 (comment) -- is that the equivalent change?

But I agree that we can and should test such a change against the current build to see if it changes behavior.

@dongjoon-hyun
Copy link
Member Author

This is an opposite approach. This PR makes Apache Spark independent from that Scala patch instead of matching to it. In that PR, I also suggest this approach.

to match scala/scala@846ee2b#diff-ac889f851e109fc4387cd738d52ce177 don't we need both calls to change bytesHash to arrayHash

@srowen
Copy link
Member

srowen commented Sep 17, 2019

Oh, OK. But wouldn't that probably change the behavior of Spark? I guess we'll see here in the tests.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 17, 2019

For me, the PR itself looks solid and safe, but I also believe that Jenkins will make it sure for us soon. 😄

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110825 has finished for PR 25821 at commit a8187fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 17, 2019

Hi, @srowen , @wangyum , @kiszk .
Could you review this once more? I want to merge this first and retrigger Scala-2.12.10 PR on top of this. As I mentioned there, this helps many cases.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure why it works, but tests pass, so it seems OK to me, if you see a reason this will work for Scala 2.12.10

Copy link
Member

@kiszk kiszk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it looks simpler than I thought in the bed :) Thank you.

@HyukjinKwon
Copy link
Member

Merged to master.

@dongjoon-hyun
Copy link
Member Author

Thank you all!

@dongjoon-hyun dongjoon-hyun deleted the SPARK-SCALA-HASH branch September 18, 2019 01:42
@srowen
Copy link
Member

srowen commented Sep 18, 2019

See comment on other PR - yep I get the idea now, makes sense.

dongjoon-hyun added a commit that referenced this pull request Sep 18, 2019
…f `bytesHash(data)`

This PR changes `bytesHash(data)` API invocation with the underlaying `byteHash(data, arraySeed)` invocation.
```scala
def bytesHash(data: Array[Byte]): Int = bytesHash(data, arraySeed)
```

The original API is changed between Scala versions by the following commit. From Scala 2.12.9, the semantic of the function is changed. If we use the underlying form, we are safe during Scala version migration.
- scala/scala@846ee2b#diff-ac889f851e109fc4387cd738d52ce177

No.

This is a kind of refactoring.

Pass the Jenkins with the existing tests.

Closes #25821 from dongjoon-hyun/SPARK-SCALA-HASH.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 3ece8ee)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member Author

Hi, All.
Although this is a kind of improvement, I'll land this to branch-2.4.
We need this to upgrade branch-2.4 to Scala-2.12.10 and this is very safe patch.

@srowen
Copy link
Member

srowen commented Sep 18, 2019

Agree.

rshkv pushed a commit to palantir/spark that referenced this pull request Jun 20, 2020
…f `bytesHash(data)`

This PR changes `bytesHash(data)` API invocation with the underlaying `byteHash(data, arraySeed)` invocation.
```scala
def bytesHash(data: Array[Byte]): Int = bytesHash(data, arraySeed)
```

The original API is changed between Scala versions by the following commit. From Scala 2.12.9, the semantic of the function is changed. If we use the underlying form, we are safe during Scala version migration.
- scala/scala@846ee2b#diff-ac889f851e109fc4387cd738d52ce177

No.

This is a kind of refactoring.

Pass the Jenkins with the existing tests.

Closes apache#25821 from dongjoon-hyun/SPARK-SCALA-HASH.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request Jul 16, 2020
…f `bytesHash(data)`

This PR changes `bytesHash(data)` API invocation with the underlaying `byteHash(data, arraySeed)` invocation.
```scala
def bytesHash(data: Array[Byte]): Int = bytesHash(data, arraySeed)
```

The original API is changed between Scala versions by the following commit. From Scala 2.12.9, the semantic of the function is changed. If we use the underlying form, we are safe during Scala version migration.
- scala/scala@846ee2b#diff-ac889f851e109fc4387cd738d52ce177

No.

This is a kind of refactoring.

Pass the Jenkins with the existing tests.

Closes apache#25821 from dongjoon-hyun/SPARK-SCALA-HASH.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants