[SPARK-2494] [PySpark] make hash of None consistant cross machines #1371

davies · 2014-07-11T06:42:44Z

In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this.

AmplabJenkins · 2014-07-11T06:46:17Z

Can one of the admins verify this patch?

mateiz · 2014-07-11T20:56:17Z

This was already fixed before in another way -- see the code at line 1061 of rdd.py:

        if partitionFunc is None:
            partitionFunc = lambda x: 0 if x is None else hash(x)

I much prefer this to replacing the global "hash" function. Just do the same fix elsewhere if there are places where it's a problem.

mateiz · 2014-07-11T20:58:53Z

Actually I see you even deleted this code. What was the reason for this change?

davies · 2014-07-11T21:54:13Z

@mateiz If there is None in Tuple, such as (None, 3), the hash of it will be different cross machines.

If user provide a partitionFunc, which uses hash() in it, then it will have problem.

This hack does not look good for me. Maybe we just use this portable hash as default one?

mateiz · 2014-07-11T21:57:56Z

Ah, I see, that makes sense. In that case let's give it our own global name instead of replacing the built-in hash. We can just put it in the "pyspark" package.

davies · 2014-07-11T22:00:02Z

The original motivation to do in this way is that we hope it can fix the problem of structure with None in it automatically, but in fact, it does not work in some cases, such as tuple. It's also will help in some cases, such as user defined objects.

mateiz · 2014-07-12T05:21:46Z

Sure, but is it okay to not replace the global hash()? Just call it something like pyspark.hash and set the default to pyspark.hash.

mateiz · 2014-07-12T17:12:48Z

Jenkins, add to whitelist and test this please

mateiz · 2014-07-12T17:13:15Z

python/pyspark/tests.py

Won't this fail now? (Or if it passes it's because this runs on one machine somehow)

SparkQA · 2014-07-12T17:17:33Z

QA tests have started for PR 1371. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16587/consoleFull

SparkQA · 2014-07-12T18:51:01Z

QA results for PR 1371:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16587/consoleFull

davies · 2014-07-14T18:10:48Z

Jenkins, test this please

mateiz · 2014-07-15T02:49:38Z

BTW for us to merge this you should also open a JIRA for it so we can track it. Just add one on https://issues.apache.org/jira/browse/SPARK.

davies · 2014-07-15T19:20:34Z

Create issue #SPARK-2494 to track this.

mattf · 2014-07-18T14:17:23Z

i've confirmed that this patch addresses the reported issue...

 (
  len(sc.parallelize([((None, 1), 1),] * 100, 100).groupByKey(10).collect()) == 1,
  len(sc.parallelize([(((None, 1), 1), 1),] * 100, 100).groupByKey(10).collect()) == 1,
  len(sc.parallelize([((1, None), 1),] * 100, 100).groupByKey(10).collect()) == 1,
  len(sc.parallelize([(((None, 1), None), 1),] * 100, 100).groupByKey(10).collect()) == 1,
 ) => (True, True, True, True)

davies · 2014-07-18T16:32:47Z

@mattf, Thanks!

mateiz · 2014-07-20T08:25:07Z

python/pyspark/rdd.py

My comment from before was deleted, but please add a link to where the implementation is from, or a reference to the Python source code for this

Also explain what "consistent hash code means", this comment doesn't say anything about the hash code of None being different across machines by default

mateiz · 2014-07-20T08:26:36Z

Hey @davies apart from the small comments above, please add a test in tests.py. Jobs similar to the ones Matt posted would be great. Otherwise this might break again in the future.

davies · 2014-07-20T23:11:47Z

@matei, our tests only run in local mode, but this issue can only be
reproduced in multi-node cluster. Do we still need it ?

On Sun, Jul 20, 2014 at 1:26 AM, Matei Zaharia notifications@github.com
wrote:

Hey @davies https://github.com/davies apart from the small comments
above, please add a test in tests.py. Jobs similar to the ones Matt
posted would be great. Otherwise this might break again in the future.

Reply to this email directly or view it on GitHub
#1371 (comment).

Davies

mateiz · 2014-07-20T23:33:20Z

Even in local mode, we launch multiple Python processes, one per core. Just set the master to local[4] or something like that. Some of our other tests do that.

davies · 2014-07-21T02:19:07Z

Even with multiprocess, the hash of None are the same, because they are
forked from the same one process.

On Sun, Jul 20, 2014 at 4:33 PM, Matei Zaharia notifications@github.com
wrote:

Even in local mode, we launch multiple Python processes, one per core.
Just set the master to local[4] or something like that. Some of our other
tests do that.

Reply to this email directly or view it on GitHub
#1371 (comment).

Davies

mateiz · 2014-07-21T06:27:04Z

Are you sure about that? They're forked from Java, not from the Python process.

If this is the case, please suggest another way to test this. We can't add a bug fix without a test.

mateiz · 2014-07-21T07:29:53Z

Jenkins, test this please

mateiz · 2014-07-21T07:30:32Z

Actually I see there are some doctests that I missed earlier, maybe that's okay. Though last time it failed Jenkins...

SparkQA · 2014-07-21T07:33:36Z

QA tests have started for PR 1371. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16906/consoleFull

SparkQA · 2014-07-21T09:16:11Z

QA results for PR 1371:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16906/consoleFull

mateiz · 2014-07-21T16:33:20Z

Jenkins, test this please

SparkQA · 2014-07-21T16:38:08Z

QA tests have started for PR 1371. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16914/consoleFull

SparkQA · 2014-07-21T18:15:36Z

QA results for PR 1371:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16914/consoleFull

davies · 2014-07-21T18:39:52Z

The JVM fork one python daemon(daemon.py), then the daemon fork all the workers.

In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this. Author: Davies Liu <davies.liu@gmail.com> Closes #1371 from davies/hash_of_none and squashes the following commits: d01745f [Davies Liu] add comments, remove outdated unit tests 5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy() b7118aa [Davies Liu] use __builtin__ instead of __builtins__ 839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines (cherry picked from commit 872538c) Signed-off-by: Matei Zaharia <matei@databricks.com>

mateiz · 2014-07-21T19:02:58Z

Ah right, that makes sense. I've merged this in now.

In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this. Author: Davies Liu <davies.liu@gmail.com> Closes apache#1371 from davies/hash_of_none and squashes the following commits: d01745f [Davies Liu] add comments, remove outdated unit tests 5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy() b7118aa [Davies Liu] use __builtin__ instead of __builtins__ 839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines

hijack hash to make hash of None consistant cross machines

839e417

use __builtin__ instead of __builtins__

b7118aa

davies changed the title ~~hijack hash to make hash of None consistant cross machines~~ [PySpark] hijack hash to make hash of None consistant cross machines Jul 11, 2014

disable hijack of hash, use it only for partitionBy()

5467141

mateiz reviewed Jul 12, 2014
View reviewed changes

python/pyspark/tests.py Outdated

Copy link

Contributor

mateiz Jul 12, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this fail now? (Or if it passes it's because this runs on one machine somehow)

add comments, remove outdated unit tests

d01745f

davies changed the title ~~[PySpark] hijack hash to make hash of None consistant cross machines~~ [SPARK-2494] [PySpark] hijack hash to make hash of None consistant cross machines Jul 15, 2014

davies changed the title ~~[SPARK-2494] [PySpark] hijack hash to make hash of None consistant cross machines~~ [SPARK-2494] [PySpark] make hash of None consistant cross machines Jul 16, 2014

mateiz reviewed Jul 20, 2014
View reviewed changes

asfgit closed this in 872538c Jul 21, 2014

davies deleted the hash_of_none branch September 15, 2014 22:18

[SPARK-2494] [PySpark] make hash of None consistant cross machines #1371

[SPARK-2494] [PySpark] make hash of None consistant cross machines #1371

Uh oh!

Conversation

davies commented Jul 11, 2014

Uh oh!

AmplabJenkins commented Jul 11, 2014

Uh oh!

mateiz commented Jul 11, 2014

Uh oh!

mateiz commented Jul 11, 2014

Uh oh!

davies commented Jul 11, 2014

Uh oh!

mateiz commented Jul 11, 2014

Uh oh!

davies commented Jul 11, 2014

Uh oh!

mateiz commented Jul 12, 2014

Uh oh!

mateiz commented Jul 12, 2014

Uh oh!

mateiz Jul 12, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 12, 2014

Uh oh!

SparkQA commented Jul 12, 2014

Uh oh!

davies commented Jul 14, 2014

Uh oh!

mateiz commented Jul 15, 2014

Uh oh!

davies commented Jul 15, 2014

Uh oh!

mattf commented Jul 18, 2014

Uh oh!

davies commented Jul 18, 2014

Uh oh!

mateiz Jul 20, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz Jul 20, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz commented Jul 20, 2014

Uh oh!

davies commented Jul 20, 2014

Uh oh!

mateiz commented Jul 20, 2014

Uh oh!

davies commented Jul 21, 2014

Uh oh!

mateiz commented Jul 21, 2014

Uh oh!

mateiz commented Jul 21, 2014

Uh oh!

mateiz commented Jul 21, 2014

Uh oh!

SparkQA commented Jul 21, 2014

Uh oh!

SparkQA commented Jul 21, 2014

Uh oh!

mateiz commented Jul 21, 2014

Uh oh!

SparkQA commented Jul 21, 2014

Uh oh!

SparkQA commented Jul 21, 2014

Uh oh!

davies commented Jul 21, 2014

Uh oh!

mateiz commented Jul 21, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!