Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Jul 11, 2014

In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@davies davies changed the title hijack hash to make hash of None consistant cross machines [PySpark] hijack hash to make hash of None consistant cross machines Jul 11, 2014
@mateiz
Copy link
Contributor

mateiz commented Jul 11, 2014

This was already fixed before in another way -- see the code at line 1061 of rdd.py:

        if partitionFunc is None:
            partitionFunc = lambda x: 0 if x is None else hash(x)

I much prefer this to replacing the global "hash" function. Just do the same fix elsewhere if there are places where it's a problem.

@mateiz
Copy link
Contributor

mateiz commented Jul 11, 2014

Actually I see you even deleted this code. What was the reason for this change?

@davies
Copy link
Contributor Author

davies commented Jul 11, 2014

@mateiz If there is None in Tuple, such as (None, 3), the hash of it will be different cross machines.

If user provide a partitionFunc, which uses hash() in it, then it will have problem.

This hack does not look good for me. Maybe we just use this portable hash as default one?

@mateiz
Copy link
Contributor

mateiz commented Jul 11, 2014

Ah, I see, that makes sense. In that case let's give it our own global name instead of replacing the built-in hash. We can just put it in the "pyspark" package.

@davies
Copy link
Contributor Author

davies commented Jul 11, 2014

The original motivation to do in this way is that we hope it can fix the problem of structure with None in it automatically, but in fact, it does not work in some cases, such as tuple. It's also will help in some cases, such as user defined objects.

@mateiz
Copy link
Contributor

mateiz commented Jul 12, 2014

Sure, but is it okay to not replace the global hash()? Just call it something like pyspark.hash and set the default to pyspark.hash.

@mateiz
Copy link
Contributor

mateiz commented Jul 12, 2014

Jenkins, add to whitelist and test this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this fail now? (Or if it passes it's because this runs on one machine somehow)

@SparkQA
Copy link

SparkQA commented Jul 12, 2014

QA tests have started for PR 1371. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16587/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 12, 2014

QA results for PR 1371:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16587/consoleFull

@davies
Copy link
Contributor Author

davies commented Jul 14, 2014

Jenkins, test this please

@mateiz
Copy link
Contributor

mateiz commented Jul 15, 2014

BTW for us to merge this you should also open a JIRA for it so we can track it. Just add one on https://issues.apache.org/jira/browse/SPARK.

@davies davies changed the title [PySpark] hijack hash to make hash of None consistant cross machines [SPARK-2494] [PySpark] hijack hash to make hash of None consistant cross machines Jul 15, 2014
@davies
Copy link
Contributor Author

davies commented Jul 15, 2014

Create issue #SPARK-2494 to track this.

@davies davies changed the title [SPARK-2494] [PySpark] hijack hash to make hash of None consistant cross machines [SPARK-2494] [PySpark] make hash of None consistant cross machines Jul 16, 2014
@mattf
Copy link

mattf commented Jul 18, 2014

i've confirmed that this patch addresses the reported issue...

 (
  len(sc.parallelize([((None, 1), 1),] * 100, 100).groupByKey(10).collect()) == 1,
  len(sc.parallelize([(((None, 1), 1), 1),] * 100, 100).groupByKey(10).collect()) == 1,
  len(sc.parallelize([((1, None), 1),] * 100, 100).groupByKey(10).collect()) == 1,
  len(sc.parallelize([(((None, 1), None), 1),] * 100, 100).groupByKey(10).collect()) == 1,
 ) => (True, True, True, True)

@davies
Copy link
Contributor Author

davies commented Jul 18, 2014

@mattf, Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment from before was deleted, but please add a link to where the implementation is from, or a reference to the Python source code for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also explain what "consistent hash code means", this comment doesn't say anything about the hash code of None being different across machines by default

@mateiz
Copy link
Contributor

mateiz commented Jul 20, 2014

Hey @davies apart from the small comments above, please add a test in tests.py. Jobs similar to the ones Matt posted would be great. Otherwise this might break again in the future.

@davies
Copy link
Contributor Author

davies commented Jul 20, 2014

@matei, our tests only run in local mode, but this issue can only be
reproduced in multi-node cluster. Do we still need it ?

On Sun, Jul 20, 2014 at 1:26 AM, Matei Zaharia notifications@github.com
wrote:

Hey @davies https://github.com/davies apart from the small comments
above, please add a test in tests.py. Jobs similar to the ones Matt
posted would be great. Otherwise this might break again in the future.

Reply to this email directly or view it on GitHub
#1371 (comment).

  • Davies

@mateiz
Copy link
Contributor

mateiz commented Jul 20, 2014

Even in local mode, we launch multiple Python processes, one per core. Just set the master to local[4] or something like that. Some of our other tests do that.

@davies
Copy link
Contributor Author

davies commented Jul 21, 2014

Even with multiprocess, the hash of None are the same, because they are
forked from the same one process.

On Sun, Jul 20, 2014 at 4:33 PM, Matei Zaharia notifications@github.com
wrote:

Even in local mode, we launch multiple Python processes, one per core.
Just set the master to local[4] or something like that. Some of our other
tests do that.

Reply to this email directly or view it on GitHub
#1371 (comment).

  • Davies

@mateiz
Copy link
Contributor

mateiz commented Jul 21, 2014

Are you sure about that? They're forked from Java, not from the Python process.

If this is the case, please suggest another way to test this. We can't add a bug fix without a test.

@mateiz
Copy link
Contributor

mateiz commented Jul 21, 2014

Jenkins, test this please

@mateiz
Copy link
Contributor

mateiz commented Jul 21, 2014

Actually I see there are some doctests that I missed earlier, maybe that's okay. Though last time it failed Jenkins...

@SparkQA
Copy link

SparkQA commented Jul 21, 2014

QA tests have started for PR 1371. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16906/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 21, 2014

QA results for PR 1371:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16906/consoleFull

@mateiz
Copy link
Contributor

mateiz commented Jul 21, 2014

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Jul 21, 2014

QA tests have started for PR 1371. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16914/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 21, 2014

QA results for PR 1371:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16914/consoleFull

@davies
Copy link
Contributor Author

davies commented Jul 21, 2014

The JVM fork one python daemon(daemon.py), then the daemon fork all the workers.

asfgit pushed a commit that referenced this pull request Jul 21, 2014
In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1371 from davies/hash_of_none and squashes the following commits:

d01745f [Davies Liu] add comments, remove outdated unit tests
5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy()
b7118aa [Davies Liu] use __builtin__ instead of __builtins__
839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines

(cherry picked from commit 872538c)
Signed-off-by: Matei Zaharia <matei@databricks.com>
@asfgit asfgit closed this in 872538c Jul 21, 2014
asfgit pushed a commit that referenced this pull request Jul 21, 2014
In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1371 from davies/hash_of_none and squashes the following commits:

d01745f [Davies Liu] add comments, remove outdated unit tests
5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy()
b7118aa [Davies Liu] use __builtin__ instead of __builtins__
839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines

(cherry picked from commit 872538c)
Signed-off-by: Matei Zaharia <matei@databricks.com>
asfgit pushed a commit that referenced this pull request Jul 21, 2014
In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1371 from davies/hash_of_none and squashes the following commits:

d01745f [Davies Liu] add comments, remove outdated unit tests
5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy()
b7118aa [Davies Liu] use __builtin__ instead of __builtins__
839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines

(cherry picked from commit 872538c)
Signed-off-by: Matei Zaharia <matei@databricks.com>
@mateiz
Copy link
Contributor

mateiz commented Jul 21, 2014

Ah right, that makes sense. I've merged this in now.

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this.

Author: Davies Liu <davies.liu@gmail.com>

Closes apache#1371 from davies/hash_of_none and squashes the following commits:

d01745f [Davies Liu] add comments, remove outdated unit tests
5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy()
b7118aa [Davies Liu] use __builtin__ instead of __builtins__
839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines
@davies davies deleted the hash_of_none branch September 15, 2014 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants