[SPARK-1687] [PySpark] pickable namedtuple #1623

davies · 2014-07-28T23:37:00Z

Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs.

PS: pyspark should be import BEFORE "from collections import namedtuple"

SparkQA · 2014-07-28T23:38:50Z

QA tests have started for PR 1623. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17315/consoleFull

SparkQA · 2014-07-29T00:31:54Z

QA results for PR 1623:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17315/consoleFull

JoshRosen · 2014-07-29T08:34:23Z

Is there a way to do this that doesn't require PySpark to be imported before the namedtuples are created? Can you directly replace the __reduce__ method on the namedtuple class? Alternatively, maybe you can register a new __reduce__ method using copy_reg. Cloudpickle doesn't use copy_reg.pickle() directly, so the actual fix might be lower-level.

Do not need import pyspark before using namedtuple

Conflicts: python/pyspark/tests.py

SparkQA · 2014-07-29T20:34:01Z

QA tests have started for PR 1623. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17375/consoleFull

SparkQA · 2014-07-29T21:20:46Z

QA results for PR 1623:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(PickleSerializer):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17375/consoleFull

JoshRosen · 2014-08-02T01:30:39Z

python/pyspark/serializers.py

+                hack_namedtuple(o)
+
+    def dump_stream(self, iterator, stream):
+        self._hack_namedtuple()


I was going to suggest that maybe we should have a boolean flag that tests whether we've already hacked namedtuple, but maybe we don't need it: _hack_namedtuple() is idempotent and that might be premature optimization, since here we only pay the hack cost once per stream.

JoshRosen · 2014-08-02T01:43:33Z

I came up with a contrived example that doesn't work. Try running the following with ./bin/pyspark:

from collections import namedtuple
Person = namedtuple("Person", 'id firstName lastName')
jon = Person(1, "Jon", "Doe")
from pyspark import SparkContext
sc = SparkContext("local")
sc.textFile("/usr/share/dict/words").map(lambda x: jon).first()

This results in a pickling error.

The problem here is that _hack_namedtuple() is registered too late. What if you made it into a classmethod in PickleSerializer and called it from SparkContext.__init__()?

davies · 2014-08-02T03:45:08Z

It works in my Mac, have you apply the patch? It should be registerd before dumps.

JoshRosen · 2014-08-02T03:56:28Z

Did you run that exact file with PySpark? The important bits are that namedtuple is imported and an instance is created before any PySpark imports, and we launch a job that tries to serialize a namedtuple in its function closure, and this serialization takes place before the hack is registered (hence my use of text file instead of parallelize).

On Aug 1, 2014, at 8:45 PM, Davies Liu notifications@github.com wrote:

It works in my Mac, have you apply the patch? It should be registerd before dumps.

—
Reply to this email directly or view it on GitHub.

davies · 2014-08-02T04:18:38Z

I see, CloudPickle also need this hack.

SparkQA · 2014-08-02T04:24:06Z

QA tests have started for PR 1623. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17746/consoleFull

SparkQA · 2014-08-02T04:29:18Z

QA tests have started for PR 1623. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17747/consoleFull

JoshRosen · 2014-08-02T04:38:40Z

Your latest commit improves things, but I still think the static method approach would be better, since that way we wouldn't wind up calling _hack_namedtuple() so often. Is there a reason why that doesn't work?

davies · 2014-08-02T05:00:28Z

User may call namedtuple to create class at any time, so this hack should delay to call pickle, so we have to check many times.

JoshRosen · 2014-08-02T05:15:30Z

Calling _hack_namedtuple() should set up pickling for any namedtuple subclasses defined up to that point. It looks like we re-assign to collections.namedtuple, but by then it's already too late since the user might have a reference to the original, non-wrapped namedtuple class. Is there any easy way to patch the namedtuple object itself so that it injects the hack? I think that would solve the problem, since any new namedtuple classes would automatically receive the hack and any old ones would be handled by our search through __main__.

SparkQA · 2014-08-02T05:22:29Z

QA results for PR 1623:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(PickleSerializer):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17746/consoleFull

SparkQA · 2014-08-02T05:25:00Z

QA results for PR 1623:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(PickleSerializer):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17747/consoleFull

davies · 2014-08-02T05:44:15Z

@JoshRosen Good point, I had managed to replace all the reference of namedtuple to new one, so this hijack only need once.

Because it's only related to pickle serialization, so put it called at module level.

SparkQA · 2014-08-02T05:44:18Z

QA tests have started for PR 1623. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17753/consoleFull

JoshRosen · 2014-08-02T06:10:43Z

Just to cover all possible cases, are there any thread-safety issues here? Will be be in trouble if a user creates a new namedtuple instance while _hack_namedtuple() is running? That seems like an extremely unlikely scenario, though.

davies · 2014-08-02T06:17:43Z

Because of GIL, in most cases, Python threads will not run concurrently. And this patch will replace first, then patch the classes, the process can be interrupted without problems.

SparkQA · 2014-08-02T06:37:58Z

QA results for PR 1623:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(PickleSerializer):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17753/consoleFull

SparkQA · 2014-08-02T07:09:08Z

QA tests have started for PR 1623. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17759/consoleFull

SparkQA · 2014-08-02T08:11:28Z

QA results for PR 1623:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(PickleSerializer):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17759/consoleFull

JoshRosen · 2014-08-04T06:07:04Z

This looks okay, but I still wonder whether there's a simpler approach. Have you looked at how dill handles namedtuples?

davies · 2014-08-04T06:12:11Z

It's easy to extend pickle to support namedtuple, couldpickle and dill have done in this way, but they are slow. We want to use cPickle for dataset, it should be fast by default. I had not find an way to extend cPickle, do you have any ideas?

JoshRosen · 2014-08-04T06:21:49Z

Here's another (contrived) example that breaks:

from collections import namedtuple as nt
from pyspark import SparkContext
from pyspark.serializers import PickleSerializer

sc = SparkContext("local")
p = PickleSerializer()

Person = nt("Person", 'id firstName lastName')
jon = Person(1, "Jon", "Doe")
sc.textFile("/usr/share/dict/words").map(lambda x: jon).first()

It looks like the problem here is that line 306 assumes that old references will be named namedtuple, which isn't true if I import it under a different name.

davies · 2014-08-04T06:48:03Z

Yes, it's easy to break it.

Having an solution working in 99% cases is better than no solutions, or much slower solution working 100% cases.

davies · 2014-08-04T06:50:17Z

This feature is not blocker, because we prefer use Row() instead of namedtuple to do inferSchema().

If user really want to use namedtuple or customized class in main, they could use cloudpickle.

JoshRosen · 2014-08-04T07:16:38Z

I found another technique that may be more robust to namedtuple being accessible under different names. We can replace namedtuple's code object at runtime in order to interpose on calls to it:

import types
def copy_func(f, name=None):  # See http://stackoverflow.com/a/6528148/590203
    return types.FunctionType(f.func_code, f.func_globals, name or f.func_name,
            f.func_defaults, f.func_closure)

from collections import namedtuple
namedtuple._old_namedtuple = copy_func(namedtuple)
def wrapped(*args, **kwargs):
    print "Called the wrapped function!"
    return namedtuple._old_namedtuple(*args, **kwargs)
namedtuple.func_code = wrapped.func_code

print namedtuple("Person", "name age")

This prints

Called the wrapped function!
<class 'collections.Person'>

SparkQA · 2014-08-04T18:29:25Z

QA tests have started for PR 1623. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17862/consoleFull

SparkQA · 2014-08-04T19:09:27Z

QA tests have started for PR 1623. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17867/consoleFull

JoshRosen · 2014-08-04T19:15:26Z

I've merged this into master and branch-1.1. Thanks!

(I tested this locally)

SparkQA · 2014-08-04T19:22:16Z

QA results for PR 1623:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AutoSerializer(PickleSerializer):

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17862/consoleFull

SparkQA · 2014-08-04T20:00:23Z

QA results for PR 1623:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17867/consoleFull

Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs. PS: pyspark should be import BEFORE "from collections import namedtuple" Author: Davies Liu <davies.liu@gmail.com> Closes #1623 from davies/namedtuple and squashes the following commits: 045dad8 [Davies Liu] remove unrelated code changes 4132f32 [Davies Liu] address comment 55b1c1a [Davies Liu] fix tests 61f86eb [Davies Liu] replace all the reference of namedtuple to new hacked one 98df6c6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple f7b1bde [Davies Liu] add hack for CloudPickleSerializer 0c5c849 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple 21991e6 [Davies Liu] hack namedtuple in __main__ module, make it picklable. 93b03b8 [Davies Liu] pickable namedtuple (cherry picked from commit 59f84a9) Signed-off-by: Josh Rosen <joshrosen@apache.org>

JoshRosen · 2014-08-04T21:36:58Z

Whoops, I broke the build by merging this! I should have just waited for Jenkins to finish. Sorry if this inconvenienced anyone; I won't make this mistake again. Davies has a fix in #1771 that I'll get merged.

Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs. PS: pyspark should be import BEFORE "from collections import namedtuple" Author: Davies Liu <davies.liu@gmail.com> Closes apache#1623 from davies/namedtuple and squashes the following commits: 045dad8 [Davies Liu] remove unrelated code changes 4132f32 [Davies Liu] address comment 55b1c1a [Davies Liu] fix tests 61f86eb [Davies Liu] replace all the reference of namedtuple to new hacked one 98df6c6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple f7b1bde [Davies Liu] add hack for CloudPickleSerializer 0c5c849 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple 21991e6 [Davies Liu] hack namedtuple in __main__ module, make it picklable. 93b03b8 [Davies Liu] pickable namedtuple

pickable namedtuple

93b03b8

davies added 2 commits July 29, 2014 13:20

hack namedtuple in __main__ module, make it picklable.

21991e6

Do not need import pyspark before using namedtuple

Merge branch 'master' of github.com:apache/spark into namedtuple

0c5c849

Conflicts: python/pyspark/tests.py

JoshRosen reviewed Aug 2, 2014
View reviewed changes

add hack for CloudPickleSerializer

f7b1bde

Merge branch 'master' of github.com:apache/spark into namedtuple

98df6c6

replace all the reference of namedtuple to new hacked one

61f86eb

fix tests

55b1c1a

address comment

4132f32

remove unrelated code changes

045dad8

asfgit closed this in 59f84a9 Aug 4, 2014

JoshRosen mentioned this pull request Aug 4, 2014

[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically #1744

Closed

[SPARK-1687] [PySpark] pickable namedtuple #1623

[SPARK-1687] [PySpark] pickable namedtuple #1623

Conversation

davies commented Jul 28, 2014

SparkQA commented Jul 28, 2014

SparkQA commented Jul 29, 2014

JoshRosen commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

JoshRosen Aug 2, 2014

Choose a reason for hiding this comment

JoshRosen commented Aug 2, 2014

davies commented Aug 2, 2014

JoshRosen commented Aug 2, 2014

davies commented Aug 2, 2014

SparkQA commented Aug 2, 2014

SparkQA commented Aug 2, 2014

JoshRosen commented Aug 2, 2014

davies commented Aug 2, 2014

JoshRosen commented Aug 2, 2014

SparkQA commented Aug 2, 2014

SparkQA commented Aug 2, 2014

davies commented Aug 2, 2014

SparkQA commented Aug 2, 2014

JoshRosen commented Aug 2, 2014

davies commented Aug 2, 2014

SparkQA commented Aug 2, 2014

SparkQA commented Aug 2, 2014

SparkQA commented Aug 2, 2014

JoshRosen commented Aug 4, 2014

davies commented Aug 4, 2014

JoshRosen commented Aug 4, 2014

davies commented Aug 4, 2014

davies commented Aug 4, 2014

JoshRosen commented Aug 4, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

JoshRosen commented Aug 4, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

JoshRosen commented Aug 4, 2014