-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-1687] [PySpark] pickable namedtuple #1623
Conversation
QA tests have started for PR 1623. This patch merges cleanly. |
QA results for PR 1623: |
Is there a way to do this that doesn't require PySpark to be imported before the namedtuples are created? Can you directly replace the |
Do not need import pyspark before using namedtuple
Conflicts: python/pyspark/tests.py
QA tests have started for PR 1623. This patch merges cleanly. |
QA results for PR 1623: |
hack_namedtuple(o) | ||
|
||
def dump_stream(self, iterator, stream): | ||
self._hack_namedtuple() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to suggest that maybe we should have a boolean flag that tests whether we've already hacked namedtuple, but maybe we don't need it: _hack_namedtuple() is idempotent and that might be premature optimization, since here we only pay the hack cost once per stream.
I came up with a contrived example that doesn't work. Try running the following with from collections import namedtuple
Person = namedtuple("Person", 'id firstName lastName')
jon = Person(1, "Jon", "Doe")
from pyspark import SparkContext
sc = SparkContext("local")
sc.textFile("/usr/share/dict/words").map(lambda x: jon).first() This results in a pickling error. The problem here is that |
It works in my Mac, have you apply the patch? It should be registerd before dumps. |
Did you run that exact file with PySpark? The important bits are that namedtuple is imported and an instance is created before any PySpark imports, and we launch a job that tries to serialize a namedtuple in its function closure, and this serialization takes place before the hack is registered (hence my use of text file instead of parallelize).
|
I see, CloudPickle also need this hack. |
QA tests have started for PR 1623. This patch merges cleanly. |
QA tests have started for PR 1623. This patch merges cleanly. |
Your latest commit improves things, but I still think the static method approach would be better, since that way we wouldn't wind up calling |
User may call namedtuple to create class at any time, so this hack should delay to call pickle, so we have to check many times. |
Calling |
QA results for PR 1623: |
QA results for PR 1623: |
@JoshRosen Good point, I had managed to replace all the reference of namedtuple to new one, so this hijack only need once. Because it's only related to pickle serialization, so put it called at module level. |
QA tests have started for PR 1623. This patch merges cleanly. |
Just to cover all possible cases, are there any thread-safety issues here? Will be be in trouble if a user creates a new |
Because of GIL, in most cases, Python threads will not run concurrently. And this patch will replace first, then patch the classes, the process can be interrupted without problems. |
QA results for PR 1623: |
QA tests have started for PR 1623. This patch merges cleanly. |
QA results for PR 1623: |
This looks okay, but I still wonder whether there's a simpler approach. Have you looked at how dill handles namedtuples? |
It's easy to extend pickle to support namedtuple, couldpickle and dill have done in this way, but they are slow. We want to use cPickle for dataset, it should be fast by default. I had not find an way to extend cPickle, do you have any ideas? |
Here's another (contrived) example that breaks: from collections import namedtuple as nt
from pyspark import SparkContext
from pyspark.serializers import PickleSerializer
sc = SparkContext("local")
p = PickleSerializer()
Person = nt("Person", 'id firstName lastName')
jon = Person(1, "Jon", "Doe")
sc.textFile("/usr/share/dict/words").map(lambda x: jon).first() It looks like the problem here is that line 306 assumes that old references will be named |
Yes, it's easy to break it. Having an solution working in 99% cases is better than no solutions, or much slower solution working 100% cases. |
This feature is not blocker, because we prefer use Row() instead of namedtuple to do inferSchema(). If user really want to use namedtuple or customized class in main, they could use cloudpickle. |
I found another technique that may be more robust to import types
def copy_func(f, name=None): # See http://stackoverflow.com/a/6528148/590203
return types.FunctionType(f.func_code, f.func_globals, name or f.func_name,
f.func_defaults, f.func_closure)
from collections import namedtuple
namedtuple._old_namedtuple = copy_func(namedtuple)
def wrapped(*args, **kwargs):
print "Called the wrapped function!"
return namedtuple._old_namedtuple(*args, **kwargs)
namedtuple.func_code = wrapped.func_code
print namedtuple("Person", "name age") This prints
|
QA tests have started for PR 1623. This patch merges cleanly. |
QA tests have started for PR 1623. This patch merges cleanly. |
I've merged this into (I tested this locally) |
QA results for PR 1623: |
QA results for PR 1623: |
Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs. PS: pyspark should be import BEFORE "from collections import namedtuple" Author: Davies Liu <davies.liu@gmail.com> Closes #1623 from davies/namedtuple and squashes the following commits: 045dad8 [Davies Liu] remove unrelated code changes 4132f32 [Davies Liu] address comment 55b1c1a [Davies Liu] fix tests 61f86eb [Davies Liu] replace all the reference of namedtuple to new hacked one 98df6c6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple f7b1bde [Davies Liu] add hack for CloudPickleSerializer 0c5c849 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple 21991e6 [Davies Liu] hack namedtuple in __main__ module, make it picklable. 93b03b8 [Davies Liu] pickable namedtuple (cherry picked from commit 59f84a9) Signed-off-by: Josh Rosen <joshrosen@apache.org>
Whoops, I broke the build by merging this! I should have just waited for Jenkins to finish. Sorry if this inconvenienced anyone; I won't make this mistake again. Davies has a fix in #1771 that I'll get merged. |
Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs. PS: pyspark should be import BEFORE "from collections import namedtuple" Author: Davies Liu <davies.liu@gmail.com> Closes apache#1623 from davies/namedtuple and squashes the following commits: 045dad8 [Davies Liu] remove unrelated code changes 4132f32 [Davies Liu] address comment 55b1c1a [Davies Liu] fix tests 61f86eb [Davies Liu] replace all the reference of namedtuple to new hacked one 98df6c6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple f7b1bde [Davies Liu] add hack for CloudPickleSerializer 0c5c849 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple 21991e6 [Davies Liu] hack namedtuple in __main__ module, make it picklable. 93b03b8 [Davies Liu] pickable namedtuple
Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs.
PS: pyspark should be import BEFORE "from collections import namedtuple"