New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allows the server/client to use pickle #1102

Merged
merged 4 commits into from Jun 2, 2015

Conversation

Projects
None yet
4 participants
@llllllllll
Member

llllllllll commented May 23, 2015

When trying to use the blaze server, you are heavily restricted in what kind of data you can work with because of the json protocol. The main thing that I needed was a clean way to send datetimes across the wire. To generalize the solution, I have just allowed (not forced) the client/server to use pickle. I realize most people would probably be afraid to unpickle data from the internet; however, blaze server is basically remote code execution by definition. The server will only respond to 'compute.pickle' if the allow_pickle flag is set, this is by default false.

try:
import cPickle as pickle
except ImportError:
import pickle

This comment has been minimized.

@mrocklin

mrocklin May 23, 2015

Member

Should we add pickle to compatibility?

This comment has been minimized.

@llllllllll

llllllllll May 23, 2015

Member

It is stdlib

This comment has been minimized.

@mrocklin

mrocklin May 23, 2015

Member

I know, I've just found myself doing this exact same try-except block. Just a thought that we could do this in compatibility.py and then import pickle from there. Fine either way though.

This comment has been minimized.

@llllllllll

llllllllll May 26, 2015

Member

sorry, I misunderstood and thought you meant some compatibility requirements file.

)
@pickle_extension_api.route('/compute.pickle', methods=['POST', 'PUT', 'GET'])

This comment has been minimized.

@mrocklin

mrocklin May 23, 2015

Member

I like the idea of specifying the serialization format by the route.

url = url.strip('/')
if not url[:4] == 'http':
url = 'http://' + url
self.url = url
if mode not in ('json', 'pickle'):
raise ValueError('mode must be either pickle or json')

This comment has been minimized.

@mrocklin

mrocklin May 23, 2015

Member

I wonder if this could be cheaply generalized to any serialization format by accepting dumps/loads functions in the Client/Server. This would let us extend out to other formats like msgpack.

This comment has been minimized.

@llllllllll

llllllllll May 26, 2015

Member

I just pushed a commit that generalizes this into a format where the server accepts an iterable of formats, and the client specifies which format it will use. Maybe work can be done to add a series of formats to fall back to.

headers={'Content-Type': 'application/json'})
serial = ec.serial
r = requests.get('%s/compute.%s' % (ec.url, ec.mode),
data=serial.dumps({'expr': tree}))

This comment has been minimized.

@mrocklin

mrocklin May 23, 2015

Member

Note that for pickle, if the client and server are close and if you have a decent amount of numerical data, and if you're in Python2, then you really want to use the following:

dumps = partial(pickle.dumps, protocol=pickle.HIGHEST_PROTOCOL)

This comment has been minimized.

@llllllllll

llllllllll May 26, 2015

Member

Thanks, does python2 support the python3 pickle format, I haven't tried, and to remove the issue with qdb I just used json.

@llllllllll

This comment has been minimized.

Member

llllllllll commented May 26, 2015

travis just failed 2.6 because of a dict comprehension, I will replace with a map or something.

@cpcloud cpcloud self-assigned this May 26, 2015

@cpcloud cpcloud added this to the 0.8.1 milestone May 26, 2015

@cpcloud

This comment has been minimized.

Member

cpcloud commented May 26, 2015

you can construct a generator expression inside of the dict constructor to simulate dict comprehensions:

dict((k, v) for k, v in list_of_pairs)
@DavidMertz

This comment has been minimized.

DavidMertz commented May 26, 2015

The comprehension in the initializer seems superfluous:

Python 2.6.9 |Anaconda 2.2.0 (x86_64)| (unknown, Jan 10 2014, 13:33:57)
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> list_of_pairs = [('foo',1),('bar',2)]
>>> dict(list_of_pairs)
{'foo': 1, 'bar': 2}
@llllllllll

This comment has been minimized.

Member

llllllllll commented May 26, 2015

I think that was just an example to point me in the correct direction, the actual code being discussed was:

_get_format.cache[app] = dict(
    (f.name, f) for f in _get_option('formats', options)
)
@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 1, 2015

@llllllllll can you add a test for this?

@llllllllll

This comment has been minimized.

Member

llllllllll commented Jun 2, 2015

I now run the server tests with all the serialization formats

An iterable of supported serialization formats. By default, the
server will support JSON.
A serialization format is an object that supports:
name, loads, and dumps.

This comment has been minimized.

@mrocklin

mrocklin Jun 2, 2015

Member

If I were doing the work I would probably skip making the new type and use a dict

formats = {'json': (json.dumps, json.loads),
           'pickle': (partial(pickle.dumps, protocol=pickle.HIGHEST_PROTOCOL), pickle.loads),
           'msgpack': (msgpack.packb, msgpack.unpackb)
           }

But I'm quite grateful at not having to do this work and the chosen strategy currently in this PR seems sane to me; so I say go for it, particularly since this PR has been going on for a while.

This comment has been minimized.

@cpcloud

cpcloud Jun 2, 2015

Member

@llllllllll if we decide to go with the dict solution that @mrocklin showed above at some point in the future, would that break anything you're working on?

This comment has been minimized.

@llllllllll

llllllllll Jun 2, 2015

Member

No, I am not using this yet. One benefit of the namedtuple approach is that I can pass the same serialization format object to the server and the client and it will correctly generate and query the proper compute.{name} endpoint. If I passed a dict, I would need to manually keep the name in sync.

This comment has been minimized.

@cpcloud

cpcloud Jun 2, 2015

Member

Cool. I'm happy as is.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 2, 2015

Merging

cpcloud added a commit that referenced this pull request Jun 2, 2015

Merge pull request #1102 from quantopian/server-serialization
ENH: Allows the server/client to use pickle

@cpcloud cpcloud merged commit 3d81c8b into blaze:master Jun 2, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 2, 2015

@llllllllll Thanks!

@llllllllll llllllllll referenced this pull request Jun 4, 2015

Open

Some Blaze server improvements #982

1 of 5 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment