Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix function serialization #8572

Merged
merged 6 commits into from
Nov 25, 2017
Merged

Fix function serialization #8572

merged 6 commits into from
Nov 25, 2017

Conversation

evhub
Copy link
Contributor

@evhub evhub commented Nov 23, 2017

This fixes an error I'd been struggling with for a while. Previously, func_dump was trying to decode the marshaling output using the raw_unicode_escapes codec for portability, then encode it to get back to the original. This is the exact opposite of what you want to be doing: you want to encode the bytes returned by marshal into some portable format, then decode it back into the original bytes.

As a result of this confusion, instead of (as I presume was intended) encoding the bytes as Unicode escapes, the code was searching the bytes for anything that looked like a Unicode escape and parsing it as such. This is seriously problematic, as marshaled output can contain invalid Unicode escapes, which was the problem I was having. Specifically, my marshaled output contained a path, which, since I'm on Windows, included c:\users\, which is an invalid Unicode escape and thus failed the decoding.

Presumably, the reason this mistake was made is that marshal returns bytes, which needed to be converted to a string, and decode is the method used to do that, despite the fact that semantically decoding is the opposite of the operation that needs to be performed. This is due to the fact that, for Python, encoding is the operation of encoding a string into bytes and decoding is the operation of decoding bytes into a string. Thus, since here you want to do the opposite, and encode bytes into a string and decode a string into bytes, the default methods fight back against you.

My fix is fairly simple. I explicitly use codecs to force the right type of encoding/decoding to be performed, and for portability, I'm using the base64 codec. Thus, the code takes the marshaled output, converts it to base 64 for portability, then converts it back into bytes when it needs to be loaded. This does the semantically correct thing, and solves the issue I was experiencing above.

Copy link
Member

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test for this change.

@@ -197,7 +199,7 @@ def func_load(code, defaults=None, closure=None, globs=None):
code, defaults, closure = code
if isinstance(defaults, list):
defaults = tuple(defaults)
code = marshal.loads(code.encode('raw_unicode_escape'))
code = marshal.loads(codecs.decode(six.binary_type(code), "base64"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, use ' as string delimiter, here and above.

@evhub
Copy link
Contributor Author

evhub commented Nov 23, 2017

@fchollet Added a test and fixed the string delimiters.

@fchollet
Copy link
Member

It looks like the tests are failing: https://travis-ci.org/fchollet/keras/builds/306493889

(binary strings are not JSON-serializable).

Copy link
Member

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@fchollet fchollet merged commit 45e781c into keras-team:master Nov 25, 2017
@rh314
Copy link
Contributor

rh314 commented Nov 26, 2017

For a look at a test that covers functions with closure values, have a look at #8592
(file tests/keras/utils/generic_utils_test.py)

@bbabenko
Copy link
Contributor

bbabenko commented Dec 8, 2017

@rh314 @fchollet: This PR seems to have broken backwards compatibility w.r.t. models serialized before this change. I tried to boil this down to the following simple example:

In [22]: import codecs

In [23]: import marshal

In [24]: from keras.activations import softmax  # just an example function

In [25]: code = marshal.dumps(softmax.__code__).decode('raw_unicode_escape')  # old serialization

In [26]: codecs.decode(code.encode('ascii'), 'base64')  # new deserialization
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-26-83df9bd56477> in <module>()
----> 1 codecs.decode(code.encode('ascii'), 'base64')

UnicodeEncodeError: 'ascii' codec can't encode character u'\x8f' in position 18: ordinal not in range(128)
> <ipython-input-26-83df9bd56477>(1)<module>()
----> 1 codecs.decode(code.encode('ascii'), 'base64')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants