-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow map/apply in client/server when explicitly enabled #1497
Allow map/apply in client/server when explicitly enabled #1497
Conversation
- added json_dumps and object_hook for python builtin functions - data_loads is not consistent for all serializations. For fastmsgpack, it uses pandas to decode to a Series; while other methods decode to a list.
try: | ||
import builtins | ||
except ImportError: | ||
import __builtin__ as builtins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use from blaze.compatibility import builtins
here.
@@ -187,6 +192,11 @@ def json_dumps(ds): | |||
return {'__!datashape': str(ds)} | |||
|
|||
|
|||
@dispatch(types.BuiltinFunctionType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dispatch is going to get hit for any function defined in C. This means that thinks like np.sum
will get deserialized as builtins.sum
. I think we might need some stronger checks here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to add another check as follows:
if f.__module__ in ('__builtin__', 'builtins'):
return {'__!builtin_function': f.__name__}
However, wouldn't the dispatch only work for BuiltinFunctionType in case below? I don't believe we need another check in here.
@dispatch(types.BuiltinFunctionType)
def json_dumps(f):
....
Testing against a numpy function gives the NotImplementedError because the dispatch is only using this function for types.BuiltinFunctionType
NotImplementedError: Could not find signature for json_dumps: <function>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More from the docs of types module:
This module defines names for some object types that are used by the standard Python interpreter, but not for the types defined by various extension modules.
I now understand your comment better 👍
But I believe this will be safe against functions compiled in extensions as per the docs and per some quick testing against np.sum
One issue I have with this idea is that users can create a string and then apply exec or eval to it and send that to the server to execute. I think sending functions might only be valid for the pickle backend where the server maintainers have decided security is not a concern. |
- add most_formats frozen set to Serialization which includes all Serialization types except fastmsgpack. 'fastmsgpack' serializes/loads to a Series object as opposed to a list. - break apart test_map_client_server() so all Serialization types that return a list are tested in one function. 'fastmsgpack' Serialzation is tested in test_map_client_server_fastmsgpack() since its assertion has to be for all() elements.
g for _, g in globals().items() if isinstance(g, SerializationFormat) and | ||
g is not fastmsgpack | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added most_formats so I can test fastmsgpack separately from all other formats.
- checks the function given to map is pandas or numpy function. Raises NotImplementedError otherwise - added simple tests for mapping numpy and pandas functions
|
||
object_hook.register('numpy_pandas_function', numpy_pandas_function_from_str) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added json_dumps
and object_hook
for pandas/numpy functions used in map
. Added simple tests but need more interesting examples.
As a general rule for the blaze server I would prefer if we assumed limited trust for any client that is connected. I think this is important because the blaze server box will have the credentials to access any of your data or have the data available directly. There is also the concern that the box is a shared resource and people will misuse the box because they do not know better. For example, a non-malicious denial of service could come from a data scientist using |
@llllllllll Could you please label this PR as Work In Progress? |
Providing access to the builtin namespace is providing In [7]: bz.data(["print('hello from blaze!')"]).map(eval, '?string')
Out[7]: hello from blaze!
0 None |
I understand - thanks for the example. Will add a check for it. |
mod = pd | ||
|
||
else: | ||
raise NotImplementedError("accepts numpy/pandas/builtin funcs only") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the server should raise an exception -- what is the best practice here? Example below:
In [11]: expr = t.species.map(os.path.exists, 'str')
In [12]: response = test.post('/compute', data=serial.dumps(query), headers=mimetype(serial))
In [13]: response
Out[13]: <Response streamed [400 BAD REQUEST]>
In [14]: response.status
Out[14]: '400 BAD REQUEST'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a flask NotImplemented exception: (501 Not Implemented)
- raise http exceptions if incoming json contains 'eval' or 'exec' which we don't support executing on server (ie 503 Forbidden) - similarly if json contains a function not in pandas/numpy namespace, raise HTTP 501 Not Implemented exception - don't limit what json_dumps can do since that is used for testing - added tests but marked as xfail since pickle serialization is currently not raising these exceptions - need to ask/update
@@ -292,7 +300,7 @@ def numpy_pandas_function_from_str(f): | |||
mod = pd | |||
|
|||
else: | |||
raise NotImplementedError("accepts numpy/pandas/builtin funcs only") | |||
raise wz_ex.NotImplemented("accepts numpy/pandas/builtin funcs only") | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising http exceptions from werkzeug.exceptions
, trying to be more specific; however, upon POST, the server will invoke check_request
to see if the serialized data loads correctly. The check_request
function expects ValueError and traps it
So, perhaps utils.py should only raise ValueError
Thanks for the input @llllllllll, it's good to have your usecase factored in as part of this work. What about the following, to more clearly separate the locked-down server config from the trusted-client server config (all names are changeable):
@llllllllll overall, our target user is someone in an environment where using IPython notebook kernels with IPython notebook is acceptable, i.e., within a trusted environment, behind a firewall, etc. |
That seems to satisfy both of our use cases and protects users who don't know the difference. On a somewhat related note, which serialization formats have you been using for production? |
Currently, |
I would highly reccomend the |
Closing -- this PR was absorbed into #1504. |
Add serialization for builtins. Addresses the tests in PR #1493