Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow map/apply in client/server when explicitly enabled #1497

Conversation

sandhujasmine
Copy link

Add serialization for builtins. Addresses the tests in PR #1493

  • added json_dumps and object_hook for python builtin functions
  • data_loads is not consistent for all serializations. For fastmsgpack, it uses pandas to decode to a Series; while other methods decode to a list.

kwmsmith and others added 3 commits April 27, 2016 13:17
- added json_dumps and object_hook for python builtin functions
- data_loads is not consistent for all serializations. For fastmsgpack,
  it uses pandas to decode to a Series; while other methods decode to a
  list.
try:
import builtins
except ImportError:
import __builtin__ as builtins
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use from blaze.compatibility import builtins here.

@@ -187,6 +192,11 @@ def json_dumps(ds):
return {'__!datashape': str(ds)}


@dispatch(types.BuiltinFunctionType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dispatch is going to get hit for any function defined in C. This means that thinks like np.sum will get deserialized as builtins.sum. I think we might need some stronger checks here

Copy link
Author

@sandhujasmine sandhujasmine Apr 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@llllllllll

I was going to add another check as follows:

if f.__module__ in ('__builtin__', 'builtins'):
    return {'__!builtin_function': f.__name__}

However, wouldn't the dispatch only work for BuiltinFunctionType in case below? I don't believe we need another check in here.

@dispatch(types.BuiltinFunctionType)
def json_dumps(f):
     ....

Testing against a numpy function gives the NotImplementedError because the dispatch is only using this function for types.BuiltinFunctionType

NotImplementedError: Could not find signature for json_dumps: <function>

Copy link
Author

@sandhujasmine sandhujasmine May 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More from the docs of types module:

This module defines names for some object types that are used by the standard Python interpreter, but not for the types defined by various extension modules.

I now understand your comment better 👍
But I believe this will be safe against functions compiled in extensions as per the docs and per some quick testing against np.sum

@llllllllll

@llllllllll
Copy link
Member

One issue I have with this idea is that users can create a string and then apply exec or eval to it and send that to the server to execute. I think sending functions might only be valid for the pickle backend where the server maintainers have decided security is not a concern.

Jasmine Sandhu added 2 commits May 2, 2016 11:25
- add most_formats frozen set to Serialization which includes all
  Serialization types except fastmsgpack. 'fastmsgpack' serializes/loads
  to a Series object as opposed to a list.

- break apart test_map_client_server() so all Serialization types that
  return a list are tested in one function. 'fastmsgpack' Serialzation
  is tested in test_map_client_server_fastmsgpack() since its assertion
  has to be for all() elements.
g for _, g in globals().items() if isinstance(g, SerializationFormat) and
g is not fastmsgpack
)

Copy link
Author

@sandhujasmine sandhujasmine May 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added most_formats so I can test fastmsgpack separately from all other formats.

@kwmsmith

- checks the function given to map is pandas or numpy function. Raises
  NotImplementedError otherwise

- added simple tests for mapping numpy and pandas functions

object_hook.register('numpy_pandas_function', numpy_pandas_function_from_str)


Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added json_dumps and object_hook for pandas/numpy functions used in map. Added simple tests but need more interesting examples.

@kwmsmith

@llllllllll
Copy link
Member

As a general rule for the blaze server I would prefer if we assumed limited trust for any client that is connected. I think this is important because the blaze server box will have the credentials to access any of your data or have the data available directly. There is also the concern that the box is a shared resource and people will misuse the box because they do not know better. For example, a non-malicious denial of service could come from a data scientist using np.savez on a very large array. I think that this security concern is enough to warrant implementing this with a whitelist. This would allow server administrators to decide which functions are allowed or which are disallowed. I think that we could possibly have some sentinel value that says, "all are allowed" but I would strongly suggest not making this the default. I also don't think that we should manage a whitelist ourselves because then blaze is in the business of doing security audits.

@sandhujasmine
Copy link
Author

One issue I have with this idea is that users can create a string and then apply exec or eval to it and send that to the server to execute. I think sending functions might only be valid for the pickle backend where the server maintainers have decided security is not a concern.

@llllllllll
It's not implemented with an eval. It'll only look for the functions in the pandas, numpy or builtin namespace. But we might still need to limit this further as you suggest - will investigate further.

Could you please label this PR as Work In Progress?

@llllllllll
Copy link
Member

Providing access to the builtin namespace is providing exec and eval. For example:

In [7]: bz.data(["print('hello from blaze!')"]).map(eval, '?string')
Out[7]: hello from blaze!


0  None

@sandhujasmine
Copy link
Author

Providing access to the builtin namespace is providing exec and eval. For example:

I understand - thanks for the example. Will add a check for it.

mod = pd

else:
raise NotImplementedError("accepts numpy/pandas/builtin funcs only")
Copy link
Author

@sandhujasmine sandhujasmine May 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the server should raise an exception -- what is the best practice here? Example below:

In [11]: expr = t.species.map(os.path.exists, 'str')
In [12]: response = test.post('/compute', data=serial.dumps(query), headers=mimetype(serial))
In [13]: response
Out[13]: <Response streamed [400 BAD REQUEST]>
In [14]: response.status
Out[14]: '400 BAD REQUEST'

@kwmsmith @llllllllll

Copy link
Author

@sandhujasmine sandhujasmine May 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jasmine Sandhu added 2 commits May 4, 2016 16:19
- raise http exceptions if incoming json contains 'eval' or 'exec' which
  we don't support executing on server (ie 503 Forbidden)
- similarly if json contains a function not in pandas/numpy namespace,
  raise HTTP 501 Not Implemented exception

- don't limit what json_dumps can do since that is used for testing
- added tests but marked as xfail since pickle serialization is
  currently not raising these exceptions - need to ask/update
@@ -292,7 +300,7 @@ def numpy_pandas_function_from_str(f):
mod = pd

else:
raise NotImplementedError("accepts numpy/pandas/builtin funcs only")
raise wz_ex.NotImplemented("accepts numpy/pandas/builtin funcs only")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising http exceptions from werkzeug.exceptions, trying to be more specific; however, upon POST, the server will invoke check_request to see if the serialized data loads correctly. The check_request function expects ValueError and traps it

So, perhaps utils.py should only raise ValueError

@kwmsmith @llllllllll

@kwmsmith
Copy link
Member

Thanks for the input @llllllllll, it's good to have your usecase factored in as part of this work.

What about the following, to more clearly separate the locked-down server config from the trusted-client server config (all names are changeable):

  • We create a separate json_dumps_trusted multipledispatch that is a superset of json_dumps with implementations that include whitelisted builtins, numpy, and pandas functions.
  • We create a object_hook_trusted multipledispatch to go with json_dumps_trusted.
  • We create json_trusted and msgpack_trusted SerializationFormat instances that use these *_trusted serializers / deserializers.
  • We ensure that these *_trusted serialization formats are not included in the default formats argument to Server.
  • The only way for a user to have *_trusted serialization formats is via explicitly instantiating Server() for now -- we don't provide commandline arguments to enable them.

@llllllllll overall, our target user is someone in an environment where using IPython notebook kernels with IPython notebook is acceptable, i.e., within a trusted environment, behind a firewall, etc.

@llllllllll
Copy link
Member

That seems to satisfy both of our use cases and protects users who don't know the difference. On a somewhat related note, which serialization formats have you been using for production?

@kwmsmith
Copy link
Member

which serialization formats have you been using for production?

Currently, json, but we might use msgpack in the future.

@kwmsmith kwmsmith added this to the 0.11 milestone May 11, 2016
@llllllllll
Copy link
Member

I would highly reccomend the fast_msgpack format for use with numpy and pandas, this was a really big performance improvement for us.

@kwmsmith kwmsmith changed the title Fix map/apply by adding serialization for builtins Allow map/apply in client/server when explicitly enabled May 16, 2016
@kwmsmith
Copy link
Member

Closing -- this PR was absorbed into #1504.

@kwmsmith kwmsmith closed this May 18, 2016
@kwmsmith kwmsmith modified the milestones: 0.11, 0.10.2 Jun 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants