New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow map/apply in client/server when explicitly enabled #1497

Closed
wants to merge 11 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@sandhujasmine
Contributor

sandhujasmine commented Apr 29, 2016

Add serialization for builtins. Addresses the tests in PR #1493

  • added json_dumps and object_hook for python builtin functions
  • data_loads is not consistent for all serializations. For fastmsgpack, it uses pandas to decode to a Series; while other methods decode to a list.

kwmsmith and others added some commits Apr 27, 2016

Add serialization for builtins. Tests in PR 1493 pass
- added json_dumps and object_hook for python builtin functions
- data_loads is not consistent for all serializations. For fastmsgpack,
  it uses pandas to decode to a Series; while other methods decode to a
  list.
try:
import builtins
except ImportError:
import __builtin__ as builtins

This comment has been minimized.

@kwmsmith

kwmsmith Apr 29, 2016

Member

You can use from blaze.compatibility import builtins here.

@@ -187,6 +192,11 @@ def json_dumps(ds):
return {'__!datashape': str(ds)}
@dispatch(types.BuiltinFunctionType)

This comment has been minimized.

@llllllllll

llllllllll Apr 29, 2016

Member

This dispatch is going to get hit for any function defined in C. This means that thinks like np.sum will get deserialized as builtins.sum. I think we might need some stronger checks here

This comment has been minimized.

@sandhujasmine

sandhujasmine Apr 29, 2016

Contributor

@llllllllll

I was going to add another check as follows:

if f.__module__ in ('__builtin__', 'builtins'):
    return {'__!builtin_function': f.__name__}

However, wouldn't the dispatch only work for BuiltinFunctionType in case below? I don't believe we need another check in here.

@dispatch(types.BuiltinFunctionType)
def json_dumps(f):
     ....

Testing against a numpy function gives the NotImplementedError because the dispatch is only using this function for types.BuiltinFunctionType

NotImplementedError: Could not find signature for json_dumps: <function>

This comment has been minimized.

@sandhujasmine

sandhujasmine May 2, 2016

Contributor

More from the docs of types module:

This module defines names for some object types that are used by the standard Python interpreter, but not for the types defined by various extension modules.

I now understand your comment better 👍
But I believe this will be safe against functions compiled in extensions as per the docs and per some quick testing against np.sum

@llllllllll

istrue = result == compute(expr, {t: iris}, return_type=list)
try:
assert istrue
except:

This comment has been minimized.

@llllllllll

llllllllll Apr 29, 2016

Member

could we catch only the error we are expecting?

This comment has been minimized.

@sandhujasmine

sandhujasmine Apr 29, 2016

Contributor

The question though: for fastmsgpack - pandas loads data into a Series - which is inconsistent with other serialization types that load a list.

fastmsgpack_object_hook is returning Series and DataFrames. Should it return lists to be consistent with other serialization objects. If it doesn't matter then I can break up the test so it does the correct assertion for different serializations.

It seems like all serialization should return consistent results, but I don't know what this effects.

This comment has been minimized.

@llllllllll

llllllllll Apr 29, 2016

Member

The fastmsgpack was added to dramatically speed up the performance of the blaze server. The bottleneck in our production blaze servers and clients was actually serializing and deserializing the data. The general problem was the the dataframe was getting serialized as a list of records and then when rehydrating the data we would call object_hook tens of millions of times. object_hook was consistently in the top 5 ncalls when profiling, so even with a percall that rounded to 0s, it was a large part of the runtime. The fastmsgpack format is designed to send each block of the dataframe as a single buffer which is then compressed with blosc. This is why we have a seperate data_(dumps|loads) and dumps|loads. Basically, the data field of the response record is actually a seperately serialized dataframe which has been compressed. This allows us to send a smaller response which is then deserialized much more quickly.

There is also the potential to reduce the copying, If we deserialize as a list, then we need to take the bytes off the wire and turn them into python integers which is going to trigger thousands of allocations. This is really painful because we often just want to push that right back into an ndarray. I backported some code I added to pandas to prevent the copy and still give us a mutable buffer, which can be found here: https://github.com/blaze/blaze/blob/master/blaze/server/serialization.py#L117 (note, I actually just found a bug there, I will push a fix now: 62a2911). In the blaze version, we only slightly break the abstraction that bytes are immutable; however, the pandas version does this in a much more sane way: https://github.com/pydata/pandas/blob/master/pandas/io/packers.py#L325.

@llllllllll

This comment has been minimized.

Member

llllllllll commented Apr 29, 2016

One issue I have with this idea is that users can create a string and then apply exec or eval to it and send that to the server to execute. I think sending functions might only be valid for the pickle backend where the server maintainers have decided security is not a concern.

sandhujasmine added some commits May 2, 2016

TST: Separate 'map' test for fastmsgpack since assertion is different
- add most_formats frozen set to Serialization which includes all
  Serialization types except fastmsgpack. 'fastmsgpack' serializes/loads
  to a Series object as opposed to a list.

- break apart test_map_client_server() so all Serialization types that
  return a list are tested in one function. 'fastmsgpack' Serialzation
  is tested in test_map_client_server_fastmsgpack() since its assertion
  has to be for all() elements.
g for _, g in globals().items() if isinstance(g, SerializationFormat) and
g is not fastmsgpack
)

This comment has been minimized.

@sandhujasmine

sandhujasmine May 2, 2016

Contributor

Added most_formats so I can test fastmsgpack separately from all other formats.

@kwmsmith

fastmsgpack)
assert all(result == exp_res)

This comment has been minimized.

@sandhujasmine

sandhujasmine May 2, 2016

Contributor

could we catch only the error we are expecting?

break apart 'fastmsgpack' test from test_map_client_server() into test_map_client_server_fastmsgpack(). The result of fastmsgpack is a Series object as opposed to a list so this makes the assertions cleaner.

@@ -190,8 +190,7 @@ def json_dumps(ds):
@dispatch(types.BuiltinFunctionType)
def json_dumps(f):
if f.__module__ in ('__builtin__', 'builtins'):

This comment has been minimized.

@sandhujasmine

sandhujasmine May 2, 2016

Contributor

This check is not required - accidently added it in - remove it for now.

Serialize pandas/numpy functions for map operation
- checks the function given to map is pandas or numpy function. Raises
  NotImplementedError otherwise

- added simple tests for mapping numpy and pandas functions
object_hook.register('numpy_pandas_function', numpy_pandas_function_from_str)

This comment has been minimized.

@sandhujasmine

sandhujasmine May 2, 2016

Contributor

Added json_dumps and object_hook for pandas/numpy functions used in map. Added simple tests but need more interesting examples.

@kwmsmith

else:
raise NotImplementedError("Only supports numpy or pandas functions")
return {'__!numpy_pandas_function': fcn}

This comment has been minimized.

@sandhujasmine

sandhujasmine May 2, 2016

Contributor

Added json_dumps for pandas/numpy functions used in map. Only accepts pandas/numpy functions - the builtin functions dispatch on a separate function.

Is this a good way to ensure we only handle pandas/numpy functions? The builtin functions are handled by a different json_dumps that triggers for types.BuiltinFunctionType

@kwmsmith

@@ -193,6 +193,17 @@ def json_dumps(f):
return {'__!builtin_function': f.__name__}
# conscturct a list of all numpy functions and their types

This comment has been minimized.

@llllllllll

llllllllll May 2, 2016

Member

construct*

This comment has been minimized.

@sandhujasmine

sandhujasmine May 2, 2016

Contributor

thanks - removed and replaced with a more relevant comment.

@llllllllll

This comment has been minimized.

Member

llllllllll commented May 3, 2016

As a general rule for the blaze server I would prefer if we assumed limited trust for any client that is connected. I think this is important because the blaze server box will have the credentials to access any of your data or have the data available directly. There is also the concern that the box is a shared resource and people will misuse the box because they do not know better. For example, a non-malicious denial of service could come from a data scientist using np.savez on a very large array. I think that this security concern is enough to warrant implementing this with a whitelist. This would allow server administrators to decide which functions are allowed or which are disallowed. I think that we could possibly have some sentinel value that says, "all are allowed" but I would strongly suggest not making this the default. I also don't think that we should manage a whitelist ourselves because then blaze is in the business of doing security audits.

@sandhujasmine

This comment has been minimized.

Contributor

sandhujasmine commented May 3, 2016

One issue I have with this idea is that users can create a string and then apply exec or eval to it and send that to the server to execute. I think sending functions might only be valid for the pickle backend where the server maintainers have decided security is not a concern.

@llllllllll
It's not implemented with an eval. It'll only look for the functions in the pandas, numpy or builtin namespace. But we might still need to limit this further as you suggest - will investigate further.

Could you please label this PR as Work In Progress?

@llllllllll

This comment has been minimized.

Member

llllllllll commented May 3, 2016

Providing access to the builtin namespace is providing exec and eval. For example:

In [7]: bz.data(["print('hello from blaze!')"]).map(eval, '?string')
Out[7]: hello from blaze!


0  None
@sandhujasmine

This comment has been minimized.

Contributor

sandhujasmine commented May 3, 2016

Providing access to the builtin namespace is providing exec and eval. For example:

I understand - thanks for the example. Will add a check for it.

mod = pd
else:
raise NotImplementedError("accepts numpy/pandas/builtin funcs only")

This comment has been minimized.

@sandhujasmine

sandhujasmine May 4, 2016

Contributor

Not sure if the server should raise an exception -- what is the best practice here? Example below:

In [11]: expr = t.species.map(os.path.exists, 'str')
In [12]: response = test.post('/compute', data=serial.dumps(query), headers=mimetype(serial))
In [13]: response
Out[13]: <Response streamed [400 BAD REQUEST]>
In [14]: response.status
Out[14]: '400 BAD REQUEST'

@kwmsmith @llllllllll

This comment has been minimized.

@sandhujasmine

sandhujasmine added some commits May 4, 2016

Raise standard HTTP (403, 501) exceptions
- raise http exceptions if incoming json contains 'eval' or 'exec' which
  we don't support executing on server (ie 503 Forbidden)
- similarly if json contains a function not in pandas/numpy namespace,
  raise HTTP 501 Not Implemented exception

- don't limit what json_dumps can do since that is used for testing
- added tests but marked as xfail since pickle serialization is
  currently not raising these exceptions - need to ask/update
@@ -292,7 +300,7 @@ def numpy_pandas_function_from_str(f):
mod = pd
else:
raise NotImplementedError("accepts numpy/pandas/builtin funcs only")
raise wz_ex.NotImplemented("accepts numpy/pandas/builtin funcs only")

This comment has been minimized.

@sandhujasmine

sandhujasmine May 5, 2016

Contributor

Raising http exceptions from werkzeug.exceptions, trying to be more specific; however, upon POST, the server will invoke check_request to see if the serialized data loads correctly. The check_request function expects ValueError and traps it

So, perhaps utils.py should only raise ValueError

@kwmsmith @llllllllll

@kwmsmith

This comment has been minimized.

Member

kwmsmith commented May 11, 2016

Thanks for the input @llllllllll, it's good to have your usecase factored in as part of this work.

What about the following, to more clearly separate the locked-down server config from the trusted-client server config (all names are changeable):

  • We create a separate json_dumps_trusted multipledispatch that is a superset of json_dumps with implementations that include whitelisted builtins, numpy, and pandas functions.
  • We create a object_hook_trusted multipledispatch to go with json_dumps_trusted.
  • We create json_trusted and msgpack_trusted SerializationFormat instances that use these *_trusted serializers / deserializers.
  • We ensure that these *_trusted serialization formats are not included in the default formats argument to Server.
  • The only way for a user to have *_trusted serialization formats is via explicitly instantiating Server() for now -- we don't provide commandline arguments to enable them.

@llllllllll overall, our target user is someone in an environment where using IPython notebook kernels with IPython notebook is acceptable, i.e., within a trusted environment, behind a firewall, etc.

@llllllllll

This comment has been minimized.

Member

llllllllll commented May 11, 2016

That seems to satisfy both of our use cases and protects users who don't know the difference. On a somewhat related note, which serialization formats have you been using for production?

@kwmsmith

This comment has been minimized.

Member

kwmsmith commented May 11, 2016

which serialization formats have you been using for production?

Currently, json, but we might use msgpack in the future.

@kwmsmith kwmsmith added this to the 0.11 milestone May 11, 2016

@llllllllll

This comment has been minimized.

Member

llllllllll commented May 11, 2016

I would highly reccomend the fast_msgpack format for use with numpy and pandas, this was a really big performance improvement for us.

@kwmsmith kwmsmith changed the title from Fix map/apply by adding serialization for builtins to Allow map/apply in client/server when explicitly enabled May 16, 2016

@kwmsmith

This comment has been minimized.

Member

kwmsmith commented May 18, 2016

Closing -- this PR was absorbed into #1504.

@kwmsmith kwmsmith closed this May 18, 2016

@kwmsmith kwmsmith modified the milestones: 0.11, 0.10.2 Jun 8, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment