-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up network transfer for small buffers #8318
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 27 files ±0 27 suites ±0 14h 0m 0s ⏱️ - 51m 6s For more details on these failures, see this check. Results for commit 50c7390. ± Comparison against base commit c91a735. ♻️ This comment has been updated with latest results. |
FWIW We're not using parquet but an arrow IPC format
I don't understand how you arrive at this conclusion based on your tests. My above statement was wrong. |
For comparison, if you didn't pass to the network stack distributed/distributed/protocol/serialize.py Lines 759 to 764 in 76dd800
(not to be confused with serialize_bytes , which is completely unrelated).
so you would get rid of the deep-copy at step 4, although not the one at step 6: distributed/distributed/protocol/serialize.py Lines 775 to 780 in 76dd800
(here frame is a memoryview of a numpy.empty. This deep-copy is also avoidable, as arrow would be capable of ingesting it directly without a conversion to bytes). This however would actually cause a slowdown for shards smaller than ~64 kiB, as measured above. To avoid the deep copies at step 1 and 7, you'd need to use pickle instead of arrow as your serialization engine. This would mean additional costs in pickling+unpickling extra times, which may outweight the benefit for very small buffers (I have not benchmarked the speed of pickle5 vs. arrow). |
I can't tell where your logic is true or false but empirically, I find that if I just throw a import numpy as np
from distributed.protocol.serialize import Serialize, serialize
class _List(list):
pass
buffer = _List([
np.random.random((3, )),
np.random.random((4, )),
np.random.random((5, )),
])
serialize(buffer) ({'serializer': 'pickle', 'writeable': (True, True, True)},
[b'\x80\x05\x95\x90\x01\x00\x00\x00\x00\x00\x00\x8c\x17cloudpickle.cloudpickle\x94\x8c\x14_make_skeleton_class\x94\x93\x94(\x8c\x08builtins\x94\x8c\x04type\x94\x93\x94\x8c\x05_List\x94h\x03\x8c\x04list\x94\x93\x94\x85\x94}\x94\x8c\n__module__\x94\x8c\x08__main__\x94s\x8c da114ba5d9554d56a5fff20b05203256\x94Nt\x94R\x94\x8c\x1ccloudpickle.cloudpickle_fast\x94\x8c\x0f_class_setstate\x94\x93\x94h\x0f}\x94(h\x0bh\x0c\x8c\x07__doc__\x94N\x8c\r__slotnames__\x94]\x94u}\x94\x86\x94\x86R0)\x81\x94(\x8c\x12numpy.core.numeric\x94\x8c\x0b_frombuffer\x94\x93\x94(\x97\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02f8\x94\x89\x88\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94bK\x03\x85\x94\x8c\x01C\x94t\x94R\x94h\x1c(\x97h"K\x04\x85\x94h&t\x94R\x94h\x1c(\x97h"K\x05\x85\x94h&t\x94R\x94e.',
<memory at 0x10ee80d00>,
<memory at 0x10ee80c40>,
<memory at 0x10ee80dc0>]) # Memoryview is indeed pointing to the np array
for mv, arr in zip(serialize(buffer)[1][1:], buffer):
assert mv == arr.data
assert mv.obj is arr The same is true for pyarrow tables which is just a little more involved to show since there is no import pyarrow as pa
import pandas as pd
tab = pa.Table.from_pandas(pd.DataFrame({"a": range(10)}))
arr = tab['a']
int_arr = arr.chunks[0]
# This indexing scheme is picking out the `pyarrow.Buffer` object that is being wrapped in a `PickleBuffer`
# here https://github.com/apache/arrow/blob/ac2d207611ce25c91fb9fc90d5eaff2933609660/python/pyarrow/io.pxi#L1315-L1326
pickle_buffer = int_arr.__reduce_ex__(5)[1][0][4][1].__reduce_ex__(5)[1][0] buffer = _List([tab])
serialize(buffer)
({'serializer': 'pickle', 'writeable': (True,)},
[b'\x80\x05\x95\xad\x03\x00\x00\x00\x00\x00\x00\x8c\x17cloudpickle.cloudpickle\x94\x8c\x14_make_skeleton_class\x94\x93\x94(\x8c\x08builtins\x94\x8c\x04type\x94\x93\x94\x8c\x05_List\x94h\x03\x8c\x04list\x94\x93\x94\x85\x94}\x94\x8c\n__module__\x94\x8c\x08__main__\x94s\x8c da114ba5d9554d56a5fff20b05203256\x94Nt\x94R\x94\x8c\x1ccloudpickle.cloudpickle_fast\x94\x8c\x0f_class_setstate\x94\x93\x94h\x0f}\x94(h\x0bh\x0c\x8c\x07__doc__\x94N\x8c\r__slotnames__\x94]\x94u}\x94\x86\x94\x86R0)\x81\x94\x8c\x0bpyarrow.lib\x94\x8c\x12_reconstruct_table\x94\x93\x94]\x94h\x1a\x8c\rchunked_array\x94\x93\x94]\x94h\x1a\x8c\x0e_restore_array\x94\x93\x94(h\x1a\x8c\x0etype_for_alias\x94\x93\x94\x8c\x05int64\x94\x85\x94R\x94K\nK\x00K\x00]\x94(Nh\x1a\x8c\tpy_buffer\x94\x93\x94\x97\x85\x94R\x94e]\x94Nt\x94\x85\x94R\x94ah$\x8c\x05int64\x94\x85\x94R\x94\x86\x94R\x94ah\x1a\x8c\x06schema\x94\x93\x94]\x94h\x1a\x8c\x05field\x94\x93\x94(\x8c\x01a\x94h$\x8c\x05int64\x94\x85\x94R\x94\x88Nt\x94R\x94a}\x94C\x06pandas\x94B\xa7\x01\x00\x00{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 10, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "creator": {"library": "pyarrow", "version": "13.0.0"}, "pandas_version": "2.1.2"}\x94s\x86\x94R\x94\x86\x94R\x94a.',
<memory at 0x13b57fc40>]) assert serialize(buffer)[1][-1] == pickle_buffer.raw() |
51c1bf4
to
09dd533
Compare
Yes,
To my understanding however you're sending into the network stack plain |
09dd533
to
50c7390
Compare
Regardless of the debate on PyArrow (de-)serialization, this PR seems to be a general improvement. Do you see anything blocking this from getting merged, @fjetter?
PyArrow does not perform a memory copy when deserializing an IPC stream. |
While working on #8282, I realised that sending over the network a lot of small zero-copy pickle5 buffers is substantially slower than bytes blobs. So I set up a benchmark that uses just tcp.py that would send back and forth a single message containing a lot of small numpy arrays (aka shards, from the shuffle lingo), encoded in different ways.
In all cases, the message size is 2 MiB or the shard size (whatever is larger).
The round-trip was performed on localhost, with sender and receiver running in the same event loop.
All tests are downstream of #8308
The full benchmark suite is available here: https://gist.github.com/crusaderky/3e11fd4be8b61d06109a01781dda9c83
Note: 8 kiB shards size matches the p2p rechunk tests in coiled/benchmarks for 8 MiB chunk size.
128 kiB shard size matches those with 128 MiB chunk size.
Legend
list[ndarray]
The numpy arrays are passed to the network stack unserialized, in a plain list.
The list is traversed by dask's serialization stack, and the numpy arrays are encoded by
serialize_numpy_ndarray
.Their buffers are extracted and are individually sent over the network; on the other side they are received into individual
numpy.empty
buffers (#8308) and are finally deserialized bydeserialize_numpy_array
.deep-copies: 0
distributed.protocol.serialize: heavy usage
number of buffers: 802 (for 2 kiB shards)
Note: this is what's used by
gather_dep
.opaque list[ndarray]
As above, but wrapped by the dummy
class _List(list)
defined by the shuffle module.This opaque object is serialized/deserialized by pickle5, which extracts the buffers.
deep-copies: 0
distributed.protocol.serialize: trivial usage
number of buffers: 802 (for 2 kiB shards)
Note: this is what's used for p2p rechunk downstream of #8282.
bytes
The list of numpy arrays is manually serialized by pickle, without buffers_callback, into a single monolithic bytes object, ahead of being sent to the network stack.
The dask serialization stack passes on the bytes object verbatim as a buffer, without a further deep copy, thanks to
serialize_bytes
.deep-copies: 2 (pickle.dumps; pickle.loads)
distributed.protocol.serialize: trivial usage
number of buffers: 2
list[bytes]
The numpy arrays are manually serialized with pickle, without buffers_callback, and are sent to the network stack as a list of individually small-ish bytes blobs.
The network stack traverses the list and extracts the individual bytes objects into buffers, which are sent individually.
Upon reception, the buffers have been transformed into memoryviews of numpy.empty's, and are unnecessarily deep-copied back into bytes.
After reception, we manually unpickle the
list[bytes]
into alist[ndarray]
.deep-copies: 3 (pickle.dumps; _deserialize_bytes; pickle.loads)
distributed.protocol.serialize: heavy usage
number of buffers: 802 (for 2 kiB shards)
opaque list[bytes]
The numpy arrays are manually serialized with pickle, without buffers_callback, and are sent to the network stack wrapped into an opaque container.
The network stack calls pickle again, and since bytes serialization with vanilla pickle doesn't extract the buffers even when buffers_callback is specified, they are deep-copied into a single monolithic bytes object, which is then sent in one go over the network. On the other side, the monolithic bytes blob is deep-copied and deserialized into
_List[bytes]
, and then again we call pickle.loads to convert that into a_List[ndarray]
.deep-copies: 4 (pickle.dumps; pickle.dumps (dask_serialize of unknown object); pickle.loads (dask_deserialize of uknown object); pickle.loads)
distributed.protocol.serialize: trivial usage
number of buffers: 2
Note: this is what's used for p2p rechunk upstream of #8282, and is what's still used for dataframe shuffle (with the difference that we use parquet instead of pickle for serialization into bytes blobs).
Observations
This PR
This PR performs a simple tweak when sending many small buffers; it mitigates the regression introduced by #8282 but doesn't entirely fix it. A shuffle-specific hack is coming in a separate PR.
Benchmark of this PR: