The buffer of embedded numpy variables is deep-copied in client->scheduler comms #8608

crusaderky · 2024-04-03T14:30:45Z

import distributed
import numpy as np

if __name__ == "__main__":
    with distributed.Client(n_workers=2, processes=False, protocol="tcp") as client:
        a, b = client.has_what()
        x = client.submit(np.random.random, 1024, key="x", workers=[a])
        y = client.submit(lambda x: x, x, key="y", workers=[b])
        y.result()

When the buffer reaches distributed.protocol.serialize.pickle_loads, buffers[0] is a bytes object.
This causes pickle_loads to deep-copy the buffer in order to honour the writeable flag of the original.

To verify, add at the top of pickle_loads:

print(header["writeable"], [ensure_memoryview(b).readonly for b in buffers])

What's causing me a migraine is:

if you replace

        x = client.submit(np.random.random, 1024, key="x", workers=[a])

with

        x = client.submit(lambda: np.random.random(1024), key="x", workers=[a])

then the numpy object is no longer deserialized by distributed.protocol.serialize.pickle_loads, but it's instead processed by distributed.protocol.numpy.deserialize_numpy_array, which receives a writeable buffer

if you replace the submit API with dask.array:

x = da.random.random(1024).persist(workers=[a])
y = x.map_blocks(lambda x: x).persist(workers=[b])
y.compute()

then we are using again distributed.protocol.serialize.pickle_loads, which receives a read-only buffer but this time the writeable flag is False so no deep copy happens.

The text was updated successfully, but these errors were encountered:

crusaderky · 2024-04-03T14:43:31Z

Ok, found the difference. the deep-copy is NOT tripped on the transfer of the task output from a to b; it's the random seed that's sent from the client to scheduler. 🤦

crusaderky · 2024-04-03T14:47:46Z

This issue applies to all embedded variables that are sent from client to scheduler, e.g.

x = client.submit(lambda x: x + 1, np.random.random(1024), key="x", workers=[a])

github-actions bot added the needs triage label Apr 3, 2024

crusaderky changed the title ~~Buffer is deep-copied in weird edge case~~ random seed is deep-copied in edge case Apr 3, 2024

crusaderky added p3 Affects a small number of users or is largely cosmetic and removed needs triage labels Apr 3, 2024

crusaderky changed the title ~~random seed is deep-copied in edge case~~ np.random.RandomState buffer is deep-copied in edge case Apr 3, 2024

crusaderky changed the title ~~np.random.RandomState buffer is deep-copied in edge case~~ The buffer of embedded numpy variables is deep-copied from client to scheduler Apr 3, 2024

crusaderky mentioned this issue Apr 3, 2024

Don't deep-copy read-only buffers on unpickle #8609

Merged

crusaderky changed the title ~~The buffer of embedded numpy variables is deep-copied from client to scheduler~~ The buffer of embedded numpy variables is deep-copied in client->scheduler comms Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The buffer of embedded numpy variables is deep-copied in client->scheduler comms #8608

The buffer of embedded numpy variables is deep-copied in client->scheduler comms #8608

crusaderky commented Apr 3, 2024

crusaderky commented Apr 3, 2024

crusaderky commented Apr 3, 2024 •

edited

The buffer of embedded numpy variables is deep-copied in client->scheduler comms #8608

The buffer of embedded numpy variables is deep-copied in client->scheduler comms #8608

Comments

crusaderky commented Apr 3, 2024

crusaderky commented Apr 3, 2024

crusaderky commented Apr 3, 2024 • edited

crusaderky commented Apr 3, 2024 •

edited