I suspect there's an issue unpicking Clients that use Security. Client.__setstate__ calls get_client when there is no global Client:
|
def __setstate__(self, state): |
|
key, address = state |
|
try: |
|
c = Client.current(allow_global=False) |
|
except ValueError: |
|
c = get_client(address) |
This ends up falling through to constructing a Client here:
|
elif address: |
|
return Client(address, timeout=timeout) |
If the address is tls:// (or anything that requires Security), then constructing that Client will fail, since we're not passing in a security= kwarg.
Including security state in Client.__getstate__ may not be desirable. But at the least, it could be nice to check for this case in __getstate__, and raise a more informative error.
Minimal Complete Verifiable Example:
This makes a cluster with Coiled, both because it's a convenient way to get a TLS cluster, and because it's actually necessary to use a cluster where the internal scheduler address differs from the public one to reproduce this error.
import coiled
import dask
import distributed
cluster = coiled.Cluster(n_workers=1)
client = distributed.Client(cluster)
ts = dask.datasets.timeseries(freq="1d")
tsp = ts.persist()
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
client.run(lambda: tsp.mean().compute())
...
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/core.py in handle_comm()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/scheduler.py in broadcast()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/utils.py in All()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/scheduler.py in send_message()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/core.py in send_recv()
Exception: TLS expects a `ssl_context` argument of type ssl.SSLContext (perhaps check your TLS configuration?) Instead got None
This is obviously a bad and contrived thing to do. However, this comes up more commonly in cases where people misuse map_partitions or map_blocks, and (accidentally) trigger a compute within their mapper function:
ts.map_partitions(
lambda part: part.assign(x=part.x + np.array(tsp.x).mean()),
# ^ `np.array(tsp.x)` implicitly computes the series
meta=ts,
).compute()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
...
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/protocol/pickle.py in loads()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/client.py in __setstate__()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py in get_client()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/client.py in __init__()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/client.py in start()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/utils.py in sync()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/utils.py in f()
/opt/conda/envs/coiled/lib/python3.8/site-packages/tornado/gen.py in run()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/client.py in _start()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/client.py in _ensure_connected()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/comm/core.py in connect()
/opt/conda/envs/coiled/lib/python3.8/asyncio/tasks.py in wait_for()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/comm/tcp.py in connect()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/comm/tcp.py in _get_connect_args()
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/comm/tcp.py in _expect_tls_context()
TypeError: TLS expects a `ssl_context` argument of type ssl.SSLContext (perhaps check your TLS configuration?) Instead got None
This example is also rather contrived, but the point is that the actual mistake (triggering a compute within a compute) is masked by this misleading error about SSL.
Anything else we need to know?:
This error case can probably only be triggered on a networked cluster behind a proxy—it depends on the workers connecting to a different scheduler address than the client does:
|
try: |
|
worker = get_worker() |
|
except ValueError: # could not find worker |
|
pass |
|
else: |
|
if not address or worker.scheduler.address == address: |
|
return worker._get_client(timeout=timeout) |
In [20]: client.run(lambda: distributed.get_worker().scheduler.address)
Out[20]: {'tls://10.7.1.134:45125': 'tls://10.7.0.60:8786'}
In [21]: client.scheduler.address
Out[21]: 'tls://18.237.105.241:8786'
Notice how the client and worker are connecting to different IPs for the scheduler. So the worker.scheduler.address == address check fails, causing us to try to construct a new Client connecting out over the Internet back to the scheduler, for which we don't have the necessary TLS info.
Possible solutions:
- Should we make
Client.__getstate__ include the client's Security, so secure connections can be unpickled and reconstructed?
- Should we not make
Client.__getstate__ include Security, and instead raise a clearer error when attempting to pickle a secure client?
- Should we not care about this at all, because calling compute within compute is bad?
Environment:
- Dask version: 20201.5.0
- Python version: 3.8.7
- Operating System: macOS
- Install method (conda, pip, source): pip
I suspect there's an issue unpicking Clients that use Security.
Client.__setstate__callsget_clientwhen there is no global Client:distributed/distributed/client.py
Lines 362 to 367 in 7e10875
This ends up falling through to constructing a Client here:
distributed/distributed/worker.py
Lines 3354 to 3355 in 7e10875
If the
addressistls://(or anything that requires Security), then constructing that Client will fail, since we're not passing in asecurity=kwarg.Including
securitystate inClient.__getstate__may not be desirable. But at the least, it could be nice to check for this case in__getstate__, and raise a more informative error.Minimal Complete Verifiable Example:
This makes a cluster with Coiled, both because it's a convenient way to get a TLS cluster, and because it's actually necessary to use a cluster where the internal scheduler address differs from the public one to reproduce this error.
This is obviously a bad and contrived thing to do. However, this comes up more commonly in cases where people misuse
map_partitionsormap_blocks, and (accidentally) trigger a compute within their mapper function:This example is also rather contrived, but the point is that the actual mistake (triggering a compute within a compute) is masked by this misleading error about SSL.
Anything else we need to know?:
This error case can probably only be triggered on a networked cluster behind a proxy—it depends on the workers connecting to a different scheduler address than the client does:
distributed/distributed/worker.py
Lines 3338 to 3344 in 7e10875
Notice how the client and worker are connecting to different IPs for the scheduler. So the
worker.scheduler.address == addresscheck fails, causing us to try to construct a new Client connecting out over the Internet back to the scheduler, for which we don't have the necessary TLS info.Possible solutions:
Client.__getstate__include the client's Security, so secure connections can be unpickled and reconstructed?Client.__getstate__include Security, and instead raise a clearer error when attempting to pickle a secure client?Environment: