Skip to content

TLS expects a ssl_context argument during nested compute #4854

@gjoseph92

Description

@gjoseph92

I suspect there's an issue unpicking Clients that use Security. Client.__setstate__ calls get_client when there is no global Client:

def __setstate__(self, state):
key, address = state
try:
c = Client.current(allow_global=False)
except ValueError:
c = get_client(address)

This ends up falling through to constructing a Client here:

elif address:
return Client(address, timeout=timeout)

If the address is tls:// (or anything that requires Security), then constructing that Client will fail, since we're not passing in a security= kwarg.

Including security state in Client.__getstate__ may not be desirable. But at the least, it could be nice to check for this case in __getstate__, and raise a more informative error.

Minimal Complete Verifiable Example:
This makes a cluster with Coiled, both because it's a convenient way to get a TLS cluster, and because it's actually necessary to use a cluster where the internal scheduler address differs from the public one to reproduce this error.

import coiled
import dask
import distributed

cluster = coiled.Cluster(n_workers=1)
client = distributed.Client(cluster)

ts = dask.datasets.timeseries(freq="1d")
tsp = ts.persist()

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
client.run(lambda: tsp.mean().compute())
...
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/core.py in handle_comm()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/scheduler.py in broadcast()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/utils.py in All()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/scheduler.py in send_message()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/core.py in send_recv()

Exception: TLS expects a `ssl_context` argument of type ssl.SSLContext (perhaps check your TLS configuration?)  Instead got None

This is obviously a bad and contrived thing to do. However, this comes up more commonly in cases where people misuse map_partitions or map_blocks, and (accidentally) trigger a compute within their mapper function:

ts.map_partitions(
    lambda part: part.assign(x=part.x + np.array(tsp.x).mean()),
    # ^ `np.array(tsp.x)` implicitly computes the series
    meta=ts,
).compute()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
...
/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/protocol/pickle.py in loads()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/client.py in __setstate__()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py in get_client()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/client.py in __init__()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/client.py in start()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/utils.py in sync()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/utils.py in f()

/opt/conda/envs/coiled/lib/python3.8/site-packages/tornado/gen.py in run()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/client.py in _start()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/client.py in _ensure_connected()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/comm/core.py in connect()

/opt/conda/envs/coiled/lib/python3.8/asyncio/tasks.py in wait_for()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/comm/tcp.py in connect()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/comm/tcp.py in _get_connect_args()

/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/comm/tcp.py in _expect_tls_context()

TypeError: TLS expects a `ssl_context` argument of type ssl.SSLContext (perhaps check your TLS configuration?)  Instead got None

This example is also rather contrived, but the point is that the actual mistake (triggering a compute within a compute) is masked by this misleading error about SSL.

Anything else we need to know?:

This error case can probably only be triggered on a networked cluster behind a proxy—it depends on the workers connecting to a different scheduler address than the client does:

try:
worker = get_worker()
except ValueError: # could not find worker
pass
else:
if not address or worker.scheduler.address == address:
return worker._get_client(timeout=timeout)

In [20]: client.run(lambda: distributed.get_worker().scheduler.address)
Out[20]: {'tls://10.7.1.134:45125': 'tls://10.7.0.60:8786'}

In [21]: client.scheduler.address
Out[21]: 'tls://18.237.105.241:8786'

Notice how the client and worker are connecting to different IPs for the scheduler. So the worker.scheduler.address == address check fails, causing us to try to construct a new Client connecting out over the Internet back to the scheduler, for which we don't have the necessary TLS info.

Possible solutions:

  1. Should we make Client.__getstate__ include the client's Security, so secure connections can be unpickled and reconstructed?
  2. Should we not make Client.__getstate__ include Security, and instead raise a clearer error when attempting to pickle a secure client?
  3. Should we not care about this at all, because calling compute within compute is bad?

Environment:

  • Dask version: 20201.5.0
  • Python version: 3.8.7
  • Operating System: macOS
  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions