Client errors if client=2.12.0 and scheduler=2.13.0 instead of mismatch warning #3659

grantgustafson · 2020-03-29T21:01:24Z

If client and scheduler versions are mismatched such that client=2.12.0 and scheduler=2.13.0, the client will error out on instantiation with assert msg[0]["op"] == "stream-start" instead of dask's helpful version mismatch warning. The scheduler's logs will show an error in versions.pyL125 with TypeError: tuple indices must be integers or slices, not str. Of course, aligning versions fixes this issue, but the usual helpful warning doesn't appear. Looks to be related to https://github.com/dask/distributed/pull/3567/files

The text was updated successfully, but these errors were encountered:

quasiben · 2020-03-30T02:42:45Z

@grantgustafson, thanks for the report. We saw this in rapidsai/ucx-py#459 as well. Any interest in fixing the underlying issue ? if not, no pressure.

consideRatio · 2020-03-30T10:35:56Z

I ran into this issue as well in a switch from 2.12.0 to 2.13.0. I detected this when I tried to connect to my scheduler. For me I'm quite confident its not a version mismatch issue, because my user environment, dask scheduler environment, and dask worker environment are all using the same docker image. Also, this configuration worked in 2.12.0.

My error is received when I access my scheduler.

from dask.distributed import Client
client = Client("tcp://" + os.environ["DASK_SCHEDULER"])
client

The error on the scheduler, that keeps running after logging this error, is:

distributed.scheduler - INFO - Receive client connection: Client-ab77ff64-726e-11ea-806a-ea277dde72e2
distributed.scheduler - INFO - Close client connection: Client-ab77ff64-726e-11ea-806a-ea277dde72e2
distributed.core - ERROR - tuple indices must be integers or slices, not str
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 411, in handle_comm
    result = await result
  File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 2505, in add_client
    versions,
  File "/opt/conda/lib/python3.6/site-packages/distributed/versions.py", line 125, in error_message
    node_packages[node]["python"] = info["host"]["python"]
TypeError: tuple indices must be integers or slices, not str

I think this is an issue directly related to #3567 that introduced the line node_packages[node]["python"] = info["host"]["python"] where I assume a faulty assumption was made about info, info["host"], node_packages, or node_packages[node] to be a dictionary, while it turned out that it was a tuple.

In [2]: ("this", "is", "a", "tuple")["used_as_a_dict"]                                                                                               
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-436b71a0a741> in <module>
----> 1 ("this", "is", "a", "tuple")["used_as_a_dict"]

TypeError: tuple indices must be integers or slices, not str

I'll look into this, and is happy to submit a PR.

consideRatio · 2020-03-30T10:57:00Z

def error_message(scheduler, workers, client, client_name="client"):
    from .utils import asciitable
    nodes = {**{client_name: client}, **{"scheduler": scheduler}, **workers}
    # Hold all versions, e.g. versions["scheduler"]["distributed"] = 2.9.3
    node_packages = defaultdict(dict)

    # Collect all package versions
    packages = set()

    for node, info in nodes.items():
        if info is None or not (isinstance(info, dict)) or "packages" not in info:
            node_packages[node] = defaultdict(lambda: "UNKNOWN")
        else:
            node_packages[node] = defaultdict(lambda: "MISSING")
            for pkg, version in info["packages"].items():
                node_packages[node][pkg] = version
                packages.add(pkg)
            # Collect Python version for each node
            node_packages[node]["python"] = info["host"]["python"]
            packages.add("python")

Since node_packages = defaultdict(dict) and node_packages[node] = defaultdict(lambda: "MISSING"), I conclude it is either info or info["host"] that returned an unexpected tuple.

And, since info["packages"] was used in the for loop above without issues, I assume it must be info["host"] that is the unexpected tuple that we cant pass ["python"] to as if it were a dict.

So, what is info["host"]?

What is info? Info will take on every value in the nodes dictionary. And, when we get the error, we can know from the fact we are in the else clause that...

info isn't None
info is an instance of a dictionary
info has a key named packages in it

Since info is a value in the nodes dictionary, and the nodes dictionary is a merged dictionary of client, scheduler, and workers, the key is to figure out what of these values in the resulting nodes dictionary is a dictionary with a keys named packages and host, where the host key is a tuple?

consideRatio · 2020-03-30T11:22:45Z

I learned that I can avoid causing this to show on establishing a connection to the scheduler by scaling down my workers to zero. And, when I scaled back my workers, everything was fine again.

Perhaps this could have been caused the scheduler made an assumption of not having any old worker registered? Or the issue is still there, but I didn't trigger an error_message function call that in turn triggered this error?

Hmmm...

consideRatio · 2020-03-30T13:26:27Z

I've stranded in my debugging efforts and hope that someone can continue from here, @jrbourbeau can you understand what caused this issue?

It may be that the we got into some bad state by upgrading our scheduler / workers and that we only need to document that when upgrading to 2.13.0 the workers need to fully shut down, then update the scheduler, then add workers back.

jrbourbeau · 2020-03-30T15:33:35Z

Thanks @grantgustafson for raising an issue and @consideRatio for investigating. I was able to reproduce the TypeError using client=2.12.0 and scheduler=2.13.0. Changing the value returned from get_system_info from a list of tuples to a dictionary in #3567 seems to have introduced the error. This was an oversight on my part. Reverting the change (i.e. get_system_info returns a list of tuples instead of a dictionary) resolves the issue moving forward.

jrbourbeau mentioned this issue Mar 30, 2020

Update Python version checking #3660

Merged

quasiben closed this as completed in #3660 Mar 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client errors if client=2.12.0 and scheduler=2.13.0 instead of mismatch warning #3659

Client errors if client=2.12.0 and scheduler=2.13.0 instead of mismatch warning #3659

grantgustafson commented Mar 29, 2020 •

edited

quasiben commented Mar 30, 2020

consideRatio commented Mar 30, 2020 •

edited

consideRatio commented Mar 30, 2020 •

edited

consideRatio commented Mar 30, 2020

consideRatio commented Mar 30, 2020

jrbourbeau commented Mar 30, 2020

Client errors if client=2.12.0 and scheduler=2.13.0 instead of mismatch warning #3659

Client errors if client=2.12.0 and scheduler=2.13.0 instead of mismatch warning #3659

Comments

grantgustafson commented Mar 29, 2020 • edited

quasiben commented Mar 30, 2020

consideRatio commented Mar 30, 2020 • edited

consideRatio commented Mar 30, 2020 • edited

consideRatio commented Mar 30, 2020

consideRatio commented Mar 30, 2020

jrbourbeau commented Mar 30, 2020

grantgustafson commented Mar 29, 2020 •

edited

consideRatio commented Mar 30, 2020 •

edited

consideRatio commented Mar 30, 2020 •

edited