Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client errors if client=2.12.0 and scheduler=2.13.0 instead of mismatch warning #3659

Closed
grantgustafson opened this issue Mar 29, 2020 · 6 comments · Fixed by #3660
Closed

Comments

@grantgustafson
Copy link

grantgustafson commented Mar 29, 2020

If client and scheduler versions are mismatched such that client=2.12.0 and scheduler=2.13.0, the client will error out on instantiation with assert msg[0]["op"] == "stream-start" instead of dask's helpful version mismatch warning. The scheduler's logs will show an error in versions.pyL125 with TypeError: tuple indices must be integers or slices, not str. Of course, aligning versions fixes this issue, but the usual helpful warning doesn't appear. Looks to be related to https://github.com/dask/distributed/pull/3567/files

@quasiben
Copy link
Member

@grantgustafson, thanks for the report. We saw this in rapidsai/ucx-py#459 as well. Any interest in fixing the underlying issue ? if not, no pressure.

@consideRatio
Copy link
Contributor

consideRatio commented Mar 30, 2020

I ran into this issue as well in a switch from 2.12.0 to 2.13.0. I detected this when I tried to connect to my scheduler. For me I'm quite confident its not a version mismatch issue, because my user environment, dask scheduler environment, and dask worker environment are all using the same docker image. Also, this configuration worked in 2.12.0.

My error is received when I access my scheduler.

from dask.distributed import Client
client = Client("tcp://" + os.environ["DASK_SCHEDULER"])
client

The error on the scheduler, that keeps running after logging this error, is:

distributed.scheduler - INFO - Receive client connection: Client-ab77ff64-726e-11ea-806a-ea277dde72e2
distributed.scheduler - INFO - Close client connection: Client-ab77ff64-726e-11ea-806a-ea277dde72e2
distributed.core - ERROR - tuple indices must be integers or slices, not str
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 411, in handle_comm
    result = await result
  File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 2505, in add_client
    versions,
  File "/opt/conda/lib/python3.6/site-packages/distributed/versions.py", line 125, in error_message
    node_packages[node]["python"] = info["host"]["python"]
TypeError: tuple indices must be integers or slices, not str

I think this is an issue directly related to #3567 that introduced the line node_packages[node]["python"] = info["host"]["python"] where I assume a faulty assumption was made about info, info["host"], node_packages, or node_packages[node] to be a dictionary, while it turned out that it was a tuple.

In [2]: ("this", "is", "a", "tuple")["used_as_a_dict"]                                                                                               
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-436b71a0a741> in <module>
----> 1 ("this", "is", "a", "tuple")["used_as_a_dict"]

TypeError: tuple indices must be integers or slices, not str

I'll look into this, and is happy to submit a PR.

@consideRatio
Copy link
Contributor

consideRatio commented Mar 30, 2020

def error_message(scheduler, workers, client, client_name="client"):
    from .utils import asciitable
    nodes = {**{client_name: client}, **{"scheduler": scheduler}, **workers}
    # Hold all versions, e.g. versions["scheduler"]["distributed"] = 2.9.3
    node_packages = defaultdict(dict)

    # Collect all package versions
    packages = set()

    for node, info in nodes.items():
        if info is None or not (isinstance(info, dict)) or "packages" not in info:
            node_packages[node] = defaultdict(lambda: "UNKNOWN")
        else:
            node_packages[node] = defaultdict(lambda: "MISSING")
            for pkg, version in info["packages"].items():
                node_packages[node][pkg] = version
                packages.add(pkg)
            # Collect Python version for each node
            node_packages[node]["python"] = info["host"]["python"]
            packages.add("python")

Since node_packages = defaultdict(dict) and node_packages[node] = defaultdict(lambda: "MISSING"), I conclude it is either info or info["host"] that returned an unexpected tuple.

And, since info["packages"] was used in the for loop above without issues, I assume it must be info["host"] that is the unexpected tuple that we cant pass ["python"] to as if it were a dict.

So, what is info["host"]?

What is info? Info will take on every value in the nodes dictionary. And, when we get the error, we can know from the fact we are in the else clause that...

  • info isn't None
  • info is an instance of a dictionary
  • info has a key named packages in it

Since info is a value in the nodes dictionary, and the nodes dictionary is a merged dictionary of client, scheduler, and workers, the key is to figure out what of these values in the resulting nodes dictionary is a dictionary with a keys named packages and host, where the host key is a tuple?

@consideRatio
Copy link
Contributor

I learned that I can avoid causing this to show on establishing a connection to the scheduler by scaling down my workers to zero. And, when I scaled back my workers, everything was fine again.

Perhaps this could have been caused the scheduler made an assumption of not having any old worker registered? Or the issue is still there, but I didn't trigger an error_message function call that in turn triggered this error?

Hmmm...

@consideRatio
Copy link
Contributor

I've stranded in my debugging efforts and hope that someone can continue from here, @jrbourbeau can you understand what caused this issue?

It may be that the we got into some bad state by upgrading our scheduler / workers and that we only need to document that when upgrading to 2.13.0 the workers need to fully shut down, then update the scheduler, then add workers back.

@jrbourbeau
Copy link
Member

Thanks @grantgustafson for raising an issue and @consideRatio for investigating. I was able to reproduce the TypeError using client=2.12.0 and scheduler=2.13.0. Changing the value returned from get_system_info from a list of tuples to a dictionary in #3567 seems to have introduced the error. This was an oversight on my part. Reverting the change (i.e. get_system_info returns a list of tuples instead of a dictionary) resolves the issue moving forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants