Skip to content

bokeh and tornado: KeyError: 'transfer' #7133

@arunoruto

Description

@arunoruto

Describe the issue:
When accessing the dashboard of the scheduler, I am not able to be redirected to the status tab. It always says 500: Internal Server Error. The rest of the tabs like workers, profile, etc. are working. The interface is also a bit more sluggish.

Accessing the status tab results in the following log:

2022-10-12 14:36:27,136 - bokeh.core.property.validation - ERROR - 'transfer'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils.py", line 748, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/distributed/dashboard/components/scheduler.py", line 713, in update
    transfer_incoming_bytes = [
  File "/opt/conda/lib/python3.9/site-packages/distributed/dashboard/components/scheduler.py", line 714, in <listcomp>
    ws.metrics["transfer"]["incoming_bytes"] for ws in wss
KeyError: 'transfer'
2022-10-12 14:36:27,136 - bokeh.application.handlers.function - ERROR - 'transfer'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils.py", line 748, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/distributed/dashboard/components/scheduler.py", line 4252, in status_doc
    workers_transfer_bytes.update()
  File "/opt/conda/lib/python3.9/site-packages/bokeh/core/property/validation.py", line 95, in func
    return input_function(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils.py", line 748, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/distributed/dashboard/components/scheduler.py", line 713, in update
    transfer_incoming_bytes = [
  File "/opt/conda/lib/python3.9/site-packages/distributed/dashboard/components/scheduler.py", line 714, in <listcomp>
    ws.metrics["transfer"]["incoming_bytes"] for ws in wss
KeyError: 'transfer'
2022-10-12 14:36:27,137 - tornado.application - ERROR - Uncaught exception GET /status (172.18.0.5)
HTTPServerRequest(protocol='http', host='<my url>', method='GET', uri='/status', version='HTTP/1.1', remote_ip='172.18.0.5')
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/tornado/web.py", line 1704, in _execute
    result = await result
  File "/opt/conda/lib/python3.9/site-packages/bokeh/server/views/doc_handler.py", line 54, in get
    h.modify_document(doc)
  File "/opt/conda/lib/python3.9/site-packages/bokeh/application/handlers/function.py", line 143, in modify_document
    self._func(doc)
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils.py", line 748, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/distributed/dashboard/components/scheduler.py", line 4252, in status_doc
    workers_transfer_bytes.update()
  File "/opt/conda/lib/python3.9/site-packages/bokeh/core/property/validation.py", line 95, in func
    return input_function(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils.py", line 748, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/distributed/dashboard/components/scheduler.py", line 713, in update
    transfer_incoming_bytes = [
  File "/opt/conda/lib/python3.9/site-packages/distributed/dashboard/components/scheduler.py", line 714, in <listcomp>
    ws.metrics["transfer"]["incoming_bytes"] for ws in wss
KeyError: 'transfer'

The sluggish performance could be the result of the following error:

2022-10-12 14:43:24,068 - tornado.application - ERROR - Uncaught exception GET /workers/ws (172.18.0.5)
HTTPServerRequest(protocol='http', host='<my url>', method='GET', uri='/workers/ws', version='HTTP/1.1', remote_ip='172.18.0.5')
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/tornado/web.py", line 3173, in wrapper
    return method(self, *args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/bokeh/server/views/ws.py", line 149, in open
    raise ProtocolError("Token is expired.")
bokeh.protocol.exceptions.ProtocolError: Token is expired.

Minimal Complete Verifiable Example:

Anything else we need to know?:
I am running rapidsai containers to expose GPUs to the cluster.

Environment:

  • Dask version: 2022.9.2
  • Python version: 3.9
  • Operating System: docker
  • Install method (conda, pip, source): docker

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions