Skip to content

Commit

Permalink
Promote LIFO scaling log to warning level (#812)
Browse files Browse the repository at this point in the history
* Promote LIFO scaling log to warning level

* Add section on this to troubleshooting docs
  • Loading branch information
jacobtomlinson committed Sep 8, 2023
1 parent ef94c94 commit bd1bb7e
Show file tree
Hide file tree
Showing 3 changed files with 23 additions and 2 deletions.
5 changes: 3 additions & 2 deletions dask_kubernetes/operator/controller/controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -410,8 +410,9 @@ async def retire_workers(
return workers_to_close

# Finally fall back to last-in-first-out scaling
logger.debug(
f"Scaling {worker_group_name} failed via the Dask RPC, falling back to LIFO scaling"
logger.warning(
f"Scaling {worker_group_name} failed via the HTTP API and the Dask RPC, falling back to LIFO scaling. "
"This can result in lost data, see https://kubernetes.dask.org/en/latest/operator_troubleshooting.html."
)
workers = await kr8s.asyncio.get(
"pods",
Expand Down
1 change: 1 addition & 0 deletions doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ and have the cluster running. You can then use it to manage scaling and retrieve
operator_installation
operator_resources
operator_extending
operator_troubleshooting

.. toctree::
:maxdepth: 2
Expand Down
19 changes: 19 additions & 0 deletions doc/source/operator_troubleshooting.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Troubleshooting
===============

This page contains common problems and resolutions.

Why am I losing data during scale down?
---------------------------------------

When scaling down a cluster the controller will attempt to coordinate with the Dask scheduler and
decide which workers to remove. If the controller cannot communicate with the scheduler it will fall
back to last-in-first-out scaling and will remove the worker with the lowest uptime, even if that worker
is actively processing data. This can result in loss of data and recalculation of a graph.

This commonly happens if the version of Dask on the scheduler is very different to the verison on the controller.

To mitigate this Dask has an optional HTTP API which is more decoupled than the RPC and allows for better
support between versions.

See `https://github.com/dask/dask-kubernetes/issues/807 <https://github.com/dask/dask-kubernetes/issues/807>`_

0 comments on commit bd1bb7e

Please sign in to comment.