Promote LIFO scaling log to warning level (#812)

* Promote LIFO scaling log to warning level * Add section on this to troubleshooting docs
dask · Sep 8, 2023 · bd1bb7e · bd1bb7e
1 parent ef94c94
commit bd1bb7e
Show file tree

Hide file tree

Showing 3 changed files with 23 additions and 2 deletions.
diff --git a/dask_kubernetes/operator/controller/controller.py b/dask_kubernetes/operator/controller/controller.py
@@ -410,8 +410,9 @@ async def retire_workers(
             return workers_to_close
 
     # Finally fall back to last-in-first-out scaling
-    logger.debug(
-        f"Scaling {worker_group_name} failed via the Dask RPC, falling back to LIFO scaling"
+    logger.warning(
+        f"Scaling {worker_group_name} failed via the HTTP API and the Dask RPC, falling back to LIFO scaling. "
+        "This can result in lost data, see https://kubernetes.dask.org/en/latest/operator_troubleshooting.html."
     )
     workers = await kr8s.asyncio.get(
         "pods",

diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -85,6 +85,7 @@ and have the cluster running. You can then use it to manage scaling and retrieve
    operator_installation
    operator_resources
    operator_extending
+   operator_troubleshooting
 
 .. toctree::
    :maxdepth: 2

diff --git a/doc/source/operator_troubleshooting.rst b/doc/source/operator_troubleshooting.rst
@@ -0,0 +1,19 @@
+Troubleshooting
+===============
+
+This page contains common problems and resolutions.
+
+Why am I losing data during scale down?
+---------------------------------------
+
+When scaling down a cluster the controller will attempt to coordinate with the Dask scheduler and
+decide which workers to remove. If the controller cannot communicate with the scheduler it will fall
+back to last-in-first-out scaling and will remove the worker with the lowest uptime, even if that worker
+is actively processing data. This can result in loss of data and recalculation of a graph.
+
+This commonly happens if the version of Dask on the scheduler is very different to the verison on the controller.
+
+To mitigate this Dask has an optional HTTP API which is more decoupled than the RPC and allows for better
+support between versions.
+
+See `https://github.com/dask/dask-kubernetes/issues/807 <https://github.com/dask/dask-kubernetes/issues/807>`_