Skip to content

Commit

Permalink
[ML] fix autoscaling bug where many jobs take a long time to open (#7…
Browse files Browse the repository at this point in the history
…2423) (#72481)

This commit fixes two bugs:

First: it causes autoscaling to respect the max_open_jobs setting. This setting should probably be set to its maximum value of 512 if autoscaling is turned on to minimize its impact in decision making

Second: it fixes the following scenario:

 - Many jobs are sitting in the `opening` state. There is plenty of room in the cluster, but they are still waiting assignment. This could be caused by very view, but large machine learning nodes in conjunction with a low `xpack.ml.node_concurrent_job_allocations` (default is 2).
 - If autoscaling requests a size decision, the machine learning service sees that jobs are not assigned (even though they could be) and assumes there is not enough room in the cluster
 - ML requests a scale up
 - Rense/repeat

^ This can occur until at last all jobs are assigned, at that point, a massive scale_down will occur. This will potentially cause the same pattern again as in certain architectures (ESS), scaling down means adding a new node to the cluster and removing the old one (So jobs must be re-assigned, starting the cycle all over again).
  • Loading branch information
benwtrent committed Apr 29, 2021
1 parent cf1dc2e commit 72af925
Show file tree
Hide file tree
Showing 3 changed files with 373 additions and 134 deletions.

0 comments on commit 72af925

Please sign in to comment.