Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[ML] fix autoscaling bug where many jobs take a long time to open (#7…
…2423) (#72481) This commit fixes two bugs: First: it causes autoscaling to respect the max_open_jobs setting. This setting should probably be set to its maximum value of 512 if autoscaling is turned on to minimize its impact in decision making Second: it fixes the following scenario: - Many jobs are sitting in the `opening` state. There is plenty of room in the cluster, but they are still waiting assignment. This could be caused by very view, but large machine learning nodes in conjunction with a low `xpack.ml.node_concurrent_job_allocations` (default is 2). - If autoscaling requests a size decision, the machine learning service sees that jobs are not assigned (even though they could be) and assumes there is not enough room in the cluster - ML requests a scale up - Rense/repeat ^ This can occur until at last all jobs are assigned, at that point, a massive scale_down will occur. This will potentially cause the same pattern again as in certain architectures (ESS), scaling down means adding a new node to the cluster and removing the old one (So jobs must be re-assigned, starting the cycle all over again).
- Loading branch information