[ML] fix autoscaling bug where many jobs take a long time to open (#7… · elastic/elasticsearch@72af925

Commit

[ML] fix autoscaling bug where many jobs take a long time to open (#7…

…2423) (#72481)

This commit fixes two bugs:

First: it causes autoscaling to respect the max_open_jobs setting. This setting should probably be set to its maximum value of 512 if autoscaling is turned on to minimize its impact in decision making

Second: it fixes the following scenario:

 - Many jobs are sitting in the `opening` state. There is plenty of room in the cluster, but they are still waiting assignment. This could be caused by very view, but large machine learning nodes in conjunction with a low `xpack.ml.node_concurrent_job_allocations` (default is 2).
 - If autoscaling requests a size decision, the machine learning service sees that jobs are not assigned (even though they could be) and assumes there is not enough room in the cluster
 - ML requests a scale up
 - Rense/repeat

^ This can occur until at last all jobs are assigned, at that point, a massive scale_down will occur. This will potentially cause the same pattern again as in certain architectures (ESS), scaling down means adding a new node to the cluster and removing the old one (So jobs must be re-assigned, starting the cycle all over again).

Loading branch information

benwtrent committed Apr 29, 2021

1 parent cf1dc2e commit 72af925

0 comments on commit `72af925`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `72af925`

Commit

There are no files selected for viewing

0 comments on commit 72af925

0 comments on commit `72af925`