[ML] Wait to gracefully stop deployments until alternative allocation exists #99107

jonathan-buttner · 2023-08-31T20:23:39Z

This PR is a follow on from #98406

These changes allow for a deployment that is routed to a shutting down node to continue to service requests until a new started route exists. If the assignment is only routed to a single node, then it will gracefully shutdown. Assignments routed to a single node indicate that no additional nodes exist that could accommodate the deployment during a reblance.

Relates to https://github.com/elastic/ml-team/issues/994

Depends on: https://github.com/elastic/cloud/issues/117365

elasticsearchmachine · 2023-08-31T20:24:05Z

Hi @jonathan-buttner, I've created a changelog YAML for you.

…athan-buttner/elasticsearch into ml-wait-for-new-node-before-stopping

elasticsearchmachine · 2023-09-01T14:30:42Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2023-09-01T16:41:44Z

...in/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportInternalInferModelAction.java

+
+        // We couldn't find any nodes in the started state so let's look for ones that are stopping in case we're shutting down some nodes
+        if (nodes.isEmpty()) {
+            nodes = assignment.selectRandomStartedNodesWeighedOnAllocationsForNRequests(request.numberOfDocuments(), RoutingState.STOPPING);


This is likely to be happening on a different node to the one where the selected deployment is running.

What would happen if by the time the message sent by this node to the node running the stopping model arrives, the queue on that node has been fully drained?

(Maybe it's OK but I haven't been through the code in great detail so just wanted to check what you think first.)

Ah that's a good point, I overlooked that.

In this code path it is first checking to see if there are routes in the started state. If none exist it'll try to find ones in the stopping state.

A couple scenarios:

User API request stopping the deployment

The caller here

Will return an error if the user stopped the deployment via an API request because the first thing that endpoint does is set the assignment to stopping.

I think this would be ok because the user is intentionally stopping the deployment.

Node shutdown stopping deployment

When a rebalance happens routes to shutting down nodes are set to stopping. Then the TrainedModelAssignmentNodeService will check if the deployment meets the criteria to be gracefully shutdown (route is stopping, node is shutting down, and hasn't been already stopped by the user or a previous cluster state change).

One of the code changes in this PR adds logic to also consider if there are any other allocations that can service requests. If other routes exist but they're in the starting state (but aren't ready to handle requests yet), we won't begin the graceful shutdown process yet and requests can still be queued up and serviced.

There is a race condition like you mentioned though. If selectRandomStartedNodesWeighedOnAllocationsForNRequests picked a stopping route because no started routes existed and while the request was in flight to that node a cluster state changed such that a different node was ready to handle requests and the shutting down node received that cluster state change it would begin the graceful shutdown. If this inference request arrived after the cluster state change it could be rejected because the deployment worker set the shutdown flag.

Ideas

Continue to accept requests while the queue is draining

One idea would be to continue to return false for isShutdown here until we've completely drained the queue (the queue.poll returns null).

This way, when a new node is ready to service requests, it'll be chosen over any stopping deployments. Any requests in flight to a stopping deployment will still be serviced.

I guess the danger here is that the worker might never shutdown if the queue never becomes empty. The ML plugin does cap the shutdown time frame to 10 minutes so we do have that safety net. The DeploymentManager also waits 3 minutes for the worker to drain the queue and terminate so we'd get a failure from that too. If that happens, the DeploymentManager will force set the shutdown flag kill the process. I doubt we'd get flooded with that many requests because we should be prioritizing started routes over stopping ones.

Retry the inference request on rejection error

Another idea would be to add some retry logic in the inference route to pick a new route if the previous route returned an error stating it was shutting down. Maybe we could optimize this to only retry if a different route existed (whether in started or stopping).

Do either of those solutions sound reasonable?

TrainedModelAssignmentClusterService monitors for node shutdown events, when sees one it sets all routes to that node to STOPPING.

When TrainedModelAssignmentNodeService notices the route update in a cluster change event it asynchronously tells the deployment manager to stop gracefully.

The first thing the deployment manager does on graceful stop is remove the process context from the map which means all calls to infer() will now fail because they can't find the process context. It also shutdowns the worker queue so new requests will be rejected.

For this to work the deployment manager code must continue to accept requests but that is tricky for the reasons above.

During a scale event is the replacement node present in the cluster before the shudown event is received? Is is an option when we get the shutdown event to rebalance the assignments and if the new plan can be satisfied wait until the model is started on the new node then shutdown gracefully?

The first thing the deployment manager does on graceful stop is remove the process context from the map which means all calls to infer() will now fail because they can't find the process context. It also shutdowns the worker queue so new requests will be rejected.

Good point, I forgot about that

During a scale event is the replacement node present in the cluster before the shudown event is received?

Yes, I believe the new node will be up and running ready to process requests (and have an allocation added to it) prior to the shutdown signal being sent to the other node. At least that's what I've seen based on my testing.

Is is an option when we get the shutdown event to rebalance the assignments and if the new plan can be satisfied wait until the model is started on the new node then shutdown gracefully?

Yeah this is what I was trying to achieve with this PR. But I think it still isn't able to handle the scenario Dave R brought up.

Rebalancing does occur when a node is marked for shutdown and these code change (or by adding another field to the cluster state) can achieve waiting for the model to be started on the new node before beginning the graceful shutdown.

I think the issue is that even if we wait for the model to be in the started state on the new node there's still a window of time where any inference requests that were in flight and routed to the node marked for shutdown could arrive after the graceful shutdown occurs.

As I think about this more I don't think my first suggestion would actually solve it because if we had super slow requests they could still arrive after the graceful shutdown finished even if we allowed requests to be queued up while draining the queues.

I wonder if the best way to solve this is use the changes I've added in this PR (or some new field in the cluster state/temporary invisible assignments to indicate that we're waiting for a new allocation to be ready to service requests) and add retry logic within the inference code path. That way if an inference request failed because of it being routed to a node that was in the middle of gracefully shutting down it'd simply retry and be routed to the new node.

Otherwise how do we guarantee that no requests were already in flight to a node that is gracefully shutting down even if another node is now available to service inference requests?

If we don't add retry logic we'll have a similar problem if ML nodes crash and unexpectedly disappear right?

droberts195 · 2023-10-11T19:34:13Z

.../java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java

+        if (routingInfo == null) {
+            return false;
+        }


With the current calling point for shouldGracefullyShutdownDeployment it's impossible for this test to pass and return false because there a check for routingInfo != null above where it's called.

So I guess this check is just for future-proofing. In that case I think it's better that it returns true, because if a node isn't mentioned in a model's routing table then in general it shouldn't be running there.

Yeah good point. In the changes here I decided to start passing in TrainedModelAssignment since I needed it for a few other things and figured I should do a null check. I'll switch it to true 👍

droberts195 · 2023-10-11T19:40:53Z

...in/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportInternalInferModelAction.java

-        var nodes = assignment.selectRandomStartedNodesWeighedOnAllocationsForNRequests(request.numberOfDocuments());
+        var nodes = assignment.selectRandomStartedNodesWeighedOnAllocationsForNRequests(request.numberOfDocuments(), RoutingState.STARTED);
+
+        // We couldn't find any nodes in the started state so let's look for ones that are stopping in case we're shutting down some nodes


I think what's done in this PR is an improvement on what we have today in terms of not failing requests during rolling upgrades.

For a followup that might help even more I'm thinking we could route requests to nodes where the routing state is STARTING. We could build up a queue of items to send to the C++ process before it's ready, then feed them through once it's running. We'd have to decide whether to favour STOPPING or STARTING after failing to find a STARTED, and there would probably be work elsewhere to accept requests into the queue before the C++ process is ready. But it could be another layer of improvement even if we still haven't reached perfection.

Sounds good!

…for-new-node-before-stopping

droberts195

LGTM

I'm happy to merge this on the grounds that it makes the situation better, if not perfect.

When time permits we should investigate whether the idea of #99107 (comment) is feasible.

… exists (elastic#99107) * Allowing stopping deployments to handle requests * Adding some logging * Update docs/changelog/99107.yaml * Adding test * Addressing feedback * Fixing test

elasticsearchmachine · 2023-10-12T12:33:48Z

💚 Backport successful

Status	Branch	Result
✅	8.11

… exists (#99107) (#100765) * Allowing stopping deployments to handle requests * Adding some logging * Update docs/changelog/99107.yaml * Adding test * Addressing feedback * Fixing test

jonathan-buttner added 2 commits August 31, 2023 15:39

Allowing stopping deployments to handle requests

33bc2b3

Adding some logging

7f49a7b

jonathan-buttner added >bug :ml Machine learning Team:ML Meta label for the ML team cloud-deploy Publish cloud docker image for Cloud-First-Testing v8.11.0 v8.10.1 labels Aug 31, 2023

Update docs/changelog/99107.yaml

44822f3

jonathan-buttner removed the v8.10.1 label Sep 1, 2023

jonathan-buttner added 2 commits September 1, 2023 09:28

Adding test

6312ce0

Merge branch 'ml-wait-for-new-node-before-stopping' of github.com:jon…

a748bc8

…athan-buttner/elasticsearch into ml-wait-for-new-node-before-stopping

jonathan-buttner mentioned this pull request Sep 1, 2023

[ML] retry inference #99103

Closed

jonathan-buttner marked this pull request as ready for review September 1, 2023 14:30

jonathan-buttner requested review from davidkyle and droberts195 September 1, 2023 14:30

droberts195 reviewed Sep 1, 2023

View reviewed changes

mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023

droberts195 reviewed Oct 11, 2023

View reviewed changes

jonathan-buttner added 3 commits October 11, 2023 16:22

Merge branch 'main' of github.com:elastic/elasticsearch into ml-wait-…

75dc863

…for-new-node-before-stopping

Addressing feedback

c5ba4d0

Fixing test

750b136

droberts195 approved these changes Oct 12, 2023

View reviewed changes

jonathan-buttner added v8.11.0 auto-backport-and-merge Automatically create backport pull requests and merge when ready v8.11.1 labels Oct 12, 2023

jonathan-buttner removed the v8.11.0 label Oct 12, 2023

jonathan-buttner merged commit 19fb25f into elastic:main Oct 12, 2023
14 checks passed

jonathan-buttner mentioned this pull request Oct 12, 2023

[8.11] [ML] Wait to gracefully stop deployments until alternative allocation exists (#99107) #100765

Merged

jonathan-buttner deleted the ml-wait-for-new-node-before-stopping branch October 12, 2023 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Wait to gracefully stop deployments until alternative allocation exists #99107

[ML] Wait to gracefully stop deployments until alternative allocation exists #99107

jonathan-buttner commented Aug 31, 2023 •

edited by droberts195

elasticsearchmachine commented Aug 31, 2023

elasticsearchmachine commented Sep 1, 2023

droberts195 Sep 1, 2023

jonathan-buttner Sep 1, 2023 •

edited

davidkyle Sep 4, 2023 •

edited by jonathan-buttner

jonathan-buttner Sep 5, 2023

droberts195 Oct 11, 2023

jonathan-buttner Oct 11, 2023

droberts195 Oct 11, 2023

jonathan-buttner Oct 12, 2023

droberts195 left a comment

elasticsearchmachine commented Oct 12, 2023

[ML] Wait to gracefully stop deployments until alternative allocation exists #99107

[ML] Wait to gracefully stop deployments until alternative allocation exists #99107

Conversation

jonathan-buttner commented Aug 31, 2023 • edited by droberts195

elasticsearchmachine commented Aug 31, 2023

elasticsearchmachine commented Sep 1, 2023

droberts195 Sep 1, 2023

Choose a reason for hiding this comment

jonathan-buttner Sep 1, 2023 • edited

Choose a reason for hiding this comment

User API request stopping the deployment

Node shutdown stopping deployment

Ideas

Continue to accept requests while the queue is draining

Retry the inference request on rejection error

davidkyle Sep 4, 2023 • edited by jonathan-buttner

Choose a reason for hiding this comment

jonathan-buttner Sep 5, 2023

Choose a reason for hiding this comment

droberts195 Oct 11, 2023

Choose a reason for hiding this comment

jonathan-buttner Oct 11, 2023

Choose a reason for hiding this comment

droberts195 Oct 11, 2023

Choose a reason for hiding this comment

jonathan-buttner Oct 12, 2023

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Oct 12, 2023

💚 Backport successful

jonathan-buttner commented Aug 31, 2023 •

edited by droberts195

jonathan-buttner Sep 1, 2023 •

edited

davidkyle Sep 4, 2023 •

edited by jonathan-buttner