Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Wait to gracefully stop deployments until alternative allocation exists #99107

Conversation

jonathan-buttner
Copy link
Contributor

@jonathan-buttner jonathan-buttner commented Aug 31, 2023

This PR is a follow on from #98406

These changes allow for a deployment that is routed to a shutting down node to continue to service requests until a new started route exists. If the assignment is only routed to a single node, then it will gracefully shutdown. Assignments routed to a single node indicate that no additional nodes exist that could accommodate the deployment during a reblance.

Relates to https://github.com/elastic/ml-team/issues/994

Depends on: https://github.com/elastic/cloud/issues/117365

@jonathan-buttner jonathan-buttner added >bug :ml Machine learning Team:ML Meta label for the ML team cloud-deploy Publish cloud docker image for Cloud-First-Testing v8.11.0 v8.10.1 labels Aug 31, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @jonathan-buttner, I've created a changelog YAML for you.

…athan-buttner/elasticsearch into ml-wait-for-new-node-before-stopping
@jonathan-buttner jonathan-buttner marked this pull request as ready for review September 1, 2023 14:30
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)


// We couldn't find any nodes in the started state so let's look for ones that are stopping in case we're shutting down some nodes
if (nodes.isEmpty()) {
nodes = assignment.selectRandomStartedNodesWeighedOnAllocationsForNRequests(request.numberOfDocuments(), RoutingState.STOPPING);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely to be happening on a different node to the one where the selected deployment is running.

What would happen if by the time the message sent by this node to the node running the stopping model arrives, the queue on that node has been fully drained?

(Maybe it's OK but I haven't been through the code in great detail so just wanted to check what you think first.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's a good point, I overlooked that.

In this code path it is first checking to see if there are routes in the started state. If none exist it'll try to find ones in the stopping state.

A couple scenarios:

User API request stopping the deployment

The caller here

Will return an error if the user stopped the deployment via an API request because the first thing that endpoint does is set the assignment to stopping.

I think this would be ok because the user is intentionally stopping the deployment.

Node shutdown stopping deployment

When a rebalance happens routes to shutting down nodes are set to stopping. Then the TrainedModelAssignmentNodeService will check if the deployment meets the criteria to be gracefully shutdown (route is stopping, node is shutting down, and hasn't been already stopped by the user or a previous cluster state change).

One of the code changes in this PR adds logic to also consider if there are any other allocations that can service requests. If other routes exist but they're in the starting state (but aren't ready to handle requests yet), we won't begin the graceful shutdown process yet and requests can still be queued up and serviced.

There is a race condition like you mentioned though. If selectRandomStartedNodesWeighedOnAllocationsForNRequests picked a stopping route because no started routes existed and while the request was in flight to that node a cluster state changed such that a different node was ready to handle requests and the shutting down node received that cluster state change it would begin the graceful shutdown. If this inference request arrived after the cluster state change it could be rejected because the deployment worker set the shutdown flag.

Ideas

Continue to accept requests while the queue is draining

One idea would be to continue to return false for isShutdown here until we've completely drained the queue (the queue.poll returns null).

This way, when a new node is ready to service requests, it'll be chosen over any stopping deployments. Any requests in flight to a stopping deployment will still be serviced.

I guess the danger here is that the worker might never shutdown if the queue never becomes empty. The ML plugin does cap the shutdown time frame to 10 minutes so we do have that safety net. The DeploymentManager also waits 3 minutes for the worker to drain the queue and terminate so we'd get a failure from that too. If that happens, the DeploymentManager will force set the shutdown flag kill the process. I doubt we'd get flooded with that many requests because we should be prioritizing started routes over stopping ones.

Retry the inference request on rejection error

Another idea would be to add some retry logic in the inference route to pick a new route if the previous route returned an error stating it was shutting down. Maybe we could optimize this to only retry if a different route existed (whether in started or stopping).

Do either of those solutions sound reasonable?

Copy link
Member

@davidkyle davidkyle Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • TrainedModelAssignmentClusterService monitors for node shutdown events, when sees one it sets all routes to that node to STOPPING.
  • When TrainedModelAssignmentNodeService notices the route update in a cluster change event it asynchronously tells the deployment manager to stop gracefully.
  • The first thing the deployment manager does on graceful stop is remove the process context from the map which means all calls to infer() will now fail because they can't find the process context. It also shutdowns the worker queue so new requests will be rejected.

For this to work the deployment manager code must continue to accept requests but that is tricky for the reasons above.

During a scale event is the replacement node present in the cluster before the shudown event is received? Is is an option when we get the shutdown event to rebalance the assignments and if the new plan can be satisfied wait until the model is started on the new node then shutdown gracefully?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first thing the deployment manager does on graceful stop is remove the process context from the map which means all calls to infer() will now fail because they can't find the process context. It also shutdowns the worker queue so new requests will be rejected.

Good point, I forgot about that

During a scale event is the replacement node present in the cluster before the shudown event is received?

Yes, I believe the new node will be up and running ready to process requests (and have an allocation added to it) prior to the shutdown signal being sent to the other node. At least that's what I've seen based on my testing.

Is is an option when we get the shutdown event to rebalance the assignments and if the new plan can be satisfied wait until the model is started on the new node then shutdown gracefully?

Yeah this is what I was trying to achieve with this PR. But I think it still isn't able to handle the scenario Dave R brought up.

Rebalancing does occur when a node is marked for shutdown and these code change (or by adding another field to the cluster state) can achieve waiting for the model to be started on the new node before beginning the graceful shutdown.

I think the issue is that even if we wait for the model to be in the started state on the new node there's still a window of time where any inference requests that were in flight and routed to the node marked for shutdown could arrive after the graceful shutdown occurs.

As I think about this more I don't think my first suggestion would actually solve it because if we had super slow requests they could still arrive after the graceful shutdown finished even if we allowed requests to be queued up while draining the queues.

I wonder if the best way to solve this is use the changes I've added in this PR (or some new field in the cluster state/temporary invisible assignments to indicate that we're waiting for a new allocation to be ready to service requests) and add retry logic within the inference code path. That way if an inference request failed because of it being routed to a node that was in the middle of gracefully shutting down it'd simply retry and be routed to the new node.

Otherwise how do we guarantee that no requests were already in flight to a node that is gracefully shutting down even if another node is now available to service inference requests?

If we don't add retry logic we'll have a similar problem if ML nodes crash and unexpectedly disappear right?

@mattc58 mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023
Comment on lines 445 to 447
if (routingInfo == null) {
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current calling point for shouldGracefullyShutdownDeployment it's impossible for this test to pass and return false because there a check for routingInfo != null above where it's called.

So I guess this check is just for future-proofing. In that case I think it's better that it returns true, because if a node isn't mentioned in a model's routing table then in general it shouldn't be running there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point. In the changes here I decided to start passing in TrainedModelAssignment since I needed it for a few other things and figured I should do a null check. I'll switch it to true 👍

var nodes = assignment.selectRandomStartedNodesWeighedOnAllocationsForNRequests(request.numberOfDocuments());
var nodes = assignment.selectRandomStartedNodesWeighedOnAllocationsForNRequests(request.numberOfDocuments(), RoutingState.STARTED);

// We couldn't find any nodes in the started state so let's look for ones that are stopping in case we're shutting down some nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what's done in this PR is an improvement on what we have today in terms of not failing requests during rolling upgrades.

For a followup that might help even more I'm thinking we could route requests to nodes where the routing state is STARTING. We could build up a queue of items to send to the C++ process before it's ready, then feed them through once it's running. We'd have to decide whether to favour STOPPING or STARTING after failing to find a STARTED, and there would probably be work elsewhere to accept requests into the queue before the C++ process is ready. But it could be another layer of improvement even if we still haven't reached perfection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I'm happy to merge this on the grounds that it makes the situation better, if not perfect.

When time permits we should investigate whether the idea of #99107 (comment) is feasible.

@jonathan-buttner jonathan-buttner added v8.11.0 auto-backport-and-merge Automatically create backport pull requests and merge when ready v8.11.1 labels Oct 12, 2023
@jonathan-buttner jonathan-buttner merged commit 19fb25f into elastic:main Oct 12, 2023
14 checks passed
jonathan-buttner added a commit to jonathan-buttner/elasticsearch that referenced this pull request Oct 12, 2023
… exists (elastic#99107)

* Allowing stopping deployments to handle requests

* Adding some logging

* Update docs/changelog/99107.yaml

* Adding test

* Addressing feedback

* Fixing test
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.11

@jonathan-buttner jonathan-buttner deleted the ml-wait-for-new-node-before-stopping branch October 12, 2023 13:13
elasticsearchmachine pushed a commit that referenced this pull request Oct 12, 2023
… exists (#99107) (#100765)

* Allowing stopping deployments to handle requests

* Adding some logging

* Update docs/changelog/99107.yaml

* Adding test

* Addressing feedback

* Fixing test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready >bug cloud-deploy Publish cloud docker image for Cloud-First-Testing :ml Machine learning Team:ML Meta label for the ML team v8.11.1 v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants