New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Wait to gracefully stop deployments until alternative allocation exists #99107
[ML] Wait to gracefully stop deployments until alternative allocation exists #99107
Conversation
Hi @jonathan-buttner, I've created a changelog YAML for you. |
…athan-buttner/elasticsearch into ml-wait-for-new-node-before-stopping
Pinging @elastic/ml-core (Team:ML) |
|
||
// We couldn't find any nodes in the started state so let's look for ones that are stopping in case we're shutting down some nodes | ||
if (nodes.isEmpty()) { | ||
nodes = assignment.selectRandomStartedNodesWeighedOnAllocationsForNRequests(request.numberOfDocuments(), RoutingState.STOPPING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely to be happening on a different node to the one where the selected deployment is running.
What would happen if by the time the message sent by this node to the node running the stopping model arrives, the queue on that node has been fully drained?
(Maybe it's OK but I haven't been through the code in great detail so just wanted to check what you think first.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's a good point, I overlooked that.
In this code path it is first checking to see if there are routes in the started state. If none exist it'll try to find ones in the stopping state.
A couple scenarios:
User API request stopping the deployment
The caller here
Will return an error if the user stopped the deployment via an API request because the first thing that endpoint does is set the assignment to stopping.
I think this would be ok because the user is intentionally stopping the deployment.
Node shutdown stopping deployment
When a rebalance happens routes to shutting down nodes are set to stopping. Then the TrainedModelAssignmentNodeService will check if the deployment meets the criteria to be gracefully shutdown (route is stopping, node is shutting down, and hasn't been already stopped by the user or a previous cluster state change).
One of the code changes in this PR adds logic to also consider if there are any other allocations that can service requests. If other routes exist but they're in the starting state (but aren't ready to handle requests yet), we won't begin the graceful shutdown process yet and requests can still be queued up and serviced.
There is a race condition like you mentioned though. If selectRandomStartedNodesWeighedOnAllocationsForNRequests
picked a stopping route because no started routes existed and while the request was in flight to that node a cluster state changed such that a different node was ready to handle requests and the shutting down node received that cluster state change it would begin the graceful shutdown. If this inference request arrived after the cluster state change it could be rejected because the deployment worker set the shutdown
flag.
Ideas
Continue to accept requests while the queue is draining
One idea would be to continue to return false
for isShutdown
here until we've completely drained the queue (the queue.poll
returns null).
This way, when a new node is ready to service requests, it'll be chosen over any stopping deployments. Any requests in flight to a stopping deployment will still be serviced.
I guess the danger here is that the worker might never shutdown if the queue never becomes empty. The ML plugin does cap the shutdown time frame to 10 minutes so we do have that safety net. The DeploymentManager
also waits 3 minutes for the worker to drain the queue and terminate so we'd get a failure from that too. If that happens, the DeploymentManager
will force set the shutdown
flag kill the process. I doubt we'd get flooded with that many requests because we should be prioritizing started routes over stopping ones.
Retry the inference request on rejection error
Another idea would be to add some retry logic in the inference route to pick a new route if the previous route returned an error stating it was shutting down. Maybe we could optimize this to only retry if a different route existed (whether in started or stopping).
Do either of those solutions sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TrainedModelAssignmentClusterService
monitors for node shutdown events, when sees one it sets all routes to that node to STOPPING.- When
TrainedModelAssignmentNodeService
notices the route update in a cluster change event it asynchronously tells the deployment manager to stop gracefully. - The first thing the deployment manager does on graceful stop is remove the process context from the map which means all calls to infer() will now fail because they can't find the process context. It also shutdowns the worker queue so new requests will be rejected.
For this to work the deployment manager code must continue to accept requests but that is tricky for the reasons above.
During a scale event is the replacement node present in the cluster before the shudown event is received? Is is an option when we get the shutdown event to rebalance the assignments and if the new plan can be satisfied wait until the model is started on the new node then shutdown gracefully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first thing the deployment manager does on graceful stop is remove the process context from the map which means all calls to infer() will now fail because they can't find the process context. It also shutdowns the worker queue so new requests will be rejected.
Good point, I forgot about that
During a scale event is the replacement node present in the cluster before the shudown event is received?
Yes, I believe the new node will be up and running ready to process requests (and have an allocation added to it) prior to the shutdown signal being sent to the other node. At least that's what I've seen based on my testing.
Is is an option when we get the shutdown event to rebalance the assignments and if the new plan can be satisfied wait until the model is started on the new node then shutdown gracefully?
Yeah this is what I was trying to achieve with this PR. But I think it still isn't able to handle the scenario Dave R brought up.
Rebalancing does occur when a node is marked for shutdown and these code change (or by adding another field to the cluster state) can achieve waiting for the model to be started on the new node before beginning the graceful shutdown.
I think the issue is that even if we wait for the model to be in the started state on the new node there's still a window of time where any inference requests that were in flight and routed to the node marked for shutdown could arrive after the graceful shutdown occurs.
As I think about this more I don't think my first suggestion would actually solve it because if we had super slow requests they could still arrive after the graceful shutdown finished even if we allowed requests to be queued up while draining the queues.
I wonder if the best way to solve this is use the changes I've added in this PR (or some new field in the cluster state/temporary invisible assignments to indicate that we're waiting for a new allocation to be ready to service requests) and add retry logic within the inference code path. That way if an inference request failed because of it being routed to a node that was in the middle of gracefully shutting down it'd simply retry and be routed to the new node.
Otherwise how do we guarantee that no requests were already in flight to a node that is gracefully shutting down even if another node is now available to service inference requests?
If we don't add retry logic we'll have a similar problem if ML nodes crash and unexpectedly disappear right?
if (routingInfo == null) { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current calling point for shouldGracefullyShutdownDeployment
it's impossible for this test to pass and return false
because there a check for routingInfo != null
above where it's called.
So I guess this check is just for future-proofing. In that case I think it's better that it returns true
, because if a node isn't mentioned in a model's routing table then in general it shouldn't be running there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point. In the changes here I decided to start passing in TrainedModelAssignment
since I needed it for a few other things and figured I should do a null check. I'll switch it to true
👍
var nodes = assignment.selectRandomStartedNodesWeighedOnAllocationsForNRequests(request.numberOfDocuments()); | ||
var nodes = assignment.selectRandomStartedNodesWeighedOnAllocationsForNRequests(request.numberOfDocuments(), RoutingState.STARTED); | ||
|
||
// We couldn't find any nodes in the started state so let's look for ones that are stopping in case we're shutting down some nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what's done in this PR is an improvement on what we have today in terms of not failing requests during rolling upgrades.
For a followup that might help even more I'm thinking we could route requests to nodes where the routing state is STARTING
. We could build up a queue of items to send to the C++ process before it's ready, then feed them through once it's running. We'd have to decide whether to favour STOPPING
or STARTING
after failing to find a STARTED
, and there would probably be work elsewhere to accept requests into the queue before the C++ process is ready. But it could be another layer of improvement even if we still haven't reached perfection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
…for-new-node-before-stopping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I'm happy to merge this on the grounds that it makes the situation better, if not perfect.
When time permits we should investigate whether the idea of #99107 (comment) is feasible.
… exists (elastic#99107) * Allowing stopping deployments to handle requests * Adding some logging * Update docs/changelog/99107.yaml * Adding test * Addressing feedback * Fixing test
💚 Backport successful
|
This PR is a follow on from #98406
These changes allow for a deployment that is routed to a shutting down node to continue to service requests until a new started route exists. If the assignment is only routed to a single node, then it will gracefully shutdown. Assignments routed to a single node indicate that no additional nodes exist that could accommodate the deployment during a reblance.
Relates to https://github.com/elastic/ml-team/issues/994
Depends on: https://github.com/elastic/cloud/issues/117365