Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider failed machine as terminating #118

Closed
himanshu-kun opened this issue Apr 22, 2022 · 14 comments · Fixed by #291
Closed

Consider failed machine as terminating #118

himanshu-kun opened this issue Apr 22, 2022 · 14 comments · Fixed by #291
Assignees
Labels
exp/beginner Issue that requires only basic skills kind/bug Bug needs/planning Needs (more) planning with other MCM maintainers priority/1 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)

Comments

@himanshu-kun
Copy link

What happened:
A race condition could happen b/w MCM and CA in case of node not joining in 20 min (default creationTimeout and maxNodeProvisionTimeout)
This could lead to MCM deleting Failed machine and creating a new one, and then CA deleting the new one thinking it deleting the Failed machine.

What you expected to happen:
CA should not take action if the machine it wants to terminate is already terminating.

How to reproduce it (as minimally and precisely as possible):

  • Make CA to scale up a nodegrp
  • make the machine which got created not join within 20 min. This could be achieved by deleting the VM as soon as its created or tweaking the router rules. Important bit is that node shouldn't get registered
  • keep the default configurations for both MCM and CA

Anything else we need to know:
This happens because of delay of machineSet controller marking a Failed machine with deletionTimestamp and so the machine is not considered as terminating.

Environment:

@himanshu-kun himanshu-kun added kind/bug Bug priority/1 Priority (lower number equals higher priority) labels Apr 22, 2022
@alexkyurtev
Copy link

Hello , Can we have an update on the status ?

@himanshu-kun
Copy link
Author

This would be available from release 1.22 of CA

@bbochev
Copy link

bbochev commented Jun 1, 2022

Hello @himanshu-kun, do you know when 1.22 will be rolled out?

@himanshu-kun
Copy link
Author

Hi @bbochev , I am planning to make release next week after fix for this issue. Then to reach Canary landscape it could take one more week.

@bbochev
Copy link

bbochev commented Jun 14, 2022

Hello @himanshu-kun,
Could you please update us here - does the new release of CA was implemented successfully on canary?

@himanshu-kun
Copy link
Author

Hi @bbochev
Apologies, I was not able to work on it this week due to some other major fixes. I'll start working on it now. Then I'll make the release.

@himanshu-kun
Copy link
Author

himanshu-kun commented Jun 22, 2022

After looking more into the issue, we realized that it needs a bigger change , where the state transition of machines and communication between both CA and MCM have to be redesigned.
I see that from the live ticket https://github.tools.sap/kubernetes-live/issues-live/issues/1607 the question was more on why two scale-up attempts happened. The system came back to normal state on its own within an hour.
To deal with this for now, customer can keep the maxNodeProvisionTime different (like 19min) than the machineCreationTimeout which is by default 20min. The trick is not to keep them the same.
cc @unmarshall

@gardener-robot
Copy link

@himanshu-kun You have mentioned internal references in the public. Please check.

1 similar comment
@gardener-robot
Copy link

@himanshu-kun You have mentioned internal references in the public. Please check.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Dec 22, 2022
@himanshu-kun himanshu-kun added needs/planning Needs (more) planning with other MCM maintainers and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Feb 28, 2023
@elankath
Copy link

elankath commented Feb 28, 2023

Background

  • The cluster autoscaler as part of its regular RunOnce loop calls deleteCreatedNodesWithErrors which in turn calls nodeGroup.DeleteNodes(nodesToBeDeleted).
  • The MCM implementation of NodeGroup.DeleteNodes(nodesToBeDeleted) obtains the machine objects associated with nodesToBeDeleted and then calls mcmManager.DeleteMachines(machines)
  • DeleteMachines(machines) iterates through each machine object, obtains its machine deployment and uses the following logic to set the replicas in the machine deployment
expectedReplicas := mdclone.Spec.Replicas - int32(len(machines)) + int32(len(terminatingMachines))
if expectedReplicas == mdclone.Spec.Replicas {
	klog.Infof("MachineDeployment %q is already set to %d, skipping the update", mdclone.Name, expectedReplicas)
			break
}
mdclone.Spec.Replicas = expectedReplicas
  • As can be seen above, we have a problem: the expectedReplicas is lowered since we never consider Failed machines in the above logic.

Grooming Decisions

  • For solving the problem specifically mentioned in this issue, we believe that the expectedReplicas should be computed as:
expectedReplicas := mdclone.Spec.Replicas - int32(len(machines)) + int32(len(terminatingOrFailedMachines))
  • We understand and acknowledge that this proposed fix (which was also submitted as a PR earlier in this thread) will only address one problem and not other edge cases where we don't see the most optimal scaling. (Machines in Pending phase causing deployment replica count to be lowered, but the machine could have become running the next minute, but now autoscaler would try a new node grp and wait for the node to join there). A separate issue for generic autoscaler<->MCM edge cases is raised for this. Ensure that there is a single actor which reduces the machine deployment replicas #181

@mattburgess
Copy link

To deal with this for now, customer can keep the maxNodeProvisionTime different (like 19min) than the machineCreationTimeout which is by default 20min.

We're not using Gardener, but are using MCM and hence Gardener's CA fork too. We have these settings:

  • CA - --max-node-provision-time=10m0s
  • MCM - --machine-drain-timeout=10m & --machine-health-timeout=10m (note, specifically, that we don't override MCM's --machine-creation-timeout)

In our specific case, we run into this bug when AWS can't provision nodes due to capacity issues. CA's unregistered node handler then kicks in after 10 minutes and triggers this bug. We also only appear to hit it during an rolling update of a MachineDeployment; simply causing CA to scale up doesn't appear to hit it.

Is the right workaround here to adjust MCM's --machine-creation-timeout to be less than CA's --max-node-provision time so that MCM can clean up the failed machine before CA's unregistered node handler has a chance to kick in? Your guidance above has CA set to 19m and MCM to its default of 20m but in our case we're on 10m and 20m respectively, so even though they're different we're still hitting this bug.

@mattburgess
Copy link

I don't know whether it helps at all, but my testing has shown that even the band-aid fix isn't helpful in our specific case.

When AWS is under capacity pressure (particularly prevalent in eu-central-1a) then we get an error back from the AWS API. This causes the Machine to enter a CrashLoopBackOff state, hence why the patch doesn't help us. That also explains why our adjustment of --machine-creation-timeout doesn't appear to help avoid the race either; that flag only appears to help when the Machine can reach a Pending status, i.e. the cloud provider can provide a backing instance but something else (networking, bootstrapping, etc.) prevents the Machine from reaching a Running state.

In short, whatever approach is used to address this bug will need to cater for both of those scenarios.

@himanshu-kun
Copy link
Author

in our case we're on 10m and 20m respectively, so even though they're different we're still hitting this bug.

the explanation still hold in your above scenario

that flag only appears to help when the Machine can reach a Pending status

no this flag works for CrashLoopBackoff phase also , though there are some improvements we have to make there, but that shouldn't be causing this issue. This issue is only caused on race. Let me know if you feel otherwise.

@sssash18
Copy link

/assign

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exp/beginner Issue that requires only basic skills kind/bug Bug needs/planning Needs (more) planning with other MCM maintainers priority/1 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
7 participants