-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor EIP handling #230
Conversation
for _, a := range node.Status.Addresses { | ||
// Find the non EIP external address for the node to use for the health check | ||
if a.Type == v1.NodeExternalIP && a.Address != controlPlaneEndpoint.Address { | ||
controlPlaneHealthURL = fmt.Sprintf("https://%s:%d", a.Address, m.nodeAPIServerPort) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With ClusterAPI, we are currently setting the EIP address as a loopback address on all control plane nodes to workaround the bootstrapping issue of the Service/Endpoints not yet being set, so kubeadm init
doesn't fail.
This was causing an issue with the health checks not actually checking the proper endpoints in some circumstances...
ca82a3d
to
f39e42e
Compare
Heh, a lot smaller now. I will hold off until we resolve #232 and then we can address this one. |
The merger of #232 should make this even smaller post-rebase? |
@deitch rebased on latest. |
@detiber I get most of it - looks good - except for the core. The existing one depends on In this proposed one, it appears to change that. It only checks the health in How does this regularly check the status of the EIP and update it if necessary? I get that it would work when |
Also, it doesn't look like it checks the EIP address, only the node's? |
I was just running through another test of my PR that adds So I believe there are 3 routes that could be taken here.
Edit: |
It only does this if you pass it the right tag for the EIP. If you do not, CCM ignores it. That is the flag that says, "you should control the EIP" or "you should ignore it". if you are using kibe-vip in its mode of managing the control plane EIP, then the CCM should not have the tag set. |
@deitch Thanks for the explanation, I just found that as well and updated my PR to reflect the change. |
Because the resync interval on the informer is set, every time a resync happens |
Yeah, that was the bit that I suspected to get some push back on... In the case of Cluster API, we are configuring the EIP as a local loopback address: https://github.com/kubernetes-sigs/cluster-api-provider-packet/blob/main/templates/cluster-template.yaml#L30-L36, this was needed to work around how This resulted in issues during upgrade, where the local etcd instance is removed from the cluster prior to the machine being deleted, which would result in the local apiserver being unresponsive. Not necessarily an issue if CPEM is running on the machine being deleted, but becomes an issue when it's one of the control plane nodes that isn't running CPEM that is also currently assigned the EIP. CPEM is attempting to health check the EIP, which resolves to localhost and passes, so doesn't attempt to re-assign the EIP. |
Ah, I didn't realize that. So basically every update-interval for resync it automatically will call the health check? Then that makes sense. |
Oh yes, I have suffered through that. I do wish kubeadm was a bit more configurable in that regards.
So the reason you did it is so that we won't have a situation where the health check in CPEM thinks everything is healthy - because it is checking the EIP, which is on local loopback, but really it isn't OK when coming from the Internet? |
Exactly |
@deitch Do you think it would make sense to make using the EIP URL or not a configurable option? I'm not sure how others may be using CPEM outside of Cluster API to know if it's a broader issue or just a localized one. |
I don't understand. |
@deitch rather than just avoiding the use of the EIP automatically (like this PR is doing) for health checks, or would it make sense to make that behavior opt-in, with the default continuing to use the EIP url for health checks? |
Would it make sense to split the conversation? This PR is changing two things simultaneously: how we invoke health checks; what health checks test. Does this PR have any value doing just the first, and then having a separate one focused just on the second? Or must it go together? |
Now that I get it, yes, I think so. My concern is lack of consistent EIP behaviour by CPEM, but that can be acceptable I think. |
Both are really needed in order to unblock the CAPI v1 support for CAPP, so I'm not sure there is much value in breaking it up. That said, more than happy to break it up if that is your preference. |
This should be ready to go. |
Reviewing now. @detiber can you add the new option env var to the README? |
FilterFunc: func(obj interface{}) bool { | ||
n, _ := obj.(*v1.Node) | ||
return m.nodeFilter(n) | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the apiServerPort
might not be set the first time a node is added, will this work the next time? Is the FilterFunc
called on every sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rather nicely done, and I think simplifies a lot.
I mostly have some questions, README requests, and some small changes
I still need to address the README updates, but did quite a bit of cleanup based on the other comments. Working on validating that the Server Side Apply changes didn't break anything, though 😬 |
Server side apply changes seem to be working good, will pick back up on the README changes tomorrow. |
This all looks solid. Just needs the README updated and that |
@deitch README is updated and we are no longer fitlering anything. I also squashed down the commits and cleaned up the commit message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good, other than the single bug I think I might have found.
Comment here when that is patched, and I can run another series of tests on it.
- Reconcile Service/endpoint updates separately - Provide option to perform health check against non-EIP External Address for the assigned control plane node - Attempt to pro-actively migrate EIP in the case of node deletion or node becomes unschedulable Signed-off-by: Jason DeTiberus <detiber@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's let CI go green. If you are comfortable with it @detiber, then once green, we can merge in.
@deitch sgtm, I've been testing this code pretty regularly from an EIP perspective with the automated testing I've been doing for CAPP |
Really nice on this one, thank you. |
These changes have been running as part of an automated test suite against my fork of cluster-api-provider-packet. Prior to these changes the flake frequency of the tests involving upgrade workflows or control plane node remediation was much higher.
A test deployment with these changes can be found https://raw.githubusercontent.com/detiber/packet-ccm/test/deploy/template/deployment.yaml, which uses a pre-built image with these changes: quay.io/detiber/cloud-provider-equinix-metal:dev
I suspect there are still some improvements that can be made to improve handling of the EIP, since there still seem to be a few edge cases where the EIP takes a while to be migrated, but overall these changes seem to provide quite an improvement over the previous handling.