-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Instances v2 #185
Conversation
The diffs were too big here; I think I need to just look at what it does. Do you want to squash the commits? |
As a general rule, aren't these backwards and forwards compatible one minor version? If this means ccm won't work with k8s <1.20 (i.e. 1.19, 1.18), that might be an issue. At the very least, we would need to cut a release with a clear warning that this won't work with those earlier ones. But that still might lead to some user pushback.
My oh my, yes!!
They couldn't before. It was restricted to one facility. It discovered the facility from metadata where the ccm is running or override via env var. I wouldn't worry about users who managed somehow to bypass that. But this does mean we need to rename the env var and how it is used.
Yes. I wouldn't worry about it.
My opinion is that it does, but I would like weigh-in from you and maybe @detiber as well. I am not as confident as I should be about it.
It already was disabled by returning a |
I have two structural comments. What actually calls the provider to register?In the existing one, you call metal.InitializeProvider(config), which, in turn, eventually calls cloudprovider.RegisterCloudProvider. In the change, that functionality is moved to metal.init() which calls the same cloudprovider.RegisterCloudProvider. Is this a better structure, having it called by default? Why is that better than having the single entry point in the program know and call it? Or is it required by the new cloud-provider structure from k8s? creating the configI am ambivalent about moving all of the logic to create the finalized config from Is this something semi-official? |
The commits are logical, let me know if they make more sense (are smaller) viewed individually. The commit that updates to 1.21.2 is separate from the one that introduces the
|
I think main.go's job is to say, "I have a plugin - here it is", the plugin is not really activated. When the plugin is activated, it is handed a config buffer, perhaps there is some file permission juggling going on? (/etc/kubernetes/cloud.cfg is needed, but the ccm shouldn't have direct filesystem access to it?). That's the sense I get from the structure, but I couldn't find text to back this up.
This is the pattern I found in the other providers I compared. The cloud-provider sample / example in the k8s repo sets this precedent. |
I see what it is doing now (took too much digging into the cloud-provider reference code). I don't love their approach, but I think what you are doing here is correct. I still am not madly in love with the env var overrides in the actual
So does If so, why do we need the |
Was provider-config ever standard? Perhaps it was conventional.
This PR removes --provider-config support. If we want to continue to support it, and deprecate it, I'll have to revisit the pflags code. After spending some time in |
No idea. We add it explicitly to
I don't think we want commits at any point that are not rebuildable on their own. |
Did you have a chance to address some of the comments, e.g. instances and instances v2 and what needs to be returned? Also, if we are replacing I have some other work I want to do on CPEM, so once this is in a good state and we get it in, I will attack those. |
e25f97f
to
90523d4
Compare
I've rebased this against the latest changes. We may still want to preserve the legacy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am good to go with most of this. We can make --provider-config
no longer function, that is fine. It all is packages up inside an image anyways, which is launched from the deployment manifest or the helm chart. The very few people who run it manually read the README and will get the latest. The even fewer people who run it as a standalone binary (grand total of somewhere between 1 and 3) know of the change.
I had a few small changes:
- To enable instancesv2, do we not need to return the actual
*instances
here along withtrue
, rather thannil, false
? - As we aren't going to support the old one, can we use get rid of this section?
It also needs a rebase. Once the above is in, and rebased, let me know and I can run some tests. |
@displague can we get the versioning in |
@deitch The versioning is intended to be 0.21.2, per https://github.com/kubernetes/client-go#versioning. Notably, the removal of this line is what allowed for the
|
Right you are. I had gotten confused. No problem. |
213beed
to
7edc468
Compare
Rebased and removed the dead code and comments related to
That does seem to be important. Captured in
test vigorously |
9675047
to
e360a26
Compare
Backward compat support for client-go should be fine, depending on how far we are going back, we might need to be mindful of added/removed fields and backward compatible support for upstream annotations. What is are targeted supported minimum version? We could probably start building out some automated testing using controller-runtime's envtest or even using cluster-api-provider-packet to do some verification against the minimum version. |
We really should have a supported versions matrix. With each new versioned release, we can add to the table. @detiber do you know how to build that from the specific client-go versions used in each release? I am happy to build a table of CPEM version->client-go version, but I don't know how to fill in the supported versions. Can you do that if I put in the skeleton?
I rather like that idea. Needs a new PR, of course. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I am going to run some more tests, and then we can merge it in.
@@ -78,22 +79,28 @@ func newCloud(metalConfig Config, client *packngo.Client) (cloudprovider.Interfa | |||
}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like cloud.facility
is used anywhere, and metalConfig.Facility
is only passed through to newLoadBalancers()
. It might make sense to remove facility from the cloud
struct to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe you are right. We should remove it as part of a general PR to address the facility/metadata issue. I want to get that in once this PR is done and in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind putting it in the comments on that issue?
We are trying to get rid of local metadata entirely from CPEM. One cannot fairly depend on the CPEM running in the same facility as the worker nodes (or even on EQXM at all). |
|
||
// InstanceExists returns true if the node exists in cloudprovider | ||
func (i *instances) InstanceExists(ctx context.Context, node *v1.Node) (bool, error) { | ||
return i.InstanceExistsByProviderID(ctx, node.Spec.ProviderID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really would like to have this be more flexible, using your new deviceByNode()
, so it can handle with or without a provider ID. But as InstancesV2 says that InstanceExists
and InstanceShutdown
could use either, let's leave this. I will open a new PR to do that once this is in.
Retesting. |
So, it mostly works, but the bgp does not. The reason for that is that there are other places that rely on the node having a provider ID set. Yet, even after instance metadata is set correctly during initialization, it still does not set the provider ID. Logs from setting annotations:
Notice that it called the apiVersion: v1
kind: Node
metadata:
annotations:
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2021-11-30T09:20:20Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: c3.medium.x86
beta.kubernetes.io/os: linux
failure-domain.beta.kubernetes.io/region: da
failure-domain.beta.kubernetes.io/zone: da11
kubernetes.io/arch: amd64
kubernetes.io/hostname: k8s-master
kubernetes.io/os: linux
node-role.kubernetes.io/control-plane: ""
node-role.kubernetes.io/master: ""
node.kubernetes.io/exclude-from-external-load-balancers: ""
node.kubernetes.io/instance-type: c3.medium.x86
topology.kubernetes.io/region: da
topology.kubernetes.io/zone: da11
spec:
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/master I left out the irrelevant status and managedFields. I am going to dig further, but does InstancesV2 not set the provider ID? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One missing piece that I found, which is causing the rest to fail.
metal/devices.go
Outdated
} | ||
|
||
return &cloudprovider.InstanceMetadata{ | ||
ProviderID: node.Spec.ProviderID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, here! This is where you are expected to set the provider ID, but it is just passing back what it received.
I believe this should be fmt.Sprintf("%s://%s", ProviderName, device.ID)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed that the above works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha! Only as portable as your laptop. 😁
Make the changes, I can approve and merge this in. Already fully tested.
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
…-config) Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
…erID references Signed-off-by: Marques Johansson <mjohansson@equinix.com>
8e23b1f
to
fe2ed00
Compare
Once CI is clean, we can merge it in |
CI was passing and then I raced your approval with a rebase. I'll click merge when this last test finishes, thanks for the diligent review and exceptional patience! |
I misspoke - the 'build' action was still processing when I rebased. Yipes, that job takes some time doing package installs. |
Yeah, it is really slow. Long story. I will take a look at it once this is in. |
Thank you for all the great work on this @displague ! |
Breaking Changes:
--provider-config
replaced with standard--cloud-config
(defaults to/etc/kubernetes/cloud.cfg
)refactors main.go, moving much to metal/config.go
This was done because
app.NewCloudControllerManagerCommand
has a new signature which ultimately passes a config file along. I compared notes with the GCP, DigitalOcean, vSphere, and OpenStack cloud providers.Changes:
replace
lines andGOMODULES.md
(see 'unknown revision v0.0.0' errors, seemingly due to 'require k8s.io/foo v0.0.0' kubernetes/kubernetes#79384 (comment))routes
controller, do we need to disable others? Do we use thecloud-node-lifecycle
controller?Implements Instances v2 https://github.com/kubernetes/cloud-provider/blob/v0.21.2/cloud.go#L196-L212
Zone / Region mapped to Facility / Metro respectively, https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesioregion