Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Instances v2 #185

Merged
merged 13 commits into from
Dec 1, 2021
Merged

Conversation

displague
Copy link
Member

@displague displague commented Jun 30, 2021

Breaking Changes:

  • --provider-config replaced with standard --cloud-config (defaults to /etc/kubernetes/cloud.cfg)
    • refactors main.go, moving much to metal/config.go

      This was done because app.NewCloudControllerManagerCommand has a new signature which ultimately passes a config file along. I compared notes with the GCP, DigitalOcean, vSphere, and OpenStack cloud providers.

Changes:

@displague displague requested a review from deitch June 30, 2021 02:07
@deitch
Copy link
Contributor

deitch commented Jun 30, 2021

The diffs were too big here; I think I need to just look at what it does.

Do you want to squash the commits?

@deitch
Copy link
Contributor

deitch commented Jun 30, 2021

Updates k8s.io dependencies to 1.21.2

As a general rule, aren't these backwards and forwards compatible one minor version? If this means ccm won't work with k8s <1.20 (i.e. 1.19, 1.18), that might be an issue. At the very least, we would need to cut a release with a clear warning that this won't work with those earlier ones. But that still might lead to some user pushback.

Removed the complex replace lines and GOMODULES.

My oh my, yes!!

If nodes running across sjc1 and sv15 before the introduction of Metros were running the CCM, would/could they have been in the same cluster? Now they can

They couldn't before. It was restricted to one facility. It discovered the facility from metadata where the ccm is running or override via env var. I wouldn't worry about users who managed somehow to bypass that.

But this does mean we need to rename the env var and how it is used.

Can we ignore VLANs created before Metros were introduced because they were not supported by the CCM?

Yes. I wouldn't worry about it.

Zone / Region mapped to Facility / Metro respectively. Does this fit the Kubernetes expectations around topology? Nodes in the same metro (region) will be able to rebalance pods and the EM networking will allow it.

My opinion is that it does, but I would like weigh-in from you and maybe @detiber as well. I am not as confident as I should be about it.

disables the routes controller, do we need to disable others? Do we use the cloud-node-lifecycle controller?

It already was disabled by returning a nil?

@deitch
Copy link
Contributor

deitch commented Jun 30, 2021

To enable InstancesV2(), don't you need to return the &instances here instead of nil?

For that matter, we probably should return nil, false to the previous Instances() here

@deitch
Copy link
Contributor

deitch commented Jun 30, 2021

I have two structural comments.

What actually calls the provider to register?

In the existing one, you call metal.InitializeProvider(config), which, in turn, eventually calls cloudprovider.RegisterCloudProvider.

In the change, that functionality is moved to metal.init() which calls the same cloudprovider.RegisterCloudProvider.

Is this a better structure, having it called by default? Why is that better than having the single entry point in the program know and call it? Or is it required by the new cloud-provider structure from k8s?

creating the config

I am ambivalent about moving all of the logic to create the finalized config from package main to package metal. and further having it then read env vars and such. I like the clean idea that all interaction with how this program started is in main (or perhaps a dedicated package), and by the time we call init() (or InitializeProvider), or its callback routine, we have figured all of that out.

Is this something semi-official?

main.go Show resolved Hide resolved
@displague
Copy link
Member Author

The diffs were too big here; I think I need to just look at what it does.
Do you want to squash the commits?

The commits are logical, let me know if they make more sense (are smaller) viewed individually.

The commit that updates to 1.21.2 is separate from the one that introduces the --cloud-config handling. It is worth noting that the 1.21.2 change by itself will not compile because of the app.NewCloudControllerManagerCommand changes needed and added in the following commit.

--cloud-config handling was mostly a move of the existing config functions from main.go to metal/config.go, changing metalConfig to read from a buffer instead of a filename. I got lost in the pflag changes, I don't know what the original code was trying to accomplish. I see similar pflag lines in the DO, openstack, and vsphere providers, but I don't see evidence of those providers setting the config file via pflags.

@displague
Copy link
Member Author

Is this a better structure, having it called by default? Why is that better than having the single entry point in the program know and call it? Or is it required by the new cloud-provider structure from k8s?

I think main.go's job is to say, "I have a plugin - here it is", the plugin is not really activated.

When the plugin is activated, it is handed a config buffer, perhaps there is some file permission juggling going on? (/etc/kubernetes/cloud.cfg is needed, but the ccm shouldn't have direct filesystem access to it?). That's the sense I get from the structure, but I couldn't find text to back this up.

https://github.com/equinix/cloud-provider-equinix-metal/pull/185/files#diff-d677710ef260f87ecc6eedb2d9bf090758eaa9aeb4fd917444ff83ade963ca60R84

Is this something semi-official?

This is the pattern I found in the other providers I compared. The cloud-provider sample / example in the k8s repo sets this precedent.

@deitch
Copy link
Contributor

deitch commented Jun 30, 2021

I think main.go's job is to say, "I have a plugin - here it is", the plugin is not really activated.

I see what it is doing now (took too much digging into the cloud-provider reference code). I don't love their approach, but I think what you are doing here is correct. I still am not madly in love with the env var overrides in the actual pkg/metal, but they don't seem to consider the possibility of overrides. We will live with it.

cloud-config

So does --cloud-config replace --provider-config? And it just creates an io.Reader to the file that is passed to the register function? Not sure why they changed that. Does it have any other impacts? Looking at the k8s.io/cloud-provider source code, it looks like it just passes it in as is.

If so, why do we need the --provider-config at all? We can deprecate it.

@displague
Copy link
Member Author

displague commented Jul 12, 2021

So does --cloud-config replace --provider-config? And it just creates an io.Reader to the file that is passed to the register function? Not sure why they changed that.

Was provider-config ever standard? Perhaps it was conventional.

Does it have any other impacts? Looking at the k8s.io/cloud-provider source code, it looks like it just passes it in as-is.
If so, why do we need the --provider-config at all? We can deprecate it.

This PR removes --provider-config support. If we want to continue to support it, and deprecate it, I'll have to revisit the pflags code. After spending some time in packet-cli metal-cli, I might be prepared for that.

@deitch
Copy link
Contributor

deitch commented Jul 14, 2021

Was provider-config ever standard? Perhaps it was conventional.

No idea. We add it explicitly to main.go, so if this is what "all the cool kids" are doing, I am fine just replacing it.

The commit that updates to 1.21.2 is separate from the one that introduces the --cloud-config handling. It is worth noting that the 1.21.2 change by itself will not compile because of the app.NewCloudControllerManagerCommand changes needed and added in the following commit.

I don't think we want commits at any point that are not rebuildable on their own.

@deitch
Copy link
Contributor

deitch commented Jul 14, 2021

Did you have a chance to address some of the comments, e.g. instances and instances v2 and what needs to be returned?

Also, if we are replacing --provider-config with --cloud-config, then we need to update the deployment yaml and helm chart.

I have some other work I want to do on CPEM, so once this is in a good state and we get it in, I will attack those.

@displague
Copy link
Member Author

I've rebased this against the latest changes. We may still want to preserve the legacy --provider-config, for now I've updated the charts with --cloud-config.

@displague displague marked this pull request as ready for review October 26, 2021 03:43
Copy link
Contributor

@deitch deitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good to go with most of this. We can make --provider-config no longer function, that is fine. It all is packages up inside an image anyways, which is launched from the deployment manifest or the helm chart. The very few people who run it manually read the README and will get the latest. The even fewer people who run it as a standalone binary (grand total of somewhere between 1 and 3) know of the change.

I had a few small changes:

  • To enable instancesv2, do we not need to return the actual *instances here along with true, rather than nil, false?
  • As we aren't going to support the old one, can we use get rid of this section?

@deitch
Copy link
Contributor

deitch commented Oct 26, 2021

It also needs a rebase. Once the above is in, and rebased, let me know and I can run some tests.

@deitch
Copy link
Contributor

deitch commented Oct 26, 2021

@displague can we get the versioning in go.mod/go.sum correct, i.e. 1.21.2 instead of 0.21.2?

@displague
Copy link
Member Author

displague commented Oct 26, 2021

@deitch The versioning is intended to be 0.21.2, per https://github.com/kubernetes/client-go#versioning.

Notably, the removal of this line is what allowed for the replace block in go.mod to be removed:

k8s.io/kubernetes v1.19.11

@deitch
Copy link
Contributor

deitch commented Oct 26, 2021

Right you are. I had gotten confused. No problem.

@displague
Copy link
Member Author

displague commented Oct 26, 2021

Rebased and removed the dead code and comments related to --provider-config.

To enable instancesv2, do we not need to return the actual *instances here along with true, rather than nil, false?

That does seem to be important. Captured in
20e5c56

It also needs a rebase. Once the above is in and rebased, let me know and I can run some tests.

test vigorously

@detiber
Copy link
Member

detiber commented Oct 26, 2021

Backward compat support for client-go should be fine, depending on how far we are going back, we might need to be mindful of added/removed fields and backward compatible support for upstream annotations.

What is are targeted supported minimum version? We could probably start building out some automated testing using controller-runtime's envtest or even using cluster-api-provider-packet to do some verification against the minimum version.

@deitch
Copy link
Contributor

deitch commented Oct 26, 2021

Backward compat support for client-go should be fine, depending on how far we are going back, we might need to be mindful of added/removed fields and backward compatible support for upstream annotations

We really should have a supported versions matrix. With each new versioned release, we can add to the table. @detiber do you know how to build that from the specific client-go versions used in each release? I am happy to build a table of CPEM version->client-go version, but I don't know how to fill in the supported versions. Can you do that if I put in the skeleton?

We could probably start building out some automated testing using controller-runtime's envtest

I rather like that idea. Needs a new PR, of course.

deitch
deitch previously approved these changes Oct 26, 2021
Copy link
Contributor

@deitch deitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am going to run some more tests, and then we can merge it in.

deploy/chart/templates/deployment.yaml Show resolved Hide resolved
deploy/template/deployment.yaml Show resolved Hide resolved
metal/cloud.go Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
metal/cloud.go Outdated Show resolved Hide resolved
@@ -78,22 +79,28 @@ func newCloud(metalConfig Config, client *packngo.Client) (cloudprovider.Interfa
}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like cloud.facility is used anywhere, and metalConfig.Facility is only passed through to newLoadBalancers(). It might make sense to remove facility from the cloud struct to avoid confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe you are right. We should remove it as part of a general PR to address the facility/metadata issue. I want to get that in once this PR is done and in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind putting it in the comments on that issue?

metal/cloud.go Show resolved Hide resolved
@deitch
Copy link
Contributor

deitch commented Nov 30, 2021

local metadata test (metadata.platformequinix.com) to deviceByNode

We are trying to get rid of local metadata entirely from CPEM. One cannot fairly depend on the CPEM running in the same facility as the worker nodes (or even on EQXM at all).


// InstanceExists returns true if the node exists in cloudprovider
func (i *instances) InstanceExists(ctx context.Context, node *v1.Node) (bool, error) {
return i.InstanceExistsByProviderID(ctx, node.Spec.ProviderID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really would like to have this be more flexible, using your new deviceByNode(), so it can handle with or without a provider ID. But as InstancesV2 says that InstanceExists and InstanceShutdown could use either, let's leave this. I will open a new PR to do that once this is in.

@deitch
Copy link
Contributor

deitch commented Nov 30, 2021

Retesting.

@deitch
Copy link
Contributor

deitch commented Nov 30, 2021

So, it mostly works, but the bgp does not. The reason for that is that there are other places that rely on the node having a provider ID set. Yet, even after instance metadata is set correctly during initialization, it still does not set the provider ID.

Logs from setting annotations:

I1130 17:35:57.728210   17585 node_controller.go:492] Adding node label from cloud provider: beta.kubernetes.io/instance-type=c3.medium.x86
I1130 17:35:57.728239   17585 node_controller.go:493] Adding node label from cloud provider: node.kubernetes.io/instance-type=c3.medium.x86
I1130 17:35:57.728260   17585 node_controller.go:504] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/zone=da11
I1130 17:35:57.728273   17585 node_controller.go:505] Adding node label from cloud provider: topology.kubernetes.io/zone=da11
I1130 17:35:57.728308   17585 node_controller.go:515] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/region=da
I1130 17:35:57.728324   17585 node_controller.go:516] Adding node label from cloud provider: topology.kubernetes.io/region=da
I1130 17:35:57.728334   17585 cloud.go:190] called ProviderName, returning equinixmetal

Notice that it called the ProviderName(). And here is the node:

apiVersion: v1
kind: Node
metadata:
  annotations:
    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2021-11-30T09:20:20Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: c3.medium.x86
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: da
    failure-domain.beta.kubernetes.io/zone: da11
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: k8s-master
    kubernetes.io/os: linux
    node-role.kubernetes.io/control-plane: ""
    node-role.kubernetes.io/master: ""
    node.kubernetes.io/exclude-from-external-load-balancers: ""
    node.kubernetes.io/instance-type: c3.medium.x86
    topology.kubernetes.io/region: da
    topology.kubernetes.io/zone: da11
spec:
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master

I left out the irrelevant status and managedFields. I am going to dig further, but does InstancesV2 not set the provider ID?

Copy link
Contributor

@deitch deitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One missing piece that I found, which is causing the rest to fail.

metal/devices.go Outdated
}

return &cloudprovider.InstanceMetadata{
ProviderID: node.Spec.ProviderID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, here! This is where you are expected to set the provider ID, but it is just passing back what it received.

I believe this should be fmt.Sprintf("%s://%s", ProviderName, device.ID)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that the above works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deitch I just want to say, VSCode Copilot is amazing..

image

It's like a portable @deitch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha! Only as portable as your laptop. 😁

Make the changes, I can approve and merge this in. Already fully tested.

metal/devices.go Outdated Show resolved Hide resolved
@displague displague requested a review from deitch December 1, 2021 13:29
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
…-config)

Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
…erID references

Signed-off-by: Marques Johansson <mjohansson@equinix.com>
@deitch
Copy link
Contributor

deitch commented Dec 1, 2021

Once CI is clean, we can merge it in

@displague
Copy link
Member Author

displague commented Dec 1, 2021

CI was passing and then I raced your approval with a rebase.
I squashed out the commit that introduced packngo 0.17. This commit was appearing after the packngo 0.20 commit and it was only removing commented out code. (no code changes from what was approved: https://github.com/equinix/cloud-provider-equinix-metal/compare/8e23b1f75979db6765bebb358d2119dfd879e31e..fe2ed0073c683a38701f0d6dfb657e013cc2d102, only commit messages)

I'll click merge when this last test finishes, thanks for the diligent review and exceptional patience!

@displague
Copy link
Member Author

I misspoke - the 'build' action was still processing when I rebased. Yipes, that job takes some time doing package installs.

@deitch
Copy link
Contributor

deitch commented Dec 1, 2021

Yeah, it is really slow. Long story. I will take a look at it once this is in.

@deitch deitch merged commit 033d052 into kubernetes-sigs:master Dec 1, 2021
@deitch
Copy link
Contributor

deitch commented Dec 1, 2021

Thank you for all the great work on this @displague !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants