-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
operator: Add cilium node garbage collector #19576
Conversation
76a8681
to
1cb3631
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good, just a few minor comments
install/kubernetes/cilium/templates/cilium-operator/clusterrole.yaml
Outdated
Show resolved
Hide resolved
1cb3631
to
5203cb0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing that we are selectively adding RBAC permissions and passing the nodes-gc-interval
flag through the ConfigMap, I'm assuming there is a use-case for disabling this feature. In what context users would want to disable it?
Follow-up question: how are users expected to effectively disable this feature through Helm? I think the combination of the flag being a duration, the default value, and different ways of checking it (hasKey
in the ConfigMap
, "emptiness" in the ClusterRole
) might not work as expected:
% helm template cilium ./install/kubernetes/cilium --namespace kube-system | grep -E "nodes-gc-interval|# To perform CiliumNode garbage collector"
nodes-gc-interval: "5m0s"
# To perform CiliumNode garbage collector
% helm template cilium ./install/kubernetes/cilium --namespace kube-system --set operator.nodeGCInterval=1s | grep -E "nodes-gc-interval|# To perform CiliumNode garbage collector"
nodes-gc-interval: "1s"
# To perform CiliumNode garbage collector
Here the first grep
expression is for the flag value, second expression for RBAC. Using the default and override will correctly set the flag and add the RBAC permission, happy day™.
% helm template cilium ./install/kubernetes/cilium --namespace kube-system --set operator.nodeGCInterval=null | grep -E "nodes-gc-interval|# To perform CiliumNode garbage collector"
% helm template cilium ./install/kubernetes/cilium --namespace kube-system --set operator.nodeGCInterval= | grep -E "nodes-gc-interval|# To perform CiliumNode garbage collector"
nodes-gc-interval: ""
% helm template cilium ./install/kubernetes/cilium --namespace kube-system --set operator.nodeGCInterval=0s | grep -E "nodes-gc-interval|# To perform CiliumNode garbage collector"
nodes-gc-interval: "0s"
# To perform CiliumNode garbage collector
Settting to null
shows nothing, meaning we don't pass the flag and we don't have the RBAC permission. The issue here is that we'll be using the non-zero CLI default (5m) and will fail because we don't have the permission.
Settting to empty will actually pass it as-is to the flag, which will fail duration parsing.
Settting to 0s
will correctly disable the feature in the ConfigMap, but won't add the RBAC permission.
Thanks for your input, indeed there are a few cases I didn't consider.
Normally, we don't want to disable this GC, it's just an option for users in case of any issue later on.
Agreed on the above points, I have made below changes to rectify
|
5203cb0
to
8f4d9b0
Compare
39bf622
to
71d0e3e
Compare
/test-1.16-4.9 |
/test-1.22-4.19 |
/test-1.23-net-next |
/test-1.21-5.4 |
All the tests passed now, marking this ready for merge. |
Description
In the normal scenario, CiliumNode is created by agent with owner
references attached all time time in below PR[0]. However, there could
be the case that CiliumNode is created by IPAM module[1], which
didn't have any ownerReferences at all. For this case, if the
corresponding node got terminated and never came back with same
name, the CiliumNode resource is still dangling, and needs to be
garbage collected.
This commit is to add garbage collector for CiliumNode, with below
logic:
by flag --nodes-gc-interval)
resource. Also, remove this node from GC candidate if required.
node is still in GC candidate, remove it.
Testing
Testing was done locally with kind cluster
Create one dummy ciliumnode and check gc