Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator: fix crash that occurs when cep CRD is disabled and CES enabled with kvstore #25798

Merged

Conversation

doniacld
Copy link
Contributor

@doniacld doniacld commented May 31, 2023

Description

Make the operator crash early if CiliumEndpointSlice is enabled and CiliumEndpoint CRDs is disabled since it does not make sense to have CES without cep running.

Add a check at helm installation level.

Fixes #24396

Test

Test at runtime

Adding the following configuration in kind-values.yaml and deploy a kind cluster

enableCiliumEndpointSlice: true
disableEndpointCRD: true
identityAllocationMode: "kvstore"

NB: Tested with and without identityAllocationMode: "kvstore"

Check the logs from the operator

$ ks get pod                                                                                    
NAME                                         READY   STATUS             RESTARTS     AGE
cilium-24plg                                 0/1     Init:3/6           0            3s
cilium-fblnk                                 0/1     Init:3/6           0            3s
cilium-operator-65fb77cf85-jv8xh             0/1     CrashLoopBackOff   1 (2s ago)   3s
$ ks logs cilium-operator-65fb77cf85-jv8xh -f                                                                  
level=debug msg="Skipped reading configuration file" error="Config File \"cilium-operators\" Not Found in \"[/]\"" subsys=config
level=fatal msg="Running Cilium with enable-cilium-endpoint-slice=true requires disable-endpoint-crd set to false to enable CRDs." subsys=config

Test during deployment

The error is properly raised.

$ helm install  --namespace kube-system    -f ./contrib/testing/kind-values.yaml cilium ./install/kubernetes/cilium   
Error: INSTALLATION FAILED: execution error at (cilium/templates/validate.yaml:76:7): Cilium operator requires .Values.enableCiliumEndpointSlice=true and .Values.disableEndpointCRD=false
Change enableEndpointCRD helm option type from string to boolean
Fix operator panic that occurs when Endpoint CRD is disabled and CiliumEndpointSlice is enabled

@doniacld doniacld requested review from a team as code owners May 31, 2023 13:48
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 31, 2023
@doniacld doniacld force-pushed the pr/doniacld/fix-operator-crash branch 2 times, most recently from 86fe7e2 to e5fa687 Compare May 31, 2023 14:30
@tommyp1ckles tommyp1ckles self-requested a review May 31, 2023 16:18
Copy link
Member

@mhofstetter mhofstetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @doniacld

thanks for the PR - looks good so far! only some proposals regarding the failing checks and missing labels from my side.

Checks

looks like the check BPF checks / checkpatch is failing due to the long commit message subject.

Error: ERROR:CUSTOM: Please avoid long commit subjects (max: 75, found: 85)

Not sure what's going on with the other failing CIlium Runtime checks that have been introduced lately. Might be worth to rebase to main to fetch the latest version while renaming the commit message.

Labels

you should also add a relevant release note label to this PR (https://docs.cilium.io/en/latest/contributing/development/contributing_guide/#submitting-a-pull-request - 11.) - IMO release-note/misc should be appropriate in this case (which doesn't require an explicit release note).

it's also helpful to add some labels to bring the PR into context - e.g. area/operator, kind/bug

@doniacld doniacld added area/operator Impacts the cilium-operator component kind/bug This is a bug in the Cilium logic. release-note/misc This PR makes changes that have no direct user impact. labels Jun 2, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 2, 2023
@doniacld doniacld force-pushed the pr/doniacld/fix-operator-crash branch 2 times, most recently from eb3407c to cd20d17 Compare June 2, 2023 08:44
@doniacld
Copy link
Contributor Author

doniacld commented Jun 2, 2023

Thanks @mhofstetter for your feedback! I added the labels you mentioned and reduced the char size of my commit subject.

@doniacld doniacld force-pushed the pr/doniacld/fix-operator-crash branch from cd20d17 to 8fdf21a Compare June 2, 2023 08:48
Copy link
Member

@mhofstetter mhofstetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doniacld thanks for adding the labels and fixing the commit subject.

Check in the Go-Code when starting the operator looks good to me.

But it might be worth to add this check to the Helm Chart too - to prevent an installation with these incompatible values in an earlier stage. This would also cover an installation via Cilium CLI (with helm mode).

The two config values are configured in the configmap (disableEndpointCRD & enableCiliumEndpointSlice).

I'd recommend to place a check based on the two helm values .Values.enableCiliumEndpointSlice & .Values.disableEndpointCRD in the helm validation file (https://github.com/cilium/cilium/blob/main/install/kubernetes/cilium/templates/validate.yaml)

@doniacld
Copy link
Contributor Author

doniacld commented Jun 5, 2023

Thanks for your comment regarding the helm check, we had this conversation with @gandro who gave me some history regarding the validate.yaml strategy (@gandro being the first author of this idea). Our first thought was that we do not want really to maintain this file and keep it but since it is here, I do think it will not hurt anyone to have a supplementary check for the moment.

@doniacld doniacld requested review from a team as code owners June 5, 2023 13:36
@doniacld doniacld requested a review from kaworu June 5, 2023 13:36
@doniacld doniacld force-pushed the pr/doniacld/fix-operator-crash branch 2 times, most recently from 0e82161 to a13ff4f Compare June 5, 2023 13:40
Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @doniacld patch LGTM!

install/kubernetes/cilium/templates/validate.yaml Outdated Show resolved Hide resolved
@doniacld doniacld force-pushed the pr/doniacld/fix-operator-crash branch 2 times, most recently from d26bed2 to 6b8fec5 Compare June 5, 2023 15:30
Copy link
Member

@mhofstetter mhofstetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doniacld thanks for adding the helm validation!

looks like the docker build failed due to some infra issues (and therefore all dependent checks).

i'd recommend to rebase to main to re-trigger all basic checks. In addition you can trigger the e2e-tests by commenting /test (in a single comment)

@tommyp1ckles your review became mandatory (with sig-k8s). please chime in :)

@mhofstetter
Copy link
Member

mhofstetter commented Jun 7, 2023

I added a new commit to change disableEndpointCRD type from string to boolean. The default value was a string "false", with a boolean type we avoid transformation in helm from string to boolean. Note that in Go code, disableCiliumEndpointCRD is a boolean.

@doniacld thanks for the heads-up.

in general i prefer having explicit types. it's also nice that helm automatically converts the strings to booleans.

it's just that setting disableEndpointCRD="false" (explicitly set as string in the customers values.yaml or via --set-string - which might be the case at customer side (because it was a string until now)) will result in disable-endpoint-crd: "true" in the configmap ❗ this is kind of how the Go template if works.

helm template ./install/kubernetes/cilium --set-string disableEndpointCRD="false" | grep "disable-endpoint-crd"
  disable-endpoint-crd: "true"

i don't know whether we want that and whether the note in the upgrade guide is enough 🤷‍♂️ maybe a question for more seasoned cilium teammembers? @qmonnet @kaworu

@qmonnet
Copy link
Member

qmonnet commented Jun 7, 2023

it's just that setting disableEndpointCRD="false" (explicitly set as string in the customers values.yaml or via --set-string - which might be the case at customer side (because it was a string until now)) will result in disable-endpoint-crd: "true" in the configmap

Right, I'm no Helm expert but this sounds like the ideal thing to introduce bugs, good catch. Is there a way to add some validation on the type (or value), so that users get an error if they pass "false"?

@mhofstetter
Copy link
Member

it's just that setting disableEndpointCRD="false" (explicitly set as string in the customers values.yaml or via --set-string - which might be the case at customer side (because it was a string until now)) will result in disable-endpoint-crd: "true" in the configmap

Right, I'm no Helm expert but this sounds like the ideal thing to introduce bugs, good catch. Is there a way to add some validation on the type (or value), so that users get an error if they pass "false"?

@qmonnet Schema validation is possible with Helm - but we don't use it yet 🤔 https://helm.sh/docs/topics/charts/#schema-files

otherwise we have to enforce an explicit type-check in the if itself

@@ -131,9 +131,9 @@ data:
skip-cnp-status-startup-clean: "{{ .Values.operator.skipCNPStatusStartupClean }}"
{{- end }}

{{- if hasKey .Values "disableEndpointCRD" }}
{{- if .Values.disableEndpointCRD }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to enforce an explicit type check - we have to change this to {{- if eq .Values.disableEndpointCRD true }}

this will result in an error if trying to pass an incompatible type.

error calling eq: incompatible types for comparison

Copy link
Contributor Author

@doniacld doniacld Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and tested.

Configuration with the wrong type

enableCiliumEndpointSlice: false
disableEndpointCRD: "true"

Error is as expected

Error: template: cilium/templates/cilium-configmap.yaml:134:7: executing "cilium/templates/cilium-configmap.yaml" at <eq .Values.disableEndpointCRD true>: error calling eq: incompatible types for comparison

Configuration with the wrong type BUT enableCiliumEndpointSlice: true

enableCiliumEndpointSlice: true
disableEndpointCRD: "true"

The error type is not thrown since I guess validate.yaml is the first test to be executed.

Error: execution error at (cilium/templates/validate.yaml:76:7): if Cilium Endpoint Slice is enabled (.Values.enableCiliumEndpointSlice=true), it requires .Values.disableEndpointCRD=false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doniacld great

we're missing one case:

enableCiliumEndpointSlice: true
disableEndpointCRD: "false"

the validation error gets thrown even though the endpointCRD isn't actually disabled

❯ helm template ./install/kubernetes/cilium -f ./contrib/testing/kind-values.yaml
Error: execution error at (cilium/templates/validate.yaml:76:7): if Cilium Endpoint Slice is enabled (.Values.enableCiliumEndpointSlice=true), it requires .Values.disableEndpointCRD=false

-> we should change the validation to enforce type checks too

{{- if and (eq .Values.enableCiliumEndpointSlice true ) (eq .Values.disableEndpointCRD true ) }}
  {{ fail "if Cilium Endpoint Slice is enabled (.Values.enableCiliumEndpointSlice=true), it requires .Values.disableEndpointCRD=false" }}
{{- end }}

Copy link
Contributor Author

@doniacld doniacld Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little update:

Config enableCiliumEndpointSlice:false & disableEndpointCRD: "true"

enableCiliumEndpointSlice: false
disableEndpointCRD: "true"

Error

Error: template: cilium/templates/cilium-configmap.yaml:134:7: executing "cilium/templates/cilium-configmap.yaml" at <eq .Values.disableEndpointCRD true>: error calling eq: incompatible types for comparison

Config enableCiliumEndpointSlice:true & disableEndpointCRD: "true"

enableCiliumEndpointSlice: true
disableEndpointCRD: "true"

Error

Error: template: cilium/templates/validate.yaml:75:9: executing "cilium/templates/validate.yaml" at <eq .Values.disableEndpointCRD true>: error calling eq: incompatible types for comparison

Helm command enableCiliumEndpointSlice=true & disableEndpointCRD="true"

 --set enableCiliumEndpointSlice=true --set-string disableEndpointCRD="true"

Error

Error: template: cilium/templates/validate.yaml:75:9: executing "cilium/templates/validate.yaml" at <eq .Values.disableEndpointCRD true>: error calling eq: incompatible types for comparison

Now we should be good!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @doniacld

@qmonnet
Copy link
Member

qmonnet commented Jun 7, 2023

I was thinking more of something along what we have in install/kubernetes/cilium/templates/validate.yaml. A simple check could be to error out is the value is neither true or false (or empty). Maybe a bit hacky, but could work?

@qmonnet
Copy link
Member

qmonnet commented Jun 7, 2023

Ah, just seeing your suggestion now on eq .Value... true. This looks better.

@doniacld doniacld force-pushed the pr/doniacld/fix-operator-crash branch 2 times, most recently from 405a25b to eb9a751 Compare June 7, 2023 16:06
@doniacld
Copy link
Contributor Author

doniacld commented Jun 7, 2023

i don't know whether we want that and whether the note in the upgrade guide is enough 🤷‍♂️ maybe a question for more seasoned cilium teammembers? @qmonnet @kaworu

I tried to be more explicit in the documentation, please let me know if that's okay.

@doniacld doniacld force-pushed the pr/doniacld/fix-operator-crash branch 2 times, most recently from 4e0a8d4 to 81fc102 Compare June 7, 2023 16:32
@tommyp1ckles
Copy link
Contributor

Nice work!

The helm validation makes me wonder if we should somehow unify checking/exiting config incompatibilities and doing helm based validation one day (failing to install is a lot better than crashing).

@doniacld
Copy link
Contributor Author

doniacld commented Jun 8, 2023

/test

@maintainer-s-little-helper
Copy link

Commit 8764de570cb04cbf5709beb91feaef869bfa4170 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 8, 2023
@doniacld doniacld force-pushed the pr/doniacld/fix-operator-crash branch from 8764de5 to da6fb22 Compare June 8, 2023 08:19
@maintainer-s-little-helper
Copy link

Commit 8764de570cb04cbf5709beb91feaef869bfa4170 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@doniacld doniacld removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 8, 2023
kvstore

Make the operator crash early in config.go if CiliumEndpointSlice is enabled
and CiliumEndpointCRD is disabled. Note that CiliumEndpointSlice feature needs
CiliumEndpoint CRDs to run.  In operator/cmd/root.go, remove the condition on
CiliumEndpointCRD since at this point of the lifecycle, the operator should
have crash if we end up in this case. This fix is only relevant for kvstore
mode, because in identity-allocation-mode CRD, we forcefully enable the
CiliumEndpoint CRD.

Fixes cilium#24396

Signed-off-by: Donia Chaiehloudj <donia.cld@isovalent.com>
Signed-off-by: Donia Chaiehloudj <donia.cld@isovalent.com>
…d too

This commit adds a supplementary check during the deployment to prevent
installing Cilium with incompatibles flags enableCiliumEndpointSlice=true
disableEndpointCRD=true.

Signed-off-by: Donia Chaiehloudj <donia.cld@isovalent.com>
@doniacld doniacld force-pushed the pr/doniacld/fix-operator-crash branch from da6fb22 to 37a5d8f Compare June 8, 2023 09:22
@doniacld
Copy link
Contributor Author

doniacld commented Jun 8, 2023

/test

1 similar comment
@doniacld
Copy link
Contributor Author

doniacld commented Jun 8, 2023

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 8, 2023
@dylandreimerink dylandreimerink merged commit 9fab23d into cilium:main Jun 8, 2023
63 of 64 checks passed
@doniacld doniacld deleted the pr/doniacld/fix-operator-crash branch June 8, 2023 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operator Impacts the cilium-operator component kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cilium Operator crashes when CiliumEndpoint CRD is disabled
7 participants