Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

helm: cilium kvstore clustermesh failing without clustermesh-apiserver-remote-cert in 1.15 #32122

Open
2 of 3 tasks
taraspos opened this issue Apr 22, 2024 · 4 comments
Open
2 of 3 tasks
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/agent Cilium agent related.

Comments

@taraspos
Copy link

taraspos commented Apr 22, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

After upgrade from 1.14.8 to 1.15.3 Cilium is failing to connect to remote etcd with:

level=info msg="Waiting for all etcd configuration files to be available" error="open /var/lib/cilium/clustermesh/common-etcd-client-ca.crt: no such file or directory" subsys=kvstore

Possible cause:

  1. common-etcd-client-ca.crt is supposed to be included via clustermesh-secrets from clustermesh-apiserver-remote-cert:
  • - name: clustermesh-secrets
    projected:
    # note: the leading zero means this number is in octal representation: do not remove it
    defaultMode: 0400
    sources:
    - secret:
    name: cilium-clustermesh
    optional: true
    # note: items are not explicitly listed here, since the entries of this secret
    # depend on the peers configured, and that would cause a restart of all agents
    # at every addition/removal. Leaving the field empty makes each secret entry
    # to be automatically projected into the volume as a file whose name is the key.
    - secret:
    name: clustermesh-apiserver-remote-cert
    optional: true
    items:
    - key: tls.key
    path: common-etcd-client.key
    - key: tls.crt
    path: common-etcd-client.crt
    {{- if not .Values.tls.caBundle.enabled }}
    - key: ca.crt
    path: common-etcd-client-ca.crt
    {{- else }}
    - {{ .Values.tls.caBundle.useSecret | ternary "secret" "configMap" }}:
    name: {{ .Values.tls.caBundle.name }}
    optional: true
    items:
    - key: {{ .Values.tls.caBundle.key }}
    path: common-etcd-client-ca.crt
    {{- end }}
  1. clustermesh-apiserver-remote-cert is created only when clustermesh.useAPIServer is set to true
  1. Since after the 1.15.x upgrade kvstore can't be used together with clustermesh.useAPIServer, otherwise helm chart is failing with:

    Helm upgrade failed for release kube-system/cilium with chart cilium@1.15.3: execution error at (cilium/templates/validate.yaml:85:7): The clustermesh-apiserver cannot be enabled in combination with .Values.identityAllocationMode=kvstore. To establish a Cluster Mesh, directly configure the parameters to access the remote kvstore through .Values.clustermesh.config

  1. As result, Cilium agent can't join remote etcd because of missing tls certs

Cilium Version

1.15.3

Kernel Version

NA

Kubernetes Version

N/A

Regression

1.14.8

Sysdump

No response

Relevant log output

level=info msg="Creating etcd client" ConfigPath=/var/lib/cilium/clustermesh/cluster-2 KeepAliveHeartbeat=15s KeepAliveTimeout=25s ListLimit=256 MaxInflight=100 RateLimit=100 subsys=kvstore
level=info msg="Creating etcd client" ConfigPath=/var/lib/etcd-config/etcd.config KeepAliveHeartbeat=15s KeepAliveTimeout=25s ListLimit=256 MaxInflight=100 RateLimit=100 subsys=kvstore

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@taraspos taraspos added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Apr 22, 2024
@giorio94
Copy link
Member

Hi @taraspos,

You need to manually specify through the dedicated section the TLS key/certificate used to connect to the kvstore in remote clusters when running Cilium in kvstore mode, as those cannot be automatically managed by Cilium.

@giorio94 giorio94 added the area/clustermesh Relates to multi-cluster routing functionality in Cilium. label Apr 23, 2024
@ti-mo ti-mo added sig/agent Cilium agent related. and removed needs/triage This issue requires triaging to establish severity and next steps. labels Apr 25, 2024
@ti-mo
Copy link
Contributor

ti-mo commented Apr 25, 2024

@giorio94 What's the takeaway? Does the upgrade documentation need to be improved? Sounds like this is a bit of a breaking change.

@taraspos
Copy link
Author

Hey @giorio94, thanks for the reply!

Our configuration works in 1.14.8, but we had useAPIServer set to true with replicas scaled down, that was a workaround because some resources were not created otherwise.

After upgrading to 1.15 useAPIServer=true can't be used together with kvstore, so I had to disable it and to face current issue. I will take a look at the tls configuration section you provided when attempting 1.15 upgrade somewhere later.

@giorio94
Copy link
Member

giorio94 commented Apr 26, 2024

What's the takeaway? Does the upgrade documentation need to be improved? Sounds like this is a bit of a breaking change.

Hmm, specifying the clustermesh configuration through helm in combination with Cilium running in kvstore mode had never been supported properly, and required using the trick of enabling the clustermesh-apiserver, but then scaling the replicas to 0.

That was addressed by #28763, which also added an explicit validation to prevent running the clustermesh-apiserver when Cilium is running in kvstore mode. Indeed, this combination does not work correctly and is not supported by design, and allowing them to be enabled at the same time caused confusion to several users in the past.

@taraspos Could you please share some more details about your clustermesh setup, and in particular about how the TLS certificates are configured for the external kvstores? The reason for not automatically creating the TLS certificates when the clustermesh-apiserver is disabled is that they wouldn't be trusted by the external kvstore, except in the very specific case in which the same CA is configured there as well (which I guess it could be the case for you).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/agent Cilium agent related.
Projects
None yet
Development

No branches or pull requests

3 participants