Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High-Scale IPcache: Chapter 1 #25148

Merged
merged 9 commits into from May 12, 2023
Merged

High-Scale IPcache: Chapter 1 #25148

merged 9 commits into from May 12, 2023

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Apr 26, 2023

This new feature has a CFP at cilium/design-cfps#7. The goal of this first PR is to define the new flag and remove all pod entries (see exceptions in CFP) from the ipcache when that flag is set.

Updates: #25243.

New high-scale ipcache mode to support clustermeshes with millions of pods.

@pchaigno pchaigno added the release-note/major This PR introduces major new functionality to Cilium. label Apr 26, 2023
@pchaigno pchaigno marked this pull request as ready for review April 27, 2023 07:33
@pchaigno pchaigno requested review from a team as code owners April 27, 2023 07:33
Copy link
Contributor

@zacharysarah zacharysarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pchaigno Excellent code comments. ✨ LGTM from a docs perspective

Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helm changes LGTM

Copy link
Member

@asauber asauber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI changes LGTM

@christarazi christarazi added area/daemon Impacts operation of the Cilium daemon. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels Apr 28, 2023
pkg/k8s/watchers/cilium_endpoint.go Show resolved Hide resolved
logger.Debug("Pod is using host networking")
return nil
}

if option.Config.EnableHighScaleIPcache && nodeTypes.GetName() != newPod.Spec.NodeName {
logger.Debug("Pod is not local; skipping ipcache upsert")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Pod watcher should only be listening for local pods anyway right? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the callers of (k *K8sWatcher) createPodController, there are 2.

  1. Only listens for pods on the local node
  2. Listens for all pods on all nodes

(2) occurs when disable-endpoint-crd: true and presumably with kvstore mode. So I think we need this code (conceptually at least) to cover for (2).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, we can say that kvstore mode (and disable-endpoint-crd: true) as not supported configuration when high-scale ipcache mode is enabled, and completely drop this change here, so that only case (1) is relevant.

daemon/cmd/state.go Show resolved Hide resolved
daemon/cmd/state.go Show resolved Hide resolved
@pchaigno pchaigno requested a review from christarazi May 2, 2023 10:06
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, followups are being tracked at: #25370

pchaigno and others added 9 commits May 11, 2023 14:57
This agent flag will be used in follow up commits.

This new mode doesn't support the egress gateway for now because the
egress gateway relies on identities from the ipcache to know if a
destination is outside the cluster and if egress policies should
therefore apply. IPsec is also not supported as the key SPIs are
currently stored in the ipcache.

This new mode requires IPv6 to be disabled (our tunnels are only on
IPv4).

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
This helper will be used in a subsequent commit to add identities of
remote pods in the ipcache only if they are well-known identities (e.g.,
kube-dns backend pods). This is necessary to allow FQDN policies to be
resolved.

Signed-off-by: Paul Chaignon <paul@cilium.io>
When using the high-scale IPcache mode, only reserved and well-known
identities are inserted in the ipcache. This is to avoid overflowing
the ipcache in very large clusters. We only really care about well-known
identities being in the ipcache so that we allow connectivity to
well-known pods (e.g., DNS pods).

To enforce ingress policies, identities will instead be carried over
tunnel metadata. On egress, policies will be enforced through FQDN rules.
These two changes will be implemented in subsequent commits.

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
Non-functional commit. Rename this function so that the name is clearer
in intent.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
endpointMetadataFetcher knows how to fetch Kubernetes metadata for
endpoints. Currently, there are two implementations:

1) Fetching from the K8s watcher store (cached)
2) Fetching directly from K8s itself (uncached)

(1) was the default method before this commit. With this commit, (2) is
possible when high-scale ipcache mode is enabled. This is because with
the aforementioned option enabled, the pod watcher is disabled, meaning
the store will be empty. In other wods, there's no pod metadata that's
been cached, so we must fetch it directly during endpoint restoration.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
This is a preparatory commit for the following commit to add another
method for fetching pods.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
When high-scale ipcache mode is enabled, we cannot rely on the pod store
because the pod watcher is disabled. Instead, validate endpoints during
restoration by fetching the pod metadata directly from the
kube-apiserver.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
We don't need the pod watcher at all, because we don't need to insert
any pod IPs when in high-scale ipcache mode. We are already inserting
the necessary IPs into the ipcache from the CiliumEndpoint watcher.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno
Copy link
Member Author

pchaigno commented May 11, 2023

/test

Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: Unexpected missed tail call

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.25-kernel-4.19/2123/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

@pchaigno
Copy link
Member Author

Due to the GitHub outage today, 3 CI jobs passed but couldn't set their status to success: ConformanceEKS, ConformanceAWS-CNI, and runtime. k8s-1.25-kernel-4.19 failed with known flake #24514.

The changes were reviewed. There are some followups we need to address (cf. #25370), but they are not so critical that we need to block this. This PR is also blocking 5 other PRs so let's address the followups as followups. Marking ready to merge.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 11, 2023
@aanm aanm merged commit 37d5264 into cilium:main May 12, 2023
55 of 57 checks passed
@pchaigno pchaigno deleted the high-scale-ipcache branch May 12, 2023 13:29
@pchaigno pchaigno added the feature/high-scale-ipcache Relates to the high-scale ipcache feature. label May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. feature/high-scale-ipcache Relates to the high-scale ipcache feature. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/major This PR introduces major new functionality to Cilium. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants