Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod identity propagation delay to remote IP cache when using egress gateways #23976

Closed
Champ-Goblem opened this issue Feb 23, 2023 · 8 comments · Fixed by #26741
Closed

Pod identity propagation delay to remote IP cache when using egress gateways #23976

Champ-Goblem opened this issue Feb 23, 2023 · 8 comments · Fixed by #26741
Labels
feature/egress-gateway Impacts the egress IP gateway feature. kind/performance There is a performance impact of this. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@Champ-Goblem
Copy link

We seem to be experiencing some problems with Cilium Endpoint propagation on a GKE cluster with 2600 pods, 17 nodes and 80 cronjobs. Short-lived pods either created manually or via a cronjob sometimes experience connectivity problems when connecting to other internal services protected by network policies.

The failures often happen when there is a high pod churn on the cluster, this might be due to a number of cronjobs triggering at once or pods being deleted/created at the time the affected pod spawns.

The large number of pod updates seem to cause propagation delays of new pod IPs, which leads to the cilium IP cache not being updated in time on the remote nodes. If the delay is long enough it seems that the cache is not updated before the initial network connection is attempted, this leads to cilium classifying the identity of the cronjob pod as reserved:world which then causes the connection to be blocked by the network policy in that namespace. For example from cilium monitor logs during a failure event:

Policy verdict log: flow 0x0 local EP ID 902, remote ID world, proto 6, ingress, action deny, match none, 10.4.36.247:52942 -> 10.4.58.66:3306 tcp SYN
xx drop (Policy denied) flow 0x0 to endpoint 902, file bpf_lxc.c line 2003, , identity world->48169: 10.4.36.247:52942 -> 10.4.58.66:3306 tcp SYN

vs when the cronjob connects as expected:

Policy verdict log: flow 0x0 local EP ID 902, remote ID 35916, proto 6, ingress, action allow, match L3-Only, 10.4.31.130:43190 -> 10.4.58.66:3306 tcp SYN

The network policy in that namespace looks like so:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ns-block
  namespace: ns-a
spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          netpolicylabel: ns-a
  - from:
    - namespaceSelector:
        matchLabels:
          name: istio
  - from:
    - namespaceSelector:
        matchLabels:
          name: forwarding
  - from:
    - namespaceSelector:
        matchLabels:
          name: logging
  podSelector: {}
  policyTypes:
  - Ingress
status: {}

I wrote a quick tool to watch the kubernetes event stream for cilium endpoint resource changes and then run cilium bpf ipcache get <ip-addr> on each node in the cluster to determine how long the propagation takes. For the majority of the time the delay is below 1 second, but in the event of 30/50/100 cilium endpoint updates the propagation delays can become in excess of 10s, leading to the issue.

We recently upgraded cilium to 1.13.0 and also reduced the identity labels at the same time as this upgrade. This has improved the situation and the problem occurs now less frequently, but it is still something which we would prefer to avoid happening at all.

Since the upgrade I have noticed that the failure timestamps for the jobs correspond with a large increase of:

2023-02-21T12:30:06.348176806Z stderr F level=info msg="Trace[1237947675]: \"DeltaFIFO Pop Process\" ID:ns-a/test-db-swzfx,Depth:15,Reason:slow event handlers blocking the queue (21-Feb-2023 12:30:05.768) (total time: 579ms):" subsys=klog

log lines, which is demonstrated in the below image:

image

The green line is the rate of the slow event handlers blocking the queue logs and the yellow line corresponds to the creation/deletion logs of some of the recently failed jobs.

I believe the slow event handlers are down to the time it takes to process the updates in the cilium egress gateway related code. At the time of the second set of failures (the rightmost red highlight on the graph) a flamegraph of the cilium agent process shows a lot of time spent in egressgateway.(*Manager).OnUpdateEndpoint and egressgateway.(*Manager).OnDeleteEndpoint functions.

image

Here is the left side of the graph following a cilium endpoint OnAdd:

image

and the right side following a cilium endpoint OnDelete:

image

Regarding the egress gateway setup, we have two cilium egress gateway entries on the cluster which are responsible for SNATing the connections for ~40 pods, with one node acting as the designated gateway. The designated node runs no workloads apart from the daemonsets on the clusters, the cilium daemonset and any kubernetes components.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Apr 25, 2023
@Champ-Goblem
Copy link
Author

This is still relevant

@github-actions github-actions bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Apr 26, 2023
@ti-mo ti-mo added kind/performance There is a performance impact of this. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. feature/egress-gateway Impacts the egress IP gateway feature. sig/agent Cilium agent related. labels Apr 27, 2023
@ti-mo
Copy link
Contributor

ti-mo commented Apr 27, 2023

@Champ-Goblem Thanks for the report and the profiles, that helps! I've applied some labels and will try to get some eyes on this.

@pchaigno
Copy link
Member

Did you try with the latest v1.13 release? We've improved this egress gateway logic recently. Some of the fixes made it into v1.13.2 and some will be in the v1.14.0 release.

@Champ-Goblem
Copy link
Author

We are currently on 1.13.0 but are happy to try things out again on the 1.13.2 release

@aanm aanm added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. and removed sig/agent Cilium agent related. labels Apr 27, 2023
@marseel
Copy link
Contributor

marseel commented Apr 27, 2023

̶C̶o̶u̶l̶d̶ ̶y̶o̶u̶ ̶t̶a̶k̶e̶ ̶a̶ ̶l̶o̶o̶k̶ ̶a̶t̶ ̶C̶P̶U̶ ̶u̶s̶a̶g̶e̶ ̶f̶o̶r̶ ̶c̶i̶l̶i̶u̶m̶-̶a̶g̶e̶n̶t̶ ̶a̶n̶d̶ ̶a̶l̶s̶o̶ ̶n̶o̶d̶e̶ ̶t̶h̶a̶t̶ ̶i̶t̶'̶s̶ ̶r̶u̶n̶n̶i̶n̶g̶ ̶o̶n̶,̶ ̶p̶l̶e̶a̶s̶e̶?̶ ̶T̶h̶i̶s̶ ̶i̶s̶ ̶q̶u̶i̶t̶e̶ ̶a̶ ̶h̶i̶g̶h̶-̶p̶o̶d̶ ̶d̶e̶n̶s̶i̶t̶y̶ ̶s̶e̶t̶u̶p̶ ̶a̶n̶d̶ ̶I̶ ̶a̶m̶ ̶w̶o̶n̶d̶e̶r̶i̶n̶g̶ ̶i̶f̶ ̶i̶t̶ ̶m̶i̶g̶h̶t̶ ̶a̶l̶s̶o̶ ̶h̶a̶v̶e̶ ̶s̶o̶m̶e̶ ̶i̶m̶p̶a̶c̶t̶ ̶o̶n̶ ̶p̶e̶r̶f̶o̶r̶m̶a̶n̶c̶e̶ ̶o̶f̶ ̶c̶i̶l̶i̶u̶m̶-̶a̶g̶e̶n̶t̶ ̶(̶f̶o̶r̶ ̶e̶x̶a̶m̶p̶l̶e̶ ̶n̶o̶d̶e̶'̶s̶ ̶C̶P̶U̶ ̶b̶e̶i̶n̶g̶ ̶s̶a̶t̶u̶r̶a̶t̶e̶d̶)̶
I just noticed

The designated node runs no workloads apart from the daemonsets on the clusters, the cilium daemonset and any kubernetes components.

Still it might be worth taking a look at CPU usage on node.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jun 27, 2023
@github-actions
Copy link

This issue has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/egress-gateway Impacts the egress IP gateway feature. kind/performance There is a performance impact of this. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants