Pod identity propagation delay to remote IP cache when using egress gateways #23976

Champ-Goblem · 2023-02-23T15:48:09Z

We seem to be experiencing some problems with Cilium Endpoint propagation on a GKE cluster with 2600 pods, 17 nodes and 80 cronjobs. Short-lived pods either created manually or via a cronjob sometimes experience connectivity problems when connecting to other internal services protected by network policies.

The failures often happen when there is a high pod churn on the cluster, this might be due to a number of cronjobs triggering at once or pods being deleted/created at the time the affected pod spawns.

The large number of pod updates seem to cause propagation delays of new pod IPs, which leads to the cilium IP cache not being updated in time on the remote nodes. If the delay is long enough it seems that the cache is not updated before the initial network connection is attempted, this leads to cilium classifying the identity of the cronjob pod as reserved:world which then causes the connection to be blocked by the network policy in that namespace. For example from cilium monitor logs during a failure event:

Policy verdict log: flow 0x0 local EP ID 902, remote ID world, proto 6, ingress, action deny, match none, 10.4.36.247:52942 -> 10.4.58.66:3306 tcp SYN
xx drop (Policy denied) flow 0x0 to endpoint 902, file bpf_lxc.c line 2003, , identity world->48169: 10.4.36.247:52942 -> 10.4.58.66:3306 tcp SYN

vs when the cronjob connects as expected:

Policy verdict log: flow 0x0 local EP ID 902, remote ID 35916, proto 6, ingress, action allow, match L3-Only, 10.4.31.130:43190 -> 10.4.58.66:3306 tcp SYN

The network policy in that namespace looks like so:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ns-block
  namespace: ns-a
spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          netpolicylabel: ns-a
  - from:
    - namespaceSelector:
        matchLabels:
          name: istio
  - from:
    - namespaceSelector:
        matchLabels:
          name: forwarding
  - from:
    - namespaceSelector:
        matchLabels:
          name: logging
  podSelector: {}
  policyTypes:
  - Ingress
status: {}

I wrote a quick tool to watch the kubernetes event stream for cilium endpoint resource changes and then run cilium bpf ipcache get <ip-addr> on each node in the cluster to determine how long the propagation takes. For the majority of the time the delay is below 1 second, but in the event of 30/50/100 cilium endpoint updates the propagation delays can become in excess of 10s, leading to the issue.

We recently upgraded cilium to 1.13.0 and also reduced the identity labels at the same time as this upgrade. This has improved the situation and the problem occurs now less frequently, but it is still something which we would prefer to avoid happening at all.

Since the upgrade I have noticed that the failure timestamps for the jobs correspond with a large increase of:

2023-02-21T12:30:06.348176806Z stderr F level=info msg="Trace[1237947675]: \"DeltaFIFO Pop Process\" ID:ns-a/test-db-swzfx,Depth:15,Reason:slow event handlers blocking the queue (21-Feb-2023 12:30:05.768) (total time: 579ms):" subsys=klog

log lines, which is demonstrated in the below image:

The green line is the rate of the slow event handlers blocking the queue logs and the yellow line corresponds to the creation/deletion logs of some of the recently failed jobs.

I believe the slow event handlers are down to the time it takes to process the updates in the cilium egress gateway related code. At the time of the second set of failures (the rightmost red highlight on the graph) a flamegraph of the cilium agent process shows a lot of time spent in egressgateway.(*Manager).OnUpdateEndpoint and egressgateway.(*Manager).OnDeleteEndpoint functions.

Here is the left side of the graph following a cilium endpoint OnAdd:

and the right side following a cilium endpoint OnDelete:

Regarding the egress gateway setup, we have two cilium egress gateway entries on the cluster which are responsible for SNATing the connections for ~40 pods, with one node acting as the designated gateway. The designated node runs no workloads apart from the daemonsets on the clusters, the cilium daemonset and any kubernetes components.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-04-25T01:54:01Z

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

Champ-Goblem · 2023-04-25T08:05:27Z

This is still relevant

ti-mo · 2023-04-27T10:25:36Z

@Champ-Goblem Thanks for the report and the profiles, that helps! I've applied some labels and will try to get some eyes on this.

pchaigno · 2023-04-27T10:38:05Z

Did you try with the latest v1.13 release? We've improved this egress gateway logic recently. Some of the fixes made it into v1.13.2 and some will be in the v1.14.0 release.

Champ-Goblem · 2023-04-27T10:43:32Z

We are currently on 1.13.0 but are happy to try things out again on the 1.13.2 release

marseel · 2023-04-27T10:59:48Z

̶C̶o̶u̶l̶d̶ ̶y̶o̶u̶ ̶t̶a̶k̶e̶ ̶a̶ ̶l̶o̶o̶k̶ ̶a̶t̶ ̶C̶P̶U̶ ̶u̶s̶a̶g̶e̶ ̶f̶o̶r̶ ̶c̶i̶l̶i̶u̶m̶-̶a̶g̶e̶n̶t̶ ̶a̶n̶d̶ ̶a̶l̶s̶o̶ ̶n̶o̶d̶e̶ ̶t̶h̶a̶t̶ ̶i̶t̶'̶s̶ ̶r̶u̶n̶n̶i̶n̶g̶ ̶o̶n̶,̶ ̶p̶l̶e̶a̶s̶e̶?̶ ̶T̶h̶i̶s̶ ̶i̶s̶ ̶q̶u̶i̶t̶e̶ ̶a̶ ̶h̶i̶g̶h̶-̶p̶o̶d̶ ̶d̶e̶n̶s̶i̶t̶y̶ ̶s̶e̶t̶u̶p̶ ̶a̶n̶d̶ ̶I̶ ̶a̶m̶ ̶w̶o̶n̶d̶e̶r̶i̶n̶g̶ ̶i̶f̶ ̶i̶t̶ ̶m̶i̶g̶h̶t̶ ̶a̶l̶s̶o̶ ̶h̶a̶v̶e̶ ̶s̶o̶m̶e̶ ̶i̶m̶p̶a̶c̶t̶ ̶o̶n̶ ̶p̶e̶r̶f̶o̶r̶m̶a̶n̶c̶e̶ ̶o̶f̶ ̶c̶i̶l̶i̶u̶m̶-̶a̶g̶e̶n̶t̶ ̶(̶f̶o̶r̶ ̶e̶x̶a̶m̶p̶l̶e̶ ̶n̶o̶d̶e̶'̶s̶ ̶C̶P̶U̶ ̶b̶e̶i̶n̶g̶ ̶s̶a̶t̶u̶r̶a̶t̶e̶d̶)̶
I just noticed

The designated node runs no workloads apart from the daemonsets on the clusters, the cilium daemonset and any kubernetes components.

Still it might be worth taking a look at CPU usage on node.

github-actions · 2023-06-27T02:07:38Z

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

github-actions · 2023-07-12T02:06:12Z

This issue has not seen any activity since it was marked stale.
Closing.

github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Apr 25, 2023

github-actions bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Apr 26, 2023

ti-mo added kind/performance There is a performance impact of this. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. feature/egress-gateway Impacts the egress IP gateway feature. sig/agent Cilium agent related. labels Apr 27, 2023

aanm added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. and removed sig/agent Cilium agent related. labels Apr 27, 2023

github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jun 27, 2023

jibi mentioned this issue Jul 10, 2023

Egress Gateway: make CiliumEndpoint reconciliation asynchronous from k8s watcher #26741

Merged

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 12, 2023

jibi closed this as completed in #26741 Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod identity propagation delay to remote IP cache when using egress gateways #23976

Pod identity propagation delay to remote IP cache when using egress gateways #23976

Champ-Goblem commented Feb 23, 2023

github-actions bot commented Apr 25, 2023

Champ-Goblem commented Apr 25, 2023

ti-mo commented Apr 27, 2023

pchaigno commented Apr 27, 2023

Champ-Goblem commented Apr 27, 2023

marseel commented Apr 27, 2023 •

edited

github-actions bot commented Jun 27, 2023

github-actions bot commented Jul 12, 2023

Pod identity propagation delay to remote IP cache when using egress gateways #23976

Pod identity propagation delay to remote IP cache when using egress gateways #23976

Comments

Champ-Goblem commented Feb 23, 2023

github-actions bot commented Apr 25, 2023

Champ-Goblem commented Apr 25, 2023

ti-mo commented Apr 27, 2023

pchaigno commented Apr 27, 2023

Champ-Goblem commented Apr 27, 2023

marseel commented Apr 27, 2023 • edited

github-actions bot commented Jun 27, 2023

github-actions bot commented Jul 12, 2023

marseel commented Apr 27, 2023 •

edited