-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod identity propagation delay to remote IP cache when using egress gateways #23976
Comments
This issue has been automatically marked as stale because it has not |
This is still relevant |
@Champ-Goblem Thanks for the report and the profiles, that helps! I've applied some labels and will try to get some eyes on this. |
Did you try with the latest v1.13 release? We've improved this egress gateway logic recently. Some of the fixes made it into v1.13.2 and some will be in the v1.14.0 release. |
We are currently on 1.13.0 but are happy to try things out again on the 1.13.2 release |
̶C̶o̶u̶l̶d̶ ̶y̶o̶u̶ ̶t̶a̶k̶e̶ ̶a̶ ̶l̶o̶o̶k̶ ̶a̶t̶ ̶C̶P̶U̶ ̶u̶s̶a̶g̶e̶ ̶f̶o̶r̶ ̶c̶i̶l̶i̶u̶m̶-̶a̶g̶e̶n̶t̶ ̶a̶n̶d̶ ̶a̶l̶s̶o̶ ̶n̶o̶d̶e̶ ̶t̶h̶a̶t̶ ̶i̶t̶'̶s̶ ̶r̶u̶n̶n̶i̶n̶g̶ ̶o̶n̶,̶ ̶p̶l̶e̶a̶s̶e̶?̶ ̶T̶h̶i̶s̶ ̶i̶s̶ ̶q̶u̶i̶t̶e̶ ̶a̶ ̶h̶i̶g̶h̶-̶p̶o̶d̶ ̶d̶e̶n̶s̶i̶t̶y̶ ̶s̶e̶t̶u̶p̶ ̶a̶n̶d̶ ̶I̶ ̶a̶m̶ ̶w̶o̶n̶d̶e̶r̶i̶n̶g̶ ̶i̶f̶ ̶i̶t̶ ̶m̶i̶g̶h̶t̶ ̶a̶l̶s̶o̶ ̶h̶a̶v̶e̶ ̶s̶o̶m̶e̶ ̶i̶m̶p̶a̶c̶t̶ ̶o̶n̶ ̶p̶e̶r̶f̶o̶r̶m̶a̶n̶c̶e̶ ̶o̶f̶ ̶c̶i̶l̶i̶u̶m̶-̶a̶g̶e̶n̶t̶ ̶(̶f̶o̶r̶ ̶e̶x̶a̶m̶p̶l̶e̶ ̶n̶o̶d̶e̶'̶s̶ ̶C̶P̶U̶ ̶b̶e̶i̶n̶g̶ ̶s̶a̶t̶u̶r̶a̶t̶e̶d̶)̶
Still it might be worth taking a look at CPU usage on node. |
This issue has been automatically marked as stale because it has not |
This issue has not seen any activity since it was marked stale. |
We seem to be experiencing some problems with Cilium Endpoint propagation on a GKE cluster with 2600 pods, 17 nodes and 80 cronjobs. Short-lived pods either created manually or via a cronjob sometimes experience connectivity problems when connecting to other internal services protected by network policies.
The failures often happen when there is a high pod churn on the cluster, this might be due to a number of cronjobs triggering at once or pods being deleted/created at the time the affected pod spawns.
The large number of pod updates seem to cause propagation delays of new pod IPs, which leads to the cilium IP cache not being updated in time on the remote nodes. If the delay is long enough it seems that the cache is not updated before the initial network connection is attempted, this leads to cilium classifying the identity of the cronjob pod as
reserved:world
which then causes the connection to be blocked by the network policy in that namespace. For example from cilium monitor logs during a failure event:vs when the cronjob connects as expected:
The network policy in that namespace looks like so:
I wrote a quick tool to watch the kubernetes event stream for cilium endpoint resource changes and then run
cilium bpf ipcache get <ip-addr>
on each node in the cluster to determine how long the propagation takes. For the majority of the time the delay is below 1 second, but in the event of 30/50/100 cilium endpoint updates the propagation delays can become in excess of 10s, leading to the issue.We recently upgraded cilium to 1.13.0 and also reduced the identity labels at the same time as this upgrade. This has improved the situation and the problem occurs now less frequently, but it is still something which we would prefer to avoid happening at all.
Since the upgrade I have noticed that the failure timestamps for the jobs correspond with a large increase of:
2023-02-21T12:30:06.348176806Z stderr F level=info msg="Trace[1237947675]: \"DeltaFIFO Pop Process\" ID:ns-a/test-db-swzfx,Depth:15,Reason:slow event handlers blocking the queue (21-Feb-2023 12:30:05.768) (total time: 579ms):" subsys=klog
log lines, which is demonstrated in the below image:
The green line is the rate of the
slow event handlers blocking the queue
logs and the yellow line corresponds to the creation/deletion logs of some of the recently failed jobs.I believe the slow event handlers are down to the time it takes to process the updates in the cilium egress gateway related code. At the time of the second set of failures (the rightmost red highlight on the graph) a flamegraph of the cilium agent process shows a lot of time spent in
egressgateway.(*Manager).OnUpdateEndpoint
andegressgateway.(*Manager).OnDeleteEndpoint
functions.Here is the left side of the graph following a cilium endpoint
OnAdd
:and the right side following a cilium endpoint
OnDelete
:Regarding the egress gateway setup, we have two cilium egress gateway entries on the cluster which are responsible for SNATing the connections for ~40 pods, with one node acting as the designated gateway. The designated node runs no workloads apart from the daemonsets on the clusters, the cilium daemonset and any kubernetes components.
The text was updated successfully, but these errors were encountered: