CI: bump default FQDN datapath timeout from 100 to 250ms #31866

squeed · 2024-04-09T14:30:19Z

This timeout can be CPU sensitive, and the CI environments can be CPU constrained.

Bumping this timeout ensures that performance regressions will still be caught, as those tend to cause delays of 1+ seconds. This will, however, cut down on CI flakes due to noise.

Ref: #29846

This timeout can be CPU sensitive, and the CI environments can be CPU constrained. Bumping this timeout ensures that performance regressions will still be caught, as those tend to cause delays of 1+ seconds. This will, however, cut down on CI flakes due to noise. Signed-off-by: Casey Callendrello <cdc@isovalent.com>

joestringer · 2024-04-09T18:04:15Z

I worry a little bit that we're going to paper over some serious performance issues with a setting like this. From the perspective of getting a reliable CI, I like it; unreliable CI doesn't help anyone. However, if we're experiencing this behaviour in CI then I'm sure our users are also experiencing this. As I understand, the repercussions of releasing packets prematurely based on this setting is that the datapath is likely to drop the first packet that is transmitted on the TCP connection and we'd bump the initial connection establishment latency automatically up to the ~1s range due to TCP retries. 100ms is already an incredibly long time to receive a packet, calculate policy impact based on precomputed data, and retransmit the packet. Do you think we need to spend more effort in investigating and/or mitigating the causes of high tail latency for DNS handling?

squeed · 2024-04-09T19:26:59Z

Do you think we need to spend more effort in investigating and/or mitigating the causes of high tail latency for DNS handling?

We have a pretty good idea of the cause of the tail latency spikes:

GC pauses, caused by
- cidr identity memory usage
- hubble logging enrichment
Envoy delays

The plan to fix these is primarily by inverting the label selection, which has the nice side-effect that the vast majority of FQDN responses will no longer require a policy update, just an ip->identity update.

Separately, we would like to refactor the Envoy API to support incremental updates. This is a broader rask, and is not currently formally on any list. We'd like to see the results of the FQDN refactor first, since that is needed for other purposes (e.g. the S3 problem).

joestringer

I'm fine with this and if we're actively pursuing improvements then great.

My only remaining query is: Do you see a path to removing / reducing this at some point? Or do you think this is something we'll set and forget? Because with such a high timeout it becomes far less likely that this setting will be a thorn in our side during testing, so unless we focus on it, we are likely to lose visibility on it.

squeed · 2024-04-15T09:07:34Z

My only remaining query is: Do you see a path to removing / reducing this at some point?

If we can get a handle on tail latencies, I'd love to reduce this to something quite strict in CI tests.

squeed · 2024-04-15T14:48:56Z

/test

squeed added area/CI Continuous Integration testing issue or flake release-note/misc This PR makes changes that have no direct user impact. area/fqdn Affects the FQDN policies feature labels Apr 9, 2024

squeed requested review from a team as code owners April 9, 2024 14:30

squeed requested a review from viktor-kurchenko April 9, 2024 14:30

squeed mentioned this pull request Apr 9, 2024

CI: Cilium E2E Upgrade: Timed out waiting for datapath updates of FQDN IP information after upgrade #29846

Open

joestringer approved these changes Apr 11, 2024

View reviewed changes

viktor-kurchenko approved these changes Apr 15, 2024

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 15, 2024

lmb added this pull request to the merge queue Apr 17, 2024

Merged via the queue into cilium:main with commit 34caeb2 Apr 17, 2024
64 checks passed

squeed added needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 23, 2024

gandro mentioned this pull request Apr 29, 2024

v1.15 Backports 2024-04-29 #32230

Merged

18 tasks

gandro added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 29, 2024

gandro mentioned this pull request Apr 30, 2024

v1.14 Backports 2024-04-30 #32251

Merged

13 tasks

gandro added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Apr 30, 2024

aanm mentioned this pull request May 2, 2024

Prepare for release v1.14.11 aanm/cilium#643

Merged

This was referenced May 10, 2024

Prepare for release v1.14.11 #32460

Merged

Prepare for release v1.15.5 #32470

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: bump default FQDN datapath timeout from 100 to 250ms #31866

CI: bump default FQDN datapath timeout from 100 to 250ms #31866

squeed commented Apr 9, 2024

joestringer commented Apr 9, 2024

squeed commented Apr 9, 2024

joestringer left a comment

squeed commented Apr 15, 2024

squeed commented Apr 15, 2024

CI: bump default FQDN datapath timeout from 100 to 250ms #31866

CI: bump default FQDN datapath timeout from 100 to 250ms #31866

Conversation

squeed commented Apr 9, 2024

joestringer commented Apr 9, 2024

squeed commented Apr 9, 2024

joestringer left a comment

Choose a reason for hiding this comment

squeed commented Apr 15, 2024

squeed commented Apr 15, 2024