Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: bump default FQDN datapath timeout from 100 to 250ms #31866

Merged
merged 1 commit into from
Apr 17, 2024

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Apr 9, 2024

This timeout can be CPU sensitive, and the CI environments can be CPU constrained.

Bumping this timeout ensures that performance regressions will still be caught, as those tend to cause delays of 1+ seconds. This will, however, cut down on CI flakes due to noise.

Ref: #29846

This timeout can be CPU sensitive, and the CI environments can be CPU
constrained.

Bumping this timeout ensures that performance regressions will still be
caught, as those tend to cause delays of 1+ seconds. This will, however,
cut down on CI flakes due to noise.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@squeed squeed added area/CI Continuous Integration testing issue or flake release-note/misc This PR makes changes that have no direct user impact. area/fqdn Affects the FQDN policies feature labels Apr 9, 2024
@squeed squeed requested review from a team as code owners April 9, 2024 14:30
@joestringer
Copy link
Member

I worry a little bit that we're going to paper over some serious performance issues with a setting like this. From the perspective of getting a reliable CI, I like it; unreliable CI doesn't help anyone. However, if we're experiencing this behaviour in CI then I'm sure our users are also experiencing this. As I understand, the repercussions of releasing packets prematurely based on this setting is that the datapath is likely to drop the first packet that is transmitted on the TCP connection and we'd bump the initial connection establishment latency automatically up to the ~1s range due to TCP retries. 100ms is already an incredibly long time to receive a packet, calculate policy impact based on precomputed data, and retransmit the packet. Do you think we need to spend more effort in investigating and/or mitigating the causes of high tail latency for DNS handling?

@squeed
Copy link
Contributor Author

squeed commented Apr 9, 2024

Do you think we need to spend more effort in investigating and/or mitigating the causes of high tail latency for DNS handling?

We have a pretty good idea of the cause of the tail latency spikes:

  • GC pauses, caused by
    • cidr identity memory usage
    • hubble logging enrichment
  • Envoy delays

The plan to fix these is primarily by inverting the label selection, which has the nice side-effect that the vast majority of FQDN responses will no longer require a policy update, just an ip->identity update.

Separately, we would like to refactor the Envoy API to support incremental updates. This is a broader rask, and is not currently formally on any list. We'd like to see the results of the FQDN refactor first, since that is needed for other purposes (e.g. the S3 problem).

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this and if we're actively pursuing improvements then great.

My only remaining query is: Do you see a path to removing / reducing this at some point? Or do you think this is something we'll set and forget? Because with such a high timeout it becomes far less likely that this setting will be a thorn in our side during testing, so unless we focus on it, we are likely to lose visibility on it.

@squeed
Copy link
Contributor Author

squeed commented Apr 15, 2024

My only remaining query is: Do you see a path to removing / reducing this at some point?

If we can get a handle on tail latencies, I'd love to reduce this to something quite strict in CI tests.

@squeed
Copy link
Contributor Author

squeed commented Apr 15, 2024

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 15, 2024
@lmb lmb added this pull request to the merge queue Apr 17, 2024
Merged via the queue into cilium:main with commit 34caeb2 Apr 17, 2024
64 checks passed
@squeed squeed added needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 23, 2024
@gandro gandro mentioned this pull request Apr 29, 2024
18 tasks
@gandro gandro added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 29, 2024
@gandro gandro mentioned this pull request Apr 30, 2024
13 tasks
@gandro gandro added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Apr 30, 2024
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake area/fqdn Affects the FQDN policies feature backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants