-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: bump default FQDN datapath timeout from 100 to 250ms #31866
Conversation
This timeout can be CPU sensitive, and the CI environments can be CPU constrained. Bumping this timeout ensures that performance regressions will still be caught, as those tend to cause delays of 1+ seconds. This will, however, cut down on CI flakes due to noise. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
I worry a little bit that we're going to paper over some serious performance issues with a setting like this. From the perspective of getting a reliable CI, I like it; unreliable CI doesn't help anyone. However, if we're experiencing this behaviour in CI then I'm sure our users are also experiencing this. As I understand, the repercussions of releasing packets prematurely based on this setting is that the datapath is likely to drop the first packet that is transmitted on the TCP connection and we'd bump the initial connection establishment latency automatically up to the ~1s range due to TCP retries. 100ms is already an incredibly long time to receive a packet, calculate policy impact based on precomputed data, and retransmit the packet. Do you think we need to spend more effort in investigating and/or mitigating the causes of high tail latency for DNS handling? |
We have a pretty good idea of the cause of the tail latency spikes:
The plan to fix these is primarily by inverting the label selection, which has the nice side-effect that the vast majority of FQDN responses will no longer require a policy update, just an ip->identity update. Separately, we would like to refactor the Envoy API to support incremental updates. This is a broader rask, and is not currently formally on any list. We'd like to see the results of the FQDN refactor first, since that is needed for other purposes (e.g. the S3 problem). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this and if we're actively pursuing improvements then great.
My only remaining query is: Do you see a path to removing / reducing this at some point? Or do you think this is something we'll set and forget? Because with such a high timeout it becomes far less likely that this setting will be a thorn in our side during testing, so unless we focus on it, we are likely to lose visibility on it.
If we can get a handle on tail latencies, I'd love to reduce this to something quite strict in CI tests. |
/test |
This timeout can be CPU sensitive, and the CI environments can be CPU constrained.
Bumping this timeout ensures that performance regressions will still be caught, as those tend to cause delays of 1+ seconds. This will, however, cut down on CI flakes due to noise.
Ref: #29846