-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout while waiting for initial conntrack scan #32680
Comments
Seems like this has been present since 1.6.0 so not a regression, but definitely a scalability issue. Are you willing to contribute a fix? We have docs at https://docs.cilium.io/en/stable/contributing/development/contributing_guide/, jumping on the slack is probably a good first thing to do. |
Thanks for your information. I will try |
Mybe the fixed 30s timeout is unreasonable. When there are a large number of CT records, this time should become longer. |
Better to use it as parameter, because it can depends on host resources, as i understand |
How many entries are we talking about syncing here? 30s is a very long time, unless Cilium is also starved for CPU. I'm fine in principle with making this tunable, but I don't know whether there are other parts of the code that may also make their own assumptions based on when this will complete. |
@joestringer we discovered this problem in the following scenario: the number of Pods on a single node reaches 100+, and the ct map entries reach 200w+. Completing the initial scan of connection tracking during the first 30 seconds of startup seems impossible |
Does this mean 200'0000 (2,000,000)? |
yes |
To gather information, I set the map batch lookup privileged test to test with maps with size 2_000_000, then ran the benchmark:
My understanding of the above is that it is taking 998_695_373 nanoseconds to dump a full map of 2_000_000 entries, so around one second total. This is on my dev laptop without power plugged in, so this is probably not a very tuned environment for benchmarking. I'm not sure whether the GC is using the batch dump operations yet, if not then we should improve it to use batch dump operations as this is known to be a lot faster. Options like this to improve the code generally for everyone could potentially help solve your issue as well without the extra flag. |
Is this test sufficient and close to the logic of GC? After all, during real GC, all table entries need to be traversed and to perform some operations. |
I see that DumpReliablyWithCallback is used |
@WeeNews @joestringer This GC also waits for a NAT map GC pass to complete, the NAT GC also requires a per entry lookup into CTMap so that might contribute to the extra slowness. The 13th gen Intel Joe ran the test on might also not be the best test case 😉 |
We have a batch dumping API added for both ctmap/nat map types now, which are both only used for metrics/stats. ctmap is just doing expiry checks so doing that batched makes a lot of sense. It would also be nice to do NAT batched, but we'd still be bottlenecked on the ctmap key lookups, as well there might be a slight increase in the possibility of some kind of race condition there. |
@MadMax240484 did you mention you where looking at fixing this one? I hate to step on any toes but this issue has been causing other problems recently so I need to get a fix merged sooner rather than later. I've been working on a fix here: #34070 |
If the agent is not completed its initial ctmap/nat GC scan within a hardcoded 30 second timeout then the agent with terminate with level=fatal. This can cause issues in situations where the agent is running on an environment with high resource contention. This makes it possible for the initial gc to timeout and put Cilium into a crash-loop. The only affects the ctmap pressure metrics controller, which isn't critical so instead of terminating the agent will now wait for the initial scan async, and will emit a warning log after 30 seconds if initial gc scan has not completed. This will allow us to still catch issues in CI/environments where it is unlikely that we should see these timeouts, while still avoiding unecessary termination. Fixes: #32680 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
If the agent is not completed its initial ctmap/nat GC scan within a hardcoded 30 second timeout then the agent with terminate with level=fatal. This can cause issues in situations where the agent is running on an environment with high resource contention. This makes it possible for the initial gc to timeout and put Cilium into a crash-loop. The only affects the ctmap pressure metrics controller, which isn't critical so instead of terminating the agent will now wait for the initial scan async, and will emit a warning log after 30 seconds if initial gc scan has not completed. This will allow us to still catch issues in CI/environments where it is unlikely that we should see these timeouts, while still avoiding unnecessary termination. Fixes: #32680 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
ctmap-initial-gc-timeout will be used set the ctmap initial gc pass timeout, at which point cilium agent will terminate. This is needed because under some circumstances the previously hard coded 30 second timeout could be insufficient in resources constrained environments. Fixes: cilium#32680 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles Unfortunatelly, i haven't started this fix yet because some other issues have to resolve. |
If the agent is not completed its initial ctmap/nat GC scan within a hardcoded 30 second timeout then the agent with terminate with level=fatal. This can cause issues in situations where the agent is running on an environment with high resource contention. This makes it possible for the initial gc to timeout and put Cilium into a crash-loop. The only affects the ctmap pressure metrics controller, which isn't critical so instead of terminating the agent will now wait for the initial scan async, and will emit a warning log after 30 seconds if initial gc scan has not completed. This will allow us to still catch issues in CI/environments where it is unlikely that we should see these timeouts, while still avoiding unnecessary termination. Fixes: #32680 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
If the agent is not completed its initial ctmap/nat GC scan within a hardcoded 30 second timeout then the agent with terminate with level=fatal. This can cause issues in situations where the agent is running on an environment with high resource contention. This makes it possible for the initial gc to timeout and put Cilium into a crash-loop. The only affects the ctmap pressure metrics controller, which isn't critical so instead of terminating the agent will now wait for the initial scan async, and will emit a warning log after 30 seconds if initial gc scan has not completed. This will allow us to still catch issues in CI/environments where it is unlikely that we should see these timeouts, while still avoiding unnecessary termination. Fixes: #32680 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
If the agent is not completed its initial ctmap/nat GC scan within a hardcoded 30 second timeout then the agent with terminate with level=fatal. This can cause issues in situations where the agent is running on an environment with high resource contention. This makes it possible for the initial gc to timeout and put Cilium into a crash-loop. The only affects the ctmap pressure metrics controller, which isn't critical so instead of terminating the agent will now wait for the initial scan async, and will emit a warning log after 30 seconds if initial gc scan has not completed. This will allow us to still catch issues in CI/environments where it is unlikely that we should see these timeouts, while still avoiding unnecessary termination. Fixes: #32680 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
If the agent is not completed its initial ctmap/nat GC scan within a hardcoded 30 second timeout then the agent with terminate with level=fatal. This can cause issues in situations where the agent is running on an environment with high resource contention. This makes it possible for the initial gc to timeout and put Cilium into a crash-loop. The only affects the ctmap pressure metrics controller, which isn't critical so instead of terminating the agent will now wait for the initial scan async, and will emit a warning log after 30 seconds if initial gc scan has not completed. This will allow us to still catch issues in CI/environments where it is unlikely that we should see these timeouts, while still avoiding unnecessary termination. Fixes: #32680 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Is there an existing issue for this?
What happened?
We have some node with huge CT records/ Each time agent restart fails with message
level=fatal msg="Timeout while waiting for initial conntrack scan" subsys=ct-gc
because timeout for finish GCscan is not enough. I found that this case is hardcodedCould you add some variable parameter like gc_scan_multiplicator to get possibility for tuning?
Cilium Version
1.13
Kernel Version
5.14.0-168.el9.x86_64
Kubernetes Version
1.27
Regression
No response
Sysdump
No response
Relevant log output
No response
Anything else?
logs-from-cilium-agent-in-cilium-h9fxl.log
Cilium Users Document
Code of Conduct
The text was updated successfully, but these errors were encountered: