Timeout while waiting for initial conntrack scan #32680

MadMax240484 · 2024-05-23T07:28:13Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

We have some node with huge CT records/ Each time agent restart fails with message level=fatal msg="Timeout while waiting for initial conntrack scan" subsys=ct-gc because timeout for finish GCscan is not enough. I found that this case is hardcoded

select {
	case <-initialScanComplete:
		gc.logger.Info("Initial scan of connection tracking completed")
	case <-time.After(30 * time.Second):
		gc.logger.Fatal("Timeout while waiting for initial conntrack scan")
	}

Could you add some variable parameter like gc_scan_multiplicator to get possibility for tuning?

Cilium Version

1.13

Kernel Version

5.14.0-168.el9.x86_64

Kubernetes Version

1.27

Regression

No response

Sysdump

No response

Relevant log output

No response

Anything else?

logs-from-cilium-agent-in-cilium-h9fxl.log

Cilium Users Document

Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

lmb · 2024-05-23T10:15:49Z

Seems like this has been present since 1.6.0 so not a regression, but definitely a scalability issue. Are you willing to contribute a fix? We have docs at https://docs.cilium.io/en/stable/contributing/development/contributing_guide/, jumping on the slack is probably a good first thing to do.

MadMax240484 · 2024-06-03T07:49:20Z

Thanks for your information. I will try

WeeNews · 2024-06-04T12:34:13Z

Mybe the fixed 30s timeout is unreasonable. When there are a large number of CT records, this time should become longer.

MadMax240484 · 2024-06-06T12:45:02Z

Mybe the fixed 30s timeout is unreasonable. When there are a large number of CT records, this time should become longer.

Better to use it as parameter, because it can depends on host resources, as i understand

joestringer · 2024-07-10T22:22:08Z

How many entries are we talking about syncing here? 30s is a very long time, unless Cilium is also starved for CPU.

I'm fine in principle with making this tunable, but I don't know whether there are other parts of the code that may also make their own assumptions based on when this will complete.

WeeNews · 2024-07-11T12:32:18Z

@joestringer we discovered this problem in the following scenario: the number of Pods on a single node reaches 100+, and the ct map entries reach 200w+. Completing the initial scan of connection tracking during the first 30 seconds of startup seems impossible

joestringer · 2024-07-11T17:34:31Z

ct map entries reach 200w+

Does this mean 200'0000 (2,000,000)?

WeeNews · 2024-07-12T03:37:49Z

ct map entries reach 200w+

Does this mean 200'0000 (2,000,000)?

yes

joestringer · 2024-07-12T20:30:44Z

To gather information, I set the map batch lookup privileged test to test with maps with size 2_000_000, then ran the benchmark:

$ git diff
diff --git a/pkg/maps/ctmap/ctmap_privileged_test.go b/pkg/maps/ctmap/ctmap_privileged_test.go
index 99250af9ce88..e63ce82db804 100644
--- a/pkg/maps/ctmap/ctmap_privileged_test.go
+++ b/pkg/maps/ctmap/ctmap_privileged_test.go
@@ -41,7 +41,7 @@ func BenchmarkMapBatchLookup(b *testing.B) {
        assert.NoError(b, m.Map.Unpin())
        assert.NoError(b, err)
 
-       _ = populateFakeDataCTMap4(b, m, option.CTMapEntriesGlobalTCPDefault)
+       _ = populateFakeDataCTMap4(b, m, 2_000_000)
 
        b.ReportAllocs()
        b.ResetTimer()
diff --git a/pkg/maps/ctmap/types.go b/pkg/maps/ctmap/types.go
index 81dad1f8e80c..35d560137004 100644
--- a/pkg/maps/ctmap/types.go
+++ b/pkg/maps/ctmap/types.go
@@ -132,6 +132,10 @@ func (m mapType) value() bpf.MapValue {
 }
 
 func (m mapType) maxEntries() int {
+       if true {
+               return 2_000_000
+       }
+
        switch m {
        case mapTypeIPv4TCPGlobal, mapTypeIPv6TCPGlobal:
                if option.Config.CTMapEntriesGlobalTCP != 0 {
$ sudo -E make bench-privileged BENCH=BenchmarkMapBatchLookup TESTPKGS=./pkg/maps/ctmap
PRIVILEGED_TESTS=true CGO_ENABLED=0 go test -mod=vendor -vet=all -tags=osusergo  -timeout 600s -bench=BenchmarkMapBatchLookup -run=^$ -benchtime=10s ./pkg/maps/ctmap
goos: linux
goarch: amd64
pkg: github.com/cilium/cilium/pkg/maps/ctmap
cpu: 13th Gen Intel(R) Core(TM) i7-1365U
BenchmarkMapBatchLookup-12            12         998695373 ns/op          392369 B/op      11251 allocs/op
PASS
ok      github.com/cilium/cilium/pkg/maps/ctmap 33.648s

My understanding of the above is that it is taking 998_695_373 nanoseconds to dump a full map of 2_000_000 entries, so around one second total. This is on my dev laptop without power plugged in, so this is probably not a very tuned environment for benchmarking.

I'm not sure whether the GC is using the batch dump operations yet, if not then we should improve it to use batch dump operations as this is known to be a lot faster. Options like this to improve the code generally for everyone could potentially help solve your issue as well without the extra flag.

WeeNews · 2024-07-31T06:26:03Z

Is this test sufficient and close to the logic of GC? After all, during real GC, all table entries need to be traversed and to perform some operations.

WeeNews · 2024-07-31T06:27:57Z

I see that DumpReliablyWithCallback is used

tommyp1ckles · 2024-08-01T19:40:41Z

Is this test sufficient and close to the logic of GC? After all, during real GC, all table entries need to be traversed and to perform some operations.

@WeeNews @joestringer This GC also waits for a NAT map GC pass to complete, the NAT GC also requires a per entry lookup into CTMap so that might contribute to the extra slowness.

The 13th gen Intel Joe ran the test on might also not be the best test case 😉

tommyp1ckles · 2024-08-01T19:43:08Z

I'm not sure whether the GC is using the batch dump operations yet, if not then we should improve it to use batch dump operations as this is known to be a lot faster. Options like this to improve the code generally for everyone could potentially help solve your issue as well without the extra flag.

We have a batch dumping API added for both ctmap/nat map types now, which are both only used for metrics/stats. ctmap is just doing expiry checks so doing that batched makes a lot of sense.

It would also be nice to do NAT batched, but we'd still be bottlenecked on the ctmap key lookups, as well there might be a slight increase in the possibility of some kind of race condition there.

tommyp1ckles · 2024-08-01T21:18:01Z

@MadMax240484 did you mention you where looking at fixing this one? I hate to step on any toes but this issue has been causing other problems recently so I need to get a fix merged sooner rather than later.

I've been working on a fix here: #34070

If the agent is not completed its initial ctmap/nat GC scan within a hardcoded 30 second timeout then the agent with terminate with level=fatal. This can cause issues in situations where the agent is running on an environment with high resource contention. This makes it possible for the initial gc to timeout and put Cilium into a crash-loop. The only affects the ctmap pressure metrics controller, which isn't critical so instead of terminating the agent will now wait for the initial scan async, and will emit a warning log after 30 seconds if initial gc scan has not completed. This will allow us to still catch issues in CI/environments where it is unlikely that we should see these timeouts, while still avoiding unecessary termination. Fixes: #32680 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>

If the agent is not completed its initial ctmap/nat GC scan within a hardcoded 30 second timeout then the agent with terminate with level=fatal. This can cause issues in situations where the agent is running on an environment with high resource contention. This makes it possible for the initial gc to timeout and put Cilium into a crash-loop. The only affects the ctmap pressure metrics controller, which isn't critical so instead of terminating the agent will now wait for the initial scan async, and will emit a warning log after 30 seconds if initial gc scan has not completed. This will allow us to still catch issues in CI/environments where it is unlikely that we should see these timeouts, while still avoiding unnecessary termination. Fixes: #32680 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>

ctmap-initial-gc-timeout will be used set the ctmap initial gc pass timeout, at which point cilium agent will terminate. This is needed because under some circumstances the previously hard coded 30 second timeout could be insufficient in resources constrained environments. Fixes: cilium#32680 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>

MadMax240484 · 2024-08-05T04:34:33Z

@tommyp1ckles Unfortunatelly, i haven't started this fix yet because some other issues have to resolve.
if you are working to fix #34070 and and you can try to add fix this issue, it will be very good.
Thanks for help.

If the agent is not completed its initial ctmap/nat GC scan within a hardcoded 30 second timeout then the agent with terminate with level=fatal. This can cause issues in situations where the agent is running on an environment with high resource contention. This makes it possible for the initial gc to timeout and put Cilium into a crash-loop. The only affects the ctmap pressure metrics controller, which isn't critical so instead of terminating the agent will now wait for the initial scan async, and will emit a warning log after 30 seconds if initial gc scan has not completed. This will allow us to still catch issues in CI/environments where it is unlikely that we should see these timeouts, while still avoiding unnecessary termination. Fixes: #32680 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>

MadMax240484 added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 23, 2024

lmb added feature/conntrack sig/agent Cilium agent related. labels May 23, 2024

ti-mo removed the needs/triage This issue requires triaging to establish severity and next steps. label Jun 20, 2024

tommyp1ckles self-assigned this Aug 1, 2024

tommyp1ckles assigned tommyp1ckles and unassigned tommyp1ckles Aug 1, 2024

tommyp1ckles mentioned this issue Aug 14, 2024

Make initial nat gc async during Daemon initialization. #34070

Merged

joestringer closed this as completed in #34070 Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout while waiting for initial conntrack scan #32680

Timeout while waiting for initial conntrack scan #32680

MadMax240484 commented May 23, 2024

lmb commented May 23, 2024

MadMax240484 commented Jun 3, 2024

WeeNews commented Jun 4, 2024

MadMax240484 commented Jun 6, 2024 •

edited

Loading

joestringer commented Jul 10, 2024

WeeNews commented Jul 11, 2024

joestringer commented Jul 11, 2024

WeeNews commented Jul 12, 2024

joestringer commented Jul 12, 2024

WeeNews commented Jul 31, 2024

WeeNews commented Jul 31, 2024

tommyp1ckles commented Aug 1, 2024

tommyp1ckles commented Aug 1, 2024

tommyp1ckles commented Aug 1, 2024 •

edited

Loading

MadMax240484 commented Aug 5, 2024

Timeout while waiting for initial conntrack scan #32680

Timeout while waiting for initial conntrack scan #32680

Comments

MadMax240484 commented May 23, 2024

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

lmb commented May 23, 2024

MadMax240484 commented Jun 3, 2024

WeeNews commented Jun 4, 2024

MadMax240484 commented Jun 6, 2024 • edited Loading

joestringer commented Jul 10, 2024

WeeNews commented Jul 11, 2024

joestringer commented Jul 11, 2024

WeeNews commented Jul 12, 2024

joestringer commented Jul 12, 2024

WeeNews commented Jul 31, 2024

WeeNews commented Jul 31, 2024

tommyp1ckles commented Aug 1, 2024

tommyp1ckles commented Aug 1, 2024

tommyp1ckles commented Aug 1, 2024 • edited Loading

MadMax240484 commented Aug 5, 2024

MadMax240484 commented Jun 6, 2024 •

edited

Loading

tommyp1ckles commented Aug 1, 2024 •

edited

Loading