Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout while waiting for initial conntrack scan #32680

Closed
3 tasks done
MadMax240484 opened this issue May 23, 2024 · 15 comments · Fixed by #34070
Closed
3 tasks done

Timeout while waiting for initial conntrack scan #32680

MadMax240484 opened this issue May 23, 2024 · 15 comments · Fixed by #34070
Assignees
Labels
feature/conntrack kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/agent Cilium agent related.

Comments

@MadMax240484
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

We have some node with huge CT records/ Each time agent restart fails with message level=fatal msg="Timeout while waiting for initial conntrack scan" subsys=ct-gc because timeout for finish GCscan is not enough. I found that this case is hardcoded

select {
	case <-initialScanComplete:
		gc.logger.Info("Initial scan of connection tracking completed")
	case <-time.After(30 * time.Second):
		gc.logger.Fatal("Timeout while waiting for initial conntrack scan")
	}

Could you add some variable parameter like gc_scan_multiplicator to get possibility for tuning?

Cilium Version

1.13

Kernel Version

5.14.0-168.el9.x86_64

Kubernetes Version

1.27

Regression

No response

Sysdump

No response

Relevant log output

No response

Anything else?

logs-from-cilium-agent-in-cilium-h9fxl.log

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@MadMax240484 MadMax240484 added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 23, 2024
@lmb lmb added feature/conntrack sig/agent Cilium agent related. labels May 23, 2024
@lmb
Copy link
Contributor

lmb commented May 23, 2024

Seems like this has been present since 1.6.0 so not a regression, but definitely a scalability issue. Are you willing to contribute a fix? We have docs at https://docs.cilium.io/en/stable/contributing/development/contributing_guide/, jumping on the slack is probably a good first thing to do.

@MadMax240484
Copy link
Author

Thanks for your information. I will try

@WeeNews
Copy link

WeeNews commented Jun 4, 2024

Mybe the fixed 30s timeout is unreasonable. When there are a large number of CT records, this time should become longer.

@MadMax240484
Copy link
Author

MadMax240484 commented Jun 6, 2024

Mybe the fixed 30s timeout is unreasonable. When there are a large number of CT records, this time should become longer.

Better to use it as parameter, because it can depends on host resources, as i understand

@ti-mo ti-mo removed the needs/triage This issue requires triaging to establish severity and next steps. label Jun 20, 2024
@joestringer
Copy link
Member

How many entries are we talking about syncing here? 30s is a very long time, unless Cilium is also starved for CPU.

I'm fine in principle with making this tunable, but I don't know whether there are other parts of the code that may also make their own assumptions based on when this will complete.

@WeeNews
Copy link

WeeNews commented Jul 11, 2024

@joestringer we discovered this problem in the following scenario: the number of Pods on a single node reaches 100+, and the ct map entries reach 200w+. Completing the initial scan of connection tracking during the first 30 seconds of startup seems impossible

@joestringer
Copy link
Member

ct map entries reach 200w+

Does this mean 200'0000 (2,000,000)?

@WeeNews
Copy link

WeeNews commented Jul 12, 2024

ct map entries reach 200w+

Does this mean 200'0000 (2,000,000)?

yes

@joestringer
Copy link
Member

To gather information, I set the map batch lookup privileged test to test with maps with size 2_000_000, then ran the benchmark:

$ git diff
diff --git a/pkg/maps/ctmap/ctmap_privileged_test.go b/pkg/maps/ctmap/ctmap_privileged_test.go
index 99250af9ce88..e63ce82db804 100644
--- a/pkg/maps/ctmap/ctmap_privileged_test.go
+++ b/pkg/maps/ctmap/ctmap_privileged_test.go
@@ -41,7 +41,7 @@ func BenchmarkMapBatchLookup(b *testing.B) {
        assert.NoError(b, m.Map.Unpin())
        assert.NoError(b, err)
 
-       _ = populateFakeDataCTMap4(b, m, option.CTMapEntriesGlobalTCPDefault)
+       _ = populateFakeDataCTMap4(b, m, 2_000_000)
 
        b.ReportAllocs()
        b.ResetTimer()
diff --git a/pkg/maps/ctmap/types.go b/pkg/maps/ctmap/types.go
index 81dad1f8e80c..35d560137004 100644
--- a/pkg/maps/ctmap/types.go
+++ b/pkg/maps/ctmap/types.go
@@ -132,6 +132,10 @@ func (m mapType) value() bpf.MapValue {
 }
 
 func (m mapType) maxEntries() int {
+       if true {
+               return 2_000_000
+       }
+
        switch m {
        case mapTypeIPv4TCPGlobal, mapTypeIPv6TCPGlobal:
                if option.Config.CTMapEntriesGlobalTCP != 0 {
$ sudo -E make bench-privileged BENCH=BenchmarkMapBatchLookup TESTPKGS=./pkg/maps/ctmap
PRIVILEGED_TESTS=true CGO_ENABLED=0 go test -mod=vendor -vet=all -tags=osusergo  -timeout 600s -bench=BenchmarkMapBatchLookup -run=^$ -benchtime=10s ./pkg/maps/ctmap
goos: linux
goarch: amd64
pkg: github.com/cilium/cilium/pkg/maps/ctmap
cpu: 13th Gen Intel(R) Core(TM) i7-1365U
BenchmarkMapBatchLookup-12            12         998695373 ns/op          392369 B/op      11251 allocs/op
PASS
ok      github.com/cilium/cilium/pkg/maps/ctmap 33.648s

My understanding of the above is that it is taking 998_695_373 nanoseconds to dump a full map of 2_000_000 entries, so around one second total. This is on my dev laptop without power plugged in, so this is probably not a very tuned environment for benchmarking.

I'm not sure whether the GC is using the batch dump operations yet, if not then we should improve it to use batch dump operations as this is known to be a lot faster. Options like this to improve the code generally for everyone could potentially help solve your issue as well without the extra flag.

@WeeNews
Copy link

WeeNews commented Jul 31, 2024

Is this test sufficient and close to the logic of GC? After all, during real GC, all table entries need to be traversed and to perform some operations.

@WeeNews
Copy link

WeeNews commented Jul 31, 2024

I see that DumpReliablyWithCallback is used

@tommyp1ckles tommyp1ckles self-assigned this Aug 1, 2024
@tommyp1ckles
Copy link
Contributor

Is this test sufficient and close to the logic of GC? After all, during real GC, all table entries need to be traversed and to perform some operations.

@WeeNews @joestringer This GC also waits for a NAT map GC pass to complete, the NAT GC also requires a per entry lookup into CTMap so that might contribute to the extra slowness.

The 13th gen Intel Joe ran the test on might also not be the best test case 😉

@tommyp1ckles
Copy link
Contributor

I'm not sure whether the GC is using the batch dump operations yet, if not then we should improve it to use batch dump operations as this is known to be a lot faster. Options like this to improve the code generally for everyone could potentially help solve your issue as well without the extra flag.

We have a batch dumping API added for both ctmap/nat map types now, which are both only used for metrics/stats. ctmap is just doing expiry checks so doing that batched makes a lot of sense.

It would also be nice to do NAT batched, but we'd still be bottlenecked on the ctmap key lookups, as well there might be a slight increase in the possibility of some kind of race condition there.

@tommyp1ckles
Copy link
Contributor

tommyp1ckles commented Aug 1, 2024

@MadMax240484 did you mention you where looking at fixing this one? I hate to step on any toes but this issue has been causing other problems recently so I need to get a fix merged sooner rather than later.

I've been working on a fix here: #34070

tommyp1ckles added a commit that referenced this issue Aug 1, 2024
If the agent is not completed its initial ctmap/nat GC scan within a
hardcoded 30 second timeout then the agent with terminate with
level=fatal.

This can cause issues in situations where the agent is running on an
environment with high resource contention.  This makes it possible for
the initial gc to timeout and put Cilium into a crash-loop.

The only affects the ctmap pressure metrics controller, which isn't
critical so instead of terminating the agent will now wait for the
initial scan async, and will emit a warning log after 30 seconds if
initial gc scan has not completed.

This will allow us to still catch issues in CI/environments where it
is unlikely that we should see these timeouts, while still avoiding
unecessary termination.

Fixes: #32680

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
tommyp1ckles added a commit that referenced this issue Aug 1, 2024
If the agent is not completed its initial ctmap/nat GC scan within a
hardcoded 30 second timeout then the agent with terminate with
level=fatal.

This can cause issues in situations where the agent is running on an
environment with high resource contention.  This makes it possible for
the initial gc to timeout and put Cilium into a crash-loop.

The only affects the ctmap pressure metrics controller, which isn't
critical so instead of terminating the agent will now wait for the
initial scan async, and will emit a warning log after 30 seconds if
initial gc scan has not completed.

This will allow us to still catch issues in CI/environments where it
is unlikely that we should see these timeouts, while still avoiding
unnecessary termination.

Fixes: #32680

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
tommyp1ckles added a commit to tommyp1ckles/cilium that referenced this issue Aug 2, 2024
ctmap-initial-gc-timeout will be used set the ctmap initial
gc pass timeout, at which point cilium agent will terminate.

This is needed because under some circumstances the previously
hard coded 30 second timeout could be insufficient in resources
constrained environments.

Fixes: cilium#32680

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@MadMax240484
Copy link
Author

@tommyp1ckles Unfortunatelly, i haven't started this fix yet because some other issues have to resolve.
if you are working to fix #34070 and and you can try to add fix this issue, it will be very good.
Thanks for help.

tommyp1ckles added a commit that referenced this issue Aug 9, 2024
If the agent is not completed its initial ctmap/nat GC scan within a
hardcoded 30 second timeout then the agent with terminate with
level=fatal.

This can cause issues in situations where the agent is running on an
environment with high resource contention.  This makes it possible for
the initial gc to timeout and put Cilium into a crash-loop.

The only affects the ctmap pressure metrics controller, which isn't
critical so instead of terminating the agent will now wait for the
initial scan async, and will emit a warning log after 30 seconds if
initial gc scan has not completed.

This will allow us to still catch issues in CI/environments where it
is unlikely that we should see these timeouts, while still avoiding
unnecessary termination.

Fixes: #32680

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
tommyp1ckles added a commit that referenced this issue Aug 21, 2024
If the agent is not completed its initial ctmap/nat GC scan within a
hardcoded 30 second timeout then the agent with terminate with
level=fatal.

This can cause issues in situations where the agent is running on an
environment with high resource contention.  This makes it possible for
the initial gc to timeout and put Cilium into a crash-loop.

The only affects the ctmap pressure metrics controller, which isn't
critical so instead of terminating the agent will now wait for the
initial scan async, and will emit a warning log after 30 seconds if
initial gc scan has not completed.

This will allow us to still catch issues in CI/environments where it
is unlikely that we should see these timeouts, while still avoiding
unnecessary termination.

Fixes: #32680

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
tommyp1ckles added a commit that referenced this issue Sep 15, 2024
If the agent is not completed its initial ctmap/nat GC scan within a
hardcoded 30 second timeout then the agent with terminate with
level=fatal.

This can cause issues in situations where the agent is running on an
environment with high resource contention.  This makes it possible for
the initial gc to timeout and put Cilium into a crash-loop.

The only affects the ctmap pressure metrics controller, which isn't
critical so instead of terminating the agent will now wait for the
initial scan async, and will emit a warning log after 30 seconds if
initial gc scan has not completed.

This will allow us to still catch issues in CI/environments where it
is unlikely that we should see these timeouts, while still avoiding
unnecessary termination.

Fixes: #32680

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
github-merge-queue bot pushed a commit that referenced this issue Sep 18, 2024
If the agent is not completed its initial ctmap/nat GC scan within a
hardcoded 30 second timeout then the agent with terminate with
level=fatal.

This can cause issues in situations where the agent is running on an
environment with high resource contention.  This makes it possible for
the initial gc to timeout and put Cilium into a crash-loop.

The only affects the ctmap pressure metrics controller, which isn't
critical so instead of terminating the agent will now wait for the
initial scan async, and will emit a warning log after 30 seconds if
initial gc scan has not completed.

This will allow us to still catch issues in CI/environments where it
is unlikely that we should see these timeouts, while still avoiding
unnecessary termination.

Fixes: #32680

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/conntrack kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/agent Cilium agent related.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants