Deadlock in Cilium Agent #23707

mantoine96 · 2023-02-13T11:07:27Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

After a while running (typically for us, several days), a Cilium agent will start to become unresponsive, failing its liveness checks which causes it to be restarted by Kubernetes. Since we leverage FQDN rules a lot, this ends up causing network disruption on the given host.

Cilium Version

1.12.5 820a308 2022-12-22T16:16:56+00:00 go version go1.18.9 linux/amd64

This is essentially the released 1.12.5 + this PR backported: #22252

Kernel Version

5.4.228-131.415.amzn2.x86_64

Kubernetes Version

v1.23.14-eks-ffeb93d

Sysdump

No response

Relevant log output

No response

Anything else?

The entire threaddump can be found here: cilium-wh7xn.log

This was captured by setting a preStop lifecycle hook to run kill -s ABRT 1.

This issue started occurring with v1.12.2 for us.

cc @joaoubaldo @carloscastrojumo

From our dashboards, this is what we can observe:

The restart of the agent/agent going unhealthy happens shortly after a large amount of endpoints not ready.

We have other metrics and logs if needed so let us know if you need more information for troubleshooting.

Would you recommend running an image compiled with the deadlock detection flag in production? We unfortunately cannot replicate this issue in a testing cluster.

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

gandro · 2023-02-13T12:13:41Z

Thanks for the report! The log you shared contains the following error:

fatal error: concurrent map read and map write

goroutine 377016966 [running]:
runtime.throw({0x2f8732a?, 0xc020efa810?})
	/usr/local/go/src/runtime/panic.go:992 +0x71 fp=0xc0018a9b78 sp=0xc0018a9b48 pc=0x4387b1
runtime.mapaccess2_faststr(0xc0018a9c70?, 0xc020efa810?, {0x2f18737, 0x4})
	/usr/local/go/src/runtime/map_faststr.go:117 +0x3d4 fp=0xc0018a9be0 sp=0xc0018a9b78 pc=0x413214
github.com/cilium/cilium/pkg/ipam.(*crdAllocator).totalPoolSize(...)
	/go/src/github.com/cilium/cilium/pkg/ipam/crd.go:848
github.com/cilium/cilium/pkg/ipam.(*crdAllocator).Dump(0xc005de0eb0)
	/go/src/github.com/cilium/cilium/pkg/ipam/crd.go:864 +0x189 fp=0xc0018a9ce8 sp=0xc0018a9be0 pc=0x1ffc1a9
github.com/cilium/cilium/pkg/ipam.(*IPAM).Dump(0xc005f70000)
	/go/src/github.com/cilium/cilium/pkg/ipam/allocator.go:328 +0xe6 fp=0xc0018a9e40 sp=0xc0018a9ce8 pc=0x1fed1e6
github.com/cilium/cilium/daemon/cmd.(*Daemon).DumpIPAM(0xc02a6ac900?)
	/go/src/github.com/cilium/cilium/daemon/cmd/ipam.go:134 +0x35 fp=0xc0018a9f90 sp=0xc0018a9e40 pc=0x2626a95
github.com/cilium/cilium/daemon/cmd.(*Daemon).startStatusCollector.func8({0xc02a6ac900?, 0xc01fded7d0?})
	/go/src/github.com/cilium/cilium/daemon/cmd/status.go:800 +0x1d fp=0xc0018a9fa8 sp=0xc0018a9f90 pc=0x2644f1d
github.com/cilium/cilium/pkg/status.(*Collector).runProbe.func1()
	/go/src/github.com/cilium/cilium/pkg/status/status.go:172 +0x48 fp=0xc0018a9fe0 sp=0xc0018a9fa8 pc=0x2376d28
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc0018a9fe8 sp=0xc0018a9fe0 pc=0x46bba1
created by github.com/cilium/cilium/pkg/status.(*Collector).runProbe
	/go/src/github.com/cilium/cilium/pkg/status/status.go:171 +0x25a

Which could indicate that the liveness probe failed to respond because it crashed due to a concurrent map write.

There is however a recent FQDN-related deadlock we fixed as well, which might be of interest for you:

proxy: Fix deadlock in error path of CreateOrUpdateRedirect #23377

That deadlock results in the following log line (which subsequently also causes the liveness check to fail):

2023-01-31T11:07:46.392482213Z level=warning msg="Timeout while waiting probe" probe=l7-proxy startTime="2023-01-31 11:06:46.391301457 +0000 UTC m=+1901.287024906" subsys=status

gandro · 2023-02-13T12:29:59Z

Would you recommend running an image compiled with the deadlock detection flag in production? We unfortunately cannot replicate this issue in a testing cluster.

I checked back with my colleagues and unfortunately, we do not recommend running lockdebug in production. It's overhead for each locking operation is too high. Fetching a threadump as you did is probably the better approach.

This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: cilium#23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

gandro · 2023-02-13T13:27:20Z

I've opened a PR to fix the fatal error: concurrent map read and map write issue here: #23713

It's still possible however that you also observed a deadlock. Thus, if you have additional threadumps from other runs that would be helpful to confirm that there are indeed no other unknown problems.

mantoine96 · 2023-02-13T14:08:48Z

Hey @gandro

Thank you so much for looking into this issue. We indeed saw the fatal error: concurrent map read and map write but were debating whether it might actually have occurred because of the kill -s ABRT.

I've checked #23377 and we have indeed seen some occurrences of this log message followed by agent restarts (3 in Dev, 1 in Prod, in the last ~15 days). So we will definitely get this into our image.

We haven't gotten any other threadd umps yet but we will add them here when we get them.

Thanks again!

mantoine96 · 2023-02-13T14:13:18Z

Could the error you pointed out in #23713 have resulted in a deadlock / agent becoming unresponsive / healthcheck failing?

gandro · 2023-02-13T14:32:41Z

Could the error you pointed out in #23713 have resulted in a deadlock / agent becoming unresponsive / healthcheck failing?

Yes, at least in theory. The panic stack dump you provided above has it's root in created by github.com/cilium/cilium/pkg/status.(*Collector).runProbe. That probe is used to determine the status of each cilium subsystem (such as IPAM in case of the failing probe).

If the probe never return a status (such as in case it panics), probeReturned is never triggered and thus the probe is marked as stale after 60 seconds:

cilium/pkg/status/status.go

Line 221 in 314ca7b

c.updateProbeStatus(p, nil, true, staleErr)

If there are any stale probes, then Cilium as a whole will not be marked as "ready"/"live":

cilium/daemon/cmd/status.go

Lines 652 to 657 in 428c6d2

    
           case len(sr.Stale) > 0: 
        
           	msg := "Stale status data" 
        
           	sr.Cilium = &models.Status{ 
        
           		State: models.StatusStateWarning, 
        
           		Msg:   fmt.Sprintf("%s    %s", ciliumVer, msg), 
        
           	}

The only issue with that theory is that we should see a the following warning emitted 15 seconds after the panic (which I do not see in the logs):

cilium/pkg/status/status.go

Line 198 in 314ca7b

Warnf("No response from probe within %v seconds",

This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: #23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

[ upstream commit e3a78b0 ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: cilium#23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit e3a78b0 ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: #23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit e3a78b0 ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: cilium#23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit e3a78b0 ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: #23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit e3a78b0 ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: cilium#23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>

[ upstream commit e3a78b0 ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: #23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>

christarazi · 2023-04-04T19:25:05Z

Are you still experiencing this in v1.12.8?

gandro · 2023-04-12T13:21:35Z

We've merged a deadlock fix around IPCache and FQDN (#24672) which potentially sounds like this was the issue here. Please test again with v1.12.9 which should be released next week.

mantoine96 · 2023-04-12T13:33:33Z

Hello @christarazi

We did have one more crash occurrence with the patches you mentioned (we had backported them to our image).

@gandro Great I'll test it out then!

To be fair this is really rare, we last encountered the issue 20 days ago. Hopefully this issue is the right one! :)

Thanks again!

github-actions · 2023-06-12T02:04:36Z

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

github-actions · 2023-06-26T02:08:38Z

This issue has not seen any activity since it was marked stale.
Closing.

lukaselmer · 2023-07-17T21:49:36Z

Hopefully this issue is the right one! :)

Hi @mantoine96,

It seems we are facing the same issue (however, we only observe the issue in the AWS Zurich region, and not in the AWS Frankfurt region).

Did you upgrade to the latest version of Cilium? Did it resolve the issue?

Thx!

gandro · 2023-07-18T07:52:26Z

Note that we fixed another related issue #26242 which however is still in the process of getting backported to v1.13

mantoine96 · 2023-07-18T08:09:50Z

Hello @gandro

Just confirming that in the last ~74 days, we have not had a single occurrence of the crash using our custom image based on v1.12.9.

#26242 also looks similar. We are in the process of upgrading to v1.13 so I will look out for it!

Thanks for your work on this!

@lukaselmer I would therefore recommend you try out 1.12.9 if you're experiencing similar issues. Like I mentioned, we haven't had an issue in > 74 days, which is pretty good going.

lukaselmer · 2023-07-18T12:13:03Z

Thanks @mantoine96 @gandro 🙏 In that case we'll upgrade asap 🚀🚀🚀

mantoine96 added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Feb 13, 2023

gandro mentioned this issue Feb 13, 2023

ipam/crd: Fix panic due to concurrent map read and map write #23713

Merged

christarazi added the sig/agent Cilium agent related. label Feb 15, 2023

github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jun 12, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in Cilium Agent #23707

Deadlock in Cilium Agent #23707

mantoine96 commented Feb 13, 2023 •

edited

gandro commented Feb 13, 2023 •

edited

gandro commented Feb 13, 2023 •

edited

gandro commented Feb 13, 2023

mantoine96 commented Feb 13, 2023

mantoine96 commented Feb 13, 2023

gandro commented Feb 13, 2023

christarazi commented Apr 4, 2023 •

edited

gandro commented Apr 12, 2023

mantoine96 commented Apr 12, 2023

github-actions bot commented Jun 12, 2023

github-actions bot commented Jun 26, 2023

lukaselmer commented Jul 17, 2023 •

edited

gandro commented Jul 18, 2023

mantoine96 commented Jul 18, 2023 •

edited

lukaselmer commented Jul 18, 2023

Deadlock in Cilium Agent #23707

Deadlock in Cilium Agent #23707

Comments

mantoine96 commented Feb 13, 2023 • edited

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Sysdump

Relevant log output

Anything else?

Code of Conduct

gandro commented Feb 13, 2023 • edited

gandro commented Feb 13, 2023 • edited

gandro commented Feb 13, 2023

mantoine96 commented Feb 13, 2023

mantoine96 commented Feb 13, 2023

gandro commented Feb 13, 2023

christarazi commented Apr 4, 2023 • edited

gandro commented Apr 12, 2023

mantoine96 commented Apr 12, 2023

github-actions bot commented Jun 12, 2023

github-actions bot commented Jun 26, 2023

lukaselmer commented Jul 17, 2023 • edited

gandro commented Jul 18, 2023

mantoine96 commented Jul 18, 2023 • edited

lukaselmer commented Jul 18, 2023

mantoine96 commented Feb 13, 2023 •

edited

gandro commented Feb 13, 2023 •

edited

gandro commented Feb 13, 2023 •

edited

christarazi commented Apr 4, 2023 •

edited

lukaselmer commented Jul 17, 2023 •

edited

mantoine96 commented Jul 18, 2023 •

edited