Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in Cilium Agent #23707

Closed
2 tasks done
mantoine96 opened this issue Feb 13, 2023 · 15 comments
Closed
2 tasks done

Deadlock in Cilium Agent #23707

mantoine96 opened this issue Feb 13, 2023 · 15 comments
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/agent Cilium agent related. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@mantoine96
Copy link
Contributor

mantoine96 commented Feb 13, 2023

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

After a while running (typically for us, several days), a Cilium agent will start to become unresponsive, failing its liveness checks which causes it to be restarted by Kubernetes. Since we leverage FQDN rules a lot, this ends up causing network disruption on the given host.

Cilium Version

1.12.5 820a308 2022-12-22T16:16:56+00:00 go version go1.18.9 linux/amd64

This is essentially the released 1.12.5 + this PR backported: #22252

Kernel Version

5.4.228-131.415.amzn2.x86_64

Kubernetes Version

v1.23.14-eks-ffeb93d

Sysdump

No response

Relevant log output

No response

Anything else?

The entire threaddump can be found here: cilium-wh7xn.log

This was captured by setting a preStop lifecycle hook to run kill -s ABRT 1.

This issue started occurring with v1.12.2 for us.

cc @joaoubaldo @carloscastrojumo

From our dashboards, this is what we can observe:
image
The restart of the agent/agent going unhealthy happens shortly after a large amount of endpoints not ready.

We have other metrics and logs if needed so let us know if you need more information for troubleshooting.

Would you recommend running an image compiled with the deadlock detection flag in production? We unfortunately cannot replicate this issue in a testing cluster.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@mantoine96 mantoine96 added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Feb 13, 2023
@gandro
Copy link
Member

gandro commented Feb 13, 2023

Thanks for the report! The log you shared contains the following error:

fatal error: concurrent map read and map write

goroutine 377016966 [running]:
runtime.throw({0x2f8732a?, 0xc020efa810?})
	/usr/local/go/src/runtime/panic.go:992 +0x71 fp=0xc0018a9b78 sp=0xc0018a9b48 pc=0x4387b1
runtime.mapaccess2_faststr(0xc0018a9c70?, 0xc020efa810?, {0x2f18737, 0x4})
	/usr/local/go/src/runtime/map_faststr.go:117 +0x3d4 fp=0xc0018a9be0 sp=0xc0018a9b78 pc=0x413214
github.com/cilium/cilium/pkg/ipam.(*crdAllocator).totalPoolSize(...)
	/go/src/github.com/cilium/cilium/pkg/ipam/crd.go:848
github.com/cilium/cilium/pkg/ipam.(*crdAllocator).Dump(0xc005de0eb0)
	/go/src/github.com/cilium/cilium/pkg/ipam/crd.go:864 +0x189 fp=0xc0018a9ce8 sp=0xc0018a9be0 pc=0x1ffc1a9
github.com/cilium/cilium/pkg/ipam.(*IPAM).Dump(0xc005f70000)
	/go/src/github.com/cilium/cilium/pkg/ipam/allocator.go:328 +0xe6 fp=0xc0018a9e40 sp=0xc0018a9ce8 pc=0x1fed1e6
github.com/cilium/cilium/daemon/cmd.(*Daemon).DumpIPAM(0xc02a6ac900?)
	/go/src/github.com/cilium/cilium/daemon/cmd/ipam.go:134 +0x35 fp=0xc0018a9f90 sp=0xc0018a9e40 pc=0x2626a95
github.com/cilium/cilium/daemon/cmd.(*Daemon).startStatusCollector.func8({0xc02a6ac900?, 0xc01fded7d0?})
	/go/src/github.com/cilium/cilium/daemon/cmd/status.go:800 +0x1d fp=0xc0018a9fa8 sp=0xc0018a9f90 pc=0x2644f1d
github.com/cilium/cilium/pkg/status.(*Collector).runProbe.func1()
	/go/src/github.com/cilium/cilium/pkg/status/status.go:172 +0x48 fp=0xc0018a9fe0 sp=0xc0018a9fa8 pc=0x2376d28
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc0018a9fe8 sp=0xc0018a9fe0 pc=0x46bba1
created by github.com/cilium/cilium/pkg/status.(*Collector).runProbe
	/go/src/github.com/cilium/cilium/pkg/status/status.go:171 +0x25a

Which could indicate that the liveness probe failed to respond because it crashed due to a concurrent map write.

There is however a recent FQDN-related deadlock we fixed as well, which might be of interest for you:

That deadlock results in the following log line (which subsequently also causes the liveness check to fail):

2023-01-31T11:07:46.392482213Z level=warning msg="Timeout while waiting probe" probe=l7-proxy startTime="2023-01-31 11:06:46.391301457 +0000 UTC m=+1901.287024906" subsys=status

@gandro
Copy link
Member

gandro commented Feb 13, 2023

Would you recommend running an image compiled with the deadlock detection flag in production? We unfortunately cannot replicate this issue in a testing cluster.

I checked back with my colleagues and unfortunately, we do not recommend running lockdebug in production. It's overhead for each locking operation is too high. Fetching a threadump as you did is probably the better approach.

gandro added a commit to gandro/cilium that referenced this issue Feb 13, 2023
This fixes a panic in the `totalPoolSize` function. Previously,
`totalPoolSize` required that the `crdAllocator` mutex was held. This
however is not sufficient to block concurrent writes to the
`allocationPoolSize` map, since that map is written to by
`nodeStore.updateLocalNodeResource`, which only holds the `nodeStore`
mutex.

This commit fixes the issue by moving the `totalPoolSize` function to
the `nodeStore` and having it explicitly take the `nodeStore` mutex
(instead of requiring the `crdAllocator` mutex to be held). This ensures
that all access to `allocationPoolSize` is now protected by the
`nodeStore` mutex.

The lock ordering is also preserved: The `crdAllocator` calls into
`nodeStore`, but not vise-versa. Thus, the lock ordering is always that
the `crdAllocator` lock is held first, and the `nodeStore` lock second.

Related to: cilium#23707

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro
Copy link
Member

gandro commented Feb 13, 2023

I've opened a PR to fix the fatal error: concurrent map read and map write issue here: #23713

It's still possible however that you also observed a deadlock. Thus, if you have additional threadumps from other runs that would be helpful to confirm that there are indeed no other unknown problems.

@mantoine96
Copy link
Contributor Author

Hey @gandro

Thank you so much for looking into this issue. We indeed saw the fatal error: concurrent map read and map write but were debating whether it might actually have occurred because of the kill -s ABRT.

I've checked #23377 and we have indeed seen some occurrences of this log message followed by agent restarts (3 in Dev, 1 in Prod, in the last ~15 days). So we will definitely get this into our image.

We haven't gotten any other threadd umps yet but we will add them here when we get them.

Thanks again!

@mantoine96
Copy link
Contributor Author

Could the error you pointed out in #23713 have resulted in a deadlock / agent becoming unresponsive / healthcheck failing?

@gandro
Copy link
Member

gandro commented Feb 13, 2023

Could the error you pointed out in #23713 have resulted in a deadlock / agent becoming unresponsive / healthcheck failing?

Yes, at least in theory. The panic stack dump you provided above has it's root in created by github.com/cilium/cilium/pkg/status.(*Collector).runProbe. That probe is used to determine the status of each cilium subsystem (such as IPAM in case of the failing probe).

If the probe never return a status (such as in case it panics), probeReturned is never triggered and thus the probe is marked as stale after 60 seconds:

c.updateProbeStatus(p, nil, true, staleErr)

If there are any stale probes, then Cilium as a whole will not be marked as "ready"/"live":

cilium/daemon/cmd/status.go

Lines 652 to 657 in 428c6d2

case len(sr.Stale) > 0:
msg := "Stale status data"
sr.Cilium = &models.Status{
State: models.StatusStateWarning,
Msg: fmt.Sprintf("%s %s", ciliumVer, msg),
}

The only issue with that theory is that we should see a the following warning emitted 15 seconds after the panic (which I do not see in the logs):

Warnf("No response from probe within %v seconds",

nbusseneau pushed a commit that referenced this issue Feb 14, 2023
This fixes a panic in the `totalPoolSize` function. Previously,
`totalPoolSize` required that the `crdAllocator` mutex was held. This
however is not sufficient to block concurrent writes to the
`allocationPoolSize` map, since that map is written to by
`nodeStore.updateLocalNodeResource`, which only holds the `nodeStore`
mutex.

This commit fixes the issue by moving the `totalPoolSize` function to
the `nodeStore` and having it explicitly take the `nodeStore` mutex
(instead of requiring the `crdAllocator` mutex to be held). This ensures
that all access to `allocationPoolSize` is now protected by the
`nodeStore` mutex.

The lock ordering is also preserved: The `crdAllocator` calls into
`nodeStore`, but not vise-versa. Thus, the lock ordering is always that
the `crdAllocator` lock is held first, and the `nodeStore` lock second.

Related to: #23707

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@christarazi christarazi added the sig/agent Cilium agent related. label Feb 15, 2023
pchaigno pushed a commit to pchaigno/cilium that referenced this issue Feb 15, 2023
[ upstream commit e3a78b0 ]

This fixes a panic in the `totalPoolSize` function. Previously,
`totalPoolSize` required that the `crdAllocator` mutex was held. This
however is not sufficient to block concurrent writes to the
`allocationPoolSize` map, since that map is written to by
`nodeStore.updateLocalNodeResource`, which only holds the `nodeStore`
mutex.

This commit fixes the issue by moving the `totalPoolSize` function to
the `nodeStore` and having it explicitly take the `nodeStore` mutex
(instead of requiring the `crdAllocator` mutex to be held). This ensures
that all access to `allocationPoolSize` is now protected by the
`nodeStore` mutex.

The lock ordering is also preserved: The `crdAllocator` calls into
`nodeStore`, but not vise-versa. Thus, the lock ordering is always that
the `crdAllocator` lock is held first, and the `nodeStore` lock second.

Related to: cilium#23707

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno pushed a commit that referenced this issue Feb 16, 2023
[ upstream commit e3a78b0 ]

This fixes a panic in the `totalPoolSize` function. Previously,
`totalPoolSize` required that the `crdAllocator` mutex was held. This
however is not sufficient to block concurrent writes to the
`allocationPoolSize` map, since that map is written to by
`nodeStore.updateLocalNodeResource`, which only holds the `nodeStore`
mutex.

This commit fixes the issue by moving the `totalPoolSize` function to
the `nodeStore` and having it explicitly take the `nodeStore` mutex
(instead of requiring the `crdAllocator` mutex to be held). This ensures
that all access to `allocationPoolSize` is now protected by the
`nodeStore` mutex.

The lock ordering is also preserved: The `crdAllocator` calls into
`nodeStore`, but not vise-versa. Thus, the lock ordering is always that
the `crdAllocator` lock is held first, and the `nodeStore` lock second.

Related to: #23707

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno pushed a commit to pchaigno/cilium that referenced this issue Feb 16, 2023
[ upstream commit e3a78b0 ]

This fixes a panic in the `totalPoolSize` function. Previously,
`totalPoolSize` required that the `crdAllocator` mutex was held. This
however is not sufficient to block concurrent writes to the
`allocationPoolSize` map, since that map is written to by
`nodeStore.updateLocalNodeResource`, which only holds the `nodeStore`
mutex.

This commit fixes the issue by moving the `totalPoolSize` function to
the `nodeStore` and having it explicitly take the `nodeStore` mutex
(instead of requiring the `crdAllocator` mutex to be held). This ensures
that all access to `allocationPoolSize` is now protected by the
`nodeStore` mutex.

The lock ordering is also preserved: The `crdAllocator` calls into
`nodeStore`, but not vise-versa. Thus, the lock ordering is always that
the `crdAllocator` lock is held first, and the `nodeStore` lock second.

Related to: cilium#23707

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno pushed a commit that referenced this issue Feb 20, 2023
[ upstream commit e3a78b0 ]

This fixes a panic in the `totalPoolSize` function. Previously,
`totalPoolSize` required that the `crdAllocator` mutex was held. This
however is not sufficient to block concurrent writes to the
`allocationPoolSize` map, since that map is written to by
`nodeStore.updateLocalNodeResource`, which only holds the `nodeStore`
mutex.

This commit fixes the issue by moving the `totalPoolSize` function to
the `nodeStore` and having it explicitly take the `nodeStore` mutex
(instead of requiring the `crdAllocator` mutex to be held). This ensures
that all access to `allocationPoolSize` is now protected by the
`nodeStore` mutex.

The lock ordering is also preserved: The `crdAllocator` calls into
`nodeStore`, but not vise-versa. Thus, the lock ordering is always that
the `crdAllocator` lock is held first, and the `nodeStore` lock second.

Related to: #23707

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
sayboras pushed a commit to sayboras/cilium that referenced this issue Feb 23, 2023
[ upstream commit e3a78b0 ]

This fixes a panic in the `totalPoolSize` function. Previously,
`totalPoolSize` required that the `crdAllocator` mutex was held. This
however is not sufficient to block concurrent writes to the
`allocationPoolSize` map, since that map is written to by
`nodeStore.updateLocalNodeResource`, which only holds the `nodeStore`
mutex.

This commit fixes the issue by moving the `totalPoolSize` function to
the `nodeStore` and having it explicitly take the `nodeStore` mutex
(instead of requiring the `crdAllocator` mutex to be held). This ensures
that all access to `allocationPoolSize` is now protected by the
`nodeStore` mutex.

The lock ordering is also preserved: The `crdAllocator` calls into
`nodeStore`, but not vise-versa. Thus, the lock ordering is always that
the `crdAllocator` lock is held first, and the `nodeStore` lock second.

Related to: cilium#23707

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
sayboras pushed a commit that referenced this issue Feb 27, 2023
[ upstream commit e3a78b0 ]

This fixes a panic in the `totalPoolSize` function. Previously,
`totalPoolSize` required that the `crdAllocator` mutex was held. This
however is not sufficient to block concurrent writes to the
`allocationPoolSize` map, since that map is written to by
`nodeStore.updateLocalNodeResource`, which only holds the `nodeStore`
mutex.

This commit fixes the issue by moving the `totalPoolSize` function to
the `nodeStore` and having it explicitly take the `nodeStore` mutex
(instead of requiring the `crdAllocator` mutex to be held). This ensures
that all access to `allocationPoolSize` is now protected by the
`nodeStore` mutex.

The lock ordering is also preserved: The `crdAllocator` calls into
`nodeStore`, but not vise-versa. Thus, the lock ordering is always that
the `crdAllocator` lock is held first, and the `nodeStore` lock second.

Related to: #23707

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
@christarazi
Copy link
Member

christarazi commented Apr 4, 2023

Are you still experiencing this in v1.12.8?

@gandro
Copy link
Member

gandro commented Apr 12, 2023

We've merged a deadlock fix around IPCache and FQDN (#24672) which potentially sounds like this was the issue here. Please test again with v1.12.9 which should be released next week.

@mantoine96
Copy link
Contributor Author

Hello @christarazi

We did have one more crash occurrence with the patches you mentioned (we had backported them to our image).

@gandro Great I'll test it out then!

To be fair this is really rare, we last encountered the issue 20 days ago. Hopefully this issue is the right one! :)

Thanks again!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jun 12, 2023
@github-actions
Copy link

This issue has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 26, 2023
@lukaselmer
Copy link

lukaselmer commented Jul 17, 2023

Hopefully this issue is the right one! :)

Hi @mantoine96,

It seems we are facing the same issue (however, we only observe the issue in the AWS Zurich region, and not in the AWS Frankfurt region).

Did you upgrade to the latest version of Cilium? Did it resolve the issue?

Thx!

@gandro
Copy link
Member

gandro commented Jul 18, 2023

Note that we fixed another related issue #26242 which however is still in the process of getting backported to v1.13

@mantoine96
Copy link
Contributor Author

mantoine96 commented Jul 18, 2023

Hello @gandro

Just confirming that in the last ~74 days, we have not had a single occurrence of the crash using our custom image based on v1.12.9.

#26242 also looks similar. We are in the process of upgrading to v1.13 so I will look out for it!

Thanks for your work on this!

@lukaselmer I would therefore recommend you try out 1.12.9 if you're experiencing similar issues. Like I mentioned, we haven't had an issue in > 74 days, which is pretty good going.

@lukaselmer
Copy link

Thanks @mantoine96 @gandro 🙏 In that case we'll upgrade asap 🚀🚀🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/agent Cilium agent related. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

No branches or pull requests

4 participants