Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unenrollment timeout monitor can crash with concurrent access panic #1738

Closed
joshdover opened this issue Aug 12, 2022 · 3 comments · Fixed by #1739
Closed

Unenrollment timeout monitor can crash with concurrent access panic #1738

joshdover opened this issue Aug 12, 2022 · 3 comments · Fixed by #1739
Assignees
Labels
bug Something isn't working

Comments

@joshdover
Copy link
Member

joshdover commented Aug 12, 2022

  • Version: 8.3.2, likely earlier releases too
  • Operating System: Linux
  • Steps to Reproduce: Not sure yet

There are cases where the unenrollment scheduler can cause a concurrent write or concurrent read and write error, causing Fleet Server to panic and crash with logs like:

fatal error: concurrent map writes\n
{0xc000533b30, ...}, ...}, ...)\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/src/github.com/elastic/fleet-server/internal/pkg/coordinator/monitor.go
{{{0xc00046f3e0, 0x14}, 0x0, 0x0}, 0x1, {0xc000a05500, 0x28cf, 0x2a80}, 0x0,
:211 +0x39c fp=0xc000d836c8 sp=0xc000d83660 pc=0x5645d5fd569c\ngithub.com/elastic/fleet-server/v7/internal/pkg/coordinator.(*monitorT).rescheduleUnenroller(0xc00000ed20, {0x5645d6fb7b58, 0xc0000cc380}, 0xc000d83fa0, 0xc000d83f28)\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/src/github.com/elastic/fleet-server/internal/pkg/coordinator/monitor.go:413 +0x4c9 fp=0xc000d83b08 sp=0xc000d836c8 pc=0x5645d6b32e69\ngithub.com/elastic/fleet-server/v7/internal/pkg/coordinator.(*monitorT).ensureLeadership.func1(
\ngoroutine 222 [running]:\nruntime.throw({0x5645d6b87588, 0x10})\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/.gvm/versions/go1.17.9.linux.amd64/src/runtime/panic.go:1198 +0x71 fp=0xc000d83660 sp=0xc000d83630 pc=0x5645d5ff8111\nruntime.mapassign_faststr(0x5645d6fb7b58, 0xc0000cc380, {0xc000533b30, 0x24})\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/.gvm/versions/go1.17.9.linux.amd64/src/runtime/map_faststr.go
created by github.com/elastic/elastic-agent-client/v7/pkg/client.(*client).actionRoundTrip\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/pkg/mod/github.com/elastic/elastic-agent-client/v7@v7.0.0-20210922110810-e6f1f402a9ed/pkg/client/client.go:467 +0x245\n\ngoroutine 131 [select
/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/pkg/mod/google.golang.org/grpc@v1.43.0/stream.go:333 +0x98\ncreated by google.golang.org/grpc.newClientStreamWithParams\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/pkg/mod/google.golang.org/grpc@v1.43.0/stream.go:332 +0xb45\n\n
\ngoroutine 127 [select]:\ngoogle.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc000132120)\n\t
, 0xc0000cfaa0})\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/pkg/mod/golang.org/x/net@v0.0.0-20220225172249-27dd8689420f/http2/frame.go:237 +0x6e\n
0xc000169180, 0x0)\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/.gvm/versions/go1.17.9.linux.amd64/src/crypto/tls/conn.go:606 +0x112\ncrypto/tls.(*Conn).readRecord
bytes.(*Buffer).ReadFrom(0xc0001693f8, {0x5645d6fa1f00, 0xc000014108})\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/.gvm/versions/go1.17.9.linux.amd64/src/bytes/buffer.go:
, 0xc000176005, 0x2f})\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/.gvm/versions/go1.17.9.linux.amd64/src/net/fd_posix.go:56 +0x29\n
\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/.gvm/versions/go1.17.9.linux.amd64/src/internal/poll/fd_unix.go:167 +0x25a\nnet.(*netFD).Read(0xc0001c1500, {0xc000176000
:363 +0xa5\n\ngoroutine 124 [IO wait]:\ninternal/poll.runtime_pollWait(0x7fc8f8c69fb8, 0x72
...

See full logs here fleet-server stderr.txt

This may explain some of the recent issues we've seen in Cloud with Fleet Server mysteriously crashing. We just now finally have logs to explain this.

There are several reads and writes to the policies and policiesCanceller maps in

policies map[string]policyT
policiesCanceller map[string]context.CancelFunc

These are done in various, interacting go routines inside this file. We likely need to coordinate across these routines using channels rather than via shared memory to avoid this problem. Race detector may be helpful here.

Though #1605 made some changes in this area, we've been seeing crashes from Fleet Server before this change. It's unclear whether or not this change caused the problem or not.

@joshdover joshdover added the bug Something isn't working label Aug 12, 2022
@joshdover
Copy link
Member Author

cc @pierrehilbert we likely need to spend some time fixing this relatively soon. I think many of the recent "Fleet Server is down" urgent support cases are related to this.

@ph
Copy link
Contributor

ph commented Aug 12, 2022

@joshdover I don't think we need to add the complexity of channel a simple mutex to control access to theses critical zone should be enough. The race is enable in our test suite, but it doesn't mean it will catch all of them especially if the code is not tested in a way to trigger the race condition.

@ph ph self-assigned this Aug 12, 2022
@ph
Copy link
Contributor

ph commented Aug 12, 2022

Ok, well, I have a fixes that seems to work, I will try to trigger the panic on my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants