Unenrollment timeout monitor can crash with concurrent access panic #1738

joshdover · 2022-08-12T11:45:22Z

Version: 8.3.2, likely earlier releases too
Operating System: Linux
Steps to Reproduce: Not sure yet

There are cases where the unenrollment scheduler can cause a concurrent write or concurrent read and write error, causing Fleet Server to panic and crash with logs like:

fatal error: concurrent map writes\n
{0xc000533b30, ...}, ...}, ...)\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/src/github.com/elastic/fleet-server/internal/pkg/coordinator/monitor.go
{{{0xc00046f3e0, 0x14}, 0x0, 0x0}, 0x1, {0xc000a05500, 0x28cf, 0x2a80}, 0x0,
:211 +0x39c fp=0xc000d836c8 sp=0xc000d83660 pc=0x5645d5fd569c\ngithub.com/elastic/fleet-server/v7/internal/pkg/coordinator.(*monitorT).rescheduleUnenroller(0xc00000ed20, {0x5645d6fb7b58, 0xc0000cc380}, 0xc000d83fa0, 0xc000d83f28)\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/src/github.com/elastic/fleet-server/internal/pkg/coordinator/monitor.go:413 +0x4c9 fp=0xc000d83b08 sp=0xc000d836c8 pc=0x5645d6b32e69\ngithub.com/elastic/fleet-server/v7/internal/pkg/coordinator.(*monitorT).ensureLeadership.func1(
\ngoroutine 222 [running]:\nruntime.throw({0x5645d6b87588, 0x10})\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/.gvm/versions/go1.17.9.linux.amd64/src/runtime/panic.go:1198 +0x71 fp=0xc000d83660 sp=0xc000d83630 pc=0x5645d5ff8111\nruntime.mapassign_faststr(0x5645d6fb7b58, 0xc0000cc380, {0xc000533b30, 0x24})\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/.gvm/versions/go1.17.9.linux.amd64/src/runtime/map_faststr.go
created by github.com/elastic/elastic-agent-client/v7/pkg/client.(*client).actionRoundTrip\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/pkg/mod/github.com/elastic/elastic-agent-client/v7@v7.0.0-20210922110810-e6f1f402a9ed/pkg/client/client.go:467 +0x245\n\ngoroutine 131 [select
/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/pkg/mod/google.golang.org/grpc@v1.43.0/stream.go:333 +0x98\ncreated by google.golang.org/grpc.newClientStreamWithParams\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/pkg/mod/google.golang.org/grpc@v1.43.0/stream.go:332 +0xb45\n\n
\ngoroutine 127 [select]:\ngoogle.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc000132120)\n\t
, 0xc0000cfaa0})\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/pkg/mod/golang.org/x/net@v0.0.0-20220225172249-27dd8689420f/http2/frame.go:237 +0x6e\n
0xc000169180, 0x0)\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/.gvm/versions/go1.17.9.linux.amd64/src/crypto/tls/conn.go:606 +0x112\ncrypto/tls.(*Conn).readRecord
bytes.(*Buffer).ReadFrom(0xc0001693f8, {0x5645d6fa1f00, 0xc000014108})\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/.gvm/versions/go1.17.9.linux.amd64/src/bytes/buffer.go:
, 0xc000176005, 0x2f})\n\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/.gvm/versions/go1.17.9.linux.amd64/src/net/fd_posix.go:56 +0x29\n
\t/var/lib/jenkins/workspace/ger_fleet-server-package-mbp_8.3/.gvm/versions/go1.17.9.linux.amd64/src/internal/poll/fd_unix.go:167 +0x25a\nnet.(*netFD).Read(0xc0001c1500, {0xc000176000
:363 +0xa5\n\ngoroutine 124 [IO wait]:\ninternal/poll.runtime_pollWait(0x7fc8f8c69fb8, 0x72
...

See full logs here fleet-server stderr.txt

This may explain some of the recent issues we've seen in Cloud with Fleet Server mysteriously crashing. We just now finally have logs to explain this.

There are several reads and writes to the policies and policiesCanceller maps in

fleet-server/internal/pkg/coordinator/monitor.go

Lines 77 to 78 in 7a1372a

    
           policies          map[string]policyT 
        
           policiesCanceller map[string]context.CancelFunc

These are done in various, interacting go routines inside this file. We likely need to coordinate across these routines using channels rather than via shared memory to avoid this problem. Race detector may be helpful here.

Though #1605 made some changes in this area, we've been seeing crashes from Fleet Server before this change. It's unclear whether or not this change caused the problem or not.

The text was updated successfully, but these errors were encountered:

joshdover · 2022-08-12T11:50:09Z

cc @pierrehilbert we likely need to spend some time fixing this relatively soon. I think many of the recent "Fleet Server is down" urgent support cases are related to this.

ph · 2022-08-12T12:39:42Z

@joshdover I don't think we need to add the complexity of channel a simple mutex to control access to theses critical zone should be enough. The race is enable in our test suite, but it doesn't mean it will catch all of them especially if the code is not tested in a way to trigger the race condition.

ph · 2022-08-12T15:26:12Z

Ok, well, I have a fixes that seems to work, I will try to trigger the panic on my side.

joshdover added the bug Something isn't working label Aug 12, 2022

ph self-assigned this Aug 12, 2022

This was referenced Aug 12, 2022

Protect access to policiesCanceller using a mutext #1739

Merged

Refactor and rethink the coordinator/monitor implementation #1744

Open

ph closed this as completed in #1739 Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unenrollment timeout monitor can crash with concurrent access panic #1738

Unenrollment timeout monitor can crash with concurrent access panic #1738

joshdover commented Aug 12, 2022 •

edited

joshdover commented Aug 12, 2022

ph commented Aug 12, 2022

ph commented Aug 12, 2022

Unenrollment timeout monitor can crash with concurrent access panic #1738

Unenrollment timeout monitor can crash with concurrent access panic #1738

Comments

joshdover commented Aug 12, 2022 • edited

joshdover commented Aug 12, 2022

ph commented Aug 12, 2022

ph commented Aug 12, 2022

joshdover commented Aug 12, 2022 •

edited