You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are cases where the unenrollment scheduler can cause a concurrent write or concurrent read and write error, causing Fleet Server to panic and crash with logs like:
These are done in various, interacting go routines inside this file. We likely need to coordinate across these routines using channels rather than via shared memory to avoid this problem. Race detector may be helpful here.
Though #1605 made some changes in this area, we've been seeing crashes from Fleet Server before this change. It's unclear whether or not this change caused the problem or not.
The text was updated successfully, but these errors were encountered:
cc @pierrehilbert we likely need to spend some time fixing this relatively soon. I think many of the recent "Fleet Server is down" urgent support cases are related to this.
@joshdover I don't think we need to add the complexity of channel a simple mutex to control access to theses critical zone should be enough. The race is enable in our test suite, but it doesn't mean it will catch all of them especially if the code is not tested in a way to trigger the race condition.
There are cases where the unenrollment scheduler can cause a concurrent write or concurrent read and write error, causing Fleet Server to panic and crash with logs like:
See full logs here fleet-server stderr.txt
This may explain some of the recent issues we've seen in Cloud with Fleet Server mysteriously crashing. We just now finally have logs to explain this.
There are several reads and writes to the
policies
andpoliciesCanceller
maps infleet-server/internal/pkg/coordinator/monitor.go
Lines 77 to 78 in 7a1372a
These are done in various, interacting go routines inside this file. We likely need to coordinate across these routines using channels rather than via shared memory to avoid this problem. Race detector may be helpful here.
Though #1605 made some changes in this area, we've been seeing crashes from Fleet Server before this change. It's unclear whether or not this change caused the problem or not.
The text was updated successfully, but these errors were encountered: