Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(daemon): turn off net/http behaviour of recovering from panics #350

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

benhoyt
Copy link
Contributor

@benhoyt benhoyt commented Feb 5, 2024

TODO ben: 1) add tests of this, 2) use something more like debug.Stack to avoid always allocating 1MB on panic when this occurs.

By default, Go's net/http recovers panics in the ServeHTTP goroutine and keeps serving. However, if a lock is held when the panic happens, this will mean a complete deadlock and Pebble will go unresponsive. This is almost certainly what's happening in #314.

There's no direct way of opting out of this net/http behaviour, so you have to recover() from panics yourself, print the goroutine dump, and exit.

I've tested this locally, and it prints the goroutine dump and exits with code 1 on a panic as expected, instead of locking up. It handles any panic in the handler or middleware, eg in logit).

defer func() {
err := recover()
if err != nil {
buf := make([]byte, 1024*1024)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1MB is plenty huge. We could pass runtime.Stack ever-increasing buffer sizes till it fits, like the code for debug.Stack does, but that seems overkill. For reference, the large MAAS goroutine stack dump in this comment is about 100KB.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use debug.Stack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because debug.Stack only includes the current goroutine, whereas runtime.Stack takes an all parameter which we're setting to true to get the stack trace from all goroutines.

By default when Go exits due to a panic it just prints the current goroutine's stack trace (ala debug.Stack), but that's not particularly helpful for debugging deadlock issues like this.

@benhoyt benhoyt requested a review from hpidcock February 5, 2024 03:58
Copy link
Member

@hpidcock hpidcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overall this is an improvement, as we will now see panics as they happen, but it may lead to some unexpected panics in production uses of pebble until we identify the cause of the deadlock that spawned this line of inquiry.

defer func() {
err := recover()
if err != nil {
buf := make([]byte, 1024*1024)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use debug.Stack?

@troyanov troyanov mentioned this pull request Feb 5, 2024
@benhoyt
Copy link
Contributor Author

benhoyt commented Feb 7, 2024

I think overall this is an improvement, as we will now see panics as they happen, but it may lead to some unexpected panics in production uses of pebble until we identify the cause of the deadlock that spawned this line of inquiry.

@hpidcock The thing is, those "unexpected panics in production uses" have a decent chance of deadlocking the entire Pebble process (if a lock is held with defer.Unlock()), which is much worse than exiting (in which case the calling process will likely restart it). In any case, we'll chat further this afternoon about the best way forward.

@benhoyt
Copy link
Contributor Author

benhoyt commented Feb 12, 2024

Given we now know panic-recovery isn't the cause of the issue at hand, I'm going to come back to this after reproducing and fixing #314, but I'll leave this PR open -- I think it's probably a good idea in any case.

@benhoyt benhoyt added the 24.10 Planned for the 24.10 cycle label Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
24.10 Planned for the 24.10 cycle
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants