-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colexecerror: avoid debug.Stack in CatchVectorizedRuntimeError #123277
Conversation
Benchmark results from my laptop are here: https://gist.github.com/michae2/4406203dbafc5749ad6a02f8b0ec268e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2, @rafiss, and @yuzefovich)
pkg/sql/colexecerror/error.go
line 74 at r4 (raw file):
whence
Nice :)
pkg/sql/colexecerror/error.go
line 130 at r4 (raw file):
return } retErr = err
Do you think it would be worth it to wrap the error here in an alreadyCaughtErr
struct or something, so that we only have to inspect the stack once in a set of nested catchers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @michae2, @rafiss, and @yuzefovich)
pkg/sql/colexecerror/error.go
line 111 at r4 (raw file):
)) } if panicEmittedFrom == "" {
Is this check and the one above for !panicLineFound
necessary? If they were omitted we'd call shouldCatchPanic("")
which would return false and we'd re-throw panicObj
which should ultimately print the stack anyways. Just wondering what the value of emitting errors.AssertFailedf
is instead. Do we even have test coverage of this code?
} | ||
|
||
// InternalError simply panics with the provided object. It will always be | ||
// caught and returned as internal error to the client with the corresponding | ||
// stack trace. This method should be called to propagate errors that resulted | ||
// in the vectorized engine being in an *unexpected* state. | ||
func InternalError(err error) { | ||
panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: the comment should be updated so mention the error wrapping instead of "simply panicking."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)
pkg/sql/colexecerror/error.go
line 113 at r4 (raw file):
if panicEmittedFrom == "" { stackTrace := string(debug.Stack()) panic(errors.AssertionFailedf(
i thought errors.AssertionFailedf
already would include the stack trace: https://github.com/cockroachdb/errors/blob/c1cc1919cf999fb018fcd038852e969e3d5631cc/errutil/assertions.go#L33-L35
(though i see this was the behavior from before your PR. we could check if we have been seeing duplicated stack traces in any error reports.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)
-- commits
line 2 at r4:
could you include before/after results of this benchmark in the PR description?
helpful incantation:
N=10 BENCHTIMEOUT=24h PKG=./pkg/sql/colexecerror BENCHES=BenchmarkCatchVectorizedRuntimeError ./scripts/bench 'old-sha' 'new-sha'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work and speed up! I have some comments.
Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)
pkg/sql/colexecerror/error.go
line 109 at r1 (raw file):
sqlRowPackagesPrefix = "github.com/cockroachdb/cockroach/pkg/sql/row" sqlSemPackagesPrefix = "github.com/cockroachdb/cockroach/pkg/sql/sem" testSqlColPackagesPrefix = "pkg/sql/col"
Why do we need this addition (it kinda duplicates sqlColPackagesPrefix
)? Because we strip the prefix when running tests via bazel? Consider leaving a comment.
pkg/sql/colexecerror/error.go
line 78 at r3 (raw file):
// engine. We treat a panic from lower in the stack as unrecoverable. //Find where the panic came from and only proceed if it
nit: missing spaces after the slashes in the third commit.
pkg/sql/colexecerror/error.go
line 243 at r3 (raw file):
func init() { errors.RegisterWrapperDecoder(errors.GetTypeKey((*internalError)(nil)), decodeInternalError)
I think we need to register the decoder for both internalError
and notInternalError
.
pkg/sql/colexecerror/error.go
line 246 at r3 (raw file):
} // InternalError simply panics with the provided object. It will always be
nit: this comment needs a minor adjustment.
pkg/sql/colexecerror/error.go
line 111 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Is this check and the one above for
!panicLineFound
necessary? If they were omitted we'd callshouldCatchPanic("")
which would return false and we'd re-throwpanicObj
which should ultimately print the stack anyways. Just wondering what the value of emittingerrors.AssertFailedf
is instead. Do we even have test coverage of this code?
These two checks were added in case Go runtime ever changes so that panics are emitted from a different location than runtime/panic.go
. We do have some sanity checks for this code in TestCatchVectorizedRuntimeError
, but I don't think it's possible to come up with a test in which one of these two checks doesn't pass.
pkg/sql/colexecerror/error.go
line 113 at r4 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i thought
errors.AssertionFailedf
already would include the stack trace: https://github.com/cockroachdb/errors/blob/c1cc1919cf999fb018fcd038852e969e3d5631cc/errutil/assertions.go#L33-L35(though i see this was the behavior from before your PR. we could check if we have been seeing duplicated stack traces in any error reports.)
Yes, good point.AssertionFailedf
includes the stack trace (that's why below we only call errors.NewAssertionErrorWithWrappedErrf
when we do want to create an error that would include the stack trace), so I think we can remove two calls to debug.Stack()
and rely on the assertion's behavior. (Given my comment above, I don't think it's actually possible to hit this code path right now, so there is no way to check for stack trace duplication.)
pkg/sql/colexecerror/error.go
line 130 at r4 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Do you think it would be worth it to wrap the error here in an
alreadyCaughtErr
struct or something, so that we only have to inspect the stack once in a set of nested catchers?
+1 - this seems like an easy extension of the current improvement. IIUC multiple nested catchers significantly exacerbated the problem we saw in the customer environment, and although we now have fast-paths for majority of errors, it'd be great to only inspect the stack once, regardless of the number of catches in it.
pkg/sql/sem/builtins/builtins.go
line 5645 at r1 (raw file):
), "crdb_internal.force_vectorized_assertion_error": makeBuiltin(
nit: rather than introducing a new builtin, should we introduce an overload to existing crdb_internal.force_panic
builtin where an optional second boolean argument would indicate whether the panic should be catchable by vectorized engine or not?
pkg/sql/colexecerror/main_test.go
line 42 at r1 (raw file):
var ( // testAllocator is an Allocator with an unlimited budget for use in tests. testAllocator *colmem.Allocator
nit: we don't need most of the initialization in this file. I think it can be as short as
func TestMain(m *testing.M) {
securityassets.SetLoader(securitytest.EmbeddedAssets)
randutil.SeedForTests()
serverutils.InitTestServerFactory(server.TestServerFactory)
os.Exit(m.Run())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @rafiss, and @yuzefovich)
pkg/sql/colexecerror/error.go
line 111 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
These two checks were added in case Go runtime ever changes so that panics are emitted from a different location than
runtime/panic.go
. We do have some sanity checks for this code inTestCatchVectorizedRuntimeError
, but I don't think it's possible to come up with a test in which one of these two checks doesn't pass.
Ah, got it. I'd suggest a small refactor to this code. Pull the extraction of panicEmittedFrom
into a function, and call that from a test and assert that it always finds the location. Right now you essentially have the testing in the regular code path which feels a bit strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)
pkg/sql/colexecerror/error.go
line 111 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
These two checks were added in case Go runtime ever changes so that panics are emitted from a different location than
runtime/panic.go
. We do have some sanity checks for this code inTestCatchVectorizedRuntimeError
, but I don't think it's possible to come up with a test in which one of these two checks doesn't pass.
Thinking a bit more about this, I agree that these two checks don't add that much value, so we could remove them. If callsite for panics in Go runtime ever changes, we'll easily catch the change via CI. So I'd be in favor of removing these two if
s and simply re-panicking whenever we don't find the panic line in the stack trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)
pkg/sql/colexecerror/error.go
line 48 at r2 (raw file):
// without a stacktrace, sentry report, or "internal error" designation. var nie *notInternalError if errors.As(err, &se) || errors.As(err, &nie) {
nit: Why use errors.As
here instead of errors.Is
or errors.IsAny
?
pkg/sql/colexecerror/error_test.go
Outdated
b.Run(tc.name, func(b *testing.B) { | ||
// Create as many warm connections as we will need for the benchmark. | ||
conns := make(chan *gosql.DB, numConns) | ||
for range numConns { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: this will make backport difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @mgartner, @michae2, and @petermattis)
pkg/sql/colexecerror/error.go
line 48 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: Why use
errors.As
here instead oferrors.Is
orerrors.IsAny
?
errors.Is[Any]
requires the error (or any error in the cause chain) to exactly equal a reference error.
errors.As
checks if the error (or any error in the cause chain) is assignable to the value pointed at by the target.
in this case, since there is no "singleton" notInternalError
, we need to use As
.
Previously, rafiss (Rafi Shamim) wrote…
Thanks for the explanation Rafi! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the comments! I will push an update tonight.
Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @petermattis, and @rafiss)
pkg/sql/colexecerror/error.go
line 130 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
+1 - this seems like an easy extension of the current improvement. IIUC multiple nested catchers significantly exacerbated the problem we saw in the customer environment, and although we now have fast-paths for majority of errors, it'd be great to only inspect the stack once, regardless of the number of catches in it.
Nice idea! @yuzefovich I think I might reuse notInternalError
for this, do you see any problems with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)
pkg/sql/colexecerror/error.go
line 130 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Nice idea! @yuzefovich I think I might reuse
notInternalError
for this, do you see any problems with that?
Reusing notInternalError
would lead to a behavior change. Namely, we now won't be able to tell the difference between an expected error within vectorized engine (that should propagated as an error, without stack trace) and an unexpected error outside of the vectorized engine (which shouldn't be caught and should be propagated via panic further up). I'd introduce a new error type like Drew suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @mgartner, @petermattis, @rafiss, and @yuzefovich)
Previously, rafiss (Rafi Shamim) wrote…
could you include before/after results of this benchmark in the PR description?
helpful incantation:
N=10 BENCHTIMEOUT=24h PKG=./pkg/sql/colexecerror BENCHES=BenchmarkCatchVectorizedRuntimeError ./scripts/bench 'old-sha' 'new-sha'
I'll kick off a run on a gceworker and share tomorrow.
pkg/sql/colexecerror/error.go
line 109 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Why do we need this addition (it kinda duplicates
sqlColPackagesPrefix
)? Because we strip the prefix when running tests via bazel? Consider leaving a comment.
Yes, exactly. Added a comment and I asked about it in slack.
pkg/sql/colexecerror/error.go
line 78 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: missing spaces after the slashes in the third commit.
Done.
pkg/sql/colexecerror/error.go
line 243 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I think we need to register the decoder for both
internalError
andnotInternalError
.
Oh, good catch! Done.
pkg/sql/colexecerror/error.go
line 246 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this comment needs a minor adjustment.
Done.
pkg/sql/colexecerror/error.go
line 0 at r4 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
(Reviewable was unable to map this GitHub inline comment thread to the right spot — sorry!)
super nit: the comment should be updated so mention the error wrapping instead of "simply panicking."
Done.
pkg/sql/colexecerror/error.go
line 111 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Thinking a bit more about this, I agree that these two checks don't add that much value, so we could remove them. If callsite for panics in Go runtime ever changes, we'll easily catch the change via CI. So I'd be in favor of removing these two
if
s and simply re-panicking whenever we don't find the panic line in the stack trace.
I removed these two checks.
pkg/sql/colexecerror/error.go
line 113 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Yes, good point.
AssertionFailedf
includes the stack trace (that's why below we only callerrors.NewAssertionErrorWithWrappedErrf
when we do want to create an error that would include the stack trace), so I think we can remove two calls todebug.Stack()
and rely on the assertion's behavior. (Given my comment above, I don't think it's actually possible to hit this code path right now, so there is no way to check for stack trace duplication.)
Removed these two checks.
pkg/sql/colexecerror/error.go
line 130 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Reusing
notInternalError
would lead to a behavior change. Namely, we now won't be able to tell the difference between an expected error within vectorized engine (that should propagated as an error, without stack trace) and an unexpected error outside of the vectorized engine (which shouldn't be caught and should be propagated via panic further up). I'd introduce a new error type like Drew suggested.
I tried this out, but strangely it seemed to make things slower. My guess is that we're mostly re-wrapping with materializers and columnarizers, and it looks like we already wrap with notInternalError
in columnarizer:
cockroach/pkg/sql/colexec/columnarizer.go
Line 243 in af3173a
colexecerror.ExpectedError(meta.Err) |
pkg/sql/sem/builtins/builtins.go
line 5645 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: rather than introducing a new builtin, should we introduce an overload to existing
crdb_internal.force_panic
builtin where an optional second boolean argument would indicate whether the panic should be catchable by vectorized engine or not?
Good call. I added an override with a couple more options.
pkg/sql/colexecerror/error_test.go
line 203 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Note to self: this will make backport difficult.
Done.
pkg/sql/colexecerror/main_test.go
line 42 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: we don't need most of the initialization in this file. I think it can be as short as
func TestMain(m *testing.M) { securityassets.SetLoader(securitytest.EmbeddedAssets) randutil.SeedForTests() serverutils.InitTestServerFactory(server.TestServerFactory) os.Exit(m.Run()) }
Thank you! Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r5, 1 of 1 files at r6, 1 of 1 files at r7, 1 of 1 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)
pkg/sql/colexecerror/error.go
line 130 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
I tried this out, but strangely it seemed to make things slower. My guess is that we're mostly re-wrapping with materializers and columnarizers, and it looks like we already wrap with
notInternalError
in columnarizer:cockroach/pkg/sql/colexec/columnarizer.go
Line 243 in af3173a
colexecerror.ExpectedError(meta.Err)
Hm, we might be thinking about this differently. The idea is that in the fall back case, when we had to look at the stack via runtime.CallersFrames
(because the panic should be caught by vec engine but wasn't produced via one of colexecerror.*Error
calls), we will wrap the error with a special marker alreadyCaughtError
so that the next catcher up the stack didn't have to inspect the stack (i.e. we would add another special error type to the hot path at the top of the method). This shouldn't have any influence for columnarizer-materializer pair since they already use colexecerror
methods that wrap errors with different markers. Does this match your thinking?
That said, this would be an improvement to an edge case, so I'd be ok with leaving a TODO for it.
Third time's the charm. bors r=DrewKimball,petermattis,mgartner,yuzefovich |
This comment was marked as resolved.
This comment was marked as resolved.
blathers backport release-24.1.0-rc |
Earlier this year we made the vectorized panic-catcher much more efficient (in cockroachdb#123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in cockroachdb#123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
133620: colexecerror: improve the catcher due to a recent regression r=yuzefovich a=yuzefovich Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Fixes: #133617. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
See individual commits for details.
Benchmarks before and after the change:
Fixes: #123235
Release note (performance improvement): Make error handling in the vectorized execution engine much cheaper. This should help avoid bad metastable regimes perpetuated by statement timeout handling consuming all CPU time, leading to more statement timeouts.
Co-authored-by: Drew Kimball drewk@cockroachlabs.com