Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colexecerror: avoid debug.Stack in CatchVectorizedRuntimeError #123277

Merged
merged 4 commits into from
May 2, 2024

Conversation

michae2
Copy link
Collaborator

@michae2 michae2 commented Apr 30, 2024

See individual commits for details.

Benchmarks before and after the change:

                                                                    │ /tmp/tmp.PrzUglo6GH/bench.catch~3 │    /tmp/tmp.PrzUglo6GH/bench.catch     │
                                                                    │              sec/op               │    sec/op      vs base                 │
CatchVectorizedRuntimeError/noError-12                                                    0.4033n ± ∞ ¹   0.4224n ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/expected-12                                                 40831.00n ± ∞ ¹    98.55n ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/storage-12                                                  41302.00n ± ∞ ¹    99.54n ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/contextCanceled-12                                          40947.00n ± ∞ ¹    99.78n ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/internalWithCode-12                                          41088.0n ± ∞ ¹    474.6n ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/internal-12                                                  41032.0n ± ∞ ¹    943.0n ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/runtime-12                                                    37.094µ ± ∞ ¹    1.091µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/noError-12                                         43.83µ ± ∞ ¹    43.63µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/expectedWithCode-12                              1234.24µ ± ∞ ¹    48.27µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/expectedAssertion-12                             1182.11µ ± ∞ ¹    88.12µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/internalAssertion-12                             1211.83µ ± ∞ ¹    77.58µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/internalIndexOutOfRange-12                       1227.76µ ± ∞ ¹    82.40µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/internalDivideByZero-12                          1224.45µ ± ∞ ¹    81.01µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/contextCanceled-12                               1265.30µ ± ∞ ¹    44.83µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/noError-12                                        43.78µ ± ∞ ¹    42.83µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/expectedWithCode-12                             1210.98µ ± ∞ ¹    48.41µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/expectedAssertion-12                            1187.21µ ± ∞ ¹    84.70µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/internalAssertion-12                            1224.62µ ± ∞ ¹    73.85µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/internalIndexOutOfRange-12                      1224.56µ ± ∞ ¹    81.39µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/internalDivideByZero-12                         1259.05µ ± ∞ ¹    80.63µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/contextCanceled-12                              1262.44µ ± ∞ ¹    47.13µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=600/noError-12                                        44.98µ ± ∞ ¹    43.53µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=600/expectedWithCode-12                             1231.37µ ± ∞ ¹    48.65µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=600/expectedAssertion-12                            1201.74µ ± ∞ ¹    97.23µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=600/internalAssertion-12                            1249.63µ ± ∞ ¹    80.46µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=600/internalIndexOutOfRange-12                      1257.86µ ± ∞ ¹    77.83µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=600/internalDivideByZero-12                         1251.79µ ± ∞ ¹    82.22µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=600/contextCanceled-12                              1271.82µ ± ∞ ¹    51.19µ ± ∞ ¹        ~ (p=1.000 n=1) ²
geomean                                                                                    243.2µ          13.09µ        -94.62%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                                                                    │ /tmp/tmp.PrzUglo6GH/bench.catch~3 │    /tmp/tmp.PrzUglo6GH/bench.catch     │
                                                                    │               B/op                │     B/op       vs base                 │
CatchVectorizedRuntimeError/noError-12                                                      0.000 ± ∞ ¹     0.000 ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/expected-12                                                   9834.00 ± ∞ ¹     32.00 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/storage-12                                                    8880.00 ± ∞ ¹     32.00 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/contextCanceled-12                                            8840.00 ± ∞ ¹     40.00 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/internalWithCode-12                                           9.586Ki ± ∞ ¹   1.511Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/internal-12                                                  10.479Ki ± ∞ ¹   2.961Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/runtime-12                                                   10.964Ki ± ∞ ¹   2.656Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/noError-12                                        81.75Ki ± ∞ ¹   81.71Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/expectedWithCode-12                              143.04Ki ± ∞ ¹   85.28Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/expectedAssertion-12                              362.7Ki ± ∞ ¹   304.5Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/internalAssertion-12                              304.2Ki ± ∞ ¹   245.3Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/internalIndexOutOfRange-12                        313.0Ki ± ∞ ¹   267.2Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/internalDivideByZero-12                           308.3Ki ± ∞ ¹   260.7Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/contextCanceled-12                               128.96Ki ± ∞ ¹   80.60Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/noError-12                                       81.21Ki ± ∞ ¹   81.25Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/expectedWithCode-12                             136.02Ki ± ∞ ¹   84.99Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/expectedAssertion-12                             362.0Ki ± ∞ ¹   302.9Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/internalAssertion-12                             291.4Ki ± ∞ ¹   244.5Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/internalIndexOutOfRange-12                       317.1Ki ± ∞ ¹   267.0Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/internalDivideByZero-12                          308.4Ki ± ∞ ¹   260.4Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/contextCanceled-12                              145.34Ki ± ∞ ¹   81.73Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/noError-12                                       81.33Ki ± ∞ ¹   81.24Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/expectedWithCode-12                             137.49Ki ± ∞ ¹   85.05Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/expectedAssertion-12                             364.7Ki ± ∞ ¹   305.8Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/internalAssertion-12                             318.4Ki ± ∞ ¹   244.5Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/internalIndexOutOfRange-12                       315.8Ki ± ∞ ¹   265.4Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/internalDivideByZero-12                          310.9Ki ± ∞ ¹   260.2Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/contextCanceled-12                              133.20Ki ± ∞ ¹   80.53Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
geomean                                                                                               ⁴                  -61.19%               ⁴
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ need >= 4 samples to detect a difference at alpha level 0.05
⁴ summaries must be >0 to compute geomean

                                                                    │ /tmp/tmp.PrzUglo6GH/bench.catch~3 │    /tmp/tmp.PrzUglo6GH/bench.catch    │
                                                                    │             allocs/op             │  allocs/op    vs base                 │
CatchVectorizedRuntimeError/noError-12                                                      0.000 ± ∞ ¹    0.000 ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/expected-12                                                    36.000 ± ∞ ¹    3.000 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/storage-12                                                     11.000 ± ∞ ¹    3.000 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/contextCanceled-12                                             11.000 ± ∞ ¹    4.000 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/internalWithCode-12                                             35.00 ± ∞ ¹    37.00 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/internal-12                                                     46.00 ± ∞ ¹    59.00 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/runtime-12                                                      58.00 ± ∞ ¹    51.00 ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/noError-12                                          838.0 ± ∞ ¹    838.0 ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/expectedWithCode-12                                1110.0 ± ∞ ¹    950.0 ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/expectedAssertion-12                               1.736k ± ∞ ¹   1.581k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/internalAssertion-12                               1.285k ± ∞ ¹   1.228k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/internalIndexOutOfRange-12                         1.418k ± ∞ ¹   1.425k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/internalDivideByZero-12                            1.413k ± ∞ ¹   1.411k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/contextCanceled-12                                  880.0 ± ∞ ¹    865.0 ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/noError-12                                         836.0 ± ∞ ¹    836.0 ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/expectedWithCode-12                               1012.0 ± ∞ ¹    949.0 ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/expectedAssertion-12                              1.744k ± ∞ ¹   1.572k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/internalAssertion-12                              1.201k ± ∞ ¹   1.228k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/internalIndexOutOfRange-12                        1.518k ± ∞ ¹   1.424k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/internalDivideByZero-12                           1.429k ± ∞ ¹   1.411k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/contextCanceled-12                                 993.0 ± ∞ ¹    865.0 ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/noError-12                                         838.0 ± ∞ ¹    837.0 ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/expectedWithCode-12                               1038.0 ± ∞ ¹    951.0 ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/expectedAssertion-12                              1.782k ± ∞ ¹   1.594k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/internalAssertion-12                              1.471k ± ∞ ¹   1.231k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/internalIndexOutOfRange-12                        1.476k ± ∞ ¹   1.417k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/internalDivideByZero-12                           1.471k ± ∞ ¹   1.411k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/contextCanceled-12                                 960.0 ± ∞ ¹    867.0 ± ∞ ¹        ~ (p=1.000 n=1) ³
geomean                                                                                               ⁴                 -18.86%               ⁴
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ need >= 4 samples to detect a difference at alpha level 0.05
⁴ summaries must be >0 to compute geomean

Fixes: #123235

Release note (performance improvement): Make error handling in the vectorized execution engine much cheaper. This should help avoid bad metastable regimes perpetuated by statement timeout handling consuming all CPU time, leading to more statement timeouts.

Co-authored-by: Drew Kimball drewk@cockroachlabs.com

@michae2 michae2 requested review from a team as code owners April 30, 2024 06:59
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@michae2
Copy link
Collaborator Author

michae2 commented Apr 30, 2024

Benchmark results from my laptop are here: https://gist.github.com/michae2/4406203dbafc5749ad6a02f8b0ec268e

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! :lgtm:

Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2, @rafiss, and @yuzefovich)


pkg/sql/colexecerror/error.go line 74 at r4 (raw file):

whence

Nice :)


pkg/sql/colexecerror/error.go line 130 at r4 (raw file):

			return
		}
		retErr = err

Do you think it would be worth it to wrap the error here in an alreadyCaughtErr struct or something, so that we only have to inspect the stack once in a set of nested catchers?

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @michae2, @rafiss, and @yuzefovich)


pkg/sql/colexecerror/error.go line 111 at r4 (raw file):

			))
		}
		if panicEmittedFrom == "" {

Is this check and the one above for !panicLineFound necessary? If they were omitted we'd call shouldCatchPanic("") which would return false and we'd re-throw panicObj which should ultimately print the stack anyways. Just wondering what the value of emitting errors.AssertFailedf is instead. Do we even have test coverage of this code?

}

// InternalError simply panics with the provided object. It will always be
// caught and returned as internal error to the client with the corresponding
// stack trace. This method should be called to propagate errors that resulted
// in the vectorized engine being in an *unexpected* state.
func InternalError(err error) {
panic(err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: the comment should be updated so mention the error wrapping instead of "simply panicking."

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)


pkg/sql/colexecerror/error.go line 113 at r4 (raw file):

		if panicEmittedFrom == "" {
			stackTrace := string(debug.Stack())
			panic(errors.AssertionFailedf(

i thought errors.AssertionFailedf already would include the stack trace: https://github.com/cockroachdb/errors/blob/c1cc1919cf999fb018fcd038852e969e3d5631cc/errutil/assertions.go#L33-L35

(though i see this was the behavior from before your PR. we could check if we have been seeing duplicated stack traces in any error reports.)

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)


-- commits line 2 at r4:
could you include before/after results of this benchmark in the PR description?

helpful incantation:

N=10 BENCHTIMEOUT=24h PKG=./pkg/sql/colexecerror BENCHES=BenchmarkCatchVectorizedRuntimeError ./scripts/bench 'old-sha' 'new-sha'

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work and speed up! I have some comments.

Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)


pkg/sql/colexecerror/error.go line 109 at r1 (raw file):

	sqlRowPackagesPrefix     = "github.com/cockroachdb/cockroach/pkg/sql/row"
	sqlSemPackagesPrefix     = "github.com/cockroachdb/cockroach/pkg/sql/sem"
	testSqlColPackagesPrefix = "pkg/sql/col"

Why do we need this addition (it kinda duplicates sqlColPackagesPrefix)? Because we strip the prefix when running tests via bazel? Consider leaving a comment.


pkg/sql/colexecerror/error.go line 78 at r3 (raw file):

		// engine. We treat a panic from lower in the stack as unrecoverable.

		//Find where the panic came from and only proceed if it

nit: missing spaces after the slashes in the third commit.


pkg/sql/colexecerror/error.go line 243 at r3 (raw file):

func init() {
	errors.RegisterWrapperDecoder(errors.GetTypeKey((*internalError)(nil)), decodeInternalError)

I think we need to register the decoder for both internalError and notInternalError.


pkg/sql/colexecerror/error.go line 246 at r3 (raw file):

}

// InternalError simply panics with the provided object. It will always be

nit: this comment needs a minor adjustment.


pkg/sql/colexecerror/error.go line 111 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is this check and the one above for !panicLineFound necessary? If they were omitted we'd call shouldCatchPanic("") which would return false and we'd re-throw panicObj which should ultimately print the stack anyways. Just wondering what the value of emitting errors.AssertFailedf is instead. Do we even have test coverage of this code?

These two checks were added in case Go runtime ever changes so that panics are emitted from a different location than runtime/panic.go. We do have some sanity checks for this code in TestCatchVectorizedRuntimeError, but I don't think it's possible to come up with a test in which one of these two checks doesn't pass.


pkg/sql/colexecerror/error.go line 113 at r4 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i thought errors.AssertionFailedf already would include the stack trace: https://github.com/cockroachdb/errors/blob/c1cc1919cf999fb018fcd038852e969e3d5631cc/errutil/assertions.go#L33-L35

(though i see this was the behavior from before your PR. we could check if we have been seeing duplicated stack traces in any error reports.)

Yes, good point.AssertionFailedf includes the stack trace (that's why below we only call errors.NewAssertionErrorWithWrappedErrf when we do want to create an error that would include the stack trace), so I think we can remove two calls to debug.Stack() and rely on the assertion's behavior. (Given my comment above, I don't think it's actually possible to hit this code path right now, so there is no way to check for stack trace duplication.)


pkg/sql/colexecerror/error.go line 130 at r4 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Do you think it would be worth it to wrap the error here in an alreadyCaughtErr struct or something, so that we only have to inspect the stack once in a set of nested catchers?

+1 - this seems like an easy extension of the current improvement. IIUC multiple nested catchers significantly exacerbated the problem we saw in the customer environment, and although we now have fast-paths for majority of errors, it'd be great to only inspect the stack once, regardless of the number of catches in it.


pkg/sql/sem/builtins/builtins.go line 5645 at r1 (raw file):

	),

	"crdb_internal.force_vectorized_assertion_error": makeBuiltin(

nit: rather than introducing a new builtin, should we introduce an overload to existing crdb_internal.force_panic builtin where an optional second boolean argument would indicate whether the panic should be catchable by vectorized engine or not?


pkg/sql/colexecerror/main_test.go line 42 at r1 (raw file):

var (
	// testAllocator is an Allocator with an unlimited budget for use in tests.
	testAllocator     *colmem.Allocator

nit: we don't need most of the initialization in this file. I think it can be as short as

func TestMain(m *testing.M) {
	securityassets.SetLoader(securitytest.EmbeddedAssets)
	randutil.SeedForTests()
	serverutils.InitTestServerFactory(server.TestServerFactory)

	os.Exit(m.Run())
}

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @rafiss, and @yuzefovich)


pkg/sql/colexecerror/error.go line 111 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

These two checks were added in case Go runtime ever changes so that panics are emitted from a different location than runtime/panic.go. We do have some sanity checks for this code in TestCatchVectorizedRuntimeError, but I don't think it's possible to come up with a test in which one of these two checks doesn't pass.

Ah, got it. I'd suggest a small refactor to this code. Pull the extraction of panicEmittedFrom into a function, and call that from a test and assert that it always finds the location. Right now you essentially have the testing in the regular code path which feels a bit strange.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)


pkg/sql/colexecerror/error.go line 111 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

These two checks were added in case Go runtime ever changes so that panics are emitted from a different location than runtime/panic.go. We do have some sanity checks for this code in TestCatchVectorizedRuntimeError, but I don't think it's possible to come up with a test in which one of these two checks doesn't pass.

Thinking a bit more about this, I agree that these two checks don't add that much value, so we could remove them. If callsite for panics in Go runtime ever changes, we'll easily catch the change via CI. So I'd be in favor of removing these two ifs and simply re-panicking whenever we don't find the panic line in the stack trace.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)


pkg/sql/colexecerror/error.go line 48 at r2 (raw file):

			// without a stacktrace, sentry report, or "internal error" designation.
			var nie *notInternalError
			if errors.As(err, &se) || errors.As(err, &nie) {

nit: Why use errors.As here instead of errors.Is or errors.IsAny?

b.Run(tc.name, func(b *testing.B) {
// Create as many warm connections as we will need for the benchmark.
conns := make(chan *gosql.DB, numConns)
for range numConns {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this will make backport difficult.

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @mgartner, @michae2, and @petermattis)


pkg/sql/colexecerror/error.go line 48 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: Why use errors.As here instead of errors.Is or errors.IsAny?

errors.Is[Any] requires the error (or any error in the cause chain) to exactly equal a reference error.

errors.As checks if the error (or any error in the cause chain) is assignable to the value pointed at by the target.

in this case, since there is no "singleton" notInternalError, we need to use As.

@mgartner
Copy link
Collaborator

pkg/sql/colexecerror/error.go line 48 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

errors.Is[Any] requires the error (or any error in the cause chain) to exactly equal a reference error.

errors.As checks if the error (or any error in the cause chain) is assignable to the value pointed at by the target.

in this case, since there is no "singleton" notInternalError, we need to use As.

Thanks for the explanation Rafi!

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the comments! I will push an update tonight.

Reviewable status: :shipit: complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @petermattis, and @rafiss)


pkg/sql/colexecerror/error.go line 130 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

+1 - this seems like an easy extension of the current improvement. IIUC multiple nested catchers significantly exacerbated the problem we saw in the customer environment, and although we now have fast-paths for majority of errors, it'd be great to only inspect the stack once, regardless of the number of catches in it.

Nice idea! @yuzefovich I think I might reuse notInternalError for this, do you see any problems with that?

@yuzefovich yuzefovich requested a review from rafiss May 1, 2024 00:28
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)


pkg/sql/colexecerror/error.go line 130 at r4 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Nice idea! @yuzefovich I think I might reuse notInternalError for this, do you see any problems with that?

Reusing notInternalError would lead to a behavior change. Namely, we now won't be able to tell the difference between an expected error within vectorized engine (that should propagated as an error, without stack trace) and an unexpected error outside of the vectorized engine (which shouldn't be caught and should be propagated via panic further up). I'd introduce a new error type like Drew suggested.

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @mgartner, @petermattis, @rafiss, and @yuzefovich)


-- commits line 2 at r4:

Previously, rafiss (Rafi Shamim) wrote…

could you include before/after results of this benchmark in the PR description?

helpful incantation:

N=10 BENCHTIMEOUT=24h PKG=./pkg/sql/colexecerror BENCHES=BenchmarkCatchVectorizedRuntimeError ./scripts/bench 'old-sha' 'new-sha'

I'll kick off a run on a gceworker and share tomorrow.


pkg/sql/colexecerror/error.go line 109 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Why do we need this addition (it kinda duplicates sqlColPackagesPrefix)? Because we strip the prefix when running tests via bazel? Consider leaving a comment.

Yes, exactly. Added a comment and I asked about it in slack.


pkg/sql/colexecerror/error.go line 78 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: missing spaces after the slashes in the third commit.

Done.


pkg/sql/colexecerror/error.go line 243 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I think we need to register the decoder for both internalError and notInternalError.

Oh, good catch! Done.


pkg/sql/colexecerror/error.go line 246 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: this comment needs a minor adjustment.

Done.


pkg/sql/colexecerror/error.go line 0 at r4 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

(Reviewable was unable to map this GitHub inline comment thread to the right spot — sorry!)

super nit: the comment should be updated so mention the error wrapping instead of "simply panicking."

Done.


pkg/sql/colexecerror/error.go line 111 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Thinking a bit more about this, I agree that these two checks don't add that much value, so we could remove them. If callsite for panics in Go runtime ever changes, we'll easily catch the change via CI. So I'd be in favor of removing these two ifs and simply re-panicking whenever we don't find the panic line in the stack trace.

I removed these two checks.


pkg/sql/colexecerror/error.go line 113 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Yes, good point.AssertionFailedf includes the stack trace (that's why below we only call errors.NewAssertionErrorWithWrappedErrf when we do want to create an error that would include the stack trace), so I think we can remove two calls to debug.Stack() and rely on the assertion's behavior. (Given my comment above, I don't think it's actually possible to hit this code path right now, so there is no way to check for stack trace duplication.)

Removed these two checks.


pkg/sql/colexecerror/error.go line 130 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Reusing notInternalError would lead to a behavior change. Namely, we now won't be able to tell the difference between an expected error within vectorized engine (that should propagated as an error, without stack trace) and an unexpected error outside of the vectorized engine (which shouldn't be caught and should be propagated via panic further up). I'd introduce a new error type like Drew suggested.

I tried this out, but strangely it seemed to make things slower. My guess is that we're mostly re-wrapping with materializers and columnarizers, and it looks like we already wrap with notInternalError in columnarizer:

colexecerror.ExpectedError(meta.Err)


pkg/sql/sem/builtins/builtins.go line 5645 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: rather than introducing a new builtin, should we introduce an overload to existing crdb_internal.force_panic builtin where an optional second boolean argument would indicate whether the panic should be catchable by vectorized engine or not?

Good call. I added an override with a couple more options.


pkg/sql/colexecerror/error_test.go line 203 at r4 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Note to self: this will make backport difficult.

Done.


pkg/sql/colexecerror/main_test.go line 42 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: we don't need most of the initialization in this file. I think it can be as short as

func TestMain(m *testing.M) {
	securityassets.SetLoader(securitytest.EmbeddedAssets)
	randutil.SeedForTests()
	serverutils.InitTestServerFactory(server.TestServerFactory)

	os.Exit(m.Run())
}

Thank you! Done.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r5, 1 of 1 files at r6, 1 of 1 files at r7, 1 of 1 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)


pkg/sql/colexecerror/error.go line 130 at r4 (raw file):

Previously, michae2 (Michael Erickson) wrote…

I tried this out, but strangely it seemed to make things slower. My guess is that we're mostly re-wrapping with materializers and columnarizers, and it looks like we already wrap with notInternalError in columnarizer:

colexecerror.ExpectedError(meta.Err)

Hm, we might be thinking about this differently. The idea is that in the fall back case, when we had to look at the stack via runtime.CallersFrames (because the panic should be caught by vec engine but wasn't produced via one of colexecerror.*Error calls), we will wrap the error with a special marker alreadyCaughtError so that the next catcher up the stack didn't have to inspect the stack (i.e. we would add another special error type to the hot path at the top of the method). This shouldn't have any influence for columnarizer-materializer pair since they already use colexecerror methods that wrap errors with different markers. Does this match your thinking?

That said, this would be an improvement to an edge case, so I'd be ok with leaving a TODO for it.

@michae2
Copy link
Collaborator Author

michae2 commented May 2, 2024

Third time's the charm.

bors r=DrewKimball,petermattis,mgartner,yuzefovich

@craig craig bot merged commit 592e709 into cockroachdb:master May 2, 2024
22 checks passed

This comment was marked as resolved.

@michae2
Copy link
Collaborator Author

michae2 commented May 2, 2024

blathers backport release-24.1.0-rc

@michae2 michae2 deleted the catch branch May 2, 2024 19:21
yuzefovich added a commit to yuzefovich/cockroach that referenced this pull request Oct 28, 2024
Earlier this year we made the vectorized panic-catcher much more
efficient (in cockroachdb#123277) by switching from using `debug.Stack()` to
`runtime.CallersFrames`. It appears that there is slight difference in
the behavior between the two: the former omits frames from within the
runtime (only a single frame for the panic itself is included) whereas
the latter keeps the additional runtime frames. As a result, if a panic
occurs due to a Go runtime internal violation (e.g. invalid interface
assertion) it is no longer caught to be converted into an internal CRDB
error and now crashes the server. This commit fixes this regression by
skipping over the frames that belong to the Go runtime. Note that we
will do so only for up to 5 frames within the runtime, so if there
happens to be more deeply-nested panic there, we'll still crash the CRDB
server.

Release note: None
yuzefovich added a commit to yuzefovich/cockroach that referenced this pull request Oct 28, 2024
Earlier this year we made the vectorized panic-catcher much more
efficient (in cockroachdb#123277) by switching from using `debug.Stack()` to
`runtime.CallersFrames`. It appears that there is slight difference in
the behavior between the two: the former omits frames from within the
runtime (only a single frame for the panic itself is included) whereas
the latter keeps the additional runtime frames. As a result, if a panic
occurs due to a Go runtime internal violation (e.g. invalid interface
assertion) it is no longer caught to be converted into an internal CRDB
error and now crashes the server. This commit fixes this regression by
skipping over the frames that belong to the Go runtime. Note that we
will do so only for up to 5 frames within the runtime, so if there
happens to be more deeply-nested panic there, we'll still crash the CRDB
server.

Release note: None
craig bot pushed a commit that referenced this pull request Oct 28, 2024
133620: colexecerror: improve the catcher due to a recent regression r=yuzefovich a=yuzefovich

Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server.

Fixes: #133617.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this pull request Oct 29, 2024
Earlier this year we made the vectorized panic-catcher much more
efficient (in #123277) by switching from using `debug.Stack()` to
`runtime.CallersFrames`. It appears that there is slight difference in
the behavior between the two: the former omits frames from within the
runtime (only a single frame for the panic itself is included) whereas
the latter keeps the additional runtime frames. As a result, if a panic
occurs due to a Go runtime internal violation (e.g. invalid interface
assertion) it is no longer caught to be converted into an internal CRDB
error and now crashes the server. This commit fixes this regression by
skipping over the frames that belong to the Go runtime. Note that we
will do so only for up to 5 frames within the runtime, so if there
happens to be more deeply-nested panic there, we'll still crash the CRDB
server.

Release note: None
blathers-crl bot pushed a commit that referenced this pull request Oct 29, 2024
Earlier this year we made the vectorized panic-catcher much more
efficient (in #123277) by switching from using `debug.Stack()` to
`runtime.CallersFrames`. It appears that there is slight difference in
the behavior between the two: the former omits frames from within the
runtime (only a single frame for the panic itself is included) whereas
the latter keeps the additional runtime frames. As a result, if a panic
occurs due to a Go runtime internal violation (e.g. invalid interface
assertion) it is no longer caught to be converted into an internal CRDB
error and now crashes the server. This commit fixes this regression by
skipping over the frames that belong to the Go runtime. Note that we
will do so only for up to 5 frames within the runtime, so if there
happens to be more deeply-nested panic there, we'll still crash the CRDB
server.

Release note: None
blathers-crl bot pushed a commit that referenced this pull request Oct 29, 2024
Earlier this year we made the vectorized panic-catcher much more
efficient (in #123277) by switching from using `debug.Stack()` to
`runtime.CallersFrames`. It appears that there is slight difference in
the behavior between the two: the former omits frames from within the
runtime (only a single frame for the panic itself is included) whereas
the latter keeps the additional runtime frames. As a result, if a panic
occurs due to a Go runtime internal violation (e.g. invalid interface
assertion) it is no longer caught to be converted into an internal CRDB
error and now crashes the server. This commit fixes this regression by
skipping over the frames that belong to the Go runtime. Note that we
will do so only for up to 5 frames within the runtime, so if there
happens to be more deeply-nested panic there, we'll still crash the CRDB
server.

Release note: None
blathers-crl bot pushed a commit that referenced this pull request Oct 29, 2024
Earlier this year we made the vectorized panic-catcher much more
efficient (in #123277) by switching from using `debug.Stack()` to
`runtime.CallersFrames`. It appears that there is slight difference in
the behavior between the two: the former omits frames from within the
runtime (only a single frame for the panic itself is included) whereas
the latter keeps the additional runtime frames. As a result, if a panic
occurs due to a Go runtime internal violation (e.g. invalid interface
assertion) it is no longer caught to be converted into an internal CRDB
error and now crashes the server. This commit fixes this regression by
skipping over the frames that belong to the Go runtime. Note that we
will do so only for up to 5 frames within the runtime, so if there
happens to be more deeply-nested panic there, we'll still crash the CRDB
server.

Release note: None
blathers-crl bot pushed a commit that referenced this pull request Oct 29, 2024
Earlier this year we made the vectorized panic-catcher much more
efficient (in #123277) by switching from using `debug.Stack()` to
`runtime.CallersFrames`. It appears that there is slight difference in
the behavior between the two: the former omits frames from within the
runtime (only a single frame for the panic itself is included) whereas
the latter keeps the additional runtime frames. As a result, if a panic
occurs due to a Go runtime internal violation (e.g. invalid interface
assertion) it is no longer caught to be converted into an internal CRDB
error and now crashes the server. This commit fixes this regression by
skipping over the frames that belong to the Go runtime. Note that we
will do so only for up to 5 frames within the runtime, so if there
happens to be more deeply-nested panic there, we'll still crash the CRDB
server.

Release note: None
yuzefovich added a commit that referenced this pull request Oct 29, 2024
Earlier this year we made the vectorized panic-catcher much more
efficient (in #123277) by switching from using `debug.Stack()` to
`runtime.CallersFrames`. It appears that there is slight difference in
the behavior between the two: the former omits frames from within the
runtime (only a single frame for the panic itself is included) whereas
the latter keeps the additional runtime frames. As a result, if a panic
occurs due to a Go runtime internal violation (e.g. invalid interface
assertion) it is no longer caught to be converted into an internal CRDB
error and now crashes the server. This commit fixes this regression by
skipping over the frames that belong to the Go runtime. Note that we
will do so only for up to 5 frames within the runtime, so if there
happens to be more deeply-nested panic there, we'll still crash the CRDB
server.

Release note: None
yuzefovich added a commit that referenced this pull request Oct 29, 2024
Earlier this year we made the vectorized panic-catcher much more
efficient (in #123277) by switching from using `debug.Stack()` to
`runtime.CallersFrames`. It appears that there is slight difference in
the behavior between the two: the former omits frames from within the
runtime (only a single frame for the panic itself is included) whereas
the latter keeps the additional runtime frames. As a result, if a panic
occurs due to a Go runtime internal violation (e.g. invalid interface
assertion) it is no longer caught to be converted into an internal CRDB
error and now crashes the server. This commit fixes this regression by
skipping over the frames that belong to the Go runtime. Note that we
will do so only for up to 5 frames within the runtime, so if there
happens to be more deeply-nested panic there, we'll still crash the CRDB
server.

Release note: None
yuzefovich added a commit that referenced this pull request Oct 29, 2024
Earlier this year we made the vectorized panic-catcher much more
efficient (in #123277) by switching from using `debug.Stack()` to
`runtime.CallersFrames`. It appears that there is slight difference in
the behavior between the two: the former omits frames from within the
runtime (only a single frame for the panic itself is included) whereas
the latter keeps the additional runtime frames. As a result, if a panic
occurs due to a Go runtime internal violation (e.g. invalid interface
assertion) it is no longer caught to be converted into an internal CRDB
error and now crashes the server. This commit fixes this regression by
skipping over the frames that belong to the Go runtime. Note that we
will do so only for up to 5 frames within the runtime, so if there
happens to be more deeply-nested panic there, we'll still crash the CRDB
server.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. backport-24.1.x Flags PRs that need to be backported to 24.1.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql/colexecerror: CatchVectorizedRuntimeError uses an expensive debug.Stack call
7 participants