Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: fix panic recovery for error handling #38570

Merged
merged 1 commit into from Jul 9, 2019

Conversation

@RaduBerinde
Copy link
Member

RaduBerinde commented Jun 29, 2019

The major entry points in the optimizer catch all panics that throw an
error and converts them to errors. Unfortunately, this also catches
runtime errors (in which case we convert them to errors and lose the
stack trace).

This change adds a ShouldCatch helper which determines if we should
return a thrown object as an error. If the object is a
runtime.Error, it gets wrapped by an AssertionFailed error which
will cause correct error handling (stack trace, sentry reporting, etc).

As part of this change, we are also removing wrappers like
builderError, which are no longer useful. We fix the opt tester to
fail with the full error information (using %+v) for assertion
errors.

Release note: None

@RaduBerinde RaduBerinde requested review from justinj, knz, rytaft and andy-kimball Jun 29, 2019
@RaduBerinde RaduBerinde requested a review from cockroachdb/sql-opt-prs as a code owner Jun 29, 2019
@cockroach-teamcity

This comment has been minimized.

Copy link
Member

cockroach-teamcity commented Jun 29, 2019

This change is Reviewable

@RaduBerinde RaduBerinde force-pushed the RaduBerinde:opt-err-fix branch from c88ec28 to 3ef833f Jun 29, 2019
Copy link
Member

knz left a comment

So I think this PR is misguided. When i wrote the code I intended to catch runtime.Error panics and letting them flow through. The reason is that runtime.Error panics are recoverable, and there is no reason to let a cluster go down when they occur.

FYI I even went through the go source code to validate the following:

  • runtime.Error is only emitted for "soft" errors like out-of-bound accesses, assertion failures, etc
  • for "serious" internal errors e.g. in the scheduler, bad goroutine state, allocator problem etc, the runtime throws a string which does not implement error and thus will not be captured here.

So, can you explain a little better why you thought this PR was a good idea?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @knz, and @rytaft)

@RaduBerinde

This comment has been minimized.

Copy link
Member Author

RaduBerinde commented Jun 29, 2019

Today if you are working on a change that results in a nil dereference or out-of-bound access, you get a one line error with no stack trace. Good luck debugging that. IMO that is not acceptable, both for development workflow and customer support (what will we do when we get a report from a customer which just says "out of bounds" with no other context?)

When we agreed to catch assertion errors thrown by the optimizer, it was with the condition that we will still always get stack traces for them. The discussion was mostly focused on assertions generated by our code, I don't think we specifically discussed catching runtime errors (at least not to my knowledge). I am ok catching them but only if we don't lose the stack trace.

@knz

This comment has been minimized.

Copy link
Member

knz commented Jun 29, 2019

@RaduBerinde

This comment has been minimized.

Copy link
Member Author

RaduBerinde commented Jul 1, 2019

It doesn't work. The stack trace isn't shown in important cases:

In cockroach demo:

root@127.68.126.34:45519/defaultdb> select 1 as lolomg;
pq: runtime error: index out of range
root@127.68.126.34:45519/defaultdb> 

In an opt test:

--- FAIL: TestBuilder (0.00s)
    --- FAIL: TestBuilder/select (0.00s)
        builder_test.go:60: 
            testdata/select:25: SELECT 1 AS lolomg
            expected:
            
            found:
            error: runtime error: index out of range
FAIL
@RaduBerinde

This comment has been minimized.

Copy link
Member Author

RaduBerinde commented Jul 1, 2019

@RaduBerinde

This comment has been minimized.

Copy link
Member Author

RaduBerinde commented Jul 1, 2019

Maybe I should try NewAssertionErrorWithWrappedErrf?

@knz

This comment has been minimized.

Copy link
Member

knz commented Jul 1, 2019

oh yes, absolutely. I hadn't thought of that but indeed it's the best way to ensure we get telemetry, etc.

@RaduBerinde

This comment has been minimized.

Copy link
Member Author

RaduBerinde commented Jul 2, 2019

Just leaving a note with the status of this PR - converting to AssertionFailed didn't quite work because it still doesn't print the stack trace in tests (with %+v); @knz is going to fix that in the error library first.

craig bot pushed a commit that referenced this pull request Jul 8, 2019
38557: exec: protect against unset syncFlowConsumer r=jordanlewis a=asubiotto

This should never happen since it implies that the receiver isn't
connected correctly. These happen when a node sends a SetupFlow request
to a remote node where the spec specifies that the response is on that
remote node. We don't see panics in the row execution engine due to
wrapping the syncFlowConsumer with a copyingRowReceiver, but this state
can cause setupVectorized to panic.

This commit protects against this state pending further investigation.

Release note: None

38654: exec: Handle NULLS in TopK sorter r=rohany a=rohany

This commit fixes NULLs in the TopK sorter by avoiding use
of the vec copy method, which has a bug. Instead, we add
a set method to the vec comparator, and use the templatized
comparator to perform the sets that the TopK sorter needs.

To facilitate this, we add an UnsetNull method to the Nulls
object. However, use of this method results in HasNull()
maybe returning true even if the vector doesn't have nulls.
This behavior already occurs when selection vectors are used.
Based on discussions with @solongordon and @asubiotto, this behavior
is OK, and future PR's will attempt to make this behavior better, and address
the bugs within the Vec Copy method.

38710: errors: fix the formatting with %+v r=knz a=knz

(found by @RaduBerinde; needed to complete #38570)

The new library `github.com/cockroachdb/errors` was not implementing
`%+v` formatting properly for assertion and unimplemented errors.
The wrong implementation was hiding the details of the cause
of these two error types from the formatting logic.

Fixing this bug comprehensively required completing the investigation
of the Go 2 / `xerrors` error proposal. This revealed that the
implementation of `fmt.Formatter` for wrapper errors (a `Format()`
method) is required in all cases, at least until Go's stdlib
learns about `errors.Formatter`. More details at
golang/go#29934 and this commit message: cockroachdb/errors@78b6caa.

This patch bumps the dependency `github.com/cockroachdb/errors` to
pick up the fixes to assertion failures and unimplemented errors.

The new definition of `errors.FormatError()` subsequently required
re-implemening `Format)` for `pgerros.withCandidateCode`, which is
also done here.

Finally, this patch also picks up `errors.As()` and the new
streamlined `fmt.Formatter` / `errors.Formatter` interaction, so this
patch also simplifies a few custom error types in CockroachDB
accordingly.

Release note: None

Co-authored-by: Alfonso Subiotto Marqués <alfonso@cockroachlabs.com>
Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
craig bot pushed a commit that referenced this pull request Jul 8, 2019
38654: exec: Handle NULLS in TopK sorter r=rohany a=rohany

This commit fixes NULLs in the TopK sorter by avoiding use
of the vec copy method, which has a bug. Instead, we add
a set method to the vec comparator, and use the templatized
comparator to perform the sets that the TopK sorter needs.

To facilitate this, we add an UnsetNull method to the Nulls
object. However, use of this method results in HasNull()
maybe returning true even if the vector doesn't have nulls.
This behavior already occurs when selection vectors are used.
Based on discussions with @solongordon and @asubiotto, this behavior
is OK, and future PR's will attempt to make this behavior better, and address
the bugs within the Vec Copy method.

38710: errors: fix the formatting with %+v r=knz a=knz

(found by @RaduBerinde; needed to complete #38570)

The new library `github.com/cockroachdb/errors` was not implementing
`%+v` formatting properly for assertion and unimplemented errors.
The wrong implementation was hiding the details of the cause
of these two error types from the formatting logic.

Fixing this bug comprehensively required completing the investigation
of the Go 2 / `xerrors` error proposal. This revealed that the
implementation of `fmt.Formatter` for wrapper errors (a `Format()`
method) is required in all cases, at least until Go's stdlib
learns about `errors.Formatter`. More details at
golang/go#29934 and this commit message: cockroachdb/errors@78b6caa.

This patch bumps the dependency `github.com/cockroachdb/errors` to
pick up the fixes to assertion failures and unimplemented errors.

The new definition of `errors.FormatError()` subsequently required
re-implemening `Format)` for `pgerros.withCandidateCode`, which is
also done here.

Finally, this patch also picks up `errors.As()` and the new
streamlined `fmt.Formatter` / `errors.Formatter` interaction, so this
patch also simplifies a few custom error types in CockroachDB
accordingly.

Release note: None

Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
craig bot pushed a commit that referenced this pull request Jul 8, 2019
38710: errors: fix the formatting with %+v r=knz a=knz

(found by @RaduBerinde; needed to complete #38570)

The new library `github.com/cockroachdb/errors` was not implementing
`%+v` formatting properly for assertion and unimplemented errors.
The wrong implementation was hiding the details of the cause
of these two error types from the formatting logic.

Fixing this bug comprehensively required completing the investigation
of the Go 2 / `xerrors` error proposal. This revealed that the
implementation of `fmt.Formatter` for wrapper errors (a `Format()`
method) is required in all cases, at least until Go's stdlib
learns about `errors.Formatter`. More details at
golang/go#29934 and this commit message: cockroachdb/errors@78b6caa.

This patch bumps the dependency `github.com/cockroachdb/errors` to
pick up the fixes to assertion failures and unimplemented errors.

The new definition of `errors.FormatError()` subsequently required
re-implemening `Format)` for `pgerros.withCandidateCode`, which is
also done here.

Finally, this patch also picks up `errors.As()` and the new
streamlined `fmt.Formatter` / `errors.Formatter` interaction, so this
patch also simplifies a few custom error types in CockroachDB
accordingly.

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
@RaduBerinde RaduBerinde force-pushed the RaduBerinde:opt-err-fix branch from 3ef833f to 1caf74a Jul 8, 2019
@RaduBerinde

This comment has been minimized.

Copy link
Member Author

RaduBerinde commented Jul 8, 2019

Updated, using NewAssertionErrorWithWrappedErrf now.

@RaduBerinde RaduBerinde force-pushed the RaduBerinde:opt-err-fix branch 2 times, most recently from 54b69fb to 525b9ff Jul 8, 2019
@knz
knz approved these changes Jul 9, 2019
Copy link
Member

knz left a comment

Reviewed 19 of 19 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, and @rytaft)

Copy link
Member

knz left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @RaduBerinde, and @rytaft)


pkg/util/errorutil/catch.go, line 29 at r1 (raw file):

			// Convert runtime errors to internal errors, which display the stack and
			// get reported to Sentry.
			err = errors.NewAssertionErrorWithWrappedErrf(err, "")

That's what's creating the surprising result.
Until I fix this you can make the surprising errors with safe detail disappear (and also introduce a clarification about where the runtime error comes from) as follows:

err = errors.HandledWithMessage(err, "Go runtime error")
err = errors.WithAssertionFailure(err)
err = errors.WithStack(err)
@knz
knz approved these changes Jul 9, 2019
Copy link
Member

knz left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @RaduBerinde, and @rytaft)


pkg/util/errorutil/catch.go, line 29 at r1 (raw file):

Previously, knz (kena) wrote…

That's what's creating the surprising result.
Until I fix this you can make the surprising errors with safe detail disappear (and also introduce a clarification about where the runtime error comes from) as follows:

err = errors.HandledWithMessage(err, "Go runtime error")
err = errors.WithAssertionFailure(err)
err = errors.WithStack(err)
``

</blockquote></details>

see https://github.com/cockroachdb/errors/pull/3


<!-- Sent from Reviewable.io -->
Copy link
Member

knz left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @RaduBerinde, and @rytaft)


pkg/util/errorutil/catch.go, line 29 at r1 (raw file):

Previously, knz (kena) wrote…

see cockroachdb/errors#3

Then you can use err = errors.HandleAsAssertionFailure(err) instead of the 3 lines I listed above.

The major entry points in the optimizer catch all panics that throw an
error and converts them to errors. Unfortunately, this also catches
runtime errors (in which case we convert them to errors and lose the
stack trace).

This change adds a `ShouldCatch` helper which determines if we should
return a thrown object as an error. If the object is a
`runtime.Error`, it gets wrapped by an AssertionFailed error which
will cause correct error handling (stack trace, sentry reporting, etc).

As part of this change, we are also removing wrappers like
`builderError`, which are no longer useful. We fix the opt tester to
fail with the full error information (using `%+v`) for assertion
errors.

Release note: None
@RaduBerinde RaduBerinde force-pushed the RaduBerinde:opt-err-fix branch from 525b9ff to 5ab44a9 Jul 9, 2019
@RaduBerinde RaduBerinde requested a review from Jul 9, 2019
@RaduBerinde

This comment has been minimized.

Copy link
Member Author

RaduBerinde commented Jul 9, 2019

Bumped the dep and switched to HandleAsAssertionFailure.

Copy link
Member

knz left a comment

Reviewed 3 of 3 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @RaduBerinde, and @rytaft)

@RaduBerinde

This comment has been minimized.

Copy link
Member Author

RaduBerinde commented Jul 9, 2019

TFTR!

bors r+

craig bot pushed a commit that referenced this pull request Jul 9, 2019
38570: opt: fix panic recovery for error handling r=RaduBerinde a=RaduBerinde

The major entry points in the optimizer catch all panics that throw an
error and converts them to errors. Unfortunately, this also catches
runtime errors (in which case we convert them to errors and lose the
stack trace).

This change adds a `ShouldCatch` helper which determines if we should
return a thrown object as an error. If the object is a
`runtime.Error`, it gets wrapped by an AssertionFailed error which
will cause correct error handling (stack trace, sentry reporting, etc).

As part of this change, we are also removing wrappers like
`builderError`, which are no longer useful. We fix the opt tester to
fail with the full error information (using `%+v`) for assertion
errors.

Release note: None

38660: opt: push limit into offset r=ridwanmsharif a=ridwanmsharif

This change pushes the limit into an offset whenever possible.
This shouldn't worsen any plan but does allow the `GetLimitedScans`
rule to fire in more scenarios.

Fixes #30416.
~~This is currently blocked on #38659.~~

Release note: None

38743: roachtest: skip jepsen/multi-register r=god a=nvanbenschoten

There's no use running this every night until #36431 is fixed.

Release note: None

38746: roachtest: don't reuse clusters after test failure r=andreimatei a=andreimatei

We've had a case where a cluster got messed up somehow and then a bunch
of tests that tried to reuse it failed. This patch employes a big hammer
and makes it so that we don't reuse a cluster after test failure (which
failure can be cluster related or not).

Release note: None

38766: scripts/release-notes.py: help the user with --from/--until r=lhirata a=knz

Requested by @lhirata

Release note: None

Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
Co-authored-by: Ridwan Sharif <ridwan@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
@craig

This comment has been minimized.

Copy link

craig bot commented Jul 9, 2019

Build succeeded

@craig craig bot merged commit 5ab44a9 into cockroachdb:master Jul 9, 2019
3 checks passed
3 checks passed
GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details
@RaduBerinde RaduBerinde deleted the RaduBerinde:opt-err-fix branch Jul 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.