Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upsql: capture and report internal/assertion errors #31261
Conversation
knz
requested review from
dt,
jordanlewis and
BramGruneir
Oct 11, 2018
knz
requested review from
cockroachdb/core-prs
as
code owners
Oct 11, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
@dt please check that I didn't mess up the stats reporting. |
knz
changed the title from
sql: capture internal/assertion errors
to
sql: capture and report internal/assertion errors
Oct 11, 2018
pkg/sql/pgwire/pgerror/errors.go Outdated
pkg/sql/pgwire/pgerror/errors.go Outdated
pkg/sql/pgwire/pgerror/errors.go Outdated
pkg/sql/pgwire/pgerror/errors.go Outdated
| // NewAssertionErrorf creates an internal error. | ||
| func NewAssertionErrorf(format string, args ...interface{}) error { | ||
| err := NewErrorWithDepthf(1, CodeInternalError, "programming error: "+format, args...) | ||
| err.InternalCommand = captureTrace() |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
dt
Oct 11, 2018
Member
Drawback of using the stack trace as the stable identifier since it changes with refactorings. I'd have a slight preference for adding an argument at the callsite so that the error is created with a unique, human-chosen identifying string like "unexpected-window-value". I guess coming up with names like that adds some friction to using the helper. If we do go with stack-based, I'd have a slight pref for func names over file names so that moving code between files doesn't matter?
dt
Oct 11, 2018
Member
Drawback of using the stack trace as the stable identifier since it changes with refactorings. I'd have a slight preference for adding an argument at the callsite so that the error is created with a unique, human-chosen identifying string like "unexpected-window-value". I guess coming up with names like that adds some friction to using the helper. If we do go with stack-based, I'd have a slight pref for func names over file names so that moving code between files doesn't matter?
| "Don't worry! Your data is likely safe.\n" + | ||
| "Please report this error with details at:\n\t https://github.com/cockroachdb/cockroach/issues/new/choose\nor\n\t support@cockroachlabs.com.\n\n" + | ||
| "Thank you!" | ||
| return err |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
dt
Oct 11, 2018
Member
we have a ton more info right now (the full stack trace) than we will in a string->int map at reporting time. A sentry event has much, much more useful information for debugging. I'd almost like to send a sentry event right here, the way we would if this were a crash, since that has the best chance of being actionable for fixing the issue.
BUT I don't want to send a sentry every 1000 times per second if this ends up in a hot path. Maybe we could have a "when first hit" event?
dt
Oct 11, 2018
Member
we have a ton more info right now (the full stack trace) than we will in a string->int map at reporting time. A sentry event has much, much more useful information for debugging. I'd almost like to send a sentry event right here, the way we would if this were a crash, since that has the best chance of being actionable for fixing the issue.
BUT I don't want to send a sentry every 1000 times per second if this ends up in a hot path. Maybe we could have a "when first hit" event?
jordanlewis
reviewed
Oct 11, 2018
Great idea! on all-but-the-stats.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/sql/pgwire/pgerror/errors.go, line 144 at r1 (raw file):
"Don't worry! Your data is likely safe.\n" + "Please report this error with details at:\n\t https://github.com/cockroachdb/cockroach/issues/new/choose\nor\n\t support@cockroachlabs.com.\n\n" + "Thank you!"
message bikeshed: I'd get rid of the "likely safe" sentence and the "Thank you" sentence - it feels like sugarcoating.
knz
reviewed
Oct 11, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/sql/pgwire/pgerror/errors.go, line 138 at r1 (raw file):
Previously, dt (David Taylor) wrote…
I'd say "internal error" here or -- "programming" isn't very specific on its own, e.g. most of the things interacting with a DB are also "programming", so it could be interpreted as a usage error.
Done.
pkg/sql/pgwire/pgerror/errors.go, line 139 at r1 (raw file):
Previously, dt (David Taylor) wrote…
Drawback of using the stack trace as the stable identifier since it changes with refactorings. I'd have a slight preference for adding an argument at the callsite so that the error is created with a unique, human-chosen identifying string like "unexpected-window-value". I guess coming up with names like that adds some friction to using the helper. If we do go with stack-based, I'd have a slight pref for func names over file names so that moving code between files doesn't matter?
I have thought about all you're saying here and I'd like to stick to the call trace for now. Two reasons:
- there are usually multiple assertions per functions. The function name is not sufficient.
- deciding labels is indeed too much friction. I am certainly not going to hand-annotate all of them here before the release.
pkg/sql/pgwire/pgerror/errors.go, line 141 at r1 (raw file):
Previously, dt (David Taylor) wrote…
I might add "unexpected".
Done.
pkg/sql/pgwire/pgerror/errors.go, line 142 at r1 (raw file):
Previously, dt (David Taylor) wrote…
I might skip this line/ If we do keep it, I'd avoid the contraction.
Done.
pkg/sql/pgwire/pgerror/errors.go, line 143 at r1 (raw file):
Previously, dt (David Taylor) wrote…
Could mention searching for the unique error identifier on https://github.com/cockroachdb/cockroach/issues to see if it has known workarounds, etc.
Done.
pkg/sql/pgwire/pgerror/errors.go, line 144 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
message bikeshed: I'd get rid of the "likely safe" sentence and the "Thank you" sentence - it feels like sugarcoating.
Done.
pkg/sql/pgwire/pgerror/errors.go, line 145 at r1 (raw file):
Previously, dt (David Taylor) wrote…
we have a ton more info right now (the full stack trace) than we will in a string->int map at reporting time. A sentry event has much, much more useful information for debugging. I'd almost like to send a sentry event right here, the way we would if this were a crash, since that has the best chance of being actionable for fixing the issue.
BUT I don't want to send a sentry every 1000 times per second if this ends up in a hot path. Maybe we could have a "when first hit" event?
Can we keep the implementation as-is, and add the sentry report in a later PR? (e.g. in a point release, afterwards).
The reporting data will give us the exact version number and the stack trace will give us the exact context. While I agree Sentry provides the better UX, this PR is simpler, adequate for a backport to 2.1 today and will not lose information.
pkg/server/diagnosticspb/diagnostics.proto, line 35 at r1 (raw file):
Previously, dt (David Taylor) wrote…
I think we'd discussed folding these kinds of things into the feature usage counts in the future, using a namespace like
internalerrors.<key>instead of adding a new, separate<key>-> int map, collection pipeline, etc.
Good idea. I've fold them in the error counts.
pkg/sql/conn_executor.go Outdated
knz
reviewed
Oct 11, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/sql/conn_executor.go, line 368 at r2 (raw file):
Previously, dt (David Taylor) wrote…
This isn't quite the where I was thinking: I was thinking we'd just use the general-purpose
telemetrycounters directly above instead of introducing thetracesmap at all?The
codesmap has error-count-by-pg-code -- adding to it will result in eventually being recorded in collected telemetry info aserrorcodes.internalerror-<k>. That isn't terrible, but so far, for all values oferrorcodes.<x>,xwas a valid pgerror code. Adding to it twould mean the sum oferrorcodes.*would double count these since we still count them as the internalerror pg code.
Done. PTAL
knz
referenced this pull request
Oct 11, 2018
Merged
release-2.1: sql: capture and report internal/assertion errors #31272
knz
dismissed
jordanlewis’s
stale review
Oct 11, 2018
both parts of the PR have been approved
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Thanks! bors r+ |
bot
pushed a commit
that referenced
this pull request
Oct 11, 2018
This was referenced Oct 11, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
craig
bot
commented
Oct 11, 2018
Build succeeded |
knz commentedOct 11, 2018
•
edited
Requested by @petermattis. Desired by @awoods187. Inspired by @tschottdorf.
Prior to this patch, assertion/internal errors were "merely" reported
to the client with pg error code CodeInternalError and nothing would
happen further. This was causing 2 problems:
track bugs automatically.
This patch improves the situation as follows:
a new constructor
pgerror.NewAssertionErrorfis introduced, which:details.
the new construct is used instead of
pgerror.NewErrorf(pgerror.CodeInternalErrorthroughout the SQLlayer.
the reporting mechanism now detects CodeInternalError errors,
collects then reports them with other node statistics.
Before:
After:
Release note (sql change): CockroachDB will now hint that internal
errors should be reported as bugs by users. Additionally, internal
errors are now collected internally and submitted (anonymized) with
other node statistics when statistic collection is enabled.