New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: capture and report internal/assertion errors #31261

Merged
merged 1 commit into from Oct 11, 2018

Conversation

Projects
None yet
4 participants
@knz
Member

knz commented Oct 11, 2018

Requested by @petermattis. Desired by @awoods187. Inspired by @tschottdorf.

Prior to this patch, assertion/internal errors were "merely" reported
to the client with pg error code CodeInternalError and nothing would
happen further. This was causing 2 problems:

  • it was unclear for the user what to do with this information.
  • the occurrences were not further collected and could not be used to
    track bugs automatically.

This patch improves the situation as follows:

  • a new constructor pgerror.NewAssertionErrorf is introduced, which:

    • gives a hint to the user about submitting the error as a bug with
      details.
    • populates the InternalCommand with a shortened stack trace.
  • the new construct is used instead of
    pgerror.NewErrorf(pgerror.CodeInternalError throughout the SQL
    layer.

  • the reporting mechanism now detects CodeInternalError errors,
    collects then reports them with other node statistics.

Before:

root@127.0.0.1:56399/defaultdb> select crdb_internal.force_assertion_error('woo');
pq: crdb_internal.force_assertion_error(): programming error: woo

After:

root@127.0.0.1:63733/defaultdb> select crdb_internal.force_Assertion_error('woo');
pq: crdb_internal.force_assertion_error(): programming error: woo
DETAIL: builtins.go:2805,eval.go:3547,expr.go:194,processors.go:378,processors.go:745,values.go:125,base.go:170,processors.go:768,flow.go:596,distsql_running.go:261,distsql_running.go:768,conn_executor_exec.go:983,conn_executor_exec.go:825,conn_executor_exec.go:403,conn_executor_exec.go:95,conn_executor.go:1189,conn_executor.go:466,conn.go:309
HINT: You have encountered an error inside CockroachDB. 

Please report this error with details at:
    https://github.com/cockroachdb/cockroach/issues/new/choose
or
    support@cockroachlabs.com

Release note (sql change): CockroachDB will now hint that internal
errors should be reported as bugs by users. Additionally, internal
errors are now collected internally and submitted (anonymized) with
other node statistics when statistic collection is enabled.

@knz knz requested review from dt, jordanlewis and BramGruneir Oct 11, 2018

@knz knz requested review from cockroachdb/core-prs as code owners Oct 11, 2018

@cockroach-teamcity

This comment has been minimized.

Show comment
Hide comment
@cockroach-teamcity

cockroach-teamcity Oct 11, 2018

Member

This change is Reviewable

Member

cockroach-teamcity commented Oct 11, 2018

This change is Reviewable

@knz

This comment has been minimized.

Show comment
Hide comment
@knz

knz Oct 11, 2018

Member

@dt please check that I didn't mess up the stats reporting.

Member

knz commented Oct 11, 2018

@dt please check that I didn't mess up the stats reporting.

@knz knz changed the title from sql: capture internal/assertion errors to sql: capture and report internal/assertion errors Oct 11, 2018

Show outdated Hide outdated pkg/sql/pgwire/pgerror/errors.go Outdated
Show outdated Hide outdated pkg/sql/pgwire/pgerror/errors.go Outdated
Show outdated Hide outdated pkg/server/diagnosticspb/diagnostics.proto Outdated
Show outdated Hide outdated pkg/sql/pgwire/pgerror/errors.go Outdated
Show outdated Hide outdated pkg/sql/pgwire/pgerror/errors.go Outdated
// NewAssertionErrorf creates an internal error.
func NewAssertionErrorf(format string, args ...interface{}) error {
err := NewErrorWithDepthf(1, CodeInternalError, "programming error: "+format, args...)
err.InternalCommand = captureTrace()

This comment has been minimized.

@dt

dt Oct 11, 2018

Member

Drawback of using the stack trace as the stable identifier since it changes with refactorings. I'd have a slight preference for adding an argument at the callsite so that the error is created with a unique, human-chosen identifying string like "unexpected-window-value". I guess coming up with names like that adds some friction to using the helper. If we do go with stack-based, I'd have a slight pref for func names over file names so that moving code between files doesn't matter?

@dt

dt Oct 11, 2018

Member

Drawback of using the stack trace as the stable identifier since it changes with refactorings. I'd have a slight preference for adding an argument at the callsite so that the error is created with a unique, human-chosen identifying string like "unexpected-window-value". I guess coming up with names like that adds some friction to using the helper. If we do go with stack-based, I'd have a slight pref for func names over file names so that moving code between files doesn't matter?

"Don't worry! Your data is likely safe.\n" +
"Please report this error with details at:\n\t https://github.com/cockroachdb/cockroach/issues/new/choose\nor\n\t support@cockroachlabs.com.\n\n" +
"Thank you!"
return err

This comment has been minimized.

@dt

dt Oct 11, 2018

Member

we have a ton more info right now (the full stack trace) than we will in a string->int map at reporting time. A sentry event has much, much more useful information for debugging. I'd almost like to send a sentry event right here, the way we would if this were a crash, since that has the best chance of being actionable for fixing the issue.

BUT I don't want to send a sentry every 1000 times per second if this ends up in a hot path. Maybe we could have a "when first hit" event?

@dt

dt Oct 11, 2018

Member

we have a ton more info right now (the full stack trace) than we will in a string->int map at reporting time. A sentry event has much, much more useful information for debugging. I'd almost like to send a sentry event right here, the way we would if this were a crash, since that has the best chance of being actionable for fixing the issue.

BUT I don't want to send a sentry every 1000 times per second if this ends up in a hot path. Maybe we could have a "when first hit" event?

@jordanlewis

Great idea! :lgtm: on all-but-the-stats.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/sql/pgwire/pgerror/errors.go, line 144 at r1 (raw file):

		"Don't worry! Your data is likely safe.\n" +
		"Please report this error with details at:\n\t https://github.com/cockroachdb/cockroach/issues/new/choose\nor\n\t support@cockroachlabs.com.\n\n" +
		"Thank you!"

message bikeshed: I'd get rid of the "likely safe" sentence and the "Thank you" sentence - it feels like sugarcoating.

@knz

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/sql/pgwire/pgerror/errors.go, line 138 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I'd say "internal error" here or -- "programming" isn't very specific on its own, e.g. most of the things interacting with a DB are also "programming", so it could be interpreted as a usage error.

Done.


pkg/sql/pgwire/pgerror/errors.go, line 139 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Drawback of using the stack trace as the stable identifier since it changes with refactorings. I'd have a slight preference for adding an argument at the callsite so that the error is created with a unique, human-chosen identifying string like "unexpected-window-value". I guess coming up with names like that adds some friction to using the helper. If we do go with stack-based, I'd have a slight pref for func names over file names so that moving code between files doesn't matter?

I have thought about all you're saying here and I'd like to stick to the call trace for now. Two reasons:

  • there are usually multiple assertions per functions. The function name is not sufficient.
  • deciding labels is indeed too much friction. I am certainly not going to hand-annotate all of them here before the release.

pkg/sql/pgwire/pgerror/errors.go, line 141 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I might add "unexpected".

Done.


pkg/sql/pgwire/pgerror/errors.go, line 142 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I might skip this line/ If we do keep it, I'd avoid the contraction.

Done.


pkg/sql/pgwire/pgerror/errors.go, line 143 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Could mention searching for the unique error identifier on https://github.com/cockroachdb/cockroach/issues to see if it has known workarounds, etc.

Done.


pkg/sql/pgwire/pgerror/errors.go, line 144 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

message bikeshed: I'd get rid of the "likely safe" sentence and the "Thank you" sentence - it feels like sugarcoating.

Done.


pkg/sql/pgwire/pgerror/errors.go, line 145 at r1 (raw file):

Previously, dt (David Taylor) wrote…

we have a ton more info right now (the full stack trace) than we will in a string->int map at reporting time. A sentry event has much, much more useful information for debugging. I'd almost like to send a sentry event right here, the way we would if this were a crash, since that has the best chance of being actionable for fixing the issue.

BUT I don't want to send a sentry every 1000 times per second if this ends up in a hot path. Maybe we could have a "when first hit" event?

Can we keep the implementation as-is, and add the sentry report in a later PR? (e.g. in a point release, afterwards).

The reporting data will give us the exact version number and the stack trace will give us the exact context. While I agree Sentry provides the better UX, this PR is simpler, adequate for a backport to 2.1 today and will not lose information.


pkg/server/diagnosticspb/diagnostics.proto, line 35 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I think we'd discussed folding these kinds of things into the feature usage counts in the future, using a namespace like internalerrors.<key> instead of adding a new, separate <key> -> int map, collection pipeline, etc.

Good idea. I've fold them in the error counts.

Show outdated Hide outdated pkg/sql/conn_executor.go Outdated
sql: capture internal/assertion errors
Prior to this patch, assertion/internal errors were "merely" reported
to the client with pg error code CodeInternalError and nothing would
happen further. This was causing 2 problems:

- it was unclear for the user what to do with this information.
- the occurrences were not further collected and could not be used to
  track bugs automatically.

This patch improves the situation as follows:

- a new constructor `pgerror.NewAssertionErrorf` is introduced, which:
  - gives a hint to the user about submitting the error as a bug with
    details.
  - populates the InternalCommand with a shortened stack trace.

- the new constructor is used instead of
  `pgerror.NewErrorf(pgerror.CodeInternalError` throughout the SQL
  layer.

- the reporting mechanism now detects CodeInternalError errors,
  collects then reports them with other node statistics.

Before:

```
root@127.0.0.1:56399/defaultdb> select crdb_internal.force_assertion_error('woo');
pq: crdb_internal.force_assertion_error(): programming error: woo
```

After:

```
root@127.0.0.1:63733/defaultdb> select crdb_internal.force_Assertion_error('woo');
pq: crdb_internal.force_assertion_error(): programming error: woo
DETAIL: builtins.go:2805,eval.go:3547,expr.go:194,processors.go:378,processors.go:745,values.go:125,base.go:170,processors.go:768,flow.go:596,distsql_running.go:261,distsql_running.go:768,conn_executor_exec.go:983,conn_executor_exec.go:825,conn_executor_exec.go:403,conn_executor_exec.go:95,conn_executor.go:1189,conn_executor.go:466,conn.go:309
HINT: You have encountered an error inside CockroachDB. Don't worry! Your data is likely safe.

Please report this error with details at:
    https://github.com/cockroachdb/cockroach/issues/new/choose
or
    support@cockroachlabs.com

Thank you!
```

Release note (sql change): CockroachDB will now hint that internal
errors should be reported as bugs by users. Additionally, internal
errors are now collected internally and submitted (anonymized) with
other node statistics when statistic collection is enabled.
@knz

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/sql/conn_executor.go, line 368 at r2 (raw file):

Previously, dt (David Taylor) wrote…

This isn't quite the where I was thinking: I was thinking we'd just use the general-purpose telemetry counters directly above instead of introducing the traces map at all?

The codes map has error-count-by-pg-code -- adding to it will result in eventually being recorded in collected telemetry info as errorcodes.internalerror-<k>. That isn't terrible, but so far, for all values of errorcodes.<x>, x was a valid pgerror code. Adding to it twould mean the sum of errorcodes.* would double count these since we still count them as the internalerror pg code.

Done. PTAL

@dt

dt approved these changes Oct 11, 2018

both parts of the PR have been approved

@knz

This comment has been minimized.

Show comment
Hide comment
@knz

knz Oct 11, 2018

Member

Thanks!

bors r+

Member

knz commented Oct 11, 2018

Thanks!

bors r+

craig bot pushed a commit that referenced this pull request Oct 11, 2018

Merge #31261
31261: sql: capture and report internal/assertion errors r=knz a=knz

Requested by @petermattis. Desired by @awoods187. Inspired by @tschottdorf.

Prior to this patch, assertion/internal errors were "merely" reported
to the client with pg error code CodeInternalError and nothing would
happen further. This was causing 2 problems:

- it was unclear for the user what to do with this information.
- the occurrences were not further collected and could not be used to
  track bugs automatically.

This patch improves the situation as follows:

- a new constructor `pgerror.NewAssertionErrorf` is introduced, which:
  - gives a hint to the user about submitting the error as a bug with
    details.
  - populates the InternalCommand with a shortened stack trace.

- the new construct is used instead of
  `pgerror.NewErrorf(pgerror.CodeInternalError` throughout the SQL
  layer.

- the reporting mechanism now detects CodeInternalError errors,
  collects then reports them with other node statistics.

Before:

```
root@127.0.0.1:56399/defaultdb> select crdb_internal.force_assertion_error('woo');
pq: crdb_internal.force_assertion_error(): programming error: woo
```

After:

```
root@127.0.0.1:63733/defaultdb> select crdb_internal.force_Assertion_error('woo');
pq: crdb_internal.force_assertion_error(): programming error: woo
DETAIL: builtins.go:2805,eval.go:3547,expr.go:194,processors.go:378,processors.go:745,values.go:125,base.go:170,processors.go:768,flow.go:596,distsql_running.go:261,distsql_running.go:768,conn_executor_exec.go:983,conn_executor_exec.go:825,conn_executor_exec.go:403,conn_executor_exec.go:95,conn_executor.go:1189,conn_executor.go:466,conn.go:309
HINT: You have encountered an error inside CockroachDB. Don't worry! Your data is likely safe.

Please report this error with details at:
    https://github.com/cockroachdb/cockroach/issues/new/choose
or
    support@cockroachlabs.com

Thank you!
```

Release note (sql change): CockroachDB will now hint that internal
errors should be reported as bugs by users. Additionally, internal
errors are now collected internally and submitted (anonymized) with
other node statistics when statistic collection is enabled.

Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
@craig

This comment has been minimized.

Show comment
Hide comment
@craig

craig bot commented Oct 11, 2018

Build succeeded

@craig craig bot merged commit 5d0e1c6 into cockroachdb:master Oct 11, 2018

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details

@knz knz deleted the knz:20181011-fatals branch Oct 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment