New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: prevent a crash and add telemetry for an assertion #32713

Merged
merged 1 commit into from Nov 30, 2018

Conversation

Projects
None yet
3 participants
@knz
Copy link
Member

knz commented Nov 29, 2018

Informs #32517.

We got reports in the wild that CockroachDB crashes upon SHOW
SESSIONS/QUERIES due to an invalid session ID in a ListSessions
response.

An invalid session ID should never happen; also conceptually it makes
no sense because if a session exists (insofar that it gets reported by
Listsessions) it must hav ean ID.

However I was not able to find how an invalid session ID could be
created, if at all. So the root cause of the problem remains
uncovered.

In the meantime, this patch prevents a crash when encountering an
invalid session ID. nil entries are transformed to SQL NULLs, and
non-nil entries with a non-16 size are reported as <invalid>. Also
the values are recorded and reported in telemetry.

Release note (bug fix): CockroachDB does not longer crash when
encountering an internal error related to invalid entries in the
output of SHOW SESSIONs.

@knz knz requested review from jordanlewis and andreimatei Nov 29, 2018

@knz knz added this to Triage in SQL Front-end, Lang & Semantics via automation Nov 29, 2018

@knz knz requested review from cockroachdb/sql-execution-prs as code owners Nov 29, 2018

@cockroach-teamcity

This comment has been minimized.

Copy link
Member

cockroach-teamcity commented Nov 29, 2018

This change is Reviewable

@andreimatei
Copy link
Member

andreimatei left a comment

LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/crdb_internal.go, line 900 at r1 (raw file):

		var sessionID tree.Datum
		if session.ID == nil {
			telemetry.RecordError(pgerror.NewInternalTrackingError(32517, "null"))

nit: /* issue */


pkg/sql/pgwire/pgerror/errors.go, line 166 at r1 (raw file):

	prefix := fmt.Sprintf("#%d.%s", issue, detail)
	err := NewErrorWithDepthf(1, CodeInternalError, "internal error: %s", prefix)
	err.InternalCommand = prefix + " " + captureTrace()

I guess you've done this to make the counting work, but it seems like an abuse.
Do we really need this method at all? Won't we get counting just as good if we just call telemetry.RecordError(errors.NewErrorf(...))?

sql: prevent a crash and add telemetry for an assertion
We got reports in the wild that CockroachDB crashes upon SHOW
SESSIONS/QUERIES due to an invalid session ID in a ListSessions
response.

An invalid session ID should never happen; also conceptually it makes
no sense because if a session exists (insofar that it gets reported by
Listsessions) it must hav ean ID.

However I was not able to find how an invalid session ID could be
created, if at all. So the root cause of the problem remains
uncovered.

In the meantime, this patch prevents a crash when encountering an
invalid session ID. nil entries are transformed to SQL NULLs, and
non-nil entries with a non-16 size are reported as `<invalid>`. Also
the values are recorded and reported in telemetry.

Release note (bug fix): CockroachDB does not longer crash when
encountering an internal error related to invalid entries in the
output of SHOW SESSIONs.
@knz
Copy link
Member

knz left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/crdb_internal.go, line 900 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: /* issue */

Done.


pkg/sql/pgwire/pgerror/errors.go, line 166 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I guess you've done this to make the counting work, but it seems like an abuse.
Do we really need this method at all? Won't we get counting just as good if we just call telemetry.RecordError(errors.NewErrorf(...))?

No. The telemetry is only done for special errors - InternalError and FeatureNotSupported. Also the telemetry cannot report arbitrary error messages due to the risk of information leakage. Only the InternalCommand field gets reported.

@knz knz force-pushed the knz:20181129-sessions-crash branch from 31b6f4e to d9a99db Nov 30, 2018

@knz

This comment has been minimized.

Copy link
Member

knz commented Nov 30, 2018

TFYR!

@knz

This comment has been minimized.

Copy link
Member

knz commented Nov 30, 2018

(I said I changed hte things but my push is not going through due to some github badness.)

@knz

This comment has been minimized.

Copy link
Member

knz commented Nov 30, 2018

oh now it's done.

bors r+

craig bot pushed a commit that referenced this pull request Nov 30, 2018

Merge #32713
32713: sql: prevent a crash and add telemetry for an assertion r=knz a=knz

Informs  #32517.

We got reports in the wild that CockroachDB crashes upon SHOW
SESSIONS/QUERIES due to an invalid session ID in a ListSessions
response.

An invalid session ID should never happen; also conceptually it makes
no sense because if a session exists (insofar that it gets reported by
Listsessions) it must hav ean ID.

However I was not able to find how an invalid session ID could be
created, if at all. So the root cause of the problem remains
uncovered.

In the meantime, this patch prevents a crash when encountering an
invalid session ID. nil entries are transformed to SQL NULLs, and
non-nil entries with a non-16 size are reported as `<invalid>`. Also
the values are recorded and reported in telemetry.

Release note (bug fix): CockroachDB does not longer crash when
encountering an internal error related to invalid entries in the
output of SHOW SESSIONs.

Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
@craig

This comment has been minimized.

Copy link

craig bot commented Nov 30, 2018

Build succeeded

@craig craig bot merged commit d9a99db into cockroachdb:master Nov 30, 2018

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details

@knz knz deleted the knz:20181129-sessions-crash branch Nov 30, 2018

@andreimatei
Copy link
Member

andreimatei left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/pgwire/pgerror/errors.go, line 166 at r1 (raw file):

Previously, knz (kena) wrote…

No. The telemetry is only done for special errors - InternalError and FeatureNotSupported. Also the telemetry cannot report arbitrary error messages due to the risk of information leakage. Only the InternalCommand field gets reported.

What about this code, though? Doesn't that do a count for any error you pass in?

typ := log.ErrorSource(err)

I'm asking cause I think it'd be good to have a simple to use API for this telemetry cause I can see myself using it, but the current one seems a bit awkward. Is it OK to just call telemetry.Count("othererror.invalid session id") directly and not go through the error?

Also, what's the point of the hint we're setting below? Does anyone get to see that?

@knz
Copy link
Member

knz left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/pgwire/pgerror/errors.go, line 166 at r1 (raw file):

What about this code, though? Doesn't that do a count for any error you pass in?

It does but not with a custom string.

Is it OK to just call telemetry.Count("othererror.invalid session id") directly and not go through the error?

Absolutely, yes. However that does not include a stack trace with the error.

I'm asking cause I think it'd be good to have a simple to use API for this telemetry

Please explain to DavidT and myself what you'd like to have. We are convening next week together to lay out the various use cases and document the API to the larger engineering team.

@andreimatei

This comment has been minimized.

Copy link
Member

andreimatei commented Dec 3, 2018

Please explain to DavidT and myself what you'd like to have. We are convening next week together to lay out the various use cases and document the API to the larger engineering team.

What I think I'd like telemetry.Count() to do whatever is needed in a case like this (counting an assertion failure or other unexpected things), without having to go through an error, let alone through a pgerror. So, I guess, an option to also record a stack trace (although I don't think that's necessary in this case).

@knz knz moved this from Triage to Finished (m2.2-2) in SQL Front-end, Lang & Semantics Dec 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment