sql: implement a background task for temp object cleanup #45669

otan · 2020-03-03T21:34:49Z

Resolves #45666

This PR introduces a TempObjectCleaner, which is a background task
that runs periodically on every node, but will only be executed by the
node that has the meta1 leaseholder. This task will cleanup dangling
temporary schemas that could not have been cleaned up by the connection.
The job also comes with logging and metrics. A simple test mocking the
ticking mechanism has also been included.

Also refactored the existing temporary schema deletion to use ExecEx,
which overrides the SearchPath / User executing the deletion at a lower
layer rather than using SetSessionExecutor.

Release note (sql change): Introduced a temporary table cleanup job that
runs once every periodically per cluster that will remove any temporary
schemas and their related objects that did not get removed cleanly when
the connection closed. This period can be changed by the cluster setting
sql.temp_object_cleaner.cleanup_interval, which defaults to 30min-ly.

cockroach-teamcity · 2020-03-03T21:34:57Z

This change is

otan · 2020-03-06T01:05:27Z

@jordanlewis RFAL if you have time :)

jordanlewis

Mostly though having a stamp from someone who proposed the meta1 thing would be good!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @otan)

pkg/sql/temporary_schema.go, line 409 at r1 (raw file):

	// We only want to perform the cleanup if we are holding the meta1 lease.
	// This ensures only one server can perform the job at a time.

No idea if this is legit or not. I think it would be nice to get a review from whoever suggested this idea as well :)

pkg/sql/temporary_schema.go, line 483 at r1 (raw file):

	for sessionID := range sessionIDs {
		if _, ok := activeSessions[sessionID.Uint128]; !ok {
			log.Infof(ctx, "cleaning up temporary object for session %q", sessionID)

I'd recommend just getting rid of this one, as in the happy case of cleaning up, say 1000 sessions, we'll get 1000 pointless and unactionable log lines. I'd keep the "starting" and "finishing" ones, and the error ones, and add "cleaned up %d/%d" sessions to the "finishing" log.

pkg/sql/temporary_schema.go, line 484 at r1 (raw file):

		if _, ok := activeSessions[sessionID.Uint128]; !ok {
			log.Infof(ctx, "cleaning up temporary object for session %q", sessionID)
			c.metrics.schemasToDelete.Inc(1)

IF you never decrement it, wont this look kinda scary on the timeseries graphs? The name makes it sound like it's a queue. Maybe it should be a gauge.

pkg/sql/sqlbase/internal.go, line 24 at r1 (raw file):

	// ApplicationName represents the application that the query runs under.
	ApplicationName string
	// SearchPath represents the namespsaces to search in.

nit: typo in namespaces

otan · 2020-03-06T01:26:20Z

cc, @ajwerner mind taking a look at the meta1 usage?

…

On Thu, 5 Mar 2020, 5:23 pm Jordan Lewis, ***@***.***> wrote: ***@***.**** approved this pull request. Mostly [image: <img class="emoji" title=":lgtm:" alt=":lgtm:" align="absmiddle" src="https://reviewable.io/lgtm.png" height="20" width="61"/>] <https://camo.githubusercontent.com/41b3ec3419116e3b3f84507ad965646ce4ce1f0c/68747470733a2f2f72657669657761626c652e696f2f6c67746d2e706e67> though having a stamp from someone who proposed the meta1 thing would be good! *Reviewable <https://reviewable.io/reviews/cockroachdb/cockroach/45669#-:-M1hR90O2xe85-o8HLOR:bkr4hhy>* status: [image:

] complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis <https://github.com/jordanlewis> and @otan <https://github.com/otan>) ------------------------------ *pkg/sql/temporary_schema.go, line 409 at r1 <https://reviewable.io/reviews/cockroachdb/cockroach/45669#-M1hPKXRA3ETOHi1D2M4:-M1hPKXRA3ETOHi1D2M5:b-q0m63x> (raw file <https://github.com/cockroachdb/cockroach/blob/15cbf166220ccb23900f6499450cc167fca7a08e/pkg/sql/temporary_schema.go#L409>):* // We only want to perform the cleanup if we are holding the meta1 lease. // This ensures only one server can perform the job at a time. No idea if this is legit or not. I think it would be nice to get a review from whoever suggested this idea as well :) ------------------------------ *pkg/sql/temporary_schema.go, line 483 at r1 <https://reviewable.io/reviews/cockroachdb/cockroach/45669#-M1hQBTr9SBqdP1Oor3X:-M1hQBTr9SBqdP1Oor3Y:b-c11yvb> (raw file <https://github.com/cockroachdb/cockroach/blob/15cbf166220ccb23900f6499450cc167fca7a08e/pkg/sql/temporary_schema.go#L483>):* for sessionID := range sessionIDs { if _, ok := activeSessions[sessionID.Uint128]; !ok { log.Infof(ctx, "cleaning up temporary object for session %q", sessionID) I'd recommend just getting rid of this one, as in the happy case of cleaning up, say 1000 sessions, we'll get 1000 pointless and unactionable log lines. I'd keep the "starting" and "finishing" ones, and the error ones, and add "cleaned up %d/%d" sessions to the "finishing" log. ------------------------------ *pkg/sql/temporary_schema.go, line 484 at r1 <https://reviewable.io/reviews/cockroachdb/cockroach/45669#-M1hP2MUEi0KYqssgLVa:-M1hP2MUEi0KYqssgLVb:bb7dd81> (raw file <https://github.com/cockroachdb/cockroach/blob/15cbf166220ccb23900f6499450cc167fca7a08e/pkg/sql/temporary_schema.go#L484>):* if _, ok := activeSessions[sessionID.Uint128]; !ok { log.Infof(ctx, "cleaning up temporary object for session %q", sessionID) c.metrics.schemasToDelete.Inc(1) IF you never decrement it, wont this look kinda scary on the timeseries graphs? The name makes it sound like it's a queue. Maybe it should be a gauge. ------------------------------ *pkg/sql/sqlbase/internal.go, line 24 at r1 <https://reviewable.io/reviews/cockroachdb/cockroach/45669#-M1hO1jW2QCfT0m_Utra:-M1hO1jW2QCfT0m_Utrb:b-9snu9b> (raw file <https://github.com/cockroachdb/cockroach/blob/15cbf166220ccb23900f6499450cc167fca7a08e/pkg/sql/sqlbase/internal.go#L24>):* // ApplicationName represents the application that the query runs under. ApplicationName string // SearchPath represents the namespsaces to search in. nit: typo in namespaces — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#45669?email_source=notifications&email_token=AA32FQ4TGGKB2RR4LEJ2WSLRGBGCLA5CNFSM4LATWKBKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCYHC6ZQ#pullrequestreview-370028390>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA32FQ4WOWSND66ZNVWVO6TRGBGCLANCNFSM4LATWKBA> .

ajwerner

Reviewed 4 of 10 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @otan)

pkg/sql/temporary_schema.go, line 45 at r1 (raw file):

// defaultTempObjectCleanupInterval is the interval of time after which a background job
// runs to cleanup temporary tables that weren't deleted at session exit.
var defaultTempObjectCleanupInterval = 30 * time.Minute

smells like a reserved cluster setting.

pkg/sql/temporary_schema.go, line 334 at r1 (raw file):

	metrics                          *temporaryObjectCleanerMetrics

	sleepFunc func(time.Duration)

Is this ever used?

pkg/sql/temporary_schema.go, line 362 at r1 (raw file):

		isMeta1LeaseholderFunc:           isMeta1LeaseholderFunc,
		testingKnobs:                     testingKnobs,
		metrics: &temporaryObjectCleanerMetrics{

I don't think you've got these metrics hooked up. Once you do I think you'll be dismayed by the linter's strong-arming you to describe there metrics in https://github.com/cockroachdb/cockroach/blob/master/pkg/ts/catalog/chart_catalog.go.

In order to get it hooked up you need to:

Implement https://godoc.org/github.com/cockroachdb/cockroach/pkg/util/metric#Struct on temporaryObjectCleanerMetrics
Provide a method to get the metrics:

func (c *TemporaryObjectCleaner) Metrics() metric.Struct { return &c.metrics }

export the fields in the metrics struct
register the metric struct in the registry in server.go

nit: there's a pattern generally throughout the codebase to:

have the metrics struct as a non-pointer field in its parent
have a make...Metrics function that constructs the metrics struct
store the meta in a var block // I'm not so attached to this but if you look around, it's the pattern

pkg/sql/temporary_schema.go, line 363 at r1 (raw file):

		testingKnobs:                     testingKnobs,
		metrics: &temporaryObjectCleanerMetrics{
			jobRunning: metric.NewGauge(metric.Metadata{

This feels like a thing I'd want to keep as a counter that just increases monotonically. I can see being torn about it. If anything though I'd make the below one a guage before this one. We have good rate/delta computation

It's moments like these where I wish our timeseries database was either more sophisticated or didn't exist. Always incrementing counters which you can subtract from eachother and are known to be sampled at the same time are a really nice metric primitive. We lack this operation in our timeseries database. It's probably pretty easy to add. Anyway, prometheus offers it.

The nice thing about having counters is it also gives you rates. Not that this is a thing that's going to have any interesting pattern in its rate.

pkg/sql/temporary_schema.go, line 409 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

No idea if this is legit or not. I think it would be nice to get a review from whoever suggested this idea as well :)

This is legit.

pkg/sql/temporary_schema.go, line 411 at r1 (raw file):

	// This ensures only one server can perform the job at a time.
	var isLeaseholder bool
	if err := retryFunc(ctx, func() error {

did you find the retryFunc handy here for tests or something? In general I don't think you need to be retrying this. A lot of the errors will be due to things like shutting down. Leases don't more around very often. I think I'd rather increase the interval than increase the retries here.

pkg/sql/temporary_schema.go, line 483 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I'd recommend just getting rid of this one, as in the happy case of cleaning up, say 1000 sessions, we'll get 1000 pointless and unactionable log lines. I'd keep the "starting" and "finishing" ones, and the error ones, and add "cleaned up %d/%d" sessions to the "finishing" log.

VEventf is cool. Andrei has been pushing a testing methodology whereby you can examine what happens in the execution of calls by inspecting what got passed to their traces. I like it because it's less verbose than if log.V(...) { log.Info(...) } but I'm interest in seeing how sprinkling tracing throughout the code will come in handy later.

pkg/sql/temporary_schema.go, line 484 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

IF you never decrement it, wont this look kinda scary on the timeseries graphs? The name makes it sound like it's a queue. Maybe it should be a gauge.

This one does feels weird as a counter given what you've made a guage in terms of counting the job runs.

pkg/sql/temporary_schema.go, line 489 at r1 (raw file):

			// the given schema correctly.
			if err := retryFunc(ctx, func() error {
				return cleanupSessionTempObjects(

This code doesn't seem to log or record a metric in the error path. I get that you're retrying but in the face of some errors but any retries are interesting. Perhaps a metric that you pass to your retryFunc to count retries and a log.Every on the cleaner and then an if err != nil && c.logEvery.ShouldLog() { log.InfofDepth(ctx, 1, ...) }`

otan

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @jordanlewis)

pkg/sql/temporary_schema.go, line 45 at r1 (raw file):

Previously, ajwerner wrote…

smells like a reserved cluster setting.

Done.

pkg/sql/temporary_schema.go, line 334 at r1 (raw file):

Previously, ajwerner wrote…

Is this ever used?

Nope - relic from previous retry library. Fixed!

pkg/sql/temporary_schema.go, line 362 at r1 (raw file):

Previously, ajwerner wrote…

I don't think you've got these metrics hooked up. Once you do I think you'll be dismayed by the linter's strong-arming you to describe there metrics in https://github.com/cockroachdb/cockroach/blob/master/pkg/ts/catalog/chart_catalog.go.

In order to get it hooked up you need to:

Implement https://godoc.org/github.com/cockroachdb/cockroach/pkg/util/metric#Struct on temporaryObjectCleanerMetrics

Provide a method to get the metrics:
func (c *TemporaryObjectCleaner) Metrics() metric.Struct { return &c.metrics }
export the fields in the metrics struct

register the metric struct in the registry in server.go

nit: there's a pattern generally throughout the codebase to:

have the metrics struct as a non-pointer field in its parent

have a make...Metrics function that constructs the metrics struct

store the meta in a var block // I'm not so attached to this but if you look around, it's the pattern

Done.

pkg/sql/temporary_schema.go, line 363 at r1 (raw file):

Previously, ajwerner wrote…

This feels like a thing I'd want to keep as a counter that just increases monotonically. I can see being torn about it. If anything though I'd make the below one a guage before this one. We have good rate/delta computation

It's moments like these where I wish our timeseries database was either more sophisticated or didn't exist. Always incrementing counters which you can subtract from eachother and are known to be sampled at the same time are a really nice metric primitive. We lack this operation in our timeseries database. It's probably pretty easy to add. Anyway, prometheus offers it.

The nice thing about having counters is it also gives you rates. Not that this is a thing that's going to have any interesting pattern in its rate.

I wanted to keep this as a measure for when the cleaner is running, and when the cleaner is not running (i.e. 1=active, 0=not active) -- it's kind of a practice I've had from my old place which was nice to read.

pkg/sql/temporary_schema.go, line 409 at r1 (raw file):

Previously, ajwerner wrote…

This is legit.

Done.

pkg/sql/temporary_schema.go, line 411 at r1 (raw file):

Previously, ajwerner wrote…

did you find the retryFunc handy here for tests or something? In general I don't think you need to be retrying this. A lot of the errors will be due to things like shutting down. Leases don't more around very often. I think I'd rather increase the interval than increase the retries here.

Nope, just added retries to everything. I can remove this one. Done.

pkg/sql/temporary_schema.go, line 483 at r1 (raw file):

Previously, ajwerner wrote…

VEventf is cool. Andrei has been pushing a testing methodology whereby you can examine what happens in the execution of calls by inspecting what got passed to their traces. I like it because it's less verbose than if log.V(...) { log.Info(...) } but I'm interest in seeing how sprinkling tracing throughout the code will come in handy later.

Done.

pkg/sql/temporary_schema.go, line 484 at r1 (raw file):

Previously, ajwerner wrote…

This one does feels weird as a counter given what you've made a guage in terms of counting the job runs.

I was thinking this should actively count how many have been deleted. More accurately, I care about (delta)/time period so I can see things like "deletes per second". Is this not what counter is? Sorry, haven't used this metric system much and this is what I've interpreted it to be able to do.

pkg/sql/temporary_schema.go, line 489 at r1 (raw file):

Previously, ajwerner wrote…

This code doesn't seem to log or record a metric in the error path. I get that you're retrying but in the face of some errors but any retries are interesting. Perhaps a metric that you pass to your retryFunc to count retries and a log.Every on the cleaner and then an if err != nil && c.logEvery.ShouldLog() { log.InfofDepth(ctx, 1, ...) }`

I've decided to just log every error during a retry in the retryFunc.

jordanlewis · 2020-03-06T23:15:22Z

I think there is something wrong going on. Seen in one of the test failure outputs:

I200306 19:56:58.975808 504746 sql/temporary_schema.go:448  [n1] running temporary object cleanup background job
I200306 19:56:58.976165 504746 sql/temporary_schema.go:482  [n1] found 0 temporary schemas
I200306 19:56:58.976944 504746 sql/temporary_schema.go:531  [n1] completed temporary object cleanup job
I200306 19:56:58.977016 504746 sql/temporary_schema.go:448  [n1] running temporary object cleanup background job
I200306 19:56:58.977571 504746 sql/temporary_schema.go:482  [n1] found 0 temporary schemas
I200306 19:56:58.978346 504746 sql/temporary_schema.go:531  [n1] completed temporary object cleanup job
I200306 19:56:58.978482 504746 sql/temporary_schema.go:448  [n1] running temporary object cleanup background job
I200306 19:56:58.979114 504746 sql/temporary_schema.go:482  [n1] found 0 temporary schemas
I200306 19:56:58.980437 504746 sql/temporary_schema.go:531  [n1] completed temporary object cleanup job
I200306 19:56:58.980556 504746 sql/temporary_schema.go:448  [n1] running temporary object cleanup background job
I200306 19:56:58.981144 504746 sql/temporary_schema.go:482  [n1] found 0 temporary schemas
I200306 19:56:58.982078 504746 sql/temporary_schema.go:531  [n1] completed temporary object cleanup job
I200306 19:56:58.982134 504746 sql/temporary_schema.go:448  [n1] running temporary object cleanup background job
I200306 19:56:58.982741 504746 sql/temporary_schema.go:482  [n1] found 0 temporary schemas
I200306 19:56:58.983904 504746 sql/temporary_schema.go:531  [n1] completed temporary object cleanup job
I200306 19:56:58.983971 504746 sql/temporary_schema.go:448  [n1] running temporary object cleanup background job

otan · 2020-03-07T07:12:49Z

woops, fixed!

otan · 2020-03-09T17:44:42Z

there was a data race error, i have PR #45896 for it for now.

otan · 2020-03-09T18:52:25Z

the data race error seems blocking... i will wait for that to resolve.

This PR introduces a `TempObjectCleaner`, which is a background task that runs periodically on every node, but will only be executed by the node that has the meta1 leaseholder. This task will cleanup dangling temporary schemas that could not have been cleaned up by the connection. The job also comes with logging and metrics. A simple test mocking the ticking mechanism has also been included. Also refactored the existing temporary schema deletion to use ExecEx, which overrides the SearchPath / User executing the deletion at a lower layer rather than using `SetSessionExecutor`. Release note (sql change): Introduced a temporary table cleanup job that runs once every periodically per cluster that will remove any temporary schemas and their related objects that did not get removed cleanly when the connection closed. This period can be changed by the cluster setting `sql.temp_object_cleaner.cleanup_interval`, which defaults to 30min-ly.

otan requested review from jordanlewis and a team March 3, 2020 21:34

otan added this to Open PRs in SQL Features (Deprecated - use SQL Experience board) via automation Mar 3, 2020

otan mentioned this pull request Mar 3, 2020

sql, server: background job for temp data cleanup #45185

Closed

otan force-pushed the temp_schema_cleanup_boogaloo branch from d6a8b20 to 84b98c2 Compare March 3, 2020 23:04

jordanlewis mentioned this pull request Mar 4, 2020

20.1 release blockers list #45599

Closed

83 tasks

otan force-pushed the temp_schema_cleanup_boogaloo branch 6 times, most recently from 8c5e449 to 15cbf16 Compare March 4, 2020 19:56

jordanlewis approved these changes Mar 6, 2020

View reviewed changes

ajwerner reviewed Mar 6, 2020

View reviewed changes

otan force-pushed the temp_schema_cleanup_boogaloo branch from 15cbf16 to dceef05 Compare March 6, 2020 19:11

otan commented Mar 6, 2020

View reviewed changes

otan force-pushed the temp_schema_cleanup_boogaloo branch 3 times, most recently from 7f28db7 to 04b1121 Compare March 7, 2020 07:12

otan force-pushed the temp_schema_cleanup_boogaloo branch 2 times, most recently from e1091e8 to 83e4b29 Compare March 9, 2020 16:51

otan force-pushed the temp_schema_cleanup_boogaloo branch 3 times, most recently from 9b9caad to 7587b6e Compare March 9, 2020 18:29

otan mentioned this pull request Mar 9, 2020

tree: fix data race for ResolvableFunctionReference #45896

Closed

otan mentioned this pull request Mar 9, 2020

sql: race in CANCEL QUERIES #28033

Closed

otan force-pushed the temp_schema_cleanup_boogaloo branch from 7587b6e to 34b4740 Compare March 9, 2020 20:37

otan merged commit c28acc2 into cockroachdb:master Mar 9, 2020

SQL Features (Deprecated - use SQL Experience board) automation moved this from Open PRs to Done Mar 9, 2020

otan deleted the temp_schema_cleanup_boogaloo branch March 9, 2020 21:32

jseldess mentioned this pull request Mar 19, 2020

sql: implement a background task for temp object cleanup cockroachdb/docs#6925

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: implement a background task for temp object cleanup #45669

sql: implement a background task for temp object cleanup #45669

otan commented Mar 3, 2020 •

edited

Loading

cockroach-teamcity commented Mar 3, 2020

otan commented Mar 6, 2020

jordanlewis left a comment

otan commented Mar 6, 2020 via email •

edited by petermattis

Loading

ajwerner left a comment

otan left a comment

jordanlewis commented Mar 6, 2020

otan commented Mar 7, 2020

otan commented Mar 9, 2020

otan commented Mar 9, 2020

sql: implement a background task for temp object cleanup #45669

sql: implement a background task for temp object cleanup #45669

Conversation

otan commented Mar 3, 2020 • edited Loading

cockroach-teamcity commented Mar 3, 2020

otan commented Mar 6, 2020

jordanlewis left a comment

Choose a reason for hiding this comment

otan commented Mar 6, 2020 via email • edited by petermattis Loading

ajwerner left a comment

Choose a reason for hiding this comment

otan left a comment

Choose a reason for hiding this comment

jordanlewis commented Mar 6, 2020

otan commented Mar 7, 2020

otan commented Mar 9, 2020

otan commented Mar 9, 2020

otan commented Mar 3, 2020 •

edited

Loading

otan commented Mar 6, 2020 via email •

edited by petermattis

Loading