sqlstats: fix insights memleak when disabling `sql.metrics.transaction_details.enabled` #117709

abarganier · 2024-01-11T23:13:20Z

Reviewer note: best reviewed commit-wise

When we record statements to the insights system, they are cached in a
map within the lockingRegistry, keyed by session ID. The gathered
statements remain in the map until a transaction for the same session is
recorded to the insights system, which triggers a flush of the
statements to the underlying insights store.

However, if the sql.metrics.transaction_details.enabled cluster
setting is disabled while the sql.metrics.statement_details.enabled is
left enabled, a memory leak occurs. This is because disabling the
setting prevents transactions being sent to the insights system while
statements continue to be. Therefore, the cache flush trigger mechanism
is removed from the equation, meaning that statements begin to pile up
in the cache indefinitely.

This patch addresses this issue by preventing statements from being
written to the insights system if sql.metrics.transaction_details.enabled
is disabled. When the setting is disabled, we also clear the underlying caches
of any existing data.

Release note (bug fix): A memory leak within the insights system was
found to occur when sql.metrics.transaction_details.enabled was
disabled, while leaving sql.metrics.statement_details.enabled enabled.
This patch fixes the memory leak by preventing the collection of
further statement & transaction insights when
sql.metrics.transaction_details.enabled is disabled.

Epic: CRDB-20791

Workload benchmarks:

YCSB

	Type	ops(total)	ops/sec(cum)	avg(ms)	p50(ms)	p95(ms)	p99(ms)	pMax(ms)
Before	Result	1488316	24372.7	0.8	0.3	5.5	9.4	62.9
After	Result	1478367	24452.9	0.8	0.3	5.5	9.4	58.7

Insights

	Type	ops(total)	ops/sec(cum)	avg(ms)	p50(ms)	p95(ms)	p99(ms)	pMax(ms)
Before	Result	658	10.6	1834.5	1409.3	5100.3	6174.0	7247.8
After	Result	644	10.7	1850.7	1476.4	4831.8	6173.9	6979.3

Future patches are going to add new controls around when we record statements and transactions into the insights system. To make our lives easier for testing once we add these new controls, add test knobs for interceptor functions that allow tests to consume data sent to the insights system. Release note: none

cockroach-teamcity · 2024-01-11T23:13:33Z

This change is

pkg/sql/sqlstats/insights/ingester.go

When we record statements to the insights system, they are cached in a map within the `lockingRegistry`, keyed by session ID. The gathered statements remain in the map until a transaction for the same session is recorded to the insights system, which triggers a flush of the statements to the underlying insights store. However, if the `sql.metrics.transaction_details.enabled` cluster setting is disabled while the `sql.metrics.statement_details.enabled` is left enabled, a memory leak occurs. This is because disabling the setting prevents transactions being sent to the insights system while statements continue to be. Therefore, the cache flush trigger mechanism is removed from the equation, meaning that statements begin to pile up in the cache indefinitely. This patch addresses this issue by preventing statements from being written to the insights system if `sql.metrics.transaction_details.enabled` is disabled. A follow up patch will handle flushing existing data from the cache when the value of `sql.metrics.transaction_details.enabled` is changed. Release note (bug fix): A memory leak within the insights system was found to occur when `sql.metrics.transaction_details.enabled` was disabled, while leaving `sql.metrics.statement_details.enabled` enabled. This patch fixes the memory leak by preventing the collection of further statement & transaction insights when `sql.metrics.transaction_details.enabled` is disabled.

ConcurrentBufferGuard provides utilities to optimize concurrent writes to a buffer. Callers use AtomicWrite() to grab an index to the underlying buffer, and a user-defined flush function is atomically triggered when a size limit is reached. Additionally, clients can trigger the flush function directly via ForceSync(). Calling ForceSync() for special cases can be difficult though, because the API doesn't provide any means to atomically specify a particular state for that particular run of ForceSync(). For example, if I want my flush function to optionally flush an underlying buffer, I'd have to do something like: ``` atomic.StoreUint32(&myBuf.flushUnderlying, 1) myBuf.guard.ForceSync() ``` Where the flush function would read the value of `flushUnderlying`. However, give the concurrent nature of things, this provides no guarantees that **our own ForceSync() call** will be the one that picks up the value of `myBuf.flushUnderlying`. If a ForceSync() is already underway ahead of ours, but has yet to read the value of `flushUnderlying`, it could pick up the behavior gated by that condition, instead of our own `ForceSync()`. This makes it impossible to have any guarantees that such special behavior will apply to our `ForceSync()` call. To address this, we need a way to atomically set state such as `flushUnderlying` immediately prior to our `ForceSync()` call while the ConcurrentBufferGuard's write lock is being held. This guarantees that our specific ForceSync() call **only** will be the one to act on that state. To achieve this, we introduce `ForceSyncExec(fn func())`. It simply executes the provided callback prior to calling the flush function, while the ConcurrentBufferGuard's write lock is held. Then, we are able to set bits of state specific to our own `ForceSync` call. For example: ``` myBuf.guard.ForceSyncExec(func() { atomic.StoreUint32(&myBuf.flushUnderlying, 1) }) // The flush function can be responsible for setting // the value of `flushUnderlying` back to 0. ``` Release note: none

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, @koorosh, @maryliag, and @xinhaoz)

pkg/sql/sqlstats/insights/insights.go line 165 at r6 (raw file):

	// Clear clears the cache of its contents, with no guarantees around flush behavior.
	// Data may simply be erased depending on the implementation.
	Clear(ctx context.Context)

Did you consider placing the Clear method directly on Writer? I don't see the value of this separate interface, though I could be missing something justifying it.

pkg/sql/sqlstats/insights/registry.go line 31 at r6 (raw file):

type lockingRegistry struct {
	mu struct {
		syncutil.Mutex

What necessitated the addition of a mutex here? When I was looking at this code it seems like it is only called from a single goroutine (concurrentBufferIngester.ingest()). You seem to have taken pains to make sure Clear() is called from that same single goroutine.

pkg/sql/sqlstats/insights/ingester_test.go line 152 at r6 (raw file):

		// with some data. This sets us up to call Clear() with guaranteed data
		// in both, so we can assert afterward that both have been cleared.
		ingester.testKnobs.noTimedFlush = true

This feels a bit hacky to me. Perhaps ingester.Start() could take an option to not start the timed flush goroutine. Or an option to configure the timed flush period so that it can be set very high. In additional to a feeling of hackiness, this code is currently a data race since you're potentially reading noTimedFlush on a different goroutine while it is being set here. So if you keep this you need to set this variable before calling ingester.Start().

abarganier

TFTR! I agree with your comments. I've been on-call this week, but will get back to this PR once my shift has ended.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @koorosh, @maryliag, and @xinhaoz)

abarganier

Thanks for the feedback, apologies for the delay with the turnaround. I addressed your comments, RFAL!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @koorosh, @maryliag, @petermattis, and @xinhaoz)

pkg/sql/sqlstats/insights/insights.go line 165 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Did you consider placing the Clear method directly on Writer? I don't see the value of this separate interface, though I could be missing something justifying it.

My impression was that a Writer doesn't necessarily indicate an underlying cache exists to be cleared, hence the separate interface. However, in reality, the only implementations at play here have underlying caches backing them, so I guess it's not really necessary. I went ahead and just made it part of the Writer interface.

If some use case in the future arises where it no longer makes sense, folks can split the interface then if necessary.

pkg/sql/sqlstats/insights/registry.go line 31 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What necessitated the addition of a mutex here? When I was looking at this code it seems like it is only called from a single goroutine (concurrentBufferIngester.ingest()). You seem to have taken pains to make sure Clear() is called from that same single goroutine.

Ah, this was a result of a data race being detected within TestIngester_Clear. I was making the test reach into the underlying registry that the concurrentBufferIngester was wrapping to ensure that it was cleared from the test goroutine, which was triggering the race detector. In hindsight, we shouldn't be adding a mutex to code solely for the purposes of a test.

Perhaps the correct thing to do is to avoid making those assertions within TestIngester_Clear, and instead rely on TestInsightsRegistry_Clear to test this.

I went ahead and updated the test accordingly, and removed the mutex.

pkg/sql/sqlstats/insights/ingester_test.go line 152 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This feels a bit hacky to me. Perhaps ingester.Start() could take an option to not start the timed flush goroutine. Or an option to configure the timed flush period so that it can be set very high. In additional to a feeling of hackiness, this code is currently a data race since you're potentially reading noTimedFlush on a different goroutine while it is being set here. So if you keep this you need to set this variable before calling ingester.Start().

Thanks for the callout, I think the start options is the right idea. I went ahead and added a WithoutTimedFlush option which we can use in tests when calling .Start(). Lmk what you think.

petermattis

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, @koorosh, @maryliag, and @xinhaoz)

pkg/sql/sqlstats/insights/insights.go line 165 at r6 (raw file):

Previously, abarganier (Alex Barganier) wrote…

My impression was that a Writer doesn't necessarily indicate an underlying cache exists to be cleared, hence the separate interface. However, in reality, the only implementations at play here have underlying caches backing them, so I guess it's not really necessary. I went ahead and just made it part of the Writer interface.

If some use case in the future arises where it no longer makes sense, folks can split the interface then if necessary.

Yeah, we can always refactor it in the future. I've noticed that many of the interfaces in the sqlstats area have only a single interface. Java and C++ encourage that behavior because retrofitting an interface can be awkward. But in Go retrofitting an interface is usually very straightforward and so my general advice is to only introduce an interface when there is more than one implementation. (Note, don't do anything in this PR about this).

pkg/sql/sqlstats/insights/registry.go line 218 at r7 (raw file):

		sink:     sink,
	}
	lr.statements = make(map[clusterunique.ID]*statementBuf)

You can move the creation of the statements map back into the lockingRegistry struct initialization.

A memory leak was found where disabling the cluster setting `sql.metrics.transaction_details.enabled` prevented the insights cache from flushing statements. A previous patch fixed this partially by preventing further statement & transactions from being recorded to the insights system when the cluster setting was disabled. This patch implements the other piece of the memory leak fix, which is flushing the insights caches when `sql.metrics.transaction_details.enabled` is disabled. It adds an onChange listener to the setting, which triggers a new `CacheClearer` interface implementation that handles the cache clearing. Fixes: cockroachdb#117300 Release note: none

abarganier

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian, @koorosh, @maryliag, @petermattis, and @xinhaoz)

pkg/sql/sqlstats/insights/insights.go line 165 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Yeah, we can always refactor it in the future. I've noticed that many of the interfaces in the sqlstats area have only a single interface. Java and C++ encourage that behavior because retrofitting an interface can be awkward. But in Go retrofitting an interface is usually very straightforward and so my general advice is to only introduce an interface when there is more than one implementation. (Note, don't do anything in this PR about this).

Appreciate the insight, thank you!

pkg/sql/sqlstats/insights/registry.go line 218 at r7 (raw file):

Previously, petermattis (Peter Mattis) wrote…

You can move the creation of the statements map back into the lockingRegistry struct initialization.

Thanks for the catch. Done.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian, @koorosh, @maryliag, and @xinhaoz)

abarganier · 2024-01-31T21:23:03Z

bors r=petermattis

craig · 2024-01-31T23:03:34Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2024-02-01T01:16:02Z

Build succeeded:

Bazel Essential CI (Cockroach)

abarganier requested review from dhartunian, maryliag, koorosh, xinhaoz and a team January 11, 2024 23:13

abarganier commented Jan 12, 2024

View reviewed changes

pkg/sql/sqlstats/insights/ingester.go Outdated Show resolved Hide resolved

abarganier added 2 commits January 12, 2024 15:31

abarganier force-pushed the mem-leak-cache-clear branch 3 times, most recently from aa2252b to 298e9b1 Compare January 12, 2024 20:37

petermattis reviewed Jan 17, 2024

View reviewed changes

abarganier commented Jan 19, 2024

View reviewed changes

abarganier force-pushed the mem-leak-cache-clear branch from 298e9b1 to 493f44f Compare January 30, 2024 19:55

abarganier requested a review from petermattis January 30, 2024 19:57

abarganier commented Jan 30, 2024

View reviewed changes

petermattis approved these changes Jan 31, 2024

View reviewed changes

abarganier force-pushed the mem-leak-cache-clear branch from 493f44f to 8d1b8d0 Compare January 31, 2024 15:27

abarganier requested a review from petermattis January 31, 2024 15:28

abarganier commented Jan 31, 2024

View reviewed changes

petermattis approved these changes Jan 31, 2024

View reviewed changes

craig bot merged commit 9eff166 into cockroachdb:master Feb 1, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlstats: fix insights memleak when disabling `sql.metrics.transaction_details.enabled` #117709

sqlstats: fix insights memleak when disabling `sql.metrics.transaction_details.enabled` #117709

abarganier commented Jan 11, 2024 •

edited

Loading

cockroach-teamcity commented Jan 11, 2024

petermattis left a comment

abarganier left a comment

abarganier left a comment

petermattis left a comment

abarganier left a comment

petermattis left a comment

abarganier commented Jan 31, 2024

craig bot commented Jan 31, 2024

craig bot commented Feb 1, 2024

sqlstats: fix insights memleak when disabling sql.metrics.transaction_details.enabled #117709

sqlstats: fix insights memleak when disabling sql.metrics.transaction_details.enabled #117709

Conversation

abarganier commented Jan 11, 2024 • edited Loading

Reviewer note: best reviewed commit-wise

cockroach-teamcity commented Jan 11, 2024

petermattis left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

abarganier commented Jan 31, 2024

craig bot commented Jan 31, 2024

craig bot commented Feb 1, 2024

sqlstats: fix insights memleak when disabling `sql.metrics.transaction_details.enabled` #117709

sqlstats: fix insights memleak when disabling `sql.metrics.transaction_details.enabled` #117709

abarganier commented Jan 11, 2024 •

edited

Loading