sqlstats: coordinated cluster-wide SQL stats flush#169074
Draft
kyle-a-wong wants to merge 3 commits intocockroachdb:masterfrom
Draft
sqlstats: coordinated cluster-wide SQL stats flush#169074kyle-a-wong wants to merge 3 commits intocockroachdb:masterfrom
kyle-a-wong wants to merge 3 commits intocockroachdb:masterfrom
Conversation
Contributor
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Member
be9087b to
957daae
Compare
Switch DrainSqlStats from a unary RPC returning a single DrainStatsResponse to a server-streaming RPC delivering DrainStatsChunk messages. This eliminates the gRPC max-message-size cliff: a node with many statement / transaction fingerprints can otherwise produce a response that exceeds the 4 MiB default limit, causing the entire RPC to fail and that node's stats to be silently lost. The rewrite preserves the three modes of the existing RPC: - Targeted local drain (drainSqlStatsLocal): drain in-memory stats and emit them in chunks of sql.stats.drain.chunk_size. Stats are removed from memory at drain time; chunks that fail to send are lost — same behavior as the legacy unary form, but the loss window per failure shrinks from "whole node's stats" to "one chunk's stats". - Targeted remote drain (drainSqlStatsRemote): forward to the named instance and proxy chunks back to the caller unchanged. - Cluster-fanout drain (drainSqlStatsClusterFanout): open a stream to every live SQL instance, merge per-source chunks at the proxy via the legacy DrainSqlStatsRespBuilder semantics (dedup by (Key, AggregatedTs) for statements and (TxnFingerprintID, AggregatedTs) for transactions), and emit the merged result as chunks. Memory profile on the proxy matches the previous unary form; the benefit delivered is the per-message size cap. The SQLStatusServer interface adopts a callback-shaped DrainSqlStats so in-process callers (ssremote, persistedsqlstats coordinator) don't need to construct a wire-stream value. The auto-generated gRPC and DRPC server interfaces use stream types, so RegisterService now wraps statusServer with thin grpcStatusServer / drpcStatusServer adapters (plus grpcSystemStatusServer / drpcSystemStatusServer for the system tenant) that translate Send into the callback. apiV2ServerOpts loses its serverpb.StatusServer field — *statusServer no longer satisfies it because of the streaming method — and accepts the concrete *statusServer / *systemStatusServer types directly. Other changes: - ssremote.SQLStats.DrainStats consumes the streaming RPC and materializes it back into the legacy SSDrainer-shaped slices. - A pre-existing nogo shadow warning in ss_mem_iterator.go is fixed by renaming a local variable that was hiding the stmtKey type. - New cluster setting sql.stats.drain.chunk_size (default 1000) controls per-chunk batch size. - TestEnsureSQLStatsAreFlushedForTelemetry casts to a one-method interface for Diagnostics now that *statusServer no longer satisfies serverpb.StatusServer directly. Epic: none Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
The result slices in Container.DrainStats and SQLStats.DrainStats are known-length up front (the source maps are detached before iteration begins; the per-app drains sum to TotalFingerprintCount). Pre-sizing removes the slice-grow chain — modest but unconditional win. Bench (5000 fingerprints): before: 1410 ns/op 7161 B/op 6543 allocs/op after: 1342 ns/op 7030 B/op 6517 allocs/op
Adds a singleton coordinator job that drains every SQL instance's
in-memory stats via the streaming DrainSqlStats RPC, merges the chunks
in Go, and writes the result to system.statement_statistics /
system.transaction_statistics. The new sql.stats.flush.coordinated.enabled
cluster setting opts a cluster into this mode; when on, the per-node
flush loops idle and the coordinator owns the write path.
Wiring:
- persistedsqlstats.RunCoordinatedFlush is the per-tick entry point.
It calls instanceCollector.Collect (drains every live SQL instance
via DrainSqlStats, merging per-source chunks under one lock with
cluster-size-aware map sizing) and then flushWriter.Write
(parallelizes UPSERTs across an 8-way bounded worker pool).
- sqlActivityFlushJob.Resume in pkg/sql/sqlstats/sqlactivityjob is
the long-running loop. It wakes every sql.stats.flush.interval and,
if CoordinatedFlushEnabled returns true, invokes
RunCoordinatedFlush. The job is bootstrapped as adoptable by
pkg/upgrade/upgrades/permanent_create_sql_activity_flush_job.go;
only the node holding the sqlliveness lease runs Resume.
- flush.MaybeFlush bails out early when CoordinatedFlushEnabled
returns true, so per-node flushes don't double-write.
- CoordinatedFlushEnabled is the effective predicate. It gates
sql.stats.flush.coordinated.enabled on
sqlstats.GatewayNodeEnabled: with per-node attribution on, the
coordinator's collapse-into-nodeID=0 model would destroy the
attribution operators want, so per-node flushes stay in charge.
- InternalDB.SQLStatsProvider exposes the persisted stats provider
so the resumer (which lives outside pkg/sql) can reach it without
poking into the unexported sql.Server pointer.
Performance considerations baked in:
- Merge: lock acquired once per source (not per chunk); merge maps
pre-sized based on cluster size and attribution mode (default
collapse mode sizes to ~per-instance fingerprint count, per-node
mode scales linearly); flatten destination in default mode pre-
sized to the merged map's size since everything lands in the
single nodeID=0 bucket.
- Write: each batch is a disjoint slice of fingerprints (concurrent
UPSERTs hit different primary keys and don't conflict), so all
batches across both stmt and txn tables are dispatched to a single
bounded worker pool sized by flushWriteParallelism. Workers are
scheduled via stopper.RunAsyncTaskEx + quotapool — the canonical
CockroachDB pattern for bounded async work — so they participate
in graceful shutdown and surface in goroutine dumps under
recognizable task names. With sql.stats.flush.batch_size defaulting
to 10, a 30K-row flush would otherwise be ~3000 sequential UPSERTs;
8-way parallelism turns this into a ~3000/8 problem.
BenchmarkRunCoordinatedFlush_E2E (parallel vs sequential):
3node_1000fp: 1.61s -> 0.52s (3.1x)
6node_1000fp: 3.29s -> 2.15s (1.5x)
6node_5000fp: 17.34s -> 10.47s (1.7x)
Includes correctness tests in coordinated_flush_test.go and benchmarks
(microbench for the merge path, e2e for the full flush) in
coordinated_flush_bench_test.go.
957daae to
83acc10
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a singleton coordinator job that drives cluster-wide SQL stats flushes
instead of having each node flush its in-memory stats independently. The
new
sql.stats.flush.coordinated.enabledcluster setting opts a clusterinto this mode; per-node flush loops idle and the coordinator owns the
write path.
The coordinator job (
sqlActivityFlushJob) drains every SQL instance viathe streaming
DrainSqlStatsRPC, merges chunks in memory, and writes theresult to
system.statement_statistics/system.transaction_statistics.The collapse-into-one-row design is gated off when
sql.metrics.statement_details.gateway_node.enabledis on (per-nodeattribution would be lost otherwise).
Approach
The first two commits are mechanical preconditions:
sqlstats: convert DrainSqlStats to streaming RPCturns a unaryRPC into a server-streaming one so a node with many fingerprints
doesn't trip the gRPC max-message-size cliff.
sqlstats: pre-size result slices in DrainStatsis a smallallocator win on the drain path the coordinator now hits N times per
cycle.
The third commit,
sqlstats: add coordinated SQL stats flush, is thesubstantive change — coordinator job, collector, writer, opt-in setting,
correctness tests, microbench + e2e benchmarks.
Known limitations / TODOs
CoordinatedFlushEnabledcurrently gates onclusterversion.Latestasa conservative proxy. Once a dedicated version key is minted alongside
the streaming
DrainSqlStatsRPC, that should replaceLatest.loop; that signal does not cross node boundaries, so when the activity
updater's sqlliveness lease lands on a different node from the
coordinator, activity tables won't update. A follow-up needs either a
cross-node signalling mechanism or for the coordinator to drive the
activity update inline. Documented in code as a TODO.
Epic: none