sqlstats: coordinated cluster-wide SQL stats flush by kyle-a-wong · Pull Request #169074 · cockroachdb/cockroach

kyle-a-wong · 2026-04-24T17:29:55Z

Summary

Adds a singleton coordinator job that drives cluster-wide SQL stats flushes
instead of having each node flush its in-memory stats independently. The
new sql.stats.flush.coordinated.enabled cluster setting opts a cluster
into this mode; per-node flush loops idle and the coordinator owns the
write path.

The coordinator job (sqlActivityFlushJob) drains every SQL instance via
the streaming DrainSqlStats RPC, merges chunks in memory, and writes the
result to system.statement_statistics / system.transaction_statistics.
The collapse-into-one-row design is gated off when
sql.metrics.statement_details.gateway_node.enabled is on (per-node
attribution would be lost otherwise).

Approach

The first two commits are mechanical preconditions:

sqlstats: convert DrainSqlStats to streaming RPC turns a unary
RPC into a server-streaming one so a node with many fingerprints
doesn't trip the gRPC max-message-size cliff.
sqlstats: pre-size result slices in DrainStats is a small
allocator win on the drain path the coordinator now hits N times per
cycle.

The third commit, sqlstats: add coordinated SQL stats flush, is the
substantive change — coordinator job, collector, writer, opt-in setting,
correctness tests, microbench + e2e benchmarks.

Known limitations / TODOs

CoordinatedFlushEnabled currently gates on clusterversion.Latest as
a conservative proxy. Once a dedicated version key is minted alongside
the streaming DrainSqlStats RPC, that should replace Latest.
The activity-update job is signalled in-process by the per-node flush
loop; that signal does not cross node boundaries, so when the activity
updater's sqlliveness lease lands on a different node from the
coordinator, activity tables won't update. A follow-up needs either a
cross-node signalling mechanism or for the coordinator to drive the
activity update inline. Documented in code as a TODO.

Epic: none

trunk-io · 2026-04-24T17:30:00Z

Merging to master in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

blathers-crl · 2026-04-24T17:30:01Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2026-04-24T17:30:14Z

This change is

Switch DrainSqlStats from a unary RPC returning a single DrainStatsResponse to a server-streaming RPC delivering DrainStatsChunk messages. This eliminates the gRPC max-message-size cliff: a node with many statement / transaction fingerprints can otherwise produce a response that exceeds the 4 MiB default limit, causing the entire RPC to fail and that node's stats to be silently lost. The rewrite preserves the three modes of the existing RPC: - Targeted local drain (drainSqlStatsLocal): drain in-memory stats and emit them in chunks of sql.stats.drain.chunk_size. Stats are removed from memory at drain time; chunks that fail to send are lost — same behavior as the legacy unary form, but the loss window per failure shrinks from "whole node's stats" to "one chunk's stats". - Targeted remote drain (drainSqlStatsRemote): forward to the named instance and proxy chunks back to the caller unchanged. - Cluster-fanout drain (drainSqlStatsClusterFanout): open a stream to every live SQL instance, merge per-source chunks at the proxy via the legacy DrainSqlStatsRespBuilder semantics (dedup by (Key, AggregatedTs) for statements and (TxnFingerprintID, AggregatedTs) for transactions), and emit the merged result as chunks. Memory profile on the proxy matches the previous unary form; the benefit delivered is the per-message size cap. The SQLStatusServer interface adopts a callback-shaped DrainSqlStats so in-process callers (ssremote, persistedsqlstats coordinator) don't need to construct a wire-stream value. The auto-generated gRPC and DRPC server interfaces use stream types, so RegisterService now wraps statusServer with thin grpcStatusServer / drpcStatusServer adapters (plus grpcSystemStatusServer / drpcSystemStatusServer for the system tenant) that translate Send into the callback. apiV2ServerOpts loses its serverpb.StatusServer field — *statusServer no longer satisfies it because of the streaming method — and accepts the concrete *statusServer / *systemStatusServer types directly. Other changes: - ssremote.SQLStats.DrainStats consumes the streaming RPC and materializes it back into the legacy SSDrainer-shaped slices. - A pre-existing nogo shadow warning in ss_mem_iterator.go is fixed by renaming a local variable that was hiding the stmtKey type. - New cluster setting sql.stats.drain.chunk_size (default 1000) controls per-chunk batch size. - TestEnsureSQLStatsAreFlushedForTelemetry casts to a one-method interface for Diagnostics now that *statusServer no longer satisfies serverpb.StatusServer directly. Epic: none Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

The result slices in Container.DrainStats and SQLStats.DrainStats are known-length up front (the source maps are detached before iteration begins; the per-app drains sum to TotalFingerprintCount). Pre-sizing removes the slice-grow chain — modest but unconditional win. Bench (5000 fingerprints): before: 1410 ns/op 7161 B/op 6543 allocs/op after: 1342 ns/op 7030 B/op 6517 allocs/op

Adds a singleton coordinator job that drains every SQL instance's in-memory stats via the streaming DrainSqlStats RPC, merges the chunks in Go, and writes the result to system.statement_statistics / system.transaction_statistics. The new sql.stats.flush.coordinated.enabled cluster setting opts a cluster into this mode; when on, the per-node flush loops idle and the coordinator owns the write path. Wiring: - persistedsqlstats.RunCoordinatedFlush is the per-tick entry point. It calls instanceCollector.Collect (drains every live SQL instance via DrainSqlStats, merging per-source chunks under one lock with cluster-size-aware map sizing) and then flushWriter.Write (parallelizes UPSERTs across an 8-way bounded worker pool). - sqlActivityFlushJob.Resume in pkg/sql/sqlstats/sqlactivityjob is the long-running loop. It wakes every sql.stats.flush.interval and, if CoordinatedFlushEnabled returns true, invokes RunCoordinatedFlush. The job is bootstrapped as adoptable by pkg/upgrade/upgrades/permanent_create_sql_activity_flush_job.go; only the node holding the sqlliveness lease runs Resume. - flush.MaybeFlush bails out early when CoordinatedFlushEnabled returns true, so per-node flushes don't double-write. - CoordinatedFlushEnabled is the effective predicate. It gates sql.stats.flush.coordinated.enabled on sqlstats.GatewayNodeEnabled: with per-node attribution on, the coordinator's collapse-into-nodeID=0 model would destroy the attribution operators want, so per-node flushes stay in charge. - InternalDB.SQLStatsProvider exposes the persisted stats provider so the resumer (which lives outside pkg/sql) can reach it without poking into the unexported sql.Server pointer. Performance considerations baked in: - Merge: lock acquired once per source (not per chunk); merge maps pre-sized based on cluster size and attribution mode (default collapse mode sizes to ~per-instance fingerprint count, per-node mode scales linearly); flatten destination in default mode pre- sized to the merged map's size since everything lands in the single nodeID=0 bucket. - Write: each batch is a disjoint slice of fingerprints (concurrent UPSERTs hit different primary keys and don't conflict), so all batches across both stmt and txn tables are dispatched to a single bounded worker pool sized by flushWriteParallelism. Workers are scheduled via stopper.RunAsyncTaskEx + quotapool — the canonical CockroachDB pattern for bounded async work — so they participate in graceful shutdown and surface in goroutine dumps under recognizable task names. With sql.stats.flush.batch_size defaulting to 10, a 30K-row flush would otherwise be ~3000 sequential UPSERTs; 8-way parallelism turns this into a ~3000/8 problem. BenchmarkRunCoordinatedFlush_E2E (parallel vs sequential): 3node_1000fp: 1.61s -> 0.52s (3.1x) 6node_1000fp: 3.29s -> 2.15s (1.5x) 6node_5000fp: 17.34s -> 10.47s (1.7x) Includes correctness tests in coordinated_flush_test.go and benchmarks (microbench for the merge path, e2e for the full flush) in coordinated_flush_bench_test.go.

kyle-a-wong force-pushed the sql_stats_flush_job branch 2 times, most recently from be9087b to 957daae Compare April 25, 2026 16:13

kyle-a-wong and others added 3 commits April 26, 2026 01:19

kyle-a-wong force-pushed the sql_stats_flush_job branch from 957daae to 83acc10 Compare April 26, 2026 06:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlstats: coordinated cluster-wide SQL stats flush#169074

sqlstats: coordinated cluster-wide SQL stats flush#169074
kyle-a-wong wants to merge 3 commits intocockroachdb:masterfrom
kyle-a-wong:sql_stats_flush_job

kyle-a-wong commented Apr 24, 2026

Uh oh!

trunk-io Bot commented Apr 24, 2026

Uh oh!

blathers-crl Bot commented Apr 24, 2026

Uh oh!

cockroach-teamcity commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kyle-a-wong commented Apr 24, 2026

Summary

Approach

Known limitations / TODOs

Uh oh!

trunk-io Bot commented Apr 24, 2026

Uh oh!

blathers-crl Bot commented Apr 24, 2026

Uh oh!

cockroach-teamcity commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants