Skip to content

sqlstats: coordinated cluster-wide SQL stats flush#169074

Draft
kyle-a-wong wants to merge 3 commits intocockroachdb:masterfrom
kyle-a-wong:sql_stats_flush_job
Draft

sqlstats: coordinated cluster-wide SQL stats flush#169074
kyle-a-wong wants to merge 3 commits intocockroachdb:masterfrom
kyle-a-wong:sql_stats_flush_job

Conversation

@kyle-a-wong
Copy link
Copy Markdown
Contributor

Summary

Adds a singleton coordinator job that drives cluster-wide SQL stats flushes
instead of having each node flush its in-memory stats independently. The
new sql.stats.flush.coordinated.enabled cluster setting opts a cluster
into this mode; per-node flush loops idle and the coordinator owns the
write path.

The coordinator job (sqlActivityFlushJob) drains every SQL instance via
the streaming DrainSqlStats RPC, merges chunks in memory, and writes the
result to system.statement_statistics / system.transaction_statistics.
The collapse-into-one-row design is gated off when
sql.metrics.statement_details.gateway_node.enabled is on (per-node
attribution would be lost otherwise).

Approach

The first two commits are mechanical preconditions:

  • sqlstats: convert DrainSqlStats to streaming RPC turns a unary
    RPC into a server-streaming one so a node with many fingerprints
    doesn't trip the gRPC max-message-size cliff.
  • sqlstats: pre-size result slices in DrainStats is a small
    allocator win on the drain path the coordinator now hits N times per
    cycle.

The third commit, sqlstats: add coordinated SQL stats flush, is the
substantive change — coordinator job, collector, writer, opt-in setting,
correctness tests, microbench + e2e benchmarks.

Known limitations / TODOs

  • CoordinatedFlushEnabled currently gates on clusterversion.Latest as
    a conservative proxy. Once a dedicated version key is minted alongside
    the streaming DrainSqlStats RPC, that should replace Latest.
  • The activity-update job is signalled in-process by the per-node flush
    loop; that signal does not cross node boundaries, so when the activity
    updater's sqlliveness lease lands on a different node from the
    coordinator, activity tables won't update. A follow-up needs either a
    cross-node signalling mechanism or for the coordinator to drive the
    activity update inline. Documented in code as a TODO.

Epic: none

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented Apr 24, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented Apr 24, 2026

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@kyle-a-wong kyle-a-wong force-pushed the sql_stats_flush_job branch 2 times, most recently from be9087b to 957daae Compare April 25, 2026 16:13
kyle-a-wong and others added 3 commits April 26, 2026 01:19
Switch DrainSqlStats from a unary RPC returning a single
DrainStatsResponse to a server-streaming RPC delivering DrainStatsChunk
messages. This eliminates the gRPC max-message-size cliff: a node with
many statement / transaction fingerprints can otherwise produce a
response that exceeds the 4 MiB default limit, causing the entire RPC
to fail and that node's stats to be silently lost.

The rewrite preserves the three modes of the existing RPC:

- Targeted local drain (drainSqlStatsLocal): drain in-memory stats and
  emit them in chunks of sql.stats.drain.chunk_size. Stats are removed
  from memory at drain time; chunks that fail to send are lost — same
  behavior as the legacy unary form, but the loss window per failure
  shrinks from "whole node's stats" to "one chunk's stats".

- Targeted remote drain (drainSqlStatsRemote): forward to the named
  instance and proxy chunks back to the caller unchanged.

- Cluster-fanout drain (drainSqlStatsClusterFanout): open a stream to
  every live SQL instance, merge per-source chunks at the proxy via
  the legacy DrainSqlStatsRespBuilder semantics (dedup by
  (Key, AggregatedTs) for statements and (TxnFingerprintID,
  AggregatedTs) for transactions), and emit the merged result as
  chunks. Memory profile on the proxy matches the previous unary form;
  the benefit delivered is the per-message size cap.

The SQLStatusServer interface adopts a callback-shaped DrainSqlStats so
in-process callers (ssremote, persistedsqlstats coordinator) don't need
to construct a wire-stream value. The auto-generated gRPC and DRPC
server interfaces use stream types, so RegisterService now wraps
statusServer with thin grpcStatusServer / drpcStatusServer adapters
(plus grpcSystemStatusServer / drpcSystemStatusServer for the system
tenant) that translate Send into the callback. apiV2ServerOpts loses
its serverpb.StatusServer field — *statusServer no longer satisfies it
because of the streaming method — and accepts the concrete
*statusServer / *systemStatusServer types directly.

Other changes:

- ssremote.SQLStats.DrainStats consumes the streaming RPC and
  materializes it back into the legacy SSDrainer-shaped slices.
- A pre-existing nogo shadow warning in ss_mem_iterator.go is fixed
  by renaming a local variable that was hiding the stmtKey type.
- New cluster setting sql.stats.drain.chunk_size (default 1000)
  controls per-chunk batch size.
- TestEnsureSQLStatsAreFlushedForTelemetry casts to a one-method
  interface for Diagnostics now that *statusServer no longer
  satisfies serverpb.StatusServer directly.

Epic: none

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
The result slices in Container.DrainStats and SQLStats.DrainStats are
known-length up front (the source maps are detached before iteration
begins; the per-app drains sum to TotalFingerprintCount). Pre-sizing
removes the slice-grow chain — modest but unconditional win.

Bench (5000 fingerprints):
  before: 1410 ns/op  7161 B/op  6543 allocs/op
  after:  1342 ns/op  7030 B/op  6517 allocs/op
Adds a singleton coordinator job that drains every SQL instance's
in-memory stats via the streaming DrainSqlStats RPC, merges the chunks
in Go, and writes the result to system.statement_statistics /
system.transaction_statistics. The new sql.stats.flush.coordinated.enabled
cluster setting opts a cluster into this mode; when on, the per-node
flush loops idle and the coordinator owns the write path.

Wiring:

  - persistedsqlstats.RunCoordinatedFlush is the per-tick entry point.
    It calls instanceCollector.Collect (drains every live SQL instance
    via DrainSqlStats, merging per-source chunks under one lock with
    cluster-size-aware map sizing) and then flushWriter.Write
    (parallelizes UPSERTs across an 8-way bounded worker pool).

  - sqlActivityFlushJob.Resume in pkg/sql/sqlstats/sqlactivityjob is
    the long-running loop. It wakes every sql.stats.flush.interval and,
    if CoordinatedFlushEnabled returns true, invokes
    RunCoordinatedFlush. The job is bootstrapped as adoptable by
    pkg/upgrade/upgrades/permanent_create_sql_activity_flush_job.go;
    only the node holding the sqlliveness lease runs Resume.

  - flush.MaybeFlush bails out early when CoordinatedFlushEnabled
    returns true, so per-node flushes don't double-write.

  - CoordinatedFlushEnabled is the effective predicate. It gates
    sql.stats.flush.coordinated.enabled on
    sqlstats.GatewayNodeEnabled: with per-node attribution on, the
    coordinator's collapse-into-nodeID=0 model would destroy the
    attribution operators want, so per-node flushes stay in charge.

  - InternalDB.SQLStatsProvider exposes the persisted stats provider
    so the resumer (which lives outside pkg/sql) can reach it without
    poking into the unexported sql.Server pointer.

Performance considerations baked in:

  - Merge: lock acquired once per source (not per chunk); merge maps
    pre-sized based on cluster size and attribution mode (default
    collapse mode sizes to ~per-instance fingerprint count, per-node
    mode scales linearly); flatten destination in default mode pre-
    sized to the merged map's size since everything lands in the
    single nodeID=0 bucket.

  - Write: each batch is a disjoint slice of fingerprints (concurrent
    UPSERTs hit different primary keys and don't conflict), so all
    batches across both stmt and txn tables are dispatched to a single
    bounded worker pool sized by flushWriteParallelism. Workers are
    scheduled via stopper.RunAsyncTaskEx + quotapool — the canonical
    CockroachDB pattern for bounded async work — so they participate
    in graceful shutdown and surface in goroutine dumps under
    recognizable task names. With sql.stats.flush.batch_size defaulting
    to 10, a 30K-row flush would otherwise be ~3000 sequential UPSERTs;
    8-way parallelism turns this into a ~3000/8 problem.

    BenchmarkRunCoordinatedFlush_E2E (parallel vs sequential):
      3node_1000fp:  1.61s -> 0.52s   (3.1x)
      6node_1000fp:  3.29s -> 2.15s   (1.5x)
      6node_5000fp: 17.34s -> 10.47s  (1.7x)

Includes correctness tests in coordinated_flush_test.go and benchmarks
(microbench for the merge path, e2e for the full flush) in
coordinated_flush_bench_test.go.
@kyle-a-wong kyle-a-wong force-pushed the sql_stats_flush_job branch from 957daae to 83acc10 Compare April 26, 2026 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants