obs/ash: add periodic top-N workload summary logging by alyshanjahani-crl · Pull Request #165093 · cockroachdb/cockroach

alyshanjahani-crl · 2026-03-06T16:55:08Z

The ASH sampler holds samples only in memory, which are lost on node
restart. Add periodic structured log summaries so operators have durable
evidence of workload patterns.

Two new cluster settings control this behavior:

obs.ash.log_interval (default 10m) — how often the summary is emitted.
obs.ash.log_top_n (default 10) — max entries per summary.

Every log_interval, the sampler scans the ring buffer for samples
collected since the last report, groups them by (WorkEventType,
WorkEvent, WorkloadID), and emits one structured event per top-N entry
on the OPS channel. If there are zero samples in the window, no log is
emitted.

Example log output (one line per top-N entry, sorted by SampleCount descending):

I260309 19:25:00.123456 42 1@obs/ash/sampler.go:308 [n1] 500
={"Timestamp":1773084300123456000,"EventType":"ash_workload_summary",
"WindowDurationNanos":600000000000,"WorkEventType":"IO",
"WorkEvent":"BatchEval","WorkloadID":"000000000000002a",
"SampleCount":200}

Resolves: #164382

Release note (ops change): Added periodic ASH workload summary logging
to the OPS channel. Two new cluster settings, obs.ash.log_interval
(default 10m) and obs.ash.log_top_n (default 10), control how often
and how many entries are emitted. Each summary reports the most
frequently sampled workloads grouped by event type, event name, and
workload ID, providing durable visibility into workload patterns that
previously existed only in memory.

trunk-io · 2026-03-06T16:55:12Z

😎 Merged successfully - details.

cockroach-teamcity · 2026-03-06T16:55:22Z

This change is

alyshanjahani-crl · 2026-03-06T16:56:04Z

pkg/obs/ash/sampler.go

@@ -59,6 +62,16 @@ var BufferSize = settings.RegisterIntSetting(
 	settings.PositiveInt,
 )



Do we want cluster setting for the two constants below?

I could see an argument for logSummaryTopN being a setting, but i think a 60 second periodic log wouldn't need to be modified .. would it?

Yeah, I think cluster settings for both is deseriable. I think in general, a big structured log per minute feels too high. Maybe set to 10m by default and let customers adjust if they want more.

What if we had a smaller structured log per minute? @dhartunian

Or rather, 10 (logs per message) as defined feels rather small - is this okay?

Since it's easily configurable, I don't feel super strongly. We can do 10 lines per minute if we feel it would help with diagnosing issues.

dhartunian · 2026-03-06T21:21:13Z

pkg/obs/ash/sampler.go

@@ -59,6 +62,16 @@ var BufferSize = settings.RegisterIntSetting(
 	settings.PositiveInt,
 )



Yeah, I think cluster settings for both is deseriable. I think in general, a big structured log per minute feels too high. Maybe set to 10m by default and let customers adjust if they want more.

dhartunian · 2026-03-06T21:22:21Z

pkg/obs/ash/sampler.go

+		fmt.Fprintf(&buf, "\n  count=%-5d type=%-9s event=%-20s workload=%s",
+			e.count, e.key.WorkEventType, e.key.WorkEvent, e.key.WorkloadID)
+	}
+	log.Ops.Infof(ctx, "%s", buf.String())


This should be a structured log with a protobuf in eventpb so that it's output as JSON and easily machine-parsed.

dhartunian · 2026-03-06T21:25:50Z

pkg/obs/ash/sampler.go

 	pendingSamples []pendingSample
+	// tickSamples tracks per-workload sample counts since the last
+	// periodic log summary was emitted.
+	tickSamples map[workloadKey]int


Why do you need this? Why not loop through the ring buffer with a timestamp cutoff?

It's a bit more efficient to keep the summary data updated while doing the ticks, instead of constructing it when the log summary interval happens.

But i think that efficiency gain is too small to warrant adding more state to the sampler and making the code a bit harder to follow (as @angles-n-daemons pointed out with setting/resetting these fields in different functions)

Changes to loop through the ring buffer w/ timestamp cutoff

dhartunian · 2026-03-06T21:27:27Z

Can you also show example output in the commit message? That would be helpful to see as context.

angles-n-daemons

couple small comments, nothing blocking

angles-n-daemons · 2026-03-09T17:54:52Z

pkg/obs/ash/sampler.go

+	log.Ops.Infof(ctx, "%s", buf.String())
+
+	// Reset counters.
+	for k := range s.tickSamples {


What's the reason for maybeLogSamples resetting tickSamples and totalSamples. It feels like these should be initialized each time takeSample is called, because that's where they're going to be set.

The current code should work, but it feels a little fragile, and prone to correctness bugs if ordering, or frequency of sample vs log operations were changed.

ack, i made this code simpler as per this and davids suggestion to just scan the ring buffer when doing the log summary.

nice, that's great thanks!

angles-n-daemons · 2026-03-09T17:57:02Z

pkg/obs/ash/sampler.go

@@ -59,6 +62,16 @@ var BufferSize = settings.RegisterIntSetting(
 	settings.PositiveInt,
 )



What if we had a smaller structured log per minute? @dhartunian

angles-n-daemons

looks good, one last comment

angles-n-daemons · 2026-03-10T15:53:45Z

pkg/obs/ash/sampler.go

+// maybeLogSummary emits a top-N workload summary as structured events
+// to the OPS log if enough time has elapsed since the last report.
+// It scans the ring buffer for samples newer than lastLogTime and
+// aggregates them by workload key.


Can we add a comment stating this is coupled to table sampling? It wasn't obvious to me at first, but has subtle implications on how the various settings interact.

dhartunian · 2026-03-11T15:45:35Z

pkg/obs/ash/sampler.go

+var LogTopN = settings.RegisterIntSetting(
+	settings.SystemVisible,
+	"obs.ash.log_top_n",
+	"maximum number of entries in periodic ASH workload summary",


it would be helpful to document here what it means for stuff to appear "at the top"

ack, added.

dhartunian · 2026-03-11T15:51:21Z

pkg/obs/ash/sampler.go

@@ -59,6 +62,16 @@ var BufferSize = settings.RegisterIntSetting(
 	settings.PositiveInt,
 )



Since it's easily configurable, I don't feel super strongly. We can do 10 lines per minute if we feel it would help with diagnosing issues.

dhartunian · 2026-03-11T15:53:50Z

docs/generated/eventlog.md

+
+| Field | Description | Sensitive |
+|--|--|--|
+| `WindowDurationNanos` | The duration of the reporting window in nanoseconds. | no |


I would make this millis. do we allow sampling at intervals below a second? this is just adding zeros to the string for no huge value. Plus millis are more human-readable. I'd also be fine with seconds if that's our lowest supported interval.

Sampler can run at 100ms, 500ms reliably from what i've seen with my benchmarks.

Changed to milliseconds.

The ASH sampler holds samples only in memory, which are lost on node restart. Add periodic structured log summaries so operators have durable evidence of workload patterns. Two new cluster settings control this behavior: obs.ash.log_interval (default 10m) — how often the summary is emitted. obs.ash.log_top_n (default 10) — max entries per summary. Every log_interval, the sampler scans the ring buffer for samples collected since the last report, groups them by (WorkEventType, WorkEvent, WorkloadID), and emits one structured event per top-N entry on the OPS channel. If there are zero samples in the window, no log is emitted. Example log output (one line per top-N entry, sorted by SampleCount descending): I260309 19:25:00.123456 42 1@obs/ash/sampler.go:308 [n1] 500 ={"Timestamp":1773084300123456000,"EventType":"ash_workload_summary", "WindowDurationNanos":600000000000,"WorkEventType":"IO", "WorkEvent":"BatchEval","WorkloadID":"000000000000002a", "SampleCount":200} Resolves: cockroachdb#164382 Release note (ops change): Added periodic ASH workload summary logging to the OPS channel. Two new cluster settings, `obs.ash.log_interval` (default 10m) and `obs.ash.log_top_n` (default 10), control how often and how many entries are emitted. Each summary reports the most frequently sampled workloads grouped by event type, event name, and workload ID, providing durable visibility into workload patterns that previously existed only in memory. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

alyshanjahani-crl · 2026-03-12T15:01:57Z

TFTRs!

alyshanjahani-crl · 2026-03-12T15:02:02Z

/trunk merge

alyshanjahani-crl commented Mar 6, 2026

View reviewed changes

alyshanjahani-crl marked this pull request as ready for review March 6, 2026 16:56

alyshanjahani-crl requested a review from a team as a code owner March 6, 2026 16:56

alyshanjahani-crl requested review from angles-n-daemons and dhartunian and removed request for a team March 6, 2026 16:56

alyshanjahani-crl mentioned this pull request Mar 6, 2026

obs/ash: add anomaly detection and ASH-derived environment sampler triggers #165094

Open

dhartunian reviewed Mar 6, 2026

View reviewed changes

angles-n-daemons reviewed Mar 9, 2026

View reviewed changes

alyshanjahani-crl force-pushed the ash-sampler-logging branch from 8a730d5 to 87e7250 Compare March 9, 2026 19:33

alyshanjahani-crl requested a review from a team as a code owner March 9, 2026 19:33

alyshanjahani-crl requested review from visheshbardia and removed request for a team March 9, 2026 19:33

alyshanjahani-crl force-pushed the ash-sampler-logging branch from 87e7250 to 5b404b6 Compare March 9, 2026 19:49

alyshanjahani-crl requested review from angles-n-daemons and dhartunian March 9, 2026 20:38

alyshanjahani-crl mentioned this pull request Mar 9, 2026

obs/ash: write ASH reports with goroutine dumps and CPU profiles #165089

Open

alyshanjahani-crl force-pushed the ash-sampler-logging branch from 5b404b6 to 16bbaec Compare March 9, 2026 21:22

angles-n-daemons approved these changes Mar 10, 2026

View reviewed changes

alyshanjahani-crl force-pushed the ash-sampler-logging branch from 16bbaec to d9c30a0 Compare March 10, 2026 20:00

dhartunian reviewed Mar 11, 2026

View reviewed changes

alyshanjahani-crl force-pushed the ash-sampler-logging branch from d9c30a0 to 55f1256 Compare March 11, 2026 20:07

alyshanjahani-crl requested a review from dhartunian March 11, 2026 20:09

trunk-io bot merged commit f10f39f into cockroachdb:master Mar 12, 2026
30 checks passed

celeste-cockroachdb bot added the target-release-26.2.0 label Mar 12, 2026

		@@ -59,6 +62,16 @@ var BufferSize = settings.RegisterIntSetting(
		settings.PositiveInt,
		)

Conversation

alyshanjahani-crl commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trunk-io bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Mar 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

angles-n-daemons Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhartunian commented Mar 6, 2026

Uh oh!

angles-n-daemons left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

angles-n-daemons left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alyshanjahani-crl commented Mar 12, 2026

Uh oh!

alyshanjahani-crl commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alyshanjahani-crl commented Mar 6, 2026 •

edited

Loading

trunk-io bot commented Mar 6, 2026 •

edited

Loading

angles-n-daemons Mar 9, 2026 •

edited

Loading