bulk,backupccl: process and persist aggregator stats #107994

adityamaru · 2023-08-01T22:24:22Z

Please refer to individual commit messages.

blathers-crl · 2023-08-01T22:24:25Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-08-01T22:24:35Z

This change is

pkg/jobs/execution_detail_utils.go

pkg/util/bulk/aggregator_stats.go

stevendanna

Nice to see this coming together. Left some drive-by comments. Nothing major though so take them or leave them.

pkg/ccl/backupccl/backup_job.go

pkg/ccl/backupccl/backup_processor.go

pkg/ccl/backupccl/backuppb/backup.go

pkg/sql/execinfrapb/data.proto

pkg/util/bulk/aggregator_stats.go

stevendanna · 2023-08-02T08:26:32Z

pkg/ccl/backupccl/backup_processor.go

+	flushTracingAggregatorFrequency = settings.RegisterDurationSetting(
+		settings.TenantWritable,
+		"bulkio.backup.tracing_aggregator_flush_after",
+		"frequency at which backup tracing aggregator stats are flushed",
+		time.Second*1,
+	)


This seems pretty aggressive combined with the 5 second hard-coded timer in the aggregator. In the text we might want to make it clear that this is the rate at which they are flushed from the processors, not the rate at which they are flushed to storage.

Whoops this was only for proving to myself it worked. Neither of these values should be in the seconds in my opinion. I'll fix this up!

I'm not convinced what I have is ideal right now. Both the per-processor and coordinator flush intervals are governed by the same cluster setting that defaults to 1 minute. I did consider splitting it out into two cluster settings but it feels like overkill. if we are pushing more aggregator stats through the flow we should be flushing them to disk also more frequently.

In an ideal world I'd have this configurable per job, but cluster settings will have to do for now. Let me know what you think?

I'm tempted to just say every processor should send every 15s, without a setting, and then have the setting only apply to the flushing in the coordinator (which as I mentioned in slack, I'm tempted to remove later in favor of isProfiling bool on the job record)

makes sense, I think the setting being at 1 minute might still be too aggressive. For a backup/restore that runs for an hour thats a 180 files showing up in that list on the Advanced Debugging page. I'll bump that up to 10 minutes by default that can then be lowered on demand.

adityamaru · 2023-08-08T13:26:53Z

friendly ping for a final look @dt / @stevendanna

This change builds on top of cockroachdb#107994 and wires up each restore data processor to emit TracingAggregatorEvents to the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Fixes: cockroachdb#100126 Release note: None

This change builds on top of cockroachdb#107994 and wires up each stream ingestion data processor to emit TracingAggregatorEvents to the frontier and subsequently the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Currently, the only aggregator event that is propagated is the IngestionPerformanceStats emitted by the sst batcher. Fixes: cockroachdb#100126 Release note: None

stevendanna

Thanks for addressing my comments. Additional comments were just things I noticed while re-reading but are not blocking.

pkg/ccl/backupccl/backup_job.go

stevendanna · 2023-08-16T12:12:56Z

pkg/util/bulk/aggregator_stats.go

+		select {
+		case <-flushTimer.C:


I wonder if something like if time.Since(lastFlushTime) > flushTracingAggregatorFrequency.Get(&st.SV) would be clearer. But this has grown on me on a second reading.

stevendanna · 2023-08-16T12:39:27Z

pkg/ccl/backupccl/backup_processor_planning.go

+		if meta.AggregatorEvents != nil {
+			tracingAggCh <- meta.AggregatorEvents
+		}


I wonder if we want a layer of safety here against tracing ever blocking. Namely, I'm wondering if we want some reasonable sized buffer. If the buffer is full, perhaps then we send with some timeout. If we hit the timeout, we drop the event on the floor with a warning. If you agree, perhaps we can do that in a follow up PR to not block this chain of PRs you have built up.

I think this is a good idea, I will file an issue and follow up with this.

This change introduces a new meta type that allows processors in a DistSQL flow to send back `TracingAggregatorEvents`. These events capture valuable information about the current execution state of the job and will be exposed in a future commit for improved observability. Currently, only the backup processor has been taught to periodically send metas of this type to the coordinator. In the future we will teach C2C, restore and import to do the same. Informs: cockroachdb#100126 Release note: None

This commit teaches the coordinator of the backup job to listen for `TracingAggregatorEvents` metas from nodes that are executing in the DistSQL flow. Each `TracingAggregatorEvent` can be identified by its tag. The received metas are categorized by node and further categorized by tag so that we have an up-to-date in-memory representation of the latest `TracingAggregatorEvent` of each tag on each node. Periodically, this in-memory state is flushed to the `system.job_info` table in both machine-readable and human-readable file formats: - A file per node, for each aggregated TracingAggregatorEvent. These files contain the machine-readable proto bytes of the TracingAggregatorEvent. - A text file that contains a cluster-wide and per-node summary of each TracingAggregatorEvent in its human-readable format. Example: ``` -- SQL Instance ID: 1; Flow ID: 831caaf5-75cd-4e00-9e11-9a7469727eb5 - ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s -- Cluster-wide -- ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s ``` These files can be viewed and downloaded in the Advanced Debugging tab of the job details page. The files wil help understand the execution state of the job at different points in time. Some future work items that will build off this infrastructure are: - Annotating the job's DistSQL diagram with per-processor stats. - Displaying relevant stats in the job details page. - Teaching restore, import and C2C jobs to also start persisting aggregator stats for improved observability. We are not equipped to handle special characters in the path of a status/admin server URL. To bypass this problem in the face of filenames with special characters we move the filename from the path component of the URL to a query parameter. Informs: cockroachdb#100126 Release note: None

adityamaru · 2023-08-19T14:57:38Z

TFTR!

bors r=stevendanna

craig · 2023-08-19T15:38:41Z

Build succeeded:

Bazel Essential CI (Cockroach)

This change builds on top of cockroachdb#107994 and wires up each restore data processor to emit TracingAggregatorEvents to the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Fixes: cockroachdb#100126 Release note: None

108359: backupccl: hookup tracing aggregator events from the restore job r=stevendanna a=adityamaru This change builds on top of #107994 and wires up each restore data processor to emit TracingAggregatorEvents to the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Fixes: #100126 Release note: None 109291: streamingccl: stream span config checkpoints r=stevendanna a=msbutler This patch modifies the span config event stream to emit a checkpoint event containing the rangefeed frontier after the event stream processes each rangefeed cache flush. The span config client can then use this information while processing updates. Specifically, the subscription.Next() call may return a checkpoint which indicates that all updates up to a given frontier have been emitted by the rangefeed. This patch also fixes two bugs: - prevents sending an empty batch of updates - prevents sending system target span config updates Informs #106823 Release note: None Co-authored-by: adityamaru <adityamaru@gmail.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com>

This change builds on top of cockroachdb#107994 and wires up each stream ingestion data processor to emit TracingAggregatorEvents to the frontier and subsequently the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Currently, the only aggregator event that is propagated is the IngestionPerformanceStats emitted by the sst batcher. Fixes: cockroachdb#100126 Release note: None

108458: streamingccl: hookup tracing aggregtor events for the C2C job r=stevendanna a=adityamaru This change builds on top of #107994 and wires up each stream ingestion data processor to emit TracingAggregatorEvents to the frontier and subsequently the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Currently, the only aggregator event that is propagated is the IngestionPerformanceStats emitted by the sst batcher. Informs: #108374 Fixes: #100126 Release note: None 109529: concurrency: correctly establish joint claims when a lock is released r=nvanbenschoten a=arulajmani This patch closes the loop on joint claims. In particular, it correctly handles which locking requests are allowed to proceed when a lock is released. We also handle the case where a request that holds a claim (but not the lock) drops out without acquiring the lock. The handling itself is simple -- the head of the locking requests wait queue that is compatible with each other is allowed to proceed. The compatible request(s) are said to have established a (possibly joint) claim. Most of this patch is beefing up testing. Some of the testing additions here weren't strictly related to the code change. Closes #102272 Epic: none Release note: None Co-authored-by: adityamaru <adityamaru@gmail.com> Co-authored-by: Arul Ajmani <arulajmani@gmail.com>

adityamaru force-pushed the plumb-tracing-events branch from 0123b1f to 3410f1d Compare August 1, 2023 22:26

dt approved these changes Aug 2, 2023

View reviewed changes

pkg/jobs/execution_detail_utils.go Outdated Show resolved Hide resolved

pkg/util/bulk/aggregator_stats.go Outdated Show resolved Hide resolved

stevendanna reviewed Aug 2, 2023

View reviewed changes

adityamaru force-pushed the plumb-tracing-events branch 3 times, most recently from fc09a21 to 150ddab Compare August 3, 2023 20:00

adityamaru marked this pull request as ready for review August 3, 2023 20:06

adityamaru requested review from a team as code owners August 3, 2023 20:06

adityamaru requested review from a team August 3, 2023 20:06

adityamaru requested a review from a team as a code owner August 3, 2023 20:06

adityamaru requested review from rhu713, Santamaura, msirek, dt and stevendanna and removed request for a team, rhu713, Santamaura and msirek August 3, 2023 20:06

adityamaru force-pushed the plumb-tracing-events branch from 150ddab to 8b8df64 Compare August 4, 2023 13:54

adityamaru force-pushed the plumb-tracing-events branch from 8b8df64 to f286f1e Compare August 8, 2023 14:42

adityamaru mentioned this pull request Aug 8, 2023

backupccl: hookup tracing aggregator events from the restore job #108359

Merged

adityamaru mentioned this pull request Aug 9, 2023

streamingccl: hookup tracing aggregtor events for the C2C job #108458

Merged

stevendanna approved these changes Aug 16, 2023

View reviewed changes

stevendanna reviewed Aug 16, 2023

View reviewed changes

adityamaru force-pushed the plumb-tracing-events branch from f286f1e to 84abb0c Compare August 18, 2023 13:59

adityamaru requested a review from a team as a code owner August 18, 2023 13:59

adityamaru requested review from rharding6373 and removed request for a team and rharding6373 August 18, 2023 13:59

adityamaru force-pushed the plumb-tracing-events branch from 84abb0c to 3956d87 Compare August 18, 2023 17:02

adityamaru added 2 commits August 18, 2023 10:03

adityamaru force-pushed the plumb-tracing-events branch from 3956d87 to e54b7eb Compare August 18, 2023 17:03

craig bot merged commit 39b3141 into cockroachdb:master Aug 19, 2023
2 checks passed

adityamaru deleted the plumb-tracing-events branch August 19, 2023 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulk,backupccl: process and persist aggregator stats #107994

bulk,backupccl: process and persist aggregator stats #107994

adityamaru commented Aug 1, 2023 •

edited

Loading

blathers-crl bot commented Aug 1, 2023

cockroach-teamcity commented Aug 1, 2023

stevendanna left a comment

stevendanna Aug 2, 2023

adityamaru Aug 2, 2023

adityamaru Aug 3, 2023

dt Aug 3, 2023

adityamaru Aug 4, 2023

adityamaru commented Aug 8, 2023

stevendanna left a comment

stevendanna Aug 16, 2023

stevendanna Aug 16, 2023

adityamaru Aug 18, 2023

adityamaru commented Aug 19, 2023

craig bot commented Aug 19, 2023

bulk,backupccl: process and persist aggregator stats #107994

bulk,backupccl: process and persist aggregator stats #107994

Conversation

adityamaru commented Aug 1, 2023 • edited Loading

blathers-crl bot commented Aug 1, 2023

cockroach-teamcity commented Aug 1, 2023

stevendanna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adityamaru commented Aug 8, 2023

stevendanna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adityamaru commented Aug 19, 2023

craig bot commented Aug 19, 2023

adityamaru commented Aug 1, 2023 •

edited

Loading