Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk,backupccl: process and persist aggregator stats #107994

Merged
merged 2 commits into from
Aug 19, 2023

Conversation

adityamaru
Copy link
Contributor

@adityamaru adityamaru commented Aug 1, 2023

Please refer to individual commit messages.

Informs: #100126

@blathers-crl
Copy link

blathers-crl bot commented Aug 1, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

pkg/jobs/execution_detail_utils.go Outdated Show resolved Hide resolved
pkg/util/bulk/aggregator_stats.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see this coming together. Left some drive-by comments. Nothing major though so take them or leave them.

pkg/ccl/backupccl/backup_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/backup_processor.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/backup_processor.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/backuppb/backup.go Show resolved Hide resolved
pkg/sql/execinfrapb/data.proto Outdated Show resolved Hide resolved
pkg/util/bulk/aggregator_stats.go Outdated Show resolved Hide resolved
Comment on lines 96 to 101
flushTracingAggregatorFrequency = settings.RegisterDurationSetting(
settings.TenantWritable,
"bulkio.backup.tracing_aggregator_flush_after",
"frequency at which backup tracing aggregator stats are flushed",
time.Second*1,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty aggressive combined with the 5 second hard-coded timer in the aggregator. In the text we might want to make it clear that this is the rate at which they are flushed from the processors, not the rate at which they are flushed to storage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops this was only for proving to myself it worked. Neither of these values should be in the seconds in my opinion. I'll fix this up!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced what I have is ideal right now. Both the per-processor and coordinator flush intervals are governed by the same cluster setting that defaults to 1 minute. I did consider splitting it out into two cluster settings but it feels like overkill. if we are pushing more aggregator stats through the flow we should be flushing them to disk also more frequently.

In an ideal world I'd have this configurable per job, but cluster settings will have to do for now. Let me know what you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to just say every processor should send every 15s, without a setting, and then have the setting only apply to the flushing in the coordinator (which as I mentioned in slack, I'm tempted to remove later in favor of isProfiling bool on the job record)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, I think the setting being at 1 minute might still be too aggressive. For a backup/restore that runs for an hour thats a 180 files showing up in that list on the Advanced Debugging page. I'll bump that up to 10 minutes by default that can then be lowered on demand.

@adityamaru adityamaru force-pushed the plumb-tracing-events branch 3 times, most recently from fc09a21 to 150ddab Compare August 3, 2023 20:00
@adityamaru adityamaru marked this pull request as ready for review August 3, 2023 20:06
@adityamaru adityamaru requested review from a team as code owners August 3, 2023 20:06
@adityamaru adityamaru requested review from a team August 3, 2023 20:06
@adityamaru adityamaru requested a review from a team as a code owner August 3, 2023 20:06
@adityamaru adityamaru requested review from rhu713, Santamaura, msirek, dt and stevendanna and removed request for a team, rhu713, Santamaura and msirek August 3, 2023 20:06
@adityamaru
Copy link
Contributor Author

friendly ping for a final look @dt / @stevendanna

adityamaru added a commit to adityamaru/cockroach that referenced this pull request Aug 8, 2023
This change builds on top of cockroachdb#107994 and wires up each restore
data processor to emit TracingAggregatorEvents to the job coordinator.
These events are periodically flushed to files in the `job_info`
table and are consumable via the DBConsole Job Details page.

Fixes: cockroachdb#100126
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this pull request Aug 9, 2023
This change builds on top of cockroachdb#107994 and wires up each restore
data processor to emit TracingAggregatorEvents to the job coordinator.
These events are periodically flushed to files in the `job_info`
table and are consumable via the DBConsole Job Details page.

Fixes: cockroachdb#100126
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this pull request Aug 9, 2023
This change builds on top of cockroachdb#107994 and wires up each stream
ingestion data processor to emit TracingAggregatorEvents to
the frontier and subsequently the job coordinator.
These events are periodically flushed to files in the `job_info`
table and are consumable via the DBConsole Job Details page.

Currently, the only aggregator event that is propagated is the
IngestionPerformanceStats emitted by the sst batcher.

Fixes: cockroachdb#100126
Release note: None
Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments. Additional comments were just things I noticed while re-reading but are not blocking.

pkg/ccl/backupccl/backup_job.go Outdated Show resolved Hide resolved
Comment on lines +135 to +142
select {
case <-flushTimer.C:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if something like if time.Since(lastFlushTime) > flushTracingAggregatorFrequency.Get(&st.SV) would be clearer. But this has grown on me on a second reading.

Comment on lines +200 to +202
if meta.AggregatorEvents != nil {
tracingAggCh <- meta.AggregatorEvents
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want a layer of safety here against tracing ever blocking. Namely, I'm wondering if we want some reasonable sized buffer. If the buffer is full, perhaps then we send with some timeout. If we hit the timeout, we drop the event on the floor with a warning. If you agree, perhaps we can do that in a follow up PR to not block this chain of PRs you have built up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea, I will file an issue and follow up with this.

@adityamaru adityamaru requested a review from a team as a code owner August 18, 2023 13:59
@adityamaru adityamaru requested review from rharding6373 and removed request for a team and rharding6373 August 18, 2023 13:59
This change introduces a new meta type that allows processors in
a DistSQL flow to send back `TracingAggregatorEvents`. These events
capture valuable information about the current execution state of
the job and will be exposed in a future commit for improved observability.

Currently, only the backup processor has been taught to periodically
send metas of this type to the coordinator. In the future we will teach
C2C, restore and import to do the same.

Informs: cockroachdb#100126
Release note: None
This commit teaches the coordinator of the backup job
to listen for `TracingAggregatorEvents` metas
from nodes that are executing in the DistSQL flow.
Each `TracingAggregatorEvent` can be identified by its
tag. The received metas are categorized by node and further
categorized by tag so that we have an up-to-date in-memory
representation of the latest `TracingAggregatorEvent` of each
tag on each node.

Periodically, this in-memory state is flushed to the `system.job_info`
table in both machine-readable and human-readable file formats:

- A file per node, for each aggregated TracingAggregatorEvent. These files
  contain the machine-readable proto bytes of the TracingAggregatorEvent.

- A text file that contains a cluster-wide and per-node summary of each
  TracingAggregatorEvent in its human-readable format.

Example:
```
-- SQL Instance ID: 1; Flow ID: 831caaf5-75cd-4e00-9e11-9a7469727eb5

- ExportStats
num_files: 443
data_size: 1639.29 MB
throughput: 54.63 MB/s

-- Cluster-wide

-- ExportStats
num_files: 443
data_size: 1639.29 MB
throughput: 54.63 MB/s

```

These files can be viewed and downloaded in the Advanced Debugging
tab of the job details page. The files wil help understand the execution
state of the job at different points in time.

Some future work items that will build off this infrastructure are:
- Annotating the job's DistSQL diagram with per-processor stats.
- Displaying relevant stats in the job details page.
- Teaching restore, import and C2C jobs to also start persisting aggregator
  stats for improved observability.

We are not equipped to handle special characters
in the path of a status/admin server URL. To bypass
this problem in the face of filenames with special
characters we move the filename from the path component
of the URL to a query parameter.

Informs: cockroachdb#100126
Release note: None
@adityamaru
Copy link
Contributor Author

TFTR!

bors r=stevendanna

@craig
Copy link
Contributor

craig bot commented Aug 19, 2023

Build succeeded:

@craig craig bot merged commit 39b3141 into cockroachdb:master Aug 19, 2023
2 checks passed
@adityamaru adityamaru deleted the plumb-tracing-events branch August 19, 2023 18:22
adityamaru added a commit to adityamaru/cockroach that referenced this pull request Aug 24, 2023
This change builds on top of cockroachdb#107994 and wires up each restore
data processor to emit TracingAggregatorEvents to the job coordinator.
These events are periodically flushed to files in the `job_info`
table and are consumable via the DBConsole Job Details page.

Fixes: cockroachdb#100126
Release note: None
craig bot pushed a commit that referenced this pull request Aug 25, 2023
108359: backupccl: hookup tracing aggregator events from the restore job r=stevendanna a=adityamaru

This change builds on top of #107994 and wires up each restore
data processor to emit TracingAggregatorEvents to the job coordinator.
These events are periodically flushed to files in the `job_info`
table and are consumable via the DBConsole Job Details page.

Fixes: #100126
Release note: None

109291: streamingccl: stream span config checkpoints r=stevendanna a=msbutler

This patch modifies the span config event stream to emit a checkpoint event
containing the rangefeed frontier after the event stream processes each
rangefeed cache flush.

The span config client can then use this information while processing updates.
Specifically, the subscription.Next() call may return a
checkpoint which indicates that all updates up to a given frontier have been
emitted by the rangefeed.

This patch also fixes two bugs:
- prevents sending an empty batch of updates
- prevents sending system target span config updates

Informs #106823

Release note: None

Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
adityamaru added a commit to adityamaru/cockroach that referenced this pull request Aug 25, 2023
This change builds on top of cockroachdb#107994 and wires up each stream
ingestion data processor to emit TracingAggregatorEvents to
the frontier and subsequently the job coordinator.
These events are periodically flushed to files in the `job_info`
table and are consumable via the DBConsole Job Details page.

Currently, the only aggregator event that is propagated is the
IngestionPerformanceStats emitted by the sst batcher.

Fixes: cockroachdb#100126
Release note: None
craig bot pushed a commit that referenced this pull request Aug 28, 2023
108458: streamingccl: hookup tracing aggregtor events for the C2C job r=stevendanna a=adityamaru

This change builds on top of #107994 and wires up each stream
ingestion data processor to emit TracingAggregatorEvents to
the frontier and subsequently the job coordinator.
These events are periodically flushed to files in the `job_info`
table and are consumable via the DBConsole Job Details page.

Currently, the only aggregator event that is propagated is the
IngestionPerformanceStats emitted by the sst batcher.

Informs: #108374
Fixes: #100126
Release note: None

109529: concurrency: correctly establish joint claims when a lock is released r=nvanbenschoten a=arulajmani

This patch closes the loop on joint claims. In particular, it correctly
handles which locking requests are allowed to proceed when a lock is
released. We also handle the case where a request that holds a claim
(but not the lock) drops out without acquiring the lock. The handling
itself is simple -- the head of the locking requests wait queue that is
compatible with each other is allowed to proceed. The compatible
request(s) are said to have established a (possibly joint) claim.

Most of this patch is beefing up testing. Some of the testing additions
here weren't strictly related to the code change.

Closes #102272

Epic: none

Release note: None

Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: Arul Ajmani <arulajmani@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants