Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streamingccl: hookup tracing aggregtor events for the C2C job #108458

Merged

Conversation

adityamaru
Copy link
Contributor

@adityamaru adityamaru commented Aug 9, 2023

This change builds on top of #107994 and wires up each stream
ingestion data processor to emit TracingAggregatorEvents to
the frontier and subsequently the job coordinator.
These events are periodically flushed to files in the job_info
table and are consumable via the DBConsole Job Details page.

Currently, the only aggregator event that is propagated is the
IngestionPerformanceStats emitted by the sst batcher.

Informs: #108374
Fixes: #100126
Release note: None

@adityamaru adityamaru requested review from a team as code owners August 9, 2023 16:45
@adityamaru adityamaru requested review from a team August 9, 2023 16:45
@adityamaru adityamaru requested a review from a team as a code owner August 9, 2023 16:45
@adityamaru adityamaru requested review from abarganier and yuzefovich and removed request for a team August 9, 2023 16:45
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru adityamaru removed request for a team, abarganier and yuzefovich August 9, 2023 16:46
@adityamaru
Copy link
Contributor Author

First three commits are #107994 and #108359. Only the last commit is relevant.

Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but it has produced another thought about some of the PRs it builds on. But once those PRs are in, I think this should be g2g.

@@ -313,7 +314,9 @@ func canWrap(mode sessiondatapb.VectorizeExecMode, core *execinfrapb.ProcessorCo
case core.RestoreData != nil:
case core.Filterer != nil:
case core.StreamIngestionData != nil:
return errStreamIngestionWrap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I read the linked issue and understand why we need this. I'm a little surprised we don't need this for the restore case though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised too, let me dig some more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has something to do with the fact that in c2c the stream ingestion procs push the meta downstream to another processor in the frontier proc. In restore however the restore data processor is the leaf processor and pushes metas to the row result writer. I'll add some logging to sanity check that the restore data procs aren't doing any unwanted buffering of progress metas.

@adityamaru adityamaru force-pushed the hookup-stream-ingestion-aggregator branch from 6370e46 to 53f69c9 Compare August 24, 2023 23:48
@adityamaru adityamaru requested a review from a team as a code owner August 24, 2023 23:48
@adityamaru adityamaru requested review from michae2 and removed request for a team August 24, 2023 23:48
@blathers-crl
Copy link

blathers-crl bot commented Aug 24, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@adityamaru adityamaru force-pushed the hookup-stream-ingestion-aggregator branch from 53f69c9 to 21f4527 Compare August 25, 2023 13:35
@adityamaru adityamaru force-pushed the hookup-stream-ingestion-aggregator branch from 21f4527 to f0cac6f Compare August 25, 2023 15:27
This change builds on top of cockroachdb#107994 and wires up each stream
ingestion data processor to emit TracingAggregatorEvents to
the frontier and subsequently the job coordinator.
These events are periodically flushed to files in the `job_info`
table and are consumable via the DBConsole Job Details page.

Currently, the only aggregator event that is propagated is the
IngestionPerformanceStats emitted by the sst batcher.

Fixes: cockroachdb#100126
Release note: None
@adityamaru adityamaru force-pushed the hookup-stream-ingestion-aggregator branch from f0cac6f to da18c77 Compare August 25, 2023 15:39
@adityamaru
Copy link
Contributor Author

bors r=stevendanna

@craig
Copy link
Contributor

craig bot commented Aug 28, 2023

Build succeeded:

@craig craig bot merged commit 9eb73a8 into cockroachdb:master Aug 28, 2023
7 of 8 checks passed
adityamaru added a commit to adityamaru/cockroach that referenced this pull request Aug 30, 2023
The motivation for this change was to fix the leaked
goroutine in cockroachdb#109658
but it led to a larger cleanup as explained below.

In cockroachdb#108458 and other PRs
we taught the processors of backup, restore and c2c to periodically send
metas containing the TracingAggregatorEvents, for the coordinator
to then process and persist for improved observability. In that
logic we were defensive against a scenario in which the processor
received a root context with no tracing span. In which case we never
initialized the tracing aggregator on the processor. This should never
be possible in production code, but we want to prevent failing the
replication job if this were to happen. As part of this defense we
also returned `nil` instead of the meta in `Next()`. What We didn't realize
was that return a nil row and nil meta indicates that the consumer has been
completely drained and causes the processor to pre-emptively drain and shutdown.

With this change, if the root context does not have a tracing span and we
are unable to init a tracing aggregator, we simply do not allow the timer
controlling the return of the TracingAggergatorEvents to fire. By doing this,
we avoid a situation where a nil row and nil meta are returned too early.

As part of this cleanup this change also simplifies the tracing aggregator.
The aggregator is no longer responsible for creating a child span and registering
a listener. Instead it simply returns an object that can be registered as an event
listener by the caller. The former was an approach that was necessary when we wanted
to decorate the span with LazyTags, but nobody uses LazyTags and we have a much
better workflow to consume aggregator events with the `Advanced Debugging` job details
page.

This simplification was motivated by the fact that `processorbase.StartInternal` now
takes event listeners that can be registered with the tracing span of the context
that hangs of the processor, and is accessible via the `processor.Ctx()` method.

Fixes: cockroachdb#109658
Release note: None
@adityamaru adityamaru deleted the hookup-stream-ingestion-aggregator branch August 30, 2023 18:57
adityamaru added a commit to adityamaru/cockroach that referenced this pull request Aug 31, 2023
The motivation for this change was to fix the leaked
goroutine in cockroachdb#109658
but it led to a larger cleanup as explained below.

In cockroachdb#108458 and other PRs
we taught the processors of backup, restore and c2c to periodically send
metas containing the TracingAggregatorEvents, for the coordinator
to then process and persist for improved observability. In that
logic we were defensive against a scenario in which the processor
received a root context with no tracing span. In which case we never
initialized the tracing aggregator on the processor. This should never
be possible in production code, but we want to prevent failing the
replication job if this were to happen. As part of this defense we
also returned `nil` instead of the meta in `Next()`. What We didn't realize
was that return a nil row and nil meta indicates that the consumer has been
completely drained and causes the processor to pre-emptively drain and shutdown.

With this change, if the root context does not have a tracing span and we
are unable to init a tracing aggregator, we simply do not allow the timer
controlling the return of the TracingAggergatorEvents to fire. By doing this,
we avoid a situation where a nil row and nil meta are returned too early.

As part of this cleanup this change also simplifies the tracing aggregator.
The aggregator is no longer responsible for creating a child span and registering
a listener. Instead it simply returns an object that can be registered as an event
listener by the caller. The former was an approach that was necessary when we wanted
to decorate the span with LazyTags, but nobody uses LazyTags and we have a much
better workflow to consume aggregator events with the `Advanced Debugging` job details
page.

This simplification was motivated by the fact that `processorbase.StartInternal` now
takes event listeners that can be registered with the tracing span of the context
that hangs of the processor, and is accessible via the `processor.Ctx()` method.

Fixes: cockroachdb#109658
Release note: None
craig bot pushed a commit that referenced this pull request Aug 31, 2023
109378: backupccl: avoid splitting if the split point might be unsafe r=lidorcarmel a=lidorcarmel

Restore may use unsafe keys as split points, which may cause unsafe splits
between column families, which may cause SQL to fail when reading the row, or
worse, return wrong resutls.

This commit avoids splitting on keys that might be unsafe.

See the issue for more info.

Epic: none
Informs: #109483

Release note: None.

109631: sql: serialize under stress and unskip TestQueryCache r=cucaroach a=cucaroach

This test would timeout under stress+race, this is because it exploits
t.Parallel to run 15 tests concurrently that each fire up a TestServer.
Fix the timeout by running the tests serially under stress.

Fixes: #105174
Epic: none
Release note: none


109683: spanconfig: skip protected timestamps on non-table data  r=arulajmani a=aadityasondhi

Previously, we were placing a protected timestamp using the
`EverythingSpan` which covered the entire keyspace when targeting a
cluster backup. This was non-ideal because not all are used for backup.
This is especially problematic for high churn ranges, such as node
liveness and timeseries, that can accumulate lots of MVCC garbage very
quickly. Placing a protected timestamp on these ranges, thus preventing
the MVCC GC to run, can cause badness.

This patch introduces a new span that covers the keyspace excluded from
backup. When we encounter a span that is within those bounds, we skip
placing a protected timestamp on it.

Fixes: #102338

Release note: None

109720: cluster-ui: break circular import of `rootActions` r=xinhaoz a=xinhaoz

Importing  `rootActions` from `reducers.ts` in cluster-ui was causing a circular import, preventing one of the redux fields experiencing the cyclic dependency from having their reducer populated, which omitted that field from the store altogether. Currently, this field is `uiConfig` which is affecting features that rely on checking the sql role of a user, such as displaying the `Reset SQL Stats` button.

This commit extracts `rootActions` into its own file to remove the cyclic dependencies created.

Fixes: #97996

Release note (bug fix): On CC, `Reset Sql Stats` button is now visible if the user has admin role.

Using a production buiild 23.1 cluster-ui version:
<img width="1909" alt="image" src="https://github.com/cockroachdb/cockroach/assets/20136951/348a941a-3be5-42ad-8b16-47cbf48f3f19">

109734: bulk,ccl: refactor tracing aggregator integration r=stevendanna a=adityamaru

The motivation for this change was to fix the leaked goroutine in #109658 but it led to a larger cleanup as explained below.

In #108458 and other PRs we taught the processors of backup, restore and c2c to periodically send metas containing the TracingAggregatorEvents, for the coordinator to then process and persist for improved observability. In that logic we were defensive against a scenario in which the processor received a root context with no tracing span. In which case we never initialized the tracing aggregator on the processor. This should never be possible in production code, but we want to prevent failing the replication job if this were to happen. As part of this defense we also returned `nil` instead of the meta in `Next()`. What We didn't realize was that return a nil row and nil meta indicates that the consumer has been completely drained and causes the processor to pre-emptively drain and shutdown.

With this change, if the root context does not have a tracing span and we are unable to init a tracing aggregator, we simply do not allow the timer controlling the return of the TracingAggergatorEvents to fire. By doing this, we avoid a situation where a nil row and nil meta are returned too early.

As part of this cleanup this change also simplifies the tracing aggregator. The aggregator is no longer responsible for creating a child span and registering a listener. Instead it simply returns an object that can be registered as an event listener by the caller. The former was an approach that was necessary when we wanted to decorate the span with LazyTags, but nobody uses LazyTags and we have a much better workflow to consume aggregator events with the `Advanced Debugging` job details page.

This simplification was motivated by the fact that `processorbase.StartInternal` now takes event listeners that can be registered with the tracing span of the context that hangs of the processor, and is accessible via the `processor.Ctx()` method.

Fixes: #109658
Release note: None

Co-authored-by: Lidor Carmel <lidor@cockroachlabs.com>
Co-authored-by: Tommy Reilly <treilly@cockroachlabs.com>
Co-authored-by: Aaditya Sondhi <20070511+aadityasondhi@users.noreply.github.com>
Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com>
Co-authored-by: adityamaru <adityamaru@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

jobsprofiler: periodically persist aggregated stats during job execution
3 participants