Skip to content

Conversation

@katmayb
Copy link
Contributor

@katmayb katmayb commented Jan 10, 2025

Fixes DOC-11998

This PR adds a guide for monitoring changefeeds, particularly as a pipeline. Includes recommended metrics, information on potential impact of high values, and suggested dashboards.

Preview

https://deploy-preview-19296--cockroachdb-docs.netlify.app/docs/v25.1/changefeed-monitoring-guide.html

@github-actions
Copy link

github-actions bot commented Jan 10, 2025

@netlify
Copy link

netlify bot commented Jan 10, 2025

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit c81d7f0
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-interactivetutorials-docs/deploys/67d88082a7255c000829f50f

@netlify
Copy link

netlify bot commented Jan 10, 2025

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit c81d7f0
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-api-docs/deploys/67d880824ef423000899efb4

@katmayb katmayb force-pushed the cdc-monitoring-guide branch from f41db60 to 8db2635 Compare January 10, 2025 21:21
@netlify
Copy link

netlify bot commented Jan 10, 2025

Netlify Preview

Name Link
🔨 Latest commit f41db60
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/67818e9fe862350008649ae2
😎 Deploy Preview https://deploy-preview-19296--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@netlify
Copy link

netlify bot commented Jan 10, 2025

Netlify Preview

Name Link
🔨 Latest commit c81d7f0
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/67d880824c780b00087ef5f4
😎 Deploy Preview https://deploy-preview-19296--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@katmayb katmayb force-pushed the cdc-monitoring-guide branch 2 times, most recently from edad6e5 to 1cd9186 Compare January 16, 2025 20:39
Copy link

@rohan-joshi rohan-joshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some comments - really appreciate all of this!

- [Sink errors over time](#sink-errors)
- [Retry counts](#downstream-delivery)

## Common troubleshooting scenarios

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after some thinking - let's remove the common troubleshooting scenario section, including all the subsections (high end to end latency, rangefeed pressure, sink perf issues).

- Scoped by `changefeed_job_id`
- Supported Versions: v23.2.13+, v24.1.6+, v24.2.4+, v24.3.0+

## Suggested dashboards

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this section entirely

- Resource usage during catch-up after restarts.
- Supported Versions: v23.2.3+, v24.1.0+

### End-to-end performance metrics

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bc we want to limit the verbosity of metrics to readers, let's remove the entire end to end performance metrics section

@katmayb
Copy link
Contributor Author

katmayb commented Jan 31, 2025

@rohan-joshi I've removed the requested sections. When we chatted, you mentioned keeping the diagram as is — is that still the case? I suppose the end-to-end component of that could be a source of confusion if we're not calling it out anywhere. Happy to update the diagram, if you think best. Let me know!

(p.s. I do think we actually mention most of the end-to-end metrics elsewhere in the cdc docs)

@katmayb katmayb marked this pull request as ready for review February 5, 2025 15:27
@katmayb
Copy link
Contributor Author

katmayb commented Feb 5, 2025

@rohan-joshi Just a friendly ping on the question above!

@katmayb katmayb requested a review from rohan-joshi February 5, 2025 15:28
@katmayb katmayb changed the title WIP changefeed monitoring guide Changefeed monitoring guide Feb 7, 2025
@rohan-joshi
Copy link

@kathancox yikes. Sorry for not responding.
I think it's okay to keep the diagram as is even if we don't reference all of the metrics in the write-up.

I'd rather not hide the subcomponents

@katmayb
Copy link
Contributor Author

katmayb commented Feb 19, 2025

OK thanks! @rohan-joshi, @asg0451 any other changes you want me to make, before I push to a docs writer for a final review?

Copy link

@rohan-joshi rohan-joshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@katmayb katmayb requested a review from rmloveland March 4, 2025 15:24

### High-level performance metrics

- Metric: `(now() - changefeed.checkpoint_progress)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace this with changefeed.max_behind_nanos, which is now scoped in the next backport releases


#### Batch latency

- Metric: `changefeed.sink_batch_hist_nanos`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may need to note that some names differ in the datadog integration. eg losing the _nanos suffix, and in prometheus they replace dots with underscores, etc. applicable for all metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@katmayb
Copy link
Contributor Author

katmayb commented Mar 4, 2025

I also just realized the "Supported versions" bullet for each metric doesn't make sense in versioned docs. I'm going to backport the page to the relevant versions and then have that bullet point list the correct version.

@rmloveland rmloveland removed their request for review March 4, 2025 19:41
@katmayb
Copy link
Contributor Author

katmayb commented Mar 5, 2025

@asg0451 I've worked on some of your feedback in the latest commit. However, it looks like your feedback preceded the edits I made for Rohan, which removed a lot of content. Can you PTAL and see if the change to max_behind_nanos is still what we want?

@katmayb
Copy link
Contributor Author

katmayb commented Mar 13, 2025

In a Slack conversation, I suggested + confirmed the following changes, which would resolve any discrepancy in feedback @rohan-joshi / @asg0451:

  • Keep the End-to-end performance section (all of these metrics are in the docs anyway and this will also match the diagram).
  • As per this commit ^, remove the Supported dashboards and Common troubleshooting scenarios sections.
  • As per Miles' suggestion, move up changefeed.max_behind_nanos to the High-level performance metrics section because this metric is now scoped + alter the diagram to include this in the correct place.

@asg0451 @rohan-joshi These are now implemented. Kept both checkpoint_progress + max_behind_nanos in both the high-level section and progress tracking section. (My thought was that it maybe better to show these as useful in both context, but please correct me if I'm wrong.) I've adapted the diagram to represent this.

I'll push to a docs review later this afternoon (3/13) Thanks both!

@katmayb katmayb requested a review from rmloveland March 13, 2025 18:41
Copy link
Contributor

@rmloveland rmloveland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, non-blocking comment is i don't think v25.1 docs are the right place to litigate all this per-metric old supported versions info but i'm guessing there are $reasons so .. LGTM

@katmayb
Copy link
Contributor Author

katmayb commented Mar 17, 2025

LGTM, non-blocking comment is i don't think v25.1 docs are the right place to litigate all this per-metric old supported versions info but i'm guessing there are $reasons so .. LGTM

Yes, you're right! My mistake, I removed from the v24.3 docs and then forgot to remove from the v25.1 docs.

@katmayb katmayb force-pushed the cdc-monitoring-guide branch from 849c82e to c81d7f0 Compare March 17, 2025 20:05
@katmayb katmayb merged commit 3399c10 into main Mar 17, 2025
6 checks passed
@katmayb katmayb deleted the cdc-monitoring-guide branch March 17, 2025 20:12
@rohan-joshi
Copy link

yay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants