Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streamingccl: add even more LDR metrics to DB console #125529

Open
7 tasks
msbutler opened this issue Jun 11, 2024 · 4 comments
Open
7 tasks

streamingccl: add even more LDR metrics to DB console #125529

msbutler opened this issue Jun 11, 2024 · 4 comments
Labels
A-disaster-recovery C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery

Comments

@msbutler
Copy link
Collaborator

msbutler commented Jun 11, 2024

After #125320, a @dt @ajstorm wished for more metrics. We should add these when we have dev time or wish we had them during a debugging session:

Source side

  • Row Updates Sent i.e. num kvs that went into batches that eventStream sent (single plot of sum across nodes)
  • Megabytes Sent i.e. sum size of KVs that went out of eventStream (one plot per-node)
  • Emission Queue Delay i.e. how long eventStream waits for Next() calls (two plots for p50+p99 per node)

Destination side

  • Application Conflicts Placeholder / TODO.
  • Application Failures / Updates send to DLQ Placeholder / TODO.
  • Batches Applied ie. number of update/insert/delete queries run by the consumer (one plot per node)
  • replanning events - i.e. the amount of times we had to replan the job

Jira issue: CRDB-39500

@msbutler msbutler added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery labels Jun 11, 2024
Copy link

blathers-crl bot commented Jun 11, 2024

cc @cockroachdb/disaster-recovery

@msbutler
Copy link
Collaborator Author

msbutler commented Jun 11, 2024

@ajstorm You also asked for the following, but i dont quite follow what you're asking for. could you provide more context on what you were looking into and what you wanted to see?

- [ ] buffer consumption rangefeed source - amount of rangefeed buffer we're consuming at the source 
- [ ] buffer consumption producer - amount of data buffered at the producer side

@ajstorm
Copy link
Collaborator

ajstorm commented Jun 18, 2024

@ajstorm You also asked for the following, but i dont quite follow what you're asking for. could you provide more context on what you were looking into and what you wanted to see?

- [ ] buffer consumption rangefeed source - amount of rangefeed buffer we're consuming at the source 
- [ ] buffer consumption producer - amount of data buffered at the producer side

The latter might not be necessary anymore, because @dt seems to have pulled out the buffering on the producer side last night. What I was referring to on the source side is the buffer that backs this error message. IIUC, when that buffer fills, we're going to get a REASON_SLOW_CONSUMER error and have to perform a catchup scan. If that's the case, it would be good to have metrics to know how close we're getting to that limit.

@ajstorm
Copy link
Collaborator

ajstorm commented Jun 21, 2024

Adding a request for a metric that tracks how far behind we are when performing an initial backfill (as mentioned here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery
Development

No branches or pull requests

2 participants