streamingccl: add even more LDR metrics to DB console #125529

msbutler · 2024-06-11T20:45:44Z

After #125320, a @dt @ajstorm wished for more metrics. We should add these when we have dev time or wish we had them during a debugging session:

Source side

Row Updates Sent i.e. num kvs that went into batches that eventStream sent (single plot of sum across nodes)
Megabytes Sent i.e. sum size of KVs that went out of eventStream (one plot per-node)
Emission Queue Delay i.e. how long eventStream waits for Next() calls (two plots for p50+p99 per node)

Destination side

Application Conflicts Placeholder / TODO.
Application Failures / Updates send to DLQ Placeholder / TODO.
Batches Applied ie. number of update/insert/delete queries run by the consumer (one plot per node)
replanning events - i.e. the amount of times we had to replan the job

Jira issue: CRDB-39500

blathers-crl · 2024-06-11T20:45:46Z

cc @cockroachdb/disaster-recovery

msbutler · 2024-06-11T20:46:35Z

@ajstorm You also asked for the following, but i dont quite follow what you're asking for. could you provide more context on what you were looking into and what you wanted to see?

- [ ] buffer consumption rangefeed source - amount of rangefeed buffer we're consuming at the source 
- [ ] buffer consumption producer - amount of data buffered at the producer side

ajstorm · 2024-06-18T18:39:00Z

@ajstorm You also asked for the following, but i dont quite follow what you're asking for. could you provide more context on what you were looking into and what you wanted to see?
- [ ] buffer consumption rangefeed source - amount of rangefeed buffer we're consuming at the source 
- [ ] buffer consumption producer - amount of data buffered at the producer side

The latter might not be necessary anymore, because @dt seems to have pulled out the buffering on the producer side last night. What I was referring to on the source side is the buffer that backs this error message. IIUC, when that buffer fills, we're going to get a REASON_SLOW_CONSUMER error and have to perform a catchup scan. If that's the case, it would be good to have metrics to know how close we're getting to that limit.

ajstorm · 2024-06-21T19:34:37Z

Adding a request for a metric that tracks how far behind we are when performing an initial backfill (as mentioned here).

msbutler added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery labels Jun 11, 2024

blathers-crl bot added this to Backlog in Disaster Recovery Backlog Jun 11, 2024

blathers-crl bot added the A-disaster-recovery label Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streamingccl: add even more LDR metrics to DB console #125529

streamingccl: add even more LDR metrics to DB console #125529

msbutler commented Jun 11, 2024 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Jun 11, 2024

msbutler commented Jun 11, 2024 •

edited

Loading

ajstorm commented Jun 18, 2024 •

edited

Loading

ajstorm commented Jun 21, 2024

streamingccl: add even more LDR metrics to DB console #125529

streamingccl: add even more LDR metrics to DB console #125529

Comments

msbutler commented Jun 11, 2024 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Jun 11, 2024

msbutler commented Jun 11, 2024 • edited Loading

ajstorm commented Jun 18, 2024 • edited Loading

ajstorm commented Jun 21, 2024

msbutler commented Jun 11, 2024 •

edited by cockroach-jira-scripts

Loading

msbutler commented Jun 11, 2024 •

edited

Loading

ajstorm commented Jun 18, 2024 •

edited

Loading