streamingccl: partition should resume from its own checkpoint #82697

gh-casper · 2022-06-09T22:37:49Z

Currently when we resume an ingestion job, we use a resume timestamp for all partitions. This can be troublesome when we only have a lagging partition and most of other partitions have caught up to up-to-date. We already tracked each partition progress, each partition should be able to resume from its own checkpoint to minimize the catch-up work.

This is what changefeed currently does: #77763

Jira issue: CRDB-16554

Epic CRDB-10146

blathers-crl · 2022-06-09T22:37:53Z

cc @cockroachdb/cdc

stevendanna · 2022-06-24T16:07:50Z

We already tracked each partition progress, each partition should be able to resume from its own checkpoint to minimize the catch-up work.

IIRC, currently, our per-partition checkpoints are "per-source-node" and not related to the span we are receiving from that node. Thus, we can't use them anytime we get a new plan. So, to do this we'll need to track the actual spans.

We should do this, but I don't think this is table-stakes.

82172: ccl/backupccl: upgrade by-name sequence reference to by-ID during restore r=Xiang-Gu a=Xiang-Gu In 20.2 and prior, sequences are referenced by-name. It was later changed to reference-by-ID to enable things like `ALTER SEQUENCE ... RENAME ...`. But if a backup is taken in 20.2 and prior, and then the backup is restored in a newer binary version (where sequence references should be by-ID), we will need to also be able to upgrade those sequence references from by-name to by-ID. fixes: #60942 Release note: None 83813: streamingccl: span-level checkpointing for streaming ingestion r=samiskin a=samiskin Resolves #82697 Currently the ingestion job checkpoints timestamps per-partition and resumes from the overall startTime. This could potentially result in a large amount of extra work done as a certain partition may lag significantly behind the others. This checkpoint information is also invalidated if we receive a new plan. This change moves from partition-based checkpointing to ResolvedSpan-based checkpointing. Ingestion processors forward progress for each of their partition’s spans as individual resolvedspans, and the complete frontier on the Frontier process is then persisted in the jobs table. Release note (bug fix): replication stream checkpoints now persist across changing plans due to storing span-based checkpoints rather than partition-based checkpoints. 84201: cloud: bump orchestrator to v22.1.3 r=e-mbrown a=e-mbrown Release note: None Co-authored-by: Xiang Gu <xiang@cockroachlabs.com> Co-authored-by: Shiranka Miskin <shiranka.miskin@gmail.com> Co-authored-by: e-mbrown <ebsonari@gmail.com>

blathers-crl · 2022-07-13T13:46:08Z

cc @cockroachdb/cdc

blathers-crl · 2022-07-13T13:56:11Z

cc @cockroachdb/cdc

gh-casper added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-tenant-streaming Including cluster streaming labels Jun 9, 2022

gh-casper added this to Triage in [DEPRECATED] CDC via automation Jun 9, 2022

blathers-crl bot added the T-cdc label Jun 9, 2022

gh-casper moved this from Triage to Cluster Streaming in [DEPRECATED] CDC Jun 9, 2022

jlinder added the sync-me-8 label Jun 10, 2022

shermanCRL added this to the 22.2 milestone Jun 28, 2022

samiskin self-assigned this Jun 30, 2022

samiskin mentioned this issue Jul 5, 2022

streamingccl: span-level checkpointing for streaming ingestion #83813

Merged

craig bot closed this as completed in 47a03f0 Jul 12, 2022

[DEPRECATED] CDC automation moved this from Cluster Streaming to Closed Jul 12, 2022

shermanCRL removed this from Closed in [DEPRECATED] CDC Jul 13, 2022

shermanCRL removed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-cdc labels Jul 13, 2022

exalate-issue-sync bot added the T-cdc label Jul 13, 2022

blathers-crl bot added this to Triage in [DEPRECATED] CDC Jul 13, 2022

blathers-crl bot added the A-cdc Change Data Capture label Jul 13, 2022

shermanCRL removed A-cdc Change Data Capture T-cdc labels Jul 13, 2022

shermanCRL removed this from Triage in [DEPRECATED] CDC Jul 13, 2022

exalate-issue-sync bot added the T-cdc label Jul 13, 2022

blathers-crl bot added this to Triage in [DEPRECATED] CDC Jul 13, 2022

blathers-crl bot added the A-cdc Change Data Capture label Jul 13, 2022

amruss removed this from Triage in [DEPRECATED] CDC Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streamingccl: partition should resume from its own checkpoint #82697

streamingccl: partition should resume from its own checkpoint #82697

gh-casper commented Jun 9, 2022 •

edited by exalate-issue-sync bot

blathers-crl bot commented Jun 9, 2022

stevendanna commented Jun 24, 2022

blathers-crl bot commented Jul 13, 2022

blathers-crl bot commented Jul 13, 2022

streamingccl: partition should resume from its own checkpoint #82697

streamingccl: partition should resume from its own checkpoint #82697

Comments

gh-casper commented Jun 9, 2022 • edited by exalate-issue-sync bot

blathers-crl bot commented Jun 9, 2022

stevendanna commented Jun 24, 2022

blathers-crl bot commented Jul 13, 2022

blathers-crl bot commented Jul 13, 2022

gh-casper commented Jun 9, 2022 •

edited by exalate-issue-sync bot