Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streamingccl: partition should resume from its own checkpoint #82697

Closed
gh-casper opened this issue Jun 9, 2022 · 4 comments · Fixed by #83813
Closed

streamingccl: partition should resume from its own checkpoint #82697

gh-casper opened this issue Jun 9, 2022 · 4 comments · Fixed by #83813
Assignees
Labels
A-cdc Change Data Capture A-tenant-streaming Including cluster streaming sync-me-8 T-cdc
Milestone

Comments

@gh-casper
Copy link
Contributor

gh-casper commented Jun 9, 2022

Currently when we resume an ingestion job, we use a resume timestamp for all partitions. This can be troublesome when we only have a lagging partition and most of other partitions have caught up to up-to-date. We already tracked each partition progress, each partition should be able to resume from its own checkpoint to minimize the catch-up work.

This is what changefeed currently does: #77763

Jira issue: CRDB-16554

Epic CRDB-10146

@gh-casper gh-casper added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-tenant-streaming Including cluster streaming labels Jun 9, 2022
@gh-casper gh-casper added this to Triage in [DEPRECATED] CDC via automation Jun 9, 2022
@blathers-crl blathers-crl bot added the T-cdc label Jun 9, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jun 9, 2022

cc @cockroachdb/cdc

@gh-casper gh-casper moved this from Triage to Cluster Streaming in [DEPRECATED] CDC Jun 9, 2022
@stevendanna
Copy link
Collaborator

We already tracked each partition progress, each partition should be able to resume from its own checkpoint to minimize the catch-up work.

IIRC, currently, our per-partition checkpoints are "per-source-node" and not related to the span we are receiving from that node. Thus, we can't use them anytime we get a new plan. So, to do this we'll need to track the actual spans.

We should do this, but I don't think this is table-stakes.

@shermanCRL shermanCRL added this to the 22.2 milestone Jun 28, 2022
@samiskin samiskin self-assigned this Jun 30, 2022
craig bot pushed a commit that referenced this issue Jul 12, 2022
82172: ccl/backupccl: upgrade by-name sequence reference to by-ID during restore r=Xiang-Gu a=Xiang-Gu

In 20.2 and prior, sequences are referenced by-name. It was later
changed to reference-by-ID to enable things like
`ALTER SEQUENCE ... RENAME ...`.

But if a backup is taken in 20.2 and prior, and then the backup is
restored in a newer binary version (where sequence references should
be by-ID), we will need to also be able to upgrade those sequence
references from by-name to by-ID.

fixes: #60942

Release note: None

83813: streamingccl: span-level checkpointing for streaming ingestion r=samiskin a=samiskin

Resolves #82697 

Currently the ingestion job checkpoints timestamps per-partition and resumes
from the overall startTime.  This could potentially result in a large amount of 
extra work done as a certain partition may lag significantly behind the others.
This checkpoint information is also invalidated if we receive a new plan.

This change moves from partition-based checkpointing to ResolvedSpan-based
checkpointing.  Ingestion processors forward progress for each of their
partition’s spans as individual resolvedspans, and the complete frontier on the
Frontier process is then persisted in the jobs table.

Release note (bug fix): replication stream checkpoints now persist across
changing plans due to storing span-based checkpoints rather than partition-based
checkpoints.

84201: cloud: bump orchestrator to v22.1.3 r=e-mbrown a=e-mbrown

Release note: None

Co-authored-by: Xiang Gu <xiang@cockroachlabs.com>
Co-authored-by: Shiranka Miskin <shiranka.miskin@gmail.com>
Co-authored-by: e-mbrown <ebsonari@gmail.com>
@craig craig bot closed this as completed in 47a03f0 Jul 12, 2022
[DEPRECATED] CDC automation moved this from Cluster Streaming to Closed Jul 12, 2022
@shermanCRL shermanCRL removed this from Closed in [DEPRECATED] CDC Jul 13, 2022
@shermanCRL shermanCRL removed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-cdc labels Jul 13, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jul 13, 2022

cc @cockroachdb/cdc

@blathers-crl blathers-crl bot added this to Triage in [DEPRECATED] CDC Jul 13, 2022
@blathers-crl blathers-crl bot added the A-cdc Change Data Capture label Jul 13, 2022
@shermanCRL shermanCRL removed A-cdc Change Data Capture T-cdc labels Jul 13, 2022
@shermanCRL shermanCRL removed this from Triage in [DEPRECATED] CDC Jul 13, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jul 13, 2022

cc @cockroachdb/cdc

@blathers-crl blathers-crl bot added this to Triage in [DEPRECATED] CDC Jul 13, 2022
@blathers-crl blathers-crl bot added the A-cdc Change Data Capture label Jul 13, 2022
@amruss amruss removed this from Triage in [DEPRECATED] CDC Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture A-tenant-streaming Including cluster streaming sync-me-8 T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants