New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
streamingccl: partition should resume from its own checkpoint #82697
Comments
cc @cockroachdb/cdc |
IIRC, currently, our per-partition checkpoints are "per-source-node" and not related to the span we are receiving from that node. Thus, we can't use them anytime we get a new plan. So, to do this we'll need to track the actual spans. We should do this, but I don't think this is table-stakes. |
82172: ccl/backupccl: upgrade by-name sequence reference to by-ID during restore r=Xiang-Gu a=Xiang-Gu In 20.2 and prior, sequences are referenced by-name. It was later changed to reference-by-ID to enable things like `ALTER SEQUENCE ... RENAME ...`. But if a backup is taken in 20.2 and prior, and then the backup is restored in a newer binary version (where sequence references should be by-ID), we will need to also be able to upgrade those sequence references from by-name to by-ID. fixes: #60942 Release note: None 83813: streamingccl: span-level checkpointing for streaming ingestion r=samiskin a=samiskin Resolves #82697 Currently the ingestion job checkpoints timestamps per-partition and resumes from the overall startTime. This could potentially result in a large amount of extra work done as a certain partition may lag significantly behind the others. This checkpoint information is also invalidated if we receive a new plan. This change moves from partition-based checkpointing to ResolvedSpan-based checkpointing. Ingestion processors forward progress for each of their partition’s spans as individual resolvedspans, and the complete frontier on the Frontier process is then persisted in the jobs table. Release note (bug fix): replication stream checkpoints now persist across changing plans due to storing span-based checkpoints rather than partition-based checkpoints. 84201: cloud: bump orchestrator to v22.1.3 r=e-mbrown a=e-mbrown Release note: None Co-authored-by: Xiang Gu <xiang@cockroachlabs.com> Co-authored-by: Shiranka Miskin <shiranka.miskin@gmail.com> Co-authored-by: e-mbrown <ebsonari@gmail.com>
cc @cockroachdb/cdc |
cc @cockroachdb/cdc |
Currently when we resume an ingestion job, we use a resume timestamp for all partitions. This can be troublesome when we only have a lagging partition and most of other partitions have caught up to up-to-date. We already tracked each partition progress, each partition should be able to resume from its own checkpoint to minimize the catch-up work.
This is what changefeed currently does: #77763
Jira issue: CRDB-16554
Epic CRDB-10146
The text was updated successfully, but these errors were encountered: