release-23.1: changefeedccl: fix initial scan checkpointing #123970

wenyihu6 · 2024-05-10T19:52:27Z

Backport 1/1 commits from #123625.

/cc @cockroachdb/release

Initially, all span initial resolved timestamps are kept as zero upon resuming a
job since initial resolved timestamps are set as initial high water which
remains zero until initial scan is completed. However, since
0eda540,
we began reloading checkpoint timestamps instead of setting them all to zero at
the start. In PR #102717, we introduced a mechanism to reduce message duplicates
by re-loading job progress upon resuming which largely increased the likelihood
of this bug. These errors could lead to incorrect frontier and missing events
during initial scans. This patches changes how we initialize initial high water
and frontier by initializing it as zero if there are any zero initial high water
in initial resolved timestamps.

Fixes: #123371

Release note (enterprise change): Fixed a bug in v22.2+ where long running
initial scans may incorrectly restore checkpoint job progress and drop events
during node / changefeed restart. This issue was most likely to occur in
clusters with: 1) changefeed.shutdown_checkpoint.enabled (v23.2) is set 2)
Multiple table targets in a changefeed, or 3) Low
changefeed.frontier_checkpoint_frequency or low
changefeed.frontier_highwater_lag_checkpoint_threshold.

Release justification: Fixed a severe bug in v22.2+ where long running
initial scans may incorrectly restore checkpoint job progress and drop events
during node / changefeed restart.

Initially, all span initial resolved timestamps are kept as zero upon resuming a job since initial resolved timestamps are set as initial high water which remains zero until initial scan is completed. However, since cockroachdb@0eda540, we began reloading checkpoint timestamps instead of setting them all to zero at the start. In PR cockroachdb#102717, we introduced a mechanism to reduce message duplicates by re-loading job progress upon resuming which largely increased the likelihood of this bug. These errors could lead to incorrect frontier and missing events during initial scans. This patches changes how we initialize initial high water and frontier by initializing it as zero if there are any zero initial high water in initial resolved timestamps. Fixes: cockroachdb#123371 Release note (enterprise change): Fixed a bug in v22.2+ where long running initial scans may incorrectly restore checkpoint job progress and drop events during node / changefeed restart. This issue was most likely to occur in clusters with: 1) changefeed.shutdown_checkpoint.enabled (v23.2) is set 2) Multiple table targets in a changefeed, or 3) Low changefeed.frontier_checkpoint_frequency or low changefeed.frontier_highwater_lag_checkpoint_threshold.

blathers-crl · 2024-05-10T19:52:30Z

cockroach-teamcity · 2024-05-10T19:52:41Z

This change is

wenyihu6 · 2024-05-20T23:01:32Z

blathers backport release-23.1.22-rc

blathers-crl bot added the backport Label PR's that are backports to older release branches label May 10, 2024

wenyihu6 marked this pull request as ready for review May 10, 2024 20:41

wenyihu6 requested a review from a team as a code owner May 10, 2024 20:41

wenyihu6 requested review from rharding6373, stevendanna and andyyang890 and removed request for a team May 10, 2024 20:41

rharding6373 approved these changes May 10, 2024

View reviewed changes

stevendanna approved these changes May 10, 2024

View reviewed changes

wenyihu6 merged commit ed045dc into cockroachdb:release-23.1 May 10, 2024
5 of 6 checks passed

blathers-crl bot mentioned this pull request May 20, 2024

release-23.1.22-rc: release-23.1: changefeedccl: fix initial scan checkpointing #124454

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-23.1: changefeedccl: fix initial scan checkpointing #123970

release-23.1: changefeedccl: fix initial scan checkpointing #123970

wenyihu6 commented May 10, 2024

blathers-crl bot commented May 10, 2024 •

edited by rharding6373

Loading

cockroach-teamcity commented May 10, 2024

wenyihu6 commented May 20, 2024

release-23.1: changefeedccl: fix initial scan checkpointing #123970

release-23.1: changefeedccl: fix initial scan checkpointing #123970

Conversation

wenyihu6 commented May 10, 2024

blathers-crl bot commented May 10, 2024 • edited by rharding6373 Loading

cockroach-teamcity commented May 10, 2024

wenyihu6 commented May 20, 2024

blathers-crl bot commented May 10, 2024 •

edited by rharding6373

Loading