streamingest: support reversing replication direction #117656

dt · 2024-01-11T02:25:08Z

After promoting a standby that was replicating from some primary to be
its own active cluster, turning it into the new primary, it is often
desirable to reverse the replication direction, so that changes made to
this now-primary cluster are replicated back to the former primary,
now operating as a standby.

Turning a formerly active, primary cluster into a replicating standby
cluster is particularly common during "failback" flows, where the once
primary cluster is returned to primary status after the standby had
temporarily been made the active cluster.

Re-promoting the primary in such cases requires it have a virtual cluster
that is fully caught up with the promoted standby cluster that is serving
traffic, then performing cut-over from that standby back to the primary.

This could be performed by creating a completely new virtual cluster
in the primary cluster from a replication stream of the temporarily active
standby; just like the creation of a normal secondary replicating cluster
this would start by backfilling all data from the source -- the promoted
standby -- and then continuously applying changes as they are streamed
to it.

However, in cases where this is being done on a cluster that previously
was the primary cluster, the cluster may still have a nearly up to date
copy of the virtual cluster, with only those writes that have been applied
by the promoted standby after cutover missing from it. In such cases,
backfilling a completely new virtual cluster from the promoted standby
involves copying far more data than needed; most of that data is already
on the primary.

Instead, the new syntax ALTER VIRTUAL CLUSTER a START REPLICATION FROM a ON x
can be used to indicate the virtual cluster 'a' should be rewound back to
the time at which virtual cluster 'a' on physical cluster 'x' -- the promoted
standby -- diverged from it. This will check with cluster x to confirm that
its virtual cluster a was indeed replicated from the cluster running the
command, and then communicate the time after which they diverged, once
cluster x was made active and started accepting new writes. The cluster
rewinds virtual cluster x back to that timestamp, then starts replicating
from cluster x at that timestamp.

Release note (enterprise change): A virtual cluster which was previously being
used as the source for physical cluster replication into a standby in another
cluster which has since been activated can now be reconfigured to become a
standby of that now-promoted cluster, reversing the direction of the replication
stream, and does so by reusing the existing data as much as possible.

Epic: CRDB-34233.

cockroach-teamcity · 2024-01-11T02:25:25Z

This change is

dt · 2024-01-11T02:26:30Z

First couple commits are #117636 so I'd review that PR first.

dt · 2024-01-11T14:14:21Z

rebased on #117636 and RFAL

pkg/ccl/streamingccl/streamingest/alter_replication_job.go

pkg/ccl/streamingccl/streamingest/stream_ingestion_planning.go

pkg/ccl/streamingccl/streamingest/alter_replication_job.go

pkg/ccl/streamingccl/streamingest/stream_ingestion_job_test.go

Release note: none. Epic: none.

After promoting a standby that was replicating from some primary to be its own active cluster, turning it into the new primary, it is often desirable to reverse the replication direction, so that changes made to this now-primary cluster are replicated _back_ to the former primary, now operating as a standby. Turning a formerly active, primary cluster into a replicating standby cluster is particularly common during "failback" flows, where the once primary cluster is returned to primary status after the standby had temporarily been made the active cluster. Re-promoting the primary in such cases requires it have a virtual cluster that is fully caught up with the promoted standby cluster that is serving traffic, then performing cut-over from that standby back to the primary. This _could_ be performed by creating a completely new virtual cluster in the primary cluster from a replication stream of the temporarily active standby; just like the creation of a normal secondary replicating cluster this would start by backfilling all data from the source -- the promoted standby -- and then continuously applying changes as they are streamed to it. However, in cases where this is being done on a cluster _that previously was the primary cluster_, the cluster may still have a nearly up to date copy of the virtual cluster, with only those writes that have been applied by the promoted standby after cutover missing from it. In such cases, backfilling a completely new virtual cluster from the promoted standby involves copying far more data than needed; most of that data is _already on the primary_. Instead, the new syntax `ALTER VIRTUAL CLUSTER a START REPLICATION FROM a ON x` can be used to indicate the virtual cluster 'a' should be rewound back to the time at which virtual cluster 'a' on physical cluster 'x' -- the promoted standby -- diverged from it. This will check with cluster x to confirm that its virtual cluster a was indeed replicated from the cluster running the command, and then communicate the time after which they diverged, once cluster x was made active and started accepting new writes. The cluster rewinds virtual cluster x back to that timestamp, then starts replicating from cluster x at that timestamp. Release note (enterprise change): A virtual cluster which was previously being used as the source for physical cluster replication into a standby in another cluster which has since been activated can now be reconfigured to become a standby of that now-promoted cluster, reversing the direction of the replication stream, and does so by reusing the existing data as much as possible. Epic: CRDB-34233.

dt · 2024-01-12T23:37:13Z

TFTR!

bors r=msbutler

craig · 2024-01-13T01:18:32Z

Build succeeded:

Bazel Essential CI (Cockroach)

dt requested a review from stevendanna January 11, 2024 02:25

dt requested review from a team as code owners January 11, 2024 02:25

dt force-pushed the pcr/failback branch 4 times, most recently from ca711ec to 92993b4 Compare January 11, 2024 14:14

dt requested a review from msbutler January 11, 2024 21:36

msbutler approved these changes Jan 12, 2024

View reviewed changes

dt force-pushed the pcr/failback branch 2 times, most recently from 7f5a862 to 8e9da4e Compare January 12, 2024 21:18

dt added 5 commits January 12, 2024 22:25

sql/parser: add ALTER V CLUSTER START REP FROM syntax

6f3ec8f

Release note: none. Epic: none.

streamingccl/streamclient: add PriorReplicationDetails to client

35c07bf

Release note: none. Epic: none.

jobs,streamingest: remove unused DestinationTenantName

81ff5de

Release note: none. Epic: none.

streamingest: pure refactor to pull createReplicationJob out

d4b8ff7

Release note: none. Epic: none.

dt force-pushed the pcr/failback branch from 8e9da4e to 23566b2 Compare January 12, 2024 22:25

craig bot merged commit e2ad8d7 into cockroachdb:master Jan 13, 2024
9 checks passed

dt deleted the pcr/failback branch January 13, 2024 01:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streamingest: support reversing replication direction #117656

streamingest: support reversing replication direction #117656

dt commented Jan 11, 2024

cockroach-teamcity commented Jan 11, 2024

dt commented Jan 11, 2024

dt commented Jan 11, 2024

dt commented Jan 12, 2024

craig bot commented Jan 13, 2024

streamingest: support reversing replication direction #117656

streamingest: support reversing replication direction #117656

Conversation

dt commented Jan 11, 2024

cockroach-teamcity commented Jan 11, 2024

dt commented Jan 11, 2024

dt commented Jan 11, 2024

dt commented Jan 12, 2024

craig bot commented Jan 13, 2024