Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streamingest: support reversing replication direction #117656

Merged
merged 5 commits into from Jan 13, 2024

Conversation

dt
Copy link
Member

@dt dt commented Jan 11, 2024

After promoting a standby that was replicating from some primary to be
its own active cluster, turning it into the new primary, it is often
desirable to reverse the replication direction, so that changes made to
this now-primary cluster are replicated back to the former primary,
now operating as a standby.

Turning a formerly active, primary cluster into a replicating standby
cluster is particularly common during "failback" flows, where the once
primary cluster is returned to primary status after the standby had
temporarily been made the active cluster.

Re-promoting the primary in such cases requires it have a virtual cluster
that is fully caught up with the promoted standby cluster that is serving
traffic, then performing cut-over from that standby back to the primary.

This could be performed by creating a completely new virtual cluster
in the primary cluster from a replication stream of the temporarily active
standby; just like the creation of a normal secondary replicating cluster
this would start by backfilling all data from the source -- the promoted
standby -- and then continuously applying changes as they are streamed
to it.

However, in cases where this is being done on a cluster that previously
was the primary cluster
, the cluster may still have a nearly up to date
copy of the virtual cluster, with only those writes that have been applied
by the promoted standby after cutover missing from it. In such cases,
backfilling a completely new virtual cluster from the promoted standby
involves copying far more data than needed; most of that data is already
on the primary
.

Instead, the new syntax ALTER VIRTUAL CLUSTER a START REPLICATION FROM a ON x
can be used to indicate the virtual cluster 'a' should be rewound back to
the time at which virtual cluster 'a' on physical cluster 'x' -- the promoted
standby -- diverged from it. This will check with cluster x to confirm that
its virtual cluster a was indeed replicated from the cluster running the
command, and then communicate the time after which they diverged, once
cluster x was made active and started accepting new writes. The cluster
rewinds virtual cluster x back to that timestamp, then starts replicating
from cluster x at that timestamp.

Release note (enterprise change): A virtual cluster which was previously being
used as the source for physical cluster replication into a standby in another
cluster which has since been activated can now be reconfigured to become a
standby of that now-promoted cluster, reversing the direction of the replication
stream, and does so by reusing the existing data as much as possible.

Epic: CRDB-34233.

@dt dt requested a review from stevendanna January 11, 2024 02:25
@dt dt requested review from a team as code owners January 11, 2024 02:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@dt
Copy link
Member Author

dt commented Jan 11, 2024

First couple commits are #117636 so I'd review that PR first.

@dt dt force-pushed the pcr/failback branch 4 times, most recently from ca711ec to 92993b4 Compare January 11, 2024 14:14
@dt
Copy link
Member Author

dt commented Jan 11, 2024

rebased on #117636 and RFAL

@dt dt requested a review from msbutler January 11, 2024 21:36
@dt dt force-pushed the pcr/failback branch 2 times, most recently from 7f5a862 to 8e9da4e Compare January 12, 2024 21:18
After promoting a standby that was replicating from some primary to be
its own active cluster, turning it into the new primary, it is often
desirable to reverse the replication direction, so that changes made to
this now-primary cluster are replicated _back_ to the former primary,
now operating as a standby.

Turning a formerly active, primary cluster into a replicating standby
cluster is particularly common during "failback" flows, where the once
primary cluster is returned to primary status after the standby had
temporarily been made the active cluster.

Re-promoting the primary in such cases requires it have a virtual cluster
that is fully caught up with the promoted standby cluster that is serving
traffic, then performing cut-over from that standby back to the primary.

This _could_ be performed by creating a completely new virtual cluster
in the primary cluster from a replication stream of the temporarily active
standby; just like the creation of a normal secondary replicating cluster
this would start by backfilling all data from the source -- the promoted
standby -- and then continuously applying changes as they are streamed
to it.

However, in cases where this is being done on a cluster _that previously
was the primary cluster_, the cluster may still have a nearly up to date
copy of the virtual cluster, with only those writes that have been applied
by the promoted standby after cutover missing from it. In such cases,
backfilling a completely new virtual cluster from the promoted standby
involves copying far more data than needed; most of that data is _already
on the primary_.

Instead, the new syntax `ALTER VIRTUAL CLUSTER a START REPLICATION FROM a ON x`
can be used to indicate the virtual cluster 'a' should be rewound back to
the time at which virtual cluster 'a' on physical cluster 'x' -- the promoted
standby -- diverged from it. This will check with cluster x to confirm that
its virtual cluster a was indeed replicated from the cluster running the
command, and then communicate the time after which they diverged, once
cluster x was made active and started accepting new writes. The cluster
rewinds virtual cluster x back to that timestamp, then starts replicating
from cluster x at that timestamp.

Release note (enterprise change): A virtual cluster which was previously being
used as the source for physical cluster replication into a standby in another
cluster which has since been activated can now be reconfigured to become a
standby of that now-promoted cluster, reversing the direction of the replication
stream, and does so by reusing the existing data as much as possible.

Epic: CRDB-34233.
@dt
Copy link
Member Author

dt commented Jan 12, 2024

TFTR!

bors r=msbutler

@craig
Copy link
Contributor

craig bot commented Jan 13, 2024

Build succeeded:

@craig craig bot merged commit e2ad8d7 into cockroachdb:master Jan 13, 2024
9 checks passed
@dt dt deleted the pcr/failback branch January 13, 2024 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants