Update the failover docs for PCR #19405

katmayb · 2025-02-28T18:00:44Z

This PR is to clarify the failover/failback (formerly cutover/cutback) docs for PCR.

Changes include:

Clearly defined: Failover and Failback in the intro of the failover page. (Note this is slightly different in v23.2, because "cutback" was just starting a PCR stream again.)
Added detail that fast failback is not possible when the original PCR stream used a non-UA clusters.
Included better communication re non-UA to UA cluster workflow.
Updated the pages to check versioning requirements.
Streamlined wording in some descriptions and, where possible, moved away from "replication stream" to "PCR" now that we have LDR in the docs as well.
Added "Fast Failback" to the features list on the Overview page in the applicable versions.
To prerequisites on Setup PCR page + Failback changed versioning to "The standby cluster should be the same version or one version ahead of the primary cluster." rather than a specific version.
Removed the Enterprise license procedural content and prereq now that CRDB licenses have changed.

Note

This is PR 2 in a series of work to improve the PCR docs. Especially UX around non-ua to UA and failover/failback + upgrades and versioning.

Follow-up PR will include improved descriptions on cluster versions and upgrade process.

Preview

https://deploy-preview-19405--cockroachdb-docs.netlify.app/docs/v25.1/failover-replication.html

github-actions · 2025-02-28T18:01:10Z

Files changed:

src/current/_includes/releases/v24.2/v24.2.0-rc.1.md:
- releases/v24.2.md
src/current/_includes/v24.1/physical-replication/fast-cutback-latest-timestamp.md:

src/current/v24.1/physical-cluster-replication-overview.md
src/current/v24.1/cutover-replication.md
src/current/v24.1/known-limitations.md
src/current/_includes/v24.1/known-limitations/fast-cutback-latest-timestamp.md:

src/current/v24.1/physical-cluster-replication-overview.md
src/current/v24.1/cutover-replication.md
src/current/v24.1/known-limitations.md
src/current/_includes/v24.1/known-limitations/fast-cutback-latest-timestamp.md (Error: Circular reference found. Your build will fail.)

src/current/_includes/v24.1/physical-replication/fast-cutback-syntax.md:

Warning: include not used in any v24.1 file or include

src/current/_includes/v24.2/physical-replication/fast-cutback-latest-timestamp.md:

Warning: include not used in any v24.2 file or include

src/current/_includes/v24.2/physical-replication/fast-cutback-syntax.md:

Warning: include not used in any v24.2 file or include

src/current/_includes/v24.3/physical-replication/fast-failback-latest-timestamp.md:

Warning: include not used in any v24.3 file or include

src/current/_includes/v24.3/physical-replication/fast-failback-syntax.md:

Warning: include not used in any v24.3 file or include

src/current/_includes/v25.1/physical-replication/fast-failback-latest-timestamp.md:

Warning: include not used in any v25.1 file or include

src/current/_includes/v25.1/physical-replication/fast-failback-syntax.md:

Warning: include not used in any v25.1 file or include

src/current/v23.2/cockroachdb-feature-availability.md
src/current/v23.2/cutover-replication.md
src/current/v23.2/physical-cluster-replication-overview.md
src/current/v23.2/set-up-physical-cluster-replication.md
src/current/v24.1/alter-virtual-cluster.md
src/current/v24.1/cutover-replication.md
src/current/v24.1/physical-cluster-replication-overview.md
src/current/v24.1/physical-cluster-replication-technical-overview.md
src/current/v24.1/set-up-physical-cluster-replication.md
src/current/v24.2/alter-virtual-cluster.md
src/current/v24.2/cutover-replication.md
src/current/v24.2/physical-cluster-replication-overview.md
src/current/v24.2/physical-cluster-replication-technical-overview.md
src/current/v24.2/set-up-physical-cluster-replication.md
src/current/v24.3/alter-virtual-cluster.md
src/current/v24.3/failover-replication.md
src/current/v24.3/physical-cluster-replication-overview.md
src/current/v24.3/physical-cluster-replication-technical-overview.md
src/current/v24.3/set-up-physical-cluster-replication.md
src/current/v25.1/alter-virtual-cluster.md
src/current/v25.1/failover-replication.md
src/current/v25.1/physical-cluster-replication-overview.md
src/current/v25.1/physical-cluster-replication-technical-overview.md
src/current/v25.1/set-up-physical-cluster-replication.md

netlify · 2025-02-28T18:01:20Z

✅ Deploy Preview for cockroachdb-api-docs canceled.

Name	Link
🔨 Latest commit	`0227470`
🔍 Latest deploy log	https://app.netlify.com/sites/cockroachdb-api-docs/deploys/67dd6c387efc7500089ee3fe

netlify · 2025-02-28T18:01:22Z

✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name	Link
🔨 Latest commit	`0227470`
🔍 Latest deploy log	https://app.netlify.com/sites/cockroachdb-interactivetutorials-docs/deploys/67dd6c38fb7bcb0008fcbd82

netlify · 2025-02-28T18:08:38Z

✅ Netlify Preview

Name	Link
🔨 Latest commit	`0227470`
🔍 Latest deploy log	https://app.netlify.com/sites/cockroachdb-docs/deploys/67dd6c389023580008224828
😎 Deploy Preview	https://deploy-preview-19405--cockroachdb-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

katmayb · 2025-03-06T20:08:08Z

src/current/v23.2/set-up-physical-cluster-replication.md

 ## Before you begin

- Two separate CockroachDB clusters (primary and standby) with a minimum of three nodes each, and each using the same CockroachDB {{page.version.version}} version.
+- Two separate CockroachDB clusters (primary and standby) with a minimum of three nodes each, and each using the same CockroachDB {{page.version.version}} version. The standby cluster should be the same version or one version ahead of the primary cluster.


Added this sentence to match the current description in the "Cluster versions and upgrades" section.

katmayb · 2025-03-06T20:08:49Z

src/current/v23.2/set-up-physical-cluster-replication.md

    - The [Deploy CockroachDB on Premises]({% link {{ page.version.version }}/deploy-cockroachdb-on-premises.md %}) tutorial creates a self-signed certificate for each {{ site.data.products.core }} cluster. To create certificates signed by an external certificate authority, refer to [Create Security Certificates using OpenSSL]({% link {{ page.version.version }}/create-security-certificates-openssl.md %}).
 - All nodes in each cluster will need access to the Certificate Authority for the other cluster. Refer to [Copy certificates](#step-3-copy-certificates).
- An [{{ site.data.products.enterprise }} license]({% link {{ page.version.version }}/licensing-faqs.md %}#types-of-licenses) on the primary **and** standby clusters. You must use the system virtual cluster on the primary and standby clusters to enable your {{ site.data.products.enterprise }} license.
 - The primary and standby clusters **must have the same [region topology]({% link {{ page.version.version }}/topology-patterns.md %})**. For example, replicating a multi-region primary cluster to a single-region standby cluster is not supported. Mismatching regions between a multi-region primary and standby cluster is also not supported.


Enterprise licensing prereq + steps gone as a result of CRDB license updates.

katmayb · 2025-03-06T20:10:36Z

src/current/v24.1/cutover-replication.md

+    - [Create a new virtual cluster]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#step-4-start-replication) (`main`) on cluster **A** from the replication of cluster **B**. Cluster **A** is now virtualized. This will start an initial scan because the PCR stream will ignore the former workload tables in the system virtual cluster that were [originally replicated to **B**]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#set-up-pcr-from-an-existing-cluster). You can [drop the tables]({% link {{ page.version.version }}/drop-table.md %}) that were in the system virtual cluster, because the new virtual cluster will now hold the workload replicating from cluster **B**.
    - [Start an entirely new cluster]({% link {{ page.version.version }}/cockroach-start.md %}) **C** and [create]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#step-4-start-replication) a `main` virtual cluster on it from the replication of cluster **B**. This will start an initial scan because cluster **C** is empty.

+## Job management


This isn't new content, just moved Job management to the bottom of the page after faliback/cutback

alicia-l2

LGTM, left some comments - thank you!

alicia-l2 · 2025-03-10T18:53:59Z

src/current/v25.1/failover-replication.md

+{{site.data.alerts.end}}

-The failover is a two-step process on the standby cluster:
+_Failover_ in [**physical cluster replication (PCR)**]({% link {{ page.version.version }}/physical-cluster-replication-overview.md %}) allows you to switch from the active primary cluster to the passive standby cluster that has ingested replicated data. When you complete the replication stream to initiate a failover, the job stops the stream of new data, resets the standby [virtual cluster]({% link {{ page.version.version }}/physical-cluster-replication-technical-overview.md %}) to a point in time where all ingested data is consistent, and then marks the standby virtual cluster as ready to accept traffic.


stops the stream

Should we just say stops replicating data from the primary?

resets

should we say sets?

marks the standby virtual cluster as ready to accept traffic.

should we say 'makes' instead of marks? Because the standby cluster is ready to accept traffic at that time. or like enables? Something just more explicit that the standby is online

Oh also - I don't know how long you wanted to make this section, but it could be cool to mention how we let you failover to a time in the past or future. Defer to you though (we could also put that in the overview)!

Implemented all of these suggestions — thank you!

alicia-l2 · 2025-03-10T18:58:25Z

src/current/v25.1/failover-replication.md


-1. [Initiating the failover](#step-1-initiate-the-failover).
-1. [Completing the failover](#step-2-complete-the-failover).
+_Failback_ in PCR switches operations back to the original primary cluster (or a different cluster) after a failover event. When you initiate a failback, the job ensures the original primary is up to date with writes from the standby that happened after failover. The original primary cluster is then set as ready to accept application traffic once again.


Hm, what do you mean by a different cluster? Do you mean like a net new cluster?

yes, I did mean that. Do you want me to take that out or change to "new cluster"?

I think maybe new cluster would be more explicit? Only calling it out because i've recently had some folks confused about it when we are talking through PCR

alicia-l2 · 2025-03-10T19:11:47Z

src/current/v25.1/failover-replication.md

+    ~~~

-The `failover_time` is the timestamp at which the replicated data is consistent. The cluster will revert any data above this timestamp:
+    The `failover_time` is the timestamp at which the replicated data is consistent. The cluster will revert any data above this timestamp:


Can we add, 'the cluster will revert any replicated data'... 'this timestamp to ensure that the standby is consistent with the primary at that timestamp'

alicia-l2 · 2025-03-10T19:19:20Z

src/current/v25.1/failover-replication.md

+## Failback

-During a replication stream, jobs running on the primary cluster will replicate to the standby cluster. Once you have [completed a failover](#step-2-complete-the-failover) (or a [failback](#fail-back-to-the-primary-cluster)), refer to the following sections for details on resuming jobs on the promoted cluster.
+After failing over to the standby cluster, you may need to fail back to the original primary cluster to serve your application. Depending on the state of the primary cluster in the original PCR stream, use one of the following workflows:


Should we say, 'you may want to revert to your original primary-standby cluster setup.' Because they technically can serve their application using the standby cluster.

Should we say 'Depending on the configuration of the primary cluster?'

Nice, yes, done!

alicia-l2 · 2025-03-10T19:24:44Z

src/current/v25.1/failover-replication.md

+    - [Create a new virtual cluster]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#step-4-start-replication) (`main`) on cluster **A** from the replication of cluster **B**. Cluster **A** is now virtualized. This will start an initial scan because the PCR stream will ignore the former workload tables in the system virtual cluster that were [originally replicated to **B**]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#set-up-pcr-from-an-existing-cluster). You can [drop the tables]({% link {{ page.version.version }}/drop-table.md %}) that were in the system virtual cluster, because the new virtual cluster will now hold the workload replicating from cluster **B**.
    - [Start an entirely new cluster]({% link {{ page.version.version }}/cockroach-start.md %}) **C** and [create]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#step-4-start-replication) a `main` virtual cluster on it from the replication of cluster **B**. This will start an initial scan because cluster **C** is empty.

+## Job management


This section is awesome - thanks so much!

alicia-l2 · 2025-03-10T19:25:57Z

src/current/v25.1/failover-replication.md

+### Changefeeds
+
+[Changefeeds]({% link {{ page.version.version }}/change-data-capture-overview.md %}) will fail on the promoted cluster immediately after failover to avoid two clusters running the same changefeed to one sink. We recommend that you recreate changefeeds on the promoted cluster.
+


these two sentences kind of confuse me -- are we saying that changefeed schedules will pause and then currently running changefeeds will fail?

Yeah, I added "currently running" to the first paragraph. There was a difference in behavior for changefeeds (fail) vs. scheduled changefeeds (pause) when this was written, which I believe is still true.

ooh, I didn't even know that - thanks!

katmayb · 2025-03-14T14:09:55Z

src/current/v23.2/cutover-replication.md

 Physical cluster replication is supported in CockroachDB {{ site.data.products.core }} clusters.
 {{site.data.alerts.end}}

 {% include_cached new-in.html version="v23.2" %} _Cutover_ in [**physical cluster replication (PCR)**]({% link {{ page.version.version }}/physical-cluster-replication-overview.md %}) allows you to switch from the active primary cluster to the passive standby cluster that has ingested replicated data. When you complete the replication stream to initiate a cutover, the job stops replicating data from the primary, sets the standby [virtual cluster]({% link {{ page.version.version }}/physical-cluster-replication-technical-overview.md %}) to a point in time (in the past or future) where all ingested data is consistent, and then makes the standby virtual cluster ready to accept traffic.


@alicia-l2 Updated here!

katmayb · 2025-03-18T14:01:03Z

@alicia-l2 Approved via slack!

rmloveland

LGTM!

rmloveland · 2025-03-20T21:08:11Z

src/current/v25.1/failover-replication.md

+### Step 1. Initiate the failover

-To initiate a failover to the standby cluster, there are different ways of specifying the point in time for the standby's promotion. That is, the standby cluster's live data at the point of failover. Refer to the following sections for steps:
+To initiate a failover to the standby cluster, you can specify the point in time for the standby's promotion in different ways. That is, the standby cluster's live data at the point of failover. Refer to the following sections for steps:


"in the following ways" ???

Much better! Thanks!

katmayb force-pushed the failover-failback-improvements branch 6 times, most recently from 01c9087 to 8f82f5d Compare March 6, 2025 19:52

katmayb commented Mar 6, 2025

View reviewed changes

katmayb marked this pull request as ready for review March 6, 2025 20:10

katmayb requested a review from alicia-l2 March 6, 2025 20:11

alicia-l2 reviewed Mar 10, 2025

View reviewed changes

katmayb requested a review from alicia-l2 March 13, 2025 17:13

katmayb commented Mar 14, 2025

View reviewed changes

katmayb requested a review from rmloveland March 18, 2025 14:00

rmloveland approved these changes Mar 20, 2025

View reviewed changes

katmayb added 8 commits March 21, 2025 09:40

Update the failover docs for PCR

11e4326

AL feedback

c0d2528

Fix error

2486b55

pitr

de08f2b

Feedback fix

de8aae9

state to configuration

7b8fc7c

Diff to new

c9f6317

RL feedback

0227470

katmayb force-pushed the failover-failback-improvements branch from 4e06874 to 0227470 Compare March 21, 2025 13:40

katmayb merged commit 9bbbc27 into main Mar 21, 2025
6 checks passed

katmayb deleted the failover-failback-improvements branch March 21, 2025 13:48

		### Changefeeds

		[Changefeeds]({% link {{ page.version.version }}/change-data-capture-overview.md %}) will fail on the promoted cluster immediately after failover to avoid two clusters running the same changefeed to one sink. We recommend that you recreate changefeeds on the promoted cluster.

Update the failover docs for PCR #19405

Update the failover docs for PCR #19405

Uh oh!

Conversation

katmayb commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note

Preview

Uh oh!

github-actions bot commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Files changed:

Uh oh!

netlify bot commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cockroachdb-api-docs canceled.

Uh oh!

netlify bot commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Uh oh!

netlify bot commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Netlify Preview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alicia-l2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

katmayb commented Mar 18, 2025

Uh oh!

rmloveland left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

katmayb commented Feb 28, 2025 •

edited

Loading

github-actions bot commented Feb 28, 2025 •

edited

Loading

netlify bot commented Feb 28, 2025 •

edited

Loading

netlify bot commented Feb 28, 2025 •

edited

Loading

netlify bot commented Feb 28, 2025 •

edited

Loading