Skip to content

Conversation

@katmayb
Copy link
Contributor

@katmayb katmayb commented Feb 28, 2025

Fixes DOC-12358, DOC-12356

This PR is to clarify the failover/failback (formerly cutover/cutback) docs for PCR.

Changes include:

  • Clearly defined: Failover and Failback in the intro of the failover page. (Note this is slightly different in v23.2, because "cutback" was just starting a PCR stream again.)
  • Added detail that fast failback is not possible when the original PCR stream used a non-UA clusters.
  • Included better communication re non-UA to UA cluster workflow.
  • Updated the pages to check versioning requirements.
  • Streamlined wording in some descriptions and, where possible, moved away from "replication stream" to "PCR" now that we have LDR in the docs as well.
  • Added "Fast Failback" to the features list on the Overview page in the applicable versions.
  • To prerequisites on Setup PCR page + Failback changed versioning to "The standby cluster should be the same version or one version ahead of the primary cluster." rather than a specific version.
  • Removed the Enterprise license procedural content and prereq now that CRDB licenses have changed.

Note

This is PR 2 in a series of work to improve the PCR docs. Especially UX around non-ua to UA and failover/failback + upgrades and versioning.

Follow-up PR will include improved descriptions on cluster versions and upgrade process.

Preview

https://deploy-preview-19405--cockroachdb-docs.netlify.app/docs/v25.1/failover-replication.html

@github-actions
Copy link

github-actions bot commented Feb 28, 2025

Files changed:

@netlify
Copy link

netlify bot commented Feb 28, 2025

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit 0227470
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-api-docs/deploys/67dd6c387efc7500089ee3fe

@netlify
Copy link

netlify bot commented Feb 28, 2025

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit 0227470
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-interactivetutorials-docs/deploys/67dd6c38fb7bcb0008fcbd82

@netlify
Copy link

netlify bot commented Feb 28, 2025

Netlify Preview

Name Link
🔨 Latest commit 0227470
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/67dd6c389023580008224828
😎 Deploy Preview https://deploy-preview-19405--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@katmayb katmayb force-pushed the failover-failback-improvements branch 6 times, most recently from 01c9087 to 8f82f5d Compare March 6, 2025 19:52
## Before you begin

- Two separate CockroachDB clusters (primary and standby) with a minimum of three nodes each, and each using the same CockroachDB {{page.version.version}} version.
- Two separate CockroachDB clusters (primary and standby) with a minimum of three nodes each, and each using the same CockroachDB {{page.version.version}} version. The standby cluster should be the same version or one version ahead of the primary cluster.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this sentence to match the current description in the "Cluster versions and upgrades" section.

- The [Deploy CockroachDB on Premises]({% link {{ page.version.version }}/deploy-cockroachdb-on-premises.md %}) tutorial creates a self-signed certificate for each {{ site.data.products.core }} cluster. To create certificates signed by an external certificate authority, refer to [Create Security Certificates using OpenSSL]({% link {{ page.version.version }}/create-security-certificates-openssl.md %}).
- All nodes in each cluster will need access to the Certificate Authority for the other cluster. Refer to [Copy certificates](#step-3-copy-certificates).
- An [{{ site.data.products.enterprise }} license]({% link {{ page.version.version }}/licensing-faqs.md %}#types-of-licenses) on the primary **and** standby clusters. You must use the system virtual cluster on the primary and standby clusters to enable your {{ site.data.products.enterprise }} license.
- The primary and standby clusters **must have the same [region topology]({% link {{ page.version.version }}/topology-patterns.md %})**. For example, replicating a multi-region primary cluster to a single-region standby cluster is not supported. Mismatching regions between a multi-region primary and standby cluster is also not supported.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enterprise licensing prereq + steps gone as a result of CRDB license updates.

- [Create a new virtual cluster]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#step-4-start-replication) (`main`) on cluster **A** from the replication of cluster **B**. Cluster **A** is now virtualized. This will start an initial scan because the PCR stream will ignore the former workload tables in the system virtual cluster that were [originally replicated to **B**]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#set-up-pcr-from-an-existing-cluster). You can [drop the tables]({% link {{ page.version.version }}/drop-table.md %}) that were in the system virtual cluster, because the new virtual cluster will now hold the workload replicating from cluster **B**.
- [Start an entirely new cluster]({% link {{ page.version.version }}/cockroach-start.md %}) **C** and [create]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#step-4-start-replication) a `main` virtual cluster on it from the replication of cluster **B**. This will start an initial scan because cluster **C** is empty.
## Job management
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't new content, just moved Job management to the bottom of the page after faliback/cutback

@katmayb katmayb marked this pull request as ready for review March 6, 2025 20:10
@katmayb katmayb requested a review from alicia-l2 March 6, 2025 20:11
Copy link

@alicia-l2 alicia-l2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left some comments - thank you!

{{site.data.alerts.end}}

The failover is a two-step process on the standby cluster:
_Failover_ in [**physical cluster replication (PCR)**]({% link {{ page.version.version }}/physical-cluster-replication-overview.md %}) allows you to switch from the active primary cluster to the passive standby cluster that has ingested replicated data. When you complete the replication stream to initiate a failover, the job stops the stream of new data, resets the standby [virtual cluster]({% link {{ page.version.version }}/physical-cluster-replication-technical-overview.md %}) to a point in time where all ingested data is consistent, and then marks the standby virtual cluster as ready to accept traffic.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stops the stream

Should we just say stops replicating data from the primary?

resets

should we say sets?

marks the standby virtual cluster as ready to accept traffic.

should we say 'makes' instead of marks? Because the standby cluster is ready to accept traffic at that time. or like enables? Something just more explicit that the standby is online

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh also - I don't know how long you wanted to make this section, but it could be cool to mention how we let you failover to a time in the past or future. Defer to you though (we could also put that in the overview)!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented all of these suggestions — thank you!


1. [Initiating the failover](#step-1-initiate-the-failover).
1. [Completing the failover](#step-2-complete-the-failover).
_Failback_ in PCR switches operations back to the original primary cluster (or a different cluster) after a failover event. When you initiate a failback, the job ensures the original primary is up to date with writes from the standby that happened after failover. The original primary cluster is then set as ready to accept application traffic once again.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, what do you mean by a different cluster? Do you mean like a net new cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I did mean that. Do you want me to take that out or change to "new cluster"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe new cluster would be more explicit? Only calling it out because i've recently had some folks confused about it when we are talking through PCR

~~~
The `failover_time` is the timestamp at which the replicated data is consistent. The cluster will revert any data above this timestamp:
The `failover_time` is the timestamp at which the replicated data is consistent. The cluster will revert any data above this timestamp:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add, 'the cluster will revert any replicated data'... 'this timestamp to ensure that the standby is consistent with the primary at that timestamp'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done!

## Failback

During a replication stream, jobs running on the primary cluster will replicate to the standby cluster. Once you have [completed a failover](#step-2-complete-the-failover) (or a [failback](#fail-back-to-the-primary-cluster)), refer to the following sections for details on resuming jobs on the promoted cluster.
After failing over to the standby cluster, you may need to fail back to the original primary cluster to serve your application. Depending on the state of the primary cluster in the original PCR stream, use one of the following workflows:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say, 'you may want to revert to your original primary-standby cluster setup.' Because they technically can serve their application using the standby cluster.

Should we say 'Depending on the configuration of the primary cluster?'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, yes, done!

- [Create a new virtual cluster]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#step-4-start-replication) (`main`) on cluster **A** from the replication of cluster **B**. Cluster **A** is now virtualized. This will start an initial scan because the PCR stream will ignore the former workload tables in the system virtual cluster that were [originally replicated to **B**]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#set-up-pcr-from-an-existing-cluster). You can [drop the tables]({% link {{ page.version.version }}/drop-table.md %}) that were in the system virtual cluster, because the new virtual cluster will now hold the workload replicating from cluster **B**.
- [Start an entirely new cluster]({% link {{ page.version.version }}/cockroach-start.md %}) **C** and [create]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#step-4-start-replication) a `main` virtual cluster on it from the replication of cluster **B**. This will start an initial scan because cluster **C** is empty.
## Job management

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is awesome - thanks so much!

### Changefeeds
[Changefeeds]({% link {{ page.version.version }}/change-data-capture-overview.md %}) will fail on the promoted cluster immediately after failover to avoid two clusters running the same changefeed to one sink. We recommend that you recreate changefeeds on the promoted cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two sentences kind of confuse me -- are we saying that changefeed schedules will pause and then currently running changefeeds will fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I added "currently running" to the first paragraph. There was a difference in behavior for changefeeds (fail) vs. scheduled changefeeds (pause) when this was written, which I believe is still true.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh, I didn't even know that - thanks!

@katmayb katmayb requested a review from alicia-l2 March 13, 2025 17:13
Physical cluster replication is supported in CockroachDB {{ site.data.products.core }} clusters.
{{site.data.alerts.end}}

{% include_cached new-in.html version="v23.2" %} _Cutover_ in [**physical cluster replication (PCR)**]({% link {{ page.version.version }}/physical-cluster-replication-overview.md %}) allows you to switch from the active primary cluster to the passive standby cluster that has ingested replicated data. When you complete the replication stream to initiate a cutover, the job stops replicating data from the primary, sets the standby [virtual cluster]({% link {{ page.version.version }}/physical-cluster-replication-technical-overview.md %}) to a point in time (in the past or future) where all ingested data is consistent, and then makes the standby virtual cluster ready to accept traffic.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alicia-l2 Updated here!

@katmayb katmayb requested a review from rmloveland March 18, 2025 14:00
@katmayb
Copy link
Contributor Author

katmayb commented Mar 18, 2025

@alicia-l2 Approved via slack!

Copy link
Contributor

@rmloveland rmloveland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

### Step 1. Initiate the failover

To initiate a failover to the standby cluster, there are different ways of specifying the point in time for the standby's promotion. That is, the standby cluster's live data at the point of failover. Refer to the following sections for steps:
To initiate a failover to the standby cluster, you can specify the point in time for the standby's promotion in different ways. That is, the standby cluster's live data at the point of failover. Refer to the following sections for steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in the following ways" ???

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better! Thanks!

@katmayb katmayb force-pushed the failover-failback-improvements branch from 4e06874 to 0227470 Compare March 21, 2025 13:40
@katmayb katmayb merged commit 9bbbc27 into main Mar 21, 2025
6 checks passed
@katmayb katmayb deleted the failover-failback-improvements branch March 21, 2025 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants