Skip to content

Conversation

@peachdawnleach
Copy link
Contributor

Addresses: DOC-14757

Working on a minor overhaul of the PCR docs- these are the smaller/less involved changes. Expecting to have separate PRs for individual larger changes soon.

@netlify
Copy link

netlify bot commented Sep 8, 2025

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit c15ab3c
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-interactivetutorials-docs/deploys/68c9b7776e7b4600084f68b8

@netlify
Copy link

netlify bot commented Sep 8, 2025

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit c15ab3c
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-api-docs/deploys/68c9b777ef3be60008637079

@github-actions
Copy link

github-actions bot commented Sep 8, 2025

@netlify
Copy link

netlify bot commented Sep 8, 2025

Netlify Preview

Name Link
🔨 Latest commit c15ab3c
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-docs/deploys/68c9b77736ae8c0007aeda03
😎 Deploy Preview https://deploy-preview-20304--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link

@alicia-l2 alicia-l2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm other than some nits! Could you request a review from @msbutler as well?

{% include_cached copy-clipboard.html %}
~~~ shell
cockroach encode-uri {replication user}:{password}@{node IP or hostname}:26257 --ca-cert certs/ca.crt --inline
cockroach encode-uri {replication user}:{password}@{node IP or hostname}:26257 --ca-cert {path to certs directory}/certs/ca.crt --inline

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to check, are you planning to add the CREATE EXTERAL CONNECTION stuff after this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes- hoping to talk to you more about this soon

@msbutler msbutler self-requested a review September 9, 2025 17:37
@msbutler
Copy link

@peachdawnleach @alicia-l2 is the plan to update the docs for older major versions as well?

@alicia-l2
Copy link

@peachdawnleach @alicia-l2 is the plan to update the docs for older major versions as well?

Yes, backport to 24.1

@msbutler
Copy link

hey @peachdawnleach , could you coax the docs review automation to create a list of preview pages i can review? It's hard to tell for the markdown files what the changes will actually reflect.

Also, is standard for this patch to contain updates to previous release versions. We'd like these changes to be backported through 24.1

Added a number of small changes from PCR overhaul doc and fixed broken links and typos

More various changes

More various changes - will elaborate as needed in comments

Moved info

Moved info to a more relevant section

Fixed broken link

Accidentally broke a link - fixed typo

Fixed more broken links

Lots of broken links from these updates- hopefully fixed the last ones

Minor fixes from review

A few minor changes based on Alicia's review
@peachdawnleach peachdawnleach force-pushed the 20250902-DOC-14757-small-pcr-changes branch from 2337283 to e4f1917 Compare September 10, 2025 18:33
@peachdawnleach
Copy link
Contributor Author

Hi @msbutler - the build isn't working right now while I'm working on backporting, but this preview is an accurate representation of the changes made.

Copy link

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left too many comments :D. Perhaps @alicia-l2 should take a first pass at my hot takes.

{{site.data.alerts.callout_danger}}
The standby cluster must be at the same version as, or one version ahead of, the primary's virtual cluster.
{{site.data.alerts.callout_info}}
The entire standby cluster must be at the same version as, or one version ahead of, the primary's virtual cluster at the time of [failover]({% link {{ page.version.version }}/failover-replication.md %}).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by "entire"? Imho this sentence would be clearer if it were: "the system virtual cluster on the standby must always be at the same or one non skippable version ahead of the system virtual cluster on the primary".

Some other thoughts:

  • Recall that the replicating tenant on the standby is a pure reflection (including its version) of the app tenant on the primary.
  • innovations releases make this "one major version" language hard to talk about. we should use the same language used in our backup/restore docs.
  • I don't think these constraints are specific failover time. they apply always. See the magic doc.

cc @alicia-l2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let's remove 'at the time of failover.' I think we need to add another page to describe upgrades, let's keep as is for now (after removing 'at the time of failover').

{% include {{ page.version.version }}/physical-replication/interface-virtual-cluster.md %}

This separation of concerns means that the replication stream can operate without affecting work happening in a virtual cluster.
If you utilize the read from standby feature in PCR, the standby cluster has an additional reader virtual cluster which is a copy of the application virtual cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think "copy" is the right term here as it implies that it's copy of all user data. We could instead write: "additional reader virtual cluster which safely serves read requests on the replicating virtual cluster" You could also provide some analogy with symlinks if you wanted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's do that, eventually we're going to have another page for this btw

This separation of concerns means that the replication stream can operate without affecting work happening in a virtual cluster.
If you utilize the read from standby feature in PCR, the standby cluster has an additional reader virtual cluster which is a copy of the application virtual cluster.

This separation of controls and data means that the replication stream can operate without affecting work happening in a virtual cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"without affecting work happening in a virtual cluster." i don't understand what this means. is this referring specifically to the primary host cluster?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this line, this line is basically trying to describe cluster virtualization which we already have a section of docs dedicated to

The stream initialization proceeds as follows:

1. The standby's consumer job connects via its system virtual cluster to the primary cluster and starts the primary cluster's physical stream producer job.
1. The standby's consumer job connects to the primary cluster via the standby's system virtual cluster and starts the primary cluster's physical stream producer job.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove "physical". fwiw, the actual job type is "REPLICATION STREAM PRODUCER"

### Failover and promotion process

The tracked replicated time and the advancing protected timestamp allows the replication stream to also track _retained time_, which is a timestamp in the past indicating the lower bound that the replication stream could fail over to. Therefore, the _failover window_ for a replication job falls between the retained time and the replicated time.
The tracked replicated time and the advancing protected timestamp allow the replication stream to also track _retained time_, which is a timestamp in the past indicating the lower bound that the replication stream could fail over to. The retained time can be up to 4 hours in the past, due to the protected timestamp. Therefore, the _failover window_ for a replication job falls between the retained time and the replicated time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i don't see what the "tracked" and "advancing" adjectives are doing here. both the replicated time and the protected timestamp are "tracked" and "advancing". I'd remove them.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep them, the fact that they're tracked/advancing isn't necessarily obvious to users who aren't already familiar with PCR.

~~~

### Create a user for the standby cluster
### Create a replication user and password

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Create a user with replication privileges"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this wasn't changed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right- this was in two places and only got changed in one. Thanks!

You can replicate data from an existing CockroachDB cluster that does not have [cluster virtualization]({% link {{ page.version.version }}/cluster-virtualization-overview.md %}) enabled to a standby cluster with cluster virtualization enabled. For instructions on setting up a PCR in this way, refer to [Set up PCR from an existing cluster]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#set-up-pcr-from-an-existing-cluster).
After a [failover](#failover) to the standby cluster, you may want to then set up PCR from the original standby cluster, which is now the primary, to another cluster, which will become the standby. There are couple of ways to set up a new standby, and some considerations.
After a [failover](#failover) to the standby cluster, you must set up PCR from the original standby cluster, which is now the primary, to another cluster, which will become the standby. There are multiple ways to set up a new standby, and some considerations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho: i think this tutorial could more careful about using always using "original primary" and "original standby" when describing failback.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the change from "you may want to" to "you must"? What if PCR is used for a one time migration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alicia-l2 I can't really speak to why this change was made, but it was in the google doc I worked from.

@msbutler I agree that this could be more careful and specific- I'll add that to my list of more substantial changes that I'll be working on in upcoming PRs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep it as 'you may want'

@peachdawnleach
Copy link
Contributor Author

Moving the backport to a separate branch/PR as it's fairly complicated with replacing old terminology

@peachdawnleach
Copy link
Contributor Author

Hi @alicia-l2 and @msbutler - all requested changes have been addressed, am I good to proceed to docs review on this? Thanks!

~~~

### Create a user for the standby cluster
### Create a replication user and password

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this wasn't changed

@peachdawnleach
Copy link
Contributor Author

Changes ready for docs review: preview here

@florence-crl florence-crl self-requested a review September 15, 2025 20:42
Copy link
Contributor

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm pending suggestions


- [From the original standby cluster (after it was promoted during failover) to the original primary cluster](#fail-back-to-the-original-primary-cluster).
- [After the PCR stream used an existing cluster as the primary cluster](#fail-back-after-pcr-from-an-existing-cluster).
- [From the original standby cluster (after it was promoted during failover) to the original primary cluster](#fail-back-to-the-original-primary-cluster). If this failback is initiated within 24 hours of the failover, PCR replicates the net-new changes from the standby cluster to the primary cluster, so you do not need to re-seed the primary cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by "re-seed the primary cluster"? Please clarify.

Also should "re-seed" not have a hyphen and instead be "reseed"? Below "reseeding" is used in src/current/v25.3/physical-cluster-replication-overview.md on line 34.

- **Improved RPO and RTO**: Depending on workload and deployment configuration, [replication lag]({% link {{ page.version.version }}/physical-cluster-replication-technical-overview.md %}) between the primary and standby is generally in the tens-of-seconds range. The failover process from the primary cluster to the standby should typically happen within five minutes when completing a failover to the latest replicated time using [`LATEST`]({% link {{ page.version.version }}/alter-virtual-cluster.md %}#synopsis).
- **Failover to a timestamp in the past or the future**: In the case of logical disasters or mistakes, you can [fail over]({% link {{ page.version.version }}/failover-replication.md %}) from the primary to the standby cluster to a timestamp in the past. This means that you can return the standby to a timestamp before the mistake was replicated to the standby. Furthermore, you can plan a failover by specifying a timestamp in the future.
- **Fast failback**: Switch back from the promoted standby cluster to the original primary cluster after a failover event without an initial scan.
- **Fast failback**: Switch back from the promoted standby cluster to the original primary cluster after a failover event without reseeding data for an initial scan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to above comment: maybe clarify "reseeding data".

peachdawnleach and others added 2 commits September 16, 2025 14:37
Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com>
Changes from docs review
@peachdawnleach peachdawnleach enabled auto-merge (squash) September 16, 2025 19:16
@peachdawnleach peachdawnleach merged commit 268762c into main Sep 16, 2025
6 checks passed
@peachdawnleach peachdawnleach deleted the 20250902-DOC-14757-small-pcr-changes branch September 16, 2025 19:40
peachdawnleach added a commit that referenced this pull request Sep 16, 2025
* Various changes

Added a number of small changes from PCR overhaul doc and fixed broken links and typos

More various changes

More various changes - will elaborate as needed in comments

Moved info

Moved info to a more relevant section

Fixed broken link

Accidentally broke a link - fixed typo

Fixed more broken links

Lots of broken links from these updates- hopefully fixed the last ones

Minor fixes from review

A few minor changes based on Alicia's review

* Adjustments from review

Some changes based on a review from Michael Butler

* changed 'must' to 'may want to'

one more change from review

* Small change from review

Small reword from review

* Fixed broken links

Fixed broken links

* Apply suggestions from code review

Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com>

* Changes from docs review

Changes from docs review

---------

Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com>
peachdawnleach added a commit that referenced this pull request Sep 18, 2025
* Various changes

Added a number of small changes from PCR overhaul doc and fixed broken links and typos

More various changes

More various changes - will elaborate as needed in comments

Moved info

Moved info to a more relevant section

Fixed broken link

Accidentally broke a link - fixed typo

Fixed more broken links

Lots of broken links from these updates- hopefully fixed the last ones

Minor fixes from review

A few minor changes based on Alicia's review

* Full backport

Fully backported DOC-14757 changes as far as v24.1.

* Add small changes from PCR overhaul (#20304)

* Various changes

Added a number of small changes from PCR overhaul doc and fixed broken links and typos

More various changes

More various changes - will elaborate as needed in comments

Moved info

Moved info to a more relevant section

Fixed broken link

Accidentally broke a link - fixed typo

Fixed more broken links

Lots of broken links from these updates- hopefully fixed the last ones

Minor fixes from review

A few minor changes based on Alicia's review

* Adjustments from review

Some changes based on a review from Michael Butler

* changed 'must' to 'may want to'

one more change from review

* Small change from review

Small reword from review

* Fixed broken links

Fixed broken links

* Apply suggestions from code review

Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com>

* Changes from docs review

Changes from docs review

---------

Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com>

* Backported changes from review

Backported changes from review

* Fixed broken links

* Fixed redirects

* Removed accidental plus signs

* Apply suggestions from code review

Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com>

* Update src/current/v24.1/failover-replication.md

Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com>

---------

Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants