-
Notifications
You must be signed in to change notification settings - Fork 477
Add small changes from PCR overhaul #20304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
|
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm other than some nits! Could you request a review from @msbutler as well?
| {% include_cached copy-clipboard.html %} | ||
| ~~~ shell | ||
| cockroach encode-uri {replication user}:{password}@{node IP or hostname}:26257 --ca-cert certs/ca.crt --inline | ||
| cockroach encode-uri {replication user}:{password}@{node IP or hostname}:26257 --ca-cert {path to certs directory}/certs/ca.crt --inline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to check, are you planning to add the CREATE EXTERAL CONNECTION stuff after this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes- hoping to talk to you more about this soon
|
@peachdawnleach @alicia-l2 is the plan to update the docs for older major versions as well? |
Yes, backport to 24.1 |
|
hey @peachdawnleach , could you coax the docs review automation to create a list of preview pages i can review? It's hard to tell for the markdown files what the changes will actually reflect. Also, is standard for this patch to contain updates to previous release versions. We'd like these changes to be backported through 24.1 |
Added a number of small changes from PCR overhaul doc and fixed broken links and typos More various changes More various changes - will elaborate as needed in comments Moved info Moved info to a more relevant section Fixed broken link Accidentally broke a link - fixed typo Fixed more broken links Lots of broken links from these updates- hopefully fixed the last ones Minor fixes from review A few minor changes based on Alicia's review
2337283 to
e4f1917
Compare
|
Hi @msbutler - the build isn't working right now while I'm working on backporting, but this preview is an accurate representation of the changes made. |
msbutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left too many comments :D. Perhaps @alicia-l2 should take a first pass at my hot takes.
| {{site.data.alerts.callout_danger}} | ||
| The standby cluster must be at the same version as, or one version ahead of, the primary's virtual cluster. | ||
| {{site.data.alerts.callout_info}} | ||
| The entire standby cluster must be at the same version as, or one version ahead of, the primary's virtual cluster at the time of [failover]({% link {{ page.version.version }}/failover-replication.md %}). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by "entire"? Imho this sentence would be clearer if it were: "the system virtual cluster on the standby must always be at the same or one non skippable version ahead of the system virtual cluster on the primary".
Some other thoughts:
- Recall that the replicating tenant on the standby is a pure reflection (including its version) of the app tenant on the primary.
- innovations releases make this "one major version" language hard to talk about. we should use the same language used in our backup/restore docs.
- I don't think these constraints are specific failover time. they apply always. See the magic doc.
cc @alicia-l2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let's remove 'at the time of failover.' I think we need to add another page to describe upgrades, let's keep as is for now (after removing 'at the time of failover').
| {% include {{ page.version.version }}/physical-replication/interface-virtual-cluster.md %} | ||
|
|
||
| This separation of concerns means that the replication stream can operate without affecting work happening in a virtual cluster. | ||
| If you utilize the read from standby feature in PCR, the standby cluster has an additional reader virtual cluster which is a copy of the application virtual cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think "copy" is the right term here as it implies that it's copy of all user data. We could instead write: "additional reader virtual cluster which safely serves read requests on the replicating virtual cluster" You could also provide some analogy with symlinks if you wanted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let's do that, eventually we're going to have another page for this btw
| This separation of concerns means that the replication stream can operate without affecting work happening in a virtual cluster. | ||
| If you utilize the read from standby feature in PCR, the standby cluster has an additional reader virtual cluster which is a copy of the application virtual cluster. | ||
|
|
||
| This separation of controls and data means that the replication stream can operate without affecting work happening in a virtual cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"without affecting work happening in a virtual cluster." i don't understand what this means. is this referring specifically to the primary host cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this line, this line is basically trying to describe cluster virtualization which we already have a section of docs dedicated to
| The stream initialization proceeds as follows: | ||
|
|
||
| 1. The standby's consumer job connects via its system virtual cluster to the primary cluster and starts the primary cluster's physical stream producer job. | ||
| 1. The standby's consumer job connects to the primary cluster via the standby's system virtual cluster and starts the primary cluster's physical stream producer job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove "physical". fwiw, the actual job type is "REPLICATION STREAM PRODUCER"
| ### Failover and promotion process | ||
|
|
||
| The tracked replicated time and the advancing protected timestamp allows the replication stream to also track _retained time_, which is a timestamp in the past indicating the lower bound that the replication stream could fail over to. Therefore, the _failover window_ for a replication job falls between the retained time and the replicated time. | ||
| The tracked replicated time and the advancing protected timestamp allow the replication stream to also track _retained time_, which is a timestamp in the past indicating the lower bound that the replication stream could fail over to. The retained time can be up to 4 hours in the past, due to the protected timestamp. Therefore, the _failover window_ for a replication job falls between the retained time and the replicated time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i don't see what the "tracked" and "advancing" adjectives are doing here. both the replicated time and the protected timestamp are "tracked" and "advancing". I'd remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep them, the fact that they're tracked/advancing isn't necessarily obvious to users who aren't already familiar with PCR.
| ~~~ | ||
|
|
||
| ### Create a user for the standby cluster | ||
| ### Create a replication user and password |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Create a user with replication privileges"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this wasn't changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right- this was in two places and only got changed in one. Thanks!
| You can replicate data from an existing CockroachDB cluster that does not have [cluster virtualization]({% link {{ page.version.version }}/cluster-virtualization-overview.md %}) enabled to a standby cluster with cluster virtualization enabled. For instructions on setting up a PCR in this way, refer to [Set up PCR from an existing cluster]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#set-up-pcr-from-an-existing-cluster). | ||
| After a [failover](#failover) to the standby cluster, you may want to then set up PCR from the original standby cluster, which is now the primary, to another cluster, which will become the standby. There are couple of ways to set up a new standby, and some considerations. | ||
| After a [failover](#failover) to the standby cluster, you must set up PCR from the original standby cluster, which is now the primary, to another cluster, which will become the standby. There are multiple ways to set up a new standby, and some considerations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imho: i think this tutorial could more careful about using always using "original primary" and "original standby" when describing failback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the change from "you may want to" to "you must"? What if PCR is used for a one time migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alicia-l2 I can't really speak to why this change was made, but it was in the google doc I worked from.
@msbutler I agree that this could be more careful and specific- I'll add that to my list of more substantial changes that I'll be working on in upcoming PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep it as 'you may want'
Some changes based on a review from Michael Butler
|
Moving the backport to a separate branch/PR as it's fairly complicated with replacing old terminology |
one more change from review
|
Hi @alicia-l2 and @msbutler - all requested changes have been addressed, am I good to proceed to docs review on this? Thanks! |
| ~~~ | ||
|
|
||
| ### Create a user for the standby cluster | ||
| ### Create a replication user and password |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this wasn't changed
Small reword from review
Fixed broken links
…kroachdb/docs into 20250902-DOC-14757-small-pcr-changes
|
Changes ready for docs review: preview here |
florence-crl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm pending suggestions
|
|
||
| - [From the original standby cluster (after it was promoted during failover) to the original primary cluster](#fail-back-to-the-original-primary-cluster). | ||
| - [After the PCR stream used an existing cluster as the primary cluster](#fail-back-after-pcr-from-an-existing-cluster). | ||
| - [From the original standby cluster (after it was promoted during failover) to the original primary cluster](#fail-back-to-the-original-primary-cluster). If this failback is initiated within 24 hours of the failover, PCR replicates the net-new changes from the standby cluster to the primary cluster, so you do not need to re-seed the primary cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by "re-seed the primary cluster"? Please clarify.
Also should "re-seed" not have a hyphen and instead be "reseed"? Below "reseeding" is used in src/current/v25.3/physical-cluster-replication-overview.md on line 34.
| - **Improved RPO and RTO**: Depending on workload and deployment configuration, [replication lag]({% link {{ page.version.version }}/physical-cluster-replication-technical-overview.md %}) between the primary and standby is generally in the tens-of-seconds range. The failover process from the primary cluster to the standby should typically happen within five minutes when completing a failover to the latest replicated time using [`LATEST`]({% link {{ page.version.version }}/alter-virtual-cluster.md %}#synopsis). | ||
| - **Failover to a timestamp in the past or the future**: In the case of logical disasters or mistakes, you can [fail over]({% link {{ page.version.version }}/failover-replication.md %}) from the primary to the standby cluster to a timestamp in the past. This means that you can return the standby to a timestamp before the mistake was replicated to the standby. Furthermore, you can plan a failover by specifying a timestamp in the future. | ||
| - **Fast failback**: Switch back from the promoted standby cluster to the original primary cluster after a failover event without an initial scan. | ||
| - **Fast failback**: Switch back from the promoted standby cluster to the original primary cluster after a failover event without reseeding data for an initial scan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to above comment: maybe clarify "reseeding data".
Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com>
Changes from docs review
* Various changes Added a number of small changes from PCR overhaul doc and fixed broken links and typos More various changes More various changes - will elaborate as needed in comments Moved info Moved info to a more relevant section Fixed broken link Accidentally broke a link - fixed typo Fixed more broken links Lots of broken links from these updates- hopefully fixed the last ones Minor fixes from review A few minor changes based on Alicia's review * Adjustments from review Some changes based on a review from Michael Butler * changed 'must' to 'may want to' one more change from review * Small change from review Small reword from review * Fixed broken links Fixed broken links * Apply suggestions from code review Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com> * Changes from docs review Changes from docs review --------- Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com>
* Various changes Added a number of small changes from PCR overhaul doc and fixed broken links and typos More various changes More various changes - will elaborate as needed in comments Moved info Moved info to a more relevant section Fixed broken link Accidentally broke a link - fixed typo Fixed more broken links Lots of broken links from these updates- hopefully fixed the last ones Minor fixes from review A few minor changes based on Alicia's review * Full backport Fully backported DOC-14757 changes as far as v24.1. * Add small changes from PCR overhaul (#20304) * Various changes Added a number of small changes from PCR overhaul doc and fixed broken links and typos More various changes More various changes - will elaborate as needed in comments Moved info Moved info to a more relevant section Fixed broken link Accidentally broke a link - fixed typo Fixed more broken links Lots of broken links from these updates- hopefully fixed the last ones Minor fixes from review A few minor changes based on Alicia's review * Adjustments from review Some changes based on a review from Michael Butler * changed 'must' to 'may want to' one more change from review * Small change from review Small reword from review * Fixed broken links Fixed broken links * Apply suggestions from code review Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com> * Changes from docs review Changes from docs review --------- Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com> * Backported changes from review Backported changes from review * Fixed broken links * Fixed redirects * Removed accidental plus signs * Apply suggestions from code review Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com> * Update src/current/v24.1/failover-replication.md Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com> --------- Co-authored-by: Florence Morris <58752716+florence-crl@users.noreply.github.com>
Addresses: DOC-14757
Working on a minor overhaul of the PCR docs- these are the smaller/less involved changes. Expecting to have separate PRs for individual larger changes soon.