Skip to content

Conversation

jseldess
Copy link
Contributor

@jseldess jseldess commented Mar 7, 2018

  • Add FAQ on handling planned node maintenance
  • Add guidance on using server.time_until_store_dead for rolling upgrades
  • Minor improvements to cockroach quit doc
  • Retitle Remove Nodes to Decommission Nodes

Fixes #2038

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@cockroach-teamcity
Copy link
Member

@jseldess jseldess requested a review from dianasaur323 March 7, 2018 05:30
@jseldess jseldess force-pushed the maintenance-events branch from 7a042fc to af982d5 Compare March 7, 2018 05:34
@cockroach-teamcity
Copy link
Member

Copy link
Contributor

@dianasaur323 dianasaur323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@jseldess jseldess force-pushed the maintenance-events branch from af982d5 to 870d0e9 Compare March 7, 2018 20:17
@jseldess
Copy link
Contributor Author

jseldess commented Mar 7, 2018

TFTR, @dianasaur323!

@cockroach-teamcity
Copy link
Member

@bdarnell
Copy link
Contributor

bdarnell commented Mar 7, 2018

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


v2.0/upgrade-cockroach-version.md, line 34 at r5 (raw file):

5. By default, if a node stays offline for more than 5 minutes, the cluster will consider it dead and will rebalance its data to other nodes. If you expect any nodes to be offline for longer than 5 minutes, you can prevent the cluster from unnecessarily rebalancing data off the nodes by increasing the `server.time_until_store_dead` [cluster setting](cluster-settings.html) to match the estimated maintenance window.

    For example, let's say you expect nodes to be offline for up to 10 minutes during the rolling upgrade. Before starting, you would change the `server.time_until_store_dead` cluster setting as follows:

Why would you expect this much downtime during a rolling upgrade? You should easily be able to complete your cockroach upgrade within the default 5m window (as long as you're automating things instead of typing them by hand during the upgrade).

In my experience the only thing that might take long enough to warrant an increase in server.time_until_store_dead is a kernel upgrade.


Comments from Reviewable

@jseldess jseldess force-pushed the maintenance-events branch from 870d0e9 to 90efa5a Compare March 7, 2018 20:34
@jseldess
Copy link
Contributor Author

jseldess commented Mar 7, 2018

v2.0/upgrade-cockroach-version.md, line 34 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why would you expect this much downtime during a rolling upgrade? You should easily be able to complete your cockroach upgrade within the default 5m window (as long as you're automating things instead of typing them by hand during the upgrade).

In my experience the only thing that might take long enough to warrant an increase in server.time_until_store_dead is a kernel upgrade.

Downgraded this to a note that states 5 min should be more than enough. Calls out the setting just in case. Is that ok, @bdarnell?


Comments from Reviewable

@cockroach-teamcity
Copy link
Member

@bdarnell
Copy link
Contributor

bdarnell commented Mar 7, 2018

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


v2.0/upgrade-cockroach-version.md, line 34 at r5 (raw file):

Previously, jseldess wrote…

Downgraded this to a note that states 5 min should be more than enough. Calls out the setting just in case. Is that ok, @bdarnell?

I'd probably leave it out completely, but I guess since we're writing these instructions as if you're running commands by hand instead of automating it it may be worth leaving in a mention.


Comments from Reviewable

@cockroach-teamcity
Copy link
Member

@jseldess
Copy link
Contributor Author

jseldess commented Mar 8, 2018

v2.0/upgrade-cockroach-version.md, line 34 at r5 (raw file):
OK, removed.

At the beginning of the "Perform the rolling upgrade" section, we have a tip:

We recommend creating scripts to perform these steps instead of performing them by hand.

But I'd like to get more information from you about how this page could be more focused on automation. Seems to me that manual steps are necessary given that users will use differing tools. But I'd love to create and link to an example script or config for a specific system as well.


Comments from Reviewable

Jesse Seldess added 2 commits March 8, 2018 09:50
- Add an operational FAQ
- Add note to rolling upgrades doc
- Minor improvements to cockroach quit doc
@jseldess jseldess force-pushed the maintenance-events branch from d762d8d to c2b4566 Compare March 8, 2018 14:50
@cockroach-teamcity
Copy link
Member

@jseldess jseldess merged commit 9989908 into master Mar 8, 2018
@jseldess jseldess deleted the maintenance-events branch March 8, 2018 17:02
@bdarnell
Copy link
Contributor

bdarnell commented Mar 8, 2018

Review status: 0 of 6 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


v2.0/upgrade-cockroach-version.md, line 34 at r5 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

OK, removed.

At the beginning of the "Perform the rolling upgrade" section, we have a tip:

We recommend creating scripts to perform these steps instead of performing them by hand.

But I'd like to get more information from you about how this page could be more focused on automation. Seems to me that manual steps are necessary given that users will use differing tools. But I'd love to create and link to an example script or config for a specific system as well.

My view is that we have to walk through the manual process here so that people will understand what they have to write in their scripts. Over time we can develop a library of scripts/configs (like we have for kubernetes, etc), and those automated versions will displace these manual steps.


Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants