Fix data unavailability tutorial #2485

a-robinson · 2018-02-12T21:19:33Z

It was broken for a few main reasons:

There was no guarantee that important system data wouldn't end up on
the two nodes that we're bringing down, since we were only using two
localities.
Even if we used 3 localities, the pattern of starting up a 3-node
cluster first and then adding more nodes means that the system ranges
may not be properly spread across the diversities, since in v1.1 we
don't proactively move data around to improve diversity.
A read-only query isn't guaranteed to hang even if a range is
unavailable. If we only kill the 2 non-leaseholders, the leaseholder
will still be able to keep extending its lease (via node liveness)
and serve reads.

To fix #1, I've modified this to spin up a 9 node cluster across 3
localities.
To fix #2, I've spun up all the nodes before running cockroach init.
We can go back to the old way of doing this once the labs use v2.0.
To fix #3, I've switched from demo-ing a SELECT to using an INSERT.

cockroach-teamcity · 2018-02-12T21:19:39Z

This change is

cockroach-teamcity · 2018-02-12T21:20:40Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/2321ff826d47f6518a6539770a75a703a9208406/

cockroach-teamcity · 2018-02-12T21:23:29Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/8e8d48f80cdb2eb108bdb33d5fee1cc195076e33/

a-robinson · 2018-02-12T21:25:22Z

Looks like I need to regenerate a couple screenshots. Will have those up in a few minutes.

a-robinson · 2018-02-12T21:36:33Z

Screenshots updated.

cockroach-teamcity · 2018-02-12T21:37:48Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/37837918666a819284824da0a1eb0aa990837a48/

cockroach-teamcity · 2018-02-12T21:38:34Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/87eca7ca405e0926c2adfece29a0396705780182/

cockroach-teamcity · 2018-02-12T22:56:10Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/e98080b0171d8a799328ce6ee5ab08c98d88612b/

cockroach-teamcity · 2018-02-12T23:08:34Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/16ad166a7ff525c5857742216d74fee508aa9be8/

petermattis · 2018-02-13T01:18:56Z

Review status: 0 of 7 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

training/data-unavailability-troubleshooting.md, line 194 at r1 (raw file):

    | Start Key | End Key | Replicas | Lease Holder |
    +-----------+---------+----------+--------------+
    | NULL      | NULL    | {3,6,9}  |            9 |

Shouldn't this be 7,8,9.

training/data-unavailability-troubleshooting.md, line 214 at r1 (raw file):

    --insecure \
    --port=26264
    ~~~

In other modules, we've told the trainees to ctrl-c the appropriate nodes. Any reason to do this differently here?

Comments from Reviewable

a-robinson · 2018-02-13T01:30:21Z

Review status: 0 of 7 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

training/data-unavailability-troubleshooting.md, line 194 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Shouldn't this be 7,8,9.

That might be clearer, but it wouldn't reflect reality given that we're starting all the nodes before running init. And we have to do that due to point #2 in the commit message.

training/data-unavailability-troubleshooting.md, line 214 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

In other modules, we've told the trainees to ctrl-c the appropriate nodes. Any reason to do this differently here?

Because the instructions in step one start up all processes in the background using & to avoid needing 10 terminals for this lab (9 for the nodes, 1 for other commands). If you think using 10 terminals would be less confusing, I can change it back. Using --background isn't an option because it doesn't work until after init has been run.

Comments from Reviewable

petermattis · 2018-02-13T01:40:50Z

Review status: 0 of 7 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

training/data-unavailability-troubleshooting.md, line 194 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

That might be clearer, but it wouldn't reflect reality given that we're starting all the nodes before running init. And we have to do that due to point #2 in the commit message.

Ah, the nodes here won't be deterministic. You'll have to call that out in the presentation. I was wondering why you recommend visiting the nodes report. Now it makes sense.

training/data-unavailability-troubleshooting.md, line 214 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Because the instructions in step one start up all processes in the background using & to avoid needing 10 terminals for this lab (9 for the nodes, 1 for other commands). If you think using 10 terminals would be less confusing, I can change it back. Using --background isn't an option because it doesn't work until after init has been run.

Ah, I missed the &. Carry on.

Comments from Reviewable

cockroach-teamcity · 2018-02-13T02:06:22Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/3d388175da698e4d383d40cfd5a9749060dded1c/

It was broken for a few main reasons: 1. There was no guarantee that important system data wouldn't end up on the two nodes that we're bringing down, since we were only using two localities. 2. Even if we used 3 localities, the pattern of starting up a 3-node cluster first and then adding more nodes means that the system ranges may not be properly spread across the diversities, since in v1.1 we don't proactively move data around to improve diversity. 3. A read-only query isn't guaranteed to hang even if a range is unavailable. If we only kill the 2 non-leaseholders, the leaseholder will still be able to keep extending its lease (via node liveness) and serve reads. To fix cockroachdb#1, I've modified this to spin up a 9 node cluster across 3 localities. To fix cockroachdb#2, I've spun up all the nodes before running cockroach init. We can go back to the old way of doing this once the labs use v2.0. To fix cockroachdb#3, I've switched from demo-ing a SELECT to using an INSERT.

cockroach-teamcity · 2018-02-13T02:09:20Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/fb1ecc89da96ca8fb5c900af2512dc9f8b02c564/

jseldess

LGTM as well, retroactively, with one nit that I'll fix.

jseldess · 2018-02-13T08:37:26Z

training/data-unavailability-troubleshooting.md

    ~~~

-## Step 2. Simulate the problem
+4. Note that the node IDs above may not match the order in which we started the nodes, because node IDs only get allocated after `cockroach init` is run. We can verify that the nodes listed by `SHOW TESTING_RANGES`are all in the `datacenter=us-east-3` locality by opening the Node Diagnostics debug page at <a href="http://localhost:8080/#/reports/nodes" data-proofer-ignore>http://localhost:8080/#/reports/nodes</a> and checking the locality for each of the 3 node IDs.


nit: need a space after SHOW TESTING RANGES. I'll fix in follow-up PR.

a-robinson requested review from petermattis and jseldess February 12, 2018 21:19

jordanlewis added the in progress label Feb 12, 2018

a-robinson force-pushed the training-rewrite branch from 2321ff8 to 8e8d48f Compare February 12, 2018 21:21

a-robinson force-pushed the training-rewrite branch from 8e8d48f to 3783791 Compare February 12, 2018 21:36

a-robinson force-pushed the training-rewrite branch from 3783791 to 87eca7c Compare February 12, 2018 21:36

a-robinson force-pushed the training-rewrite branch from 16ad166 to 3d38817 Compare February 13, 2018 02:05

a-robinson force-pushed the training-rewrite branch from 3d38817 to fb1ecc8 Compare February 13, 2018 02:07

a-robinson merged commit 5a5b629 into cockroachdb:master Feb 13, 2018

jordanlewis removed the in progress label Feb 13, 2018

jseldess reviewed Feb 13, 2018

View reviewed changes

Fix data unavailability tutorial #2485

Fix data unavailability tutorial #2485

Uh oh!

Conversation

a-robinson commented Feb 12, 2018

Uh oh!

cockroach-teamcity commented Feb 12, 2018

Uh oh!

cockroach-teamcity commented Feb 12, 2018

Uh oh!

cockroach-teamcity commented Feb 12, 2018

Uh oh!

a-robinson commented Feb 12, 2018

Uh oh!

a-robinson commented Feb 12, 2018

Uh oh!

cockroach-teamcity commented Feb 12, 2018

Uh oh!

cockroach-teamcity commented Feb 12, 2018

Uh oh!

cockroach-teamcity commented Feb 12, 2018

Uh oh!

cockroach-teamcity commented Feb 12, 2018

Uh oh!

petermattis commented Feb 13, 2018

Uh oh!

a-robinson commented Feb 13, 2018

Uh oh!

petermattis commented Feb 13, 2018

Uh oh!

cockroach-teamcity commented Feb 13, 2018

Uh oh!

cockroach-teamcity commented Feb 13, 2018

Uh oh!

jseldess left a comment

Choose a reason for hiding this comment

Uh oh!

jseldess Feb 13, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!