Skip to content

Conversation

a-robinson
Copy link
Contributor

It was broken for a few main reasons:

  1. There was no guarantee that important system data wouldn't end up on
    the two nodes that we're bringing down, since we were only using two
    localities.
  2. Even if we used 3 localities, the pattern of starting up a 3-node
    cluster first and then adding more nodes means that the system ranges
    may not be properly spread across the diversities, since in v1.1 we
    don't proactively move data around to improve diversity.
  3. A read-only query isn't guaranteed to hang even if a range is
    unavailable. If we only kill the 2 non-leaseholders, the leaseholder
    will still be able to keep extending its lease (via node liveness)
    and serve reads.

To fix #1, I've modified this to spin up a 9 node cluster across 3
localities.
To fix #2, I've spun up all the nodes before running cockroach init.
We can go back to the old way of doing this once the labs use v2.0.
To fix #3, I've switched from demo-ing a SELECT to using an INSERT.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@cockroach-teamcity
Copy link
Member

@cockroach-teamcity
Copy link
Member

@a-robinson
Copy link
Contributor Author

Looks like I need to regenerate a couple screenshots. Will have those up in a few minutes.

@a-robinson
Copy link
Contributor Author

Screenshots updated.

@cockroach-teamcity
Copy link
Member

@cockroach-teamcity
Copy link
Member

@cockroach-teamcity
Copy link
Member

@cockroach-teamcity
Copy link
Member

@petermattis
Copy link
Contributor

:lgtm:


Review status: 0 of 7 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


training/data-unavailability-troubleshooting.md, line 194 at r1 (raw file):

    | Start Key | End Key | Replicas | Lease Holder |
    +-----------+---------+----------+--------------+
    | NULL      | NULL    | {3,6,9}  |            9 |

Shouldn't this be 7,8,9.


training/data-unavailability-troubleshooting.md, line 214 at r1 (raw file):

    --insecure \
    --port=26264
    ~~~

In other modules, we've told the trainees to ctrl-c the appropriate nodes. Any reason to do this differently here?


Comments from Reviewable

@a-robinson
Copy link
Contributor Author

Review status: 0 of 7 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


training/data-unavailability-troubleshooting.md, line 194 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Shouldn't this be 7,8,9.

That might be clearer, but it wouldn't reflect reality given that we're starting all the nodes before running init. And we have to do that due to point #2 in the commit message.


training/data-unavailability-troubleshooting.md, line 214 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

In other modules, we've told the trainees to ctrl-c the appropriate nodes. Any reason to do this differently here?

Because the instructions in step one start up all processes in the background using & to avoid needing 10 terminals for this lab (9 for the nodes, 1 for other commands). If you think using 10 terminals would be less confusing, I can change it back. Using --background isn't an option because it doesn't work until after init has been run.


Comments from Reviewable

@petermattis
Copy link
Contributor

Review status: 0 of 7 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


training/data-unavailability-troubleshooting.md, line 194 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

That might be clearer, but it wouldn't reflect reality given that we're starting all the nodes before running init. And we have to do that due to point #2 in the commit message.

Ah, the nodes here won't be deterministic. You'll have to call that out in the presentation. I was wondering why you recommend visiting the nodes report. Now it makes sense.


training/data-unavailability-troubleshooting.md, line 214 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Because the instructions in step one start up all processes in the background using & to avoid needing 10 terminals for this lab (9 for the nodes, 1 for other commands). If you think using 10 terminals would be less confusing, I can change it back. Using --background isn't an option because it doesn't work until after init has been run.

Ah, I missed the &. Carry on.


Comments from Reviewable

@cockroach-teamcity
Copy link
Member

It was broken for a few main reasons:

1. There was no guarantee that important system data wouldn't end up on
the two nodes that we're bringing down, since we were only using two
localities.
2. Even if we used 3 localities, the pattern of starting up a 3-node
cluster first and then adding more nodes means that the system ranges
may not be properly spread across the diversities, since in v1.1 we
don't proactively move data around to improve diversity.
3. A read-only query isn't guaranteed to hang even if a range is
unavailable. If we only kill the 2 non-leaseholders, the leaseholder
will still be able to keep extending its lease (via node liveness)
and serve reads.

To fix cockroachdb#1, I've modified this to spin up a 9 node cluster across 3
localities.
To fix cockroachdb#2, I've spun up all the nodes before running cockroach init.
We can go back to the old way of doing this once the labs use v2.0.
To fix cockroachdb#3, I've switched from demo-ing a SELECT to using an INSERT.
@cockroach-teamcity
Copy link
Member

@a-robinson a-robinson merged commit 5a5b629 into cockroachdb:master Feb 13, 2018
Copy link
Contributor

@jseldess jseldess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well, retroactively, with one nit that I'll fix.

~~~
## Step 2. Simulate the problem
4. Note that the node IDs above may not match the order in which we started the nodes, because node IDs only get allocated after `cockroach init` is run. We can verify that the nodes listed by `SHOW TESTING_RANGES`are all in the `datacenter=us-east-3` locality by opening the Node Diagnostics debug page at <a href="http://localhost:8080/#/reports/nodes" data-proofer-ignore>http://localhost:8080/#/reports/nodes</a> and checking the locality for each of the 3 node IDs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: need a space after SHOW TESTING RANGES. I'll fix in follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

install cockroachdb
5 participants