-
Notifications
You must be signed in to change notification settings - Fork 473
Fix data unavailability tutorial #2485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2321ff8
to
8e8d48f
Compare
Looks like I need to regenerate a couple screenshots. Will have those up in a few minutes. |
8e8d48f
to
3783791
Compare
Screenshots updated. |
3783791
to
87eca7c
Compare
Review status: 0 of 7 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed. training/data-unavailability-troubleshooting.md, line 194 at r1 (raw file):
Shouldn't this be training/data-unavailability-troubleshooting.md, line 214 at r1 (raw file):
In other modules, we've told the trainees to ctrl-c the appropriate nodes. Any reason to do this differently here? Comments from Reviewable |
Review status: 0 of 7 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed. training/data-unavailability-troubleshooting.md, line 194 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
That might be clearer, but it wouldn't reflect reality given that we're starting all the nodes before running training/data-unavailability-troubleshooting.md, line 214 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
Because the instructions in step one start up all processes in the background using Comments from Reviewable |
Review status: 0 of 7 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed. training/data-unavailability-troubleshooting.md, line 194 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Ah, the nodes here won't be deterministic. You'll have to call that out in the presentation. I was wondering why you recommend visiting the training/data-unavailability-troubleshooting.md, line 214 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Ah, I missed the Comments from Reviewable |
16ad166
to
3d38817
Compare
It was broken for a few main reasons: 1. There was no guarantee that important system data wouldn't end up on the two nodes that we're bringing down, since we were only using two localities. 2. Even if we used 3 localities, the pattern of starting up a 3-node cluster first and then adding more nodes means that the system ranges may not be properly spread across the diversities, since in v1.1 we don't proactively move data around to improve diversity. 3. A read-only query isn't guaranteed to hang even if a range is unavailable. If we only kill the 2 non-leaseholders, the leaseholder will still be able to keep extending its lease (via node liveness) and serve reads. To fix cockroachdb#1, I've modified this to spin up a 9 node cluster across 3 localities. To fix cockroachdb#2, I've spun up all the nodes before running cockroach init. We can go back to the old way of doing this once the labs use v2.0. To fix cockroachdb#3, I've switched from demo-ing a SELECT to using an INSERT.
3d38817
to
fb1ecc8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well, retroactively, with one nit that I'll fix.
~~~ | ||
## Step 2. Simulate the problem | ||
4. Note that the node IDs above may not match the order in which we started the nodes, because node IDs only get allocated after `cockroach init` is run. We can verify that the nodes listed by `SHOW TESTING_RANGES`are all in the `datacenter=us-east-3` locality by opening the Node Diagnostics debug page at <a href="http://localhost:8080/#/reports/nodes" data-proofer-ignore>http://localhost:8080/#/reports/nodes</a> and checking the locality for each of the 3 node IDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: need a space after SHOW TESTING RANGES
. I'll fix in follow-up PR.
It was broken for a few main reasons:
the two nodes that we're bringing down, since we were only using two
localities.
cluster first and then adding more nodes means that the system ranges
may not be properly spread across the diversities, since in v1.1 we
don't proactively move data around to improve diversity.
unavailable. If we only kill the 2 non-leaseholders, the leaseholder
will still be able to keep extending its lease (via node liveness)
and serve reads.
To fix #1, I've modified this to spin up a 9 node cluster across 3
localities.
To fix #2, I've spun up all the nodes before running cockroach init.
We can go back to the old way of doing this once the labs use v2.0.
To fix #3, I've switched from demo-ing a SELECT to using an INSERT.