[CRDB-12226] server, ui: display circuit breakers in problem ranges and range status #75809

Santamaura · 2022-02-01T19:53:35Z

This PR adds changes to the reports/problemranges and reports/range pages.
Ranges with replicas that have a circuit breaker will show up as problem ranges and
the circuit breaker error will show up as a row on the status page.

Release note (ui change): display circuit breakers in problems ranges and range status

Problem Ranges page:

Range status page:

cockroach-teamcity · 2022-02-01T19:53:42Z

This change is

Santamaura · 2022-02-03T20:02:21Z

@tbg for testing this it seems like the ProblemRangesResponse will return with "initial connection heartbeat failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:26259: connect: connection refused" in the error_message when 2 nodes are killed instead of with a circuit_breaker_error is there another way to cause a circuit breaker error?

koorosh · 2022-02-03T22:24:48Z

@tbg for testing this it seems like the ProblemRangesResponse will return with "initial connection heartbeat failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:26259: connect: connection refused" in the error_message when 2 nodes are killed instead of with a circuit_breaker_error is there another way to cause a circuit breaker error?

~~I emulated circuit breaker with next steps:~~

~~enable circuit breaker with very low timeouts:~~

SET CLUSTER SETTING kv.replica_circuit_breaker.slow_replication_threshold = '100ms';

~~then run some workload on your cluster~~

./cockroach workload init movr 'postgresql://root@localhost:26257?sslmode=disable'
./cockroach workload run movr 'postgresql://root@localhost:26257?sslmode=disable'

and wait for a minute :)
then don't forget to set setting back to make cluster operate properly:

SET CLUSTER SETTING kv.replica_circuit_breaker.slow_replication_threshold = '0ms'; -- disable circuit breaker

nwm, it will cause circuit breaker event, but not an circuit breaker error.

tbg · 2022-02-04T08:12:23Z

@tbg for testing this it seems like the ProblemRangesResponse will return with "initial connection heartbeat failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:26259: connect: connection refused" in the error_message when 2 nodes are killed instead of with a circuit_breaker_error is there another way to cause a circuit breaker error?

I think you might have to wait a bit longer. The error you're seeing is likely caused by the problem ranges endpoint trying to reach the down nodes. After ~10s, the nodes should be marked as suspect (not live) so the problem ranges endpoint should be ignoring them. If this doesn't work, then there is a problem with the problem ranges that separately would need to be addressed. But I'm relatively confident that this will work.

Santamaura · 2022-02-04T16:21:02Z

Thanks guys I'll test both methods

Santamaura

@tbg hmm I killed the 2 nodes and waited ~30 seconds and the problem ranges endpoint still returns with an error message..after waiting some more time (a few mins) it seems like the entire dashboard doesn't even load anymore.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @koorosh, @tbg, and @zachlite)

tbg · 2022-02-07T15:52:25Z

Wait five minutes first, then kill two nodes. I want to rule out that you're accidentally losing quorum on the system ranges, in which case the cluster is pretty dysfunctional.

Santamaura · 2022-02-07T16:49:13Z

Yeah waiting 5 mins then killing the 2 nodes seems to result in the same things observed that were mentioned ^^

tbg · 2022-02-07T20:23:59Z

Let me make sure I understand correctly. I set up a local 3 node cluster, waited for full replication, and brought down n1. Now the problem range endpoint gives me this, which is fine. It doesn't have any data for n1, but how could it, that node is down. It does report data for n2 and n3. (This is on master, not your PR).

In your experiment, you would look for open circuit breakers reported by nodes that are not down (the circuit breakers are tripped since replicas on live nodes are unable to make progress, as they would have to rely on the down nodes to do so at the replication layer). I have a demo script at 16m30s here that you could copy (it's in the right half of my terminal). When the workload returns all circuit breaker errors, the problem ranges endpoint (if the circuit_breaker_error field was plumbed around correctly) should be populated. The _status/ranges/local similarly should show it (and definitely already has the circuit_breaker_error field).

Let me know if it still doesn't work, and I'll take a look at the PR tomorrow and try to repro this for myself. (I'm unable to meet synchronously this week, or I would've offered that; if this drags out until next then we can go through it together).

Santamaura · 2022-02-07T21:30:20Z

Thanks for the followup! I noticed I was missing an addition in status.go for CircuitBreakerError. So after these changes I started up a 3 node cluster, waited a bit then killed n1. I'm retrieving the CircuitBreakerError from this

cockroach/pkg/server/status.go

Line 1860 in 9ef0fed

state := rep.State(ctx)

perhaps it is not the right place to get it from? It always seemed to not return an error. If this is the correct way to check for the problem ranges endpoint then most likely there's some issue with my config.

Santamaura · 2022-02-08T18:52:08Z

So on a whim I followed Andrii's steps for his test (as mentioned above) and upon the workload encountering an error (Error: pq: result is ambiguous (raft status: {"id":"1","term":7,"vote":"1","commit":173,"lead":"1","raftState":"StateLeader","applied":173,"progress":{"1":{"match":174,"next":175,"state":"StateReplicate"},"2":{"match":173,"next":175,"state":"StateReplicate"},"3":{"match":173,"next":175,"state":"StateReplicate"}},"leadtransferee":"0"}: replica (n1,s1):1 unable to serve request to r90:/Table/6{0-1} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=10]) I believe this error is the circuit breaker error) The circuit breaker actually showed up on the problem ranges page! However by the time I click on the range that has the circuit breaker error or refresh the problem ranges page, the error is gone.

If I increase the slow_replication_threshold to a larger value the workload does not encounter an error and I do not see the circuit breaker error, even when killing a node.

tbg · 2022-02-08T20:58:21Z

I have a demo script at 16m30s here that you could copy

I just tried the above with this PR and it "just worked", not sure what went differently when you tried: https://www.loom.com/share/f560561bae6a42e58a34951226ed54da

I created the cluster via:

roachprod create local -n 5
roachprod put local ./cockroach
roachprod start local
roachprod run local:1 -- ./cockroach workload init kv --splits 100

open UI and wait for no under-replicated ranges

roachprod stop local:2-3

root@:26257/defaultdb> set cluster setting kv.replica_circuit_breaker.slow_replication_threshold = '5s';

roachprod run local:1 -- ./cockroach workload run kv --max-rate 50 --tolerate-errors 2> /dev/null
# wait a bit until you see errors/sec increase

open problem ranges

Santamaura · 2022-02-08T21:39:49Z

Ah I was starting my cluster the non-roachprod way because I wanted to see hot reloads to the UI. I tried via the roachprod way and confirm it works 👍 thanks Tobias! Strange that it doesn't work when I start cluster directly from binary

This PR adds changes to the reports/problemranges and reports/range pages. Ranges with replicas that have a circuit breaker will show up as problem ranges and the circuit breaker error will show up as a row on the status page. Release note (ui change): display circuit breakers in problems ranges and range status

koorosh

Reviewed 3 of 7 files at r1, 3 of 5 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Santamaura, @tbg, and @zachlite)

pkg/server/serverpb/status.proto, line 392 at r4 (raw file):

  // When the raft log is too large, it can be a symptom of other issues.
  bool raft_log_too_large = 7;
  bool circuit_breaker_error = 9;

@Santamaura , what is the main purpose of this field? In case it is used as a flag to indicate presence of circuit breaker errors in response, than would it be possible to check circuit_breaker_error_range_ids (that's defined below)?

Santamaura

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @koorosh, @tbg, and @zachlite)

pkg/server/serverpb/status.proto, line 392 at r4 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

@Santamaura , what is the main purpose of this field? In case it is used as a flag to indicate presence of circuit breaker errors in response, than would it be possible to check circuit_breaker_error_range_ids (that's defined below)?

It is used to check if there is a circuit breaker error on a per range basis. This flag actually determines whether we add a range id to the circuit_breaker_range_ids.

koorosh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Santamaura, @tbg, and @zachlite)

pkg/server/serverpb/status.proto, line 392 at r4 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

It is used to check if there is a circuit breaker error on a per range basis. This flag actually determines whether we add a range id to the circuit_breaker_range_ids.

Is it possible to do the same validation on front-end side to remove this field? Also naming of this field intends to an error instead of bool.

Santamaura

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @koorosh, @tbg, and @zachlite)

pkg/server/serverpb/status.proto, line 392 at r4 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

Is it possible to do the same validation on front-end side to remove this field? Also naming of this field intends to an error instead of bool.

Oh sorry to clarify, the boolean isn't used on the FE, it's used to determine whether to add the range id in the problemRanges function which will just return the list of id's to the FE

koorosh

Reviewed 1 of 7 files at r1, 2 of 5 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @Santamaura, @tbg, and @zachlite)

a discussion (no related file):

pkg/server/serverpb/status.proto, line 392 at r4 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

Oh sorry to clarify, the boolean isn't used on the FE, it's used to determine whether to add the range id in the problemRanges function which will just return the list of id's to the FE

Got it! Thanks for making it clear!

Santamaura · 2022-02-15T15:06:10Z

bors r+

craig · 2022-02-15T16:42:20Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-02-15T19:42:25Z

Build succeeded:

GitHub CI (Cockroach)

Santamaura force-pushed the circuit-breaker-ranges branch 2 times, most recently from 1e4f82e to b17b74e Compare February 3, 2022 19:26

Santamaura requested review from koorosh, zachlite and tbg February 3, 2022 19:57

Santamaura changed the title ~~ui: dispay circuit breakers in problem ranges and range status~~ [CRDB-12226] ui: dispay circuit breakers in problem ranges and range status Feb 3, 2022

Santamaura force-pushed the circuit-breaker-ranges branch from b17b74e to b1acf17 Compare February 3, 2022 20:58

Santamaura commented Feb 7, 2022

View reviewed changes

Santamaura force-pushed the circuit-breaker-ranges branch from b1acf17 to f72f7a1 Compare February 7, 2022 21:23

Santamaura changed the title ~~[CRDB-12226] ui: dispay circuit breakers in problem ranges and range status~~ [CRDB-12226] server, ui: dispay circuit breakers in problem ranges and range status Feb 7, 2022

Santamaura force-pushed the circuit-breaker-ranges branch from f72f7a1 to cbe7628 Compare February 8, 2022 15:42

tbg changed the title ~~[CRDB-12226] server, ui: dispay circuit breakers in problem ranges and range status~~ [CRDB-12226] server, ui: display circuit breakers in problem ranges and range status Feb 8, 2022

Santamaura marked this pull request as ready for review February 8, 2022 21:59

Santamaura requested a review from a team as a code owner February 8, 2022 21:59

Santamaura force-pushed the circuit-breaker-ranges branch 2 times, most recently from 718391a to 0b35df8 Compare February 10, 2022 15:54

Santamaura force-pushed the circuit-breaker-ranges branch from 0b35df8 to 0ffc720 Compare February 10, 2022 17:58

koorosh reviewed Feb 11, 2022

View reviewed changes

Santamaura commented Feb 11, 2022

View reviewed changes

koorosh reviewed Feb 11, 2022

View reviewed changes

Santamaura commented Feb 11, 2022

View reviewed changes

koorosh approved these changes Feb 15, 2022

View reviewed changes

craig bot merged commit 3cb7eb0 into cockroachdb:master Feb 15, 2022

cockroach-teamcity mentioned this pull request Feb 15, 2022

[CRDB-12226] server, ui: display circuit breakers in problem ranges and range status cockroachdb/docs#13035

Closed

Santamaura deleted the circuit-breaker-ranges branch February 15, 2022 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CRDB-12226] server, ui: display circuit breakers in problem ranges and range status #75809

[CRDB-12226] server, ui: display circuit breakers in problem ranges and range status #75809

Santamaura commented Feb 1, 2022 •

edited

cockroach-teamcity commented Feb 1, 2022

Santamaura commented Feb 3, 2022

koorosh commented Feb 3, 2022 •

edited

tbg commented Feb 4, 2022

Santamaura commented Feb 4, 2022

Santamaura left a comment

tbg commented Feb 7, 2022

Santamaura commented Feb 7, 2022

tbg commented Feb 7, 2022 •

edited

Santamaura commented Feb 7, 2022 •

edited

Santamaura commented Feb 8, 2022 •

edited

tbg commented Feb 8, 2022 •

edited

Santamaura commented Feb 8, 2022 •

edited

koorosh left a comment

Santamaura left a comment

koorosh left a comment

Santamaura left a comment

koorosh left a comment

Santamaura commented Feb 15, 2022

craig bot commented Feb 15, 2022

craig bot commented Feb 15, 2022

[CRDB-12226] server, ui: display circuit breakers in problem ranges and range status #75809

[CRDB-12226] server, ui: display circuit breakers in problem ranges and range status #75809

Conversation

Santamaura commented Feb 1, 2022 • edited

cockroach-teamcity commented Feb 1, 2022

Santamaura commented Feb 3, 2022

koorosh commented Feb 3, 2022 • edited

tbg commented Feb 4, 2022

Santamaura commented Feb 4, 2022

Santamaura left a comment

Choose a reason for hiding this comment

tbg commented Feb 7, 2022

Santamaura commented Feb 7, 2022

tbg commented Feb 7, 2022 • edited

Santamaura commented Feb 7, 2022 • edited

Santamaura commented Feb 8, 2022 • edited

tbg commented Feb 8, 2022 • edited

Santamaura commented Feb 8, 2022 • edited

koorosh left a comment

Choose a reason for hiding this comment

Santamaura left a comment

Choose a reason for hiding this comment

koorosh left a comment

Choose a reason for hiding this comment

Santamaura left a comment

Choose a reason for hiding this comment

koorosh left a comment

Choose a reason for hiding this comment

Santamaura commented Feb 15, 2022

craig bot commented Feb 15, 2022

craig bot commented Feb 15, 2022

Santamaura commented Feb 1, 2022 •

edited

koorosh commented Feb 3, 2022 •

edited

tbg commented Feb 7, 2022 •

edited

Santamaura commented Feb 7, 2022 •

edited

Santamaura commented Feb 8, 2022 •

edited

tbg commented Feb 8, 2022 •

edited

Santamaura commented Feb 8, 2022 •

edited