Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRDB-12226] server, ui: display circuit breakers in problem ranges and range status #75809

Merged
merged 1 commit into from Feb 15, 2022

Conversation

Santamaura
Copy link
Contributor

@Santamaura Santamaura commented Feb 1, 2022

This PR adds changes to the reports/problemranges and reports/range pages.
Ranges with replicas that have a circuit breaker will show up as problem ranges and
the circuit breaker error will show up as a row on the status page.

Release note (ui change): display circuit breakers in problems ranges and range status

Problem Ranges page:
Screen Shot 2022-02-08 at 4 57 51 PM

Range status page:
Screen Shot 2022-02-08 at 4 57 34 PM

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@Santamaura Santamaura force-pushed the circuit-breaker-ranges branch 2 times, most recently from 1e4f82e to b17b74e Compare February 3, 2022 19:26
@Santamaura
Copy link
Contributor Author

@tbg for testing this it seems like the ProblemRangesResponse will return with "initial connection heartbeat failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:26259: connect: connection refused" in the error_message when 2 nodes are killed instead of with a circuit_breaker_error is there another way to cause a circuit breaker error?

@Santamaura Santamaura changed the title ui: dispay circuit breakers in problem ranges and range status [CRDB-12226] ui: dispay circuit breakers in problem ranges and range status Feb 3, 2022
@koorosh
Copy link
Collaborator

koorosh commented Feb 3, 2022

@tbg for testing this it seems like the ProblemRangesResponse will return with "initial connection heartbeat failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:26259: connect: connection refused" in the error_message when 2 nodes are killed instead of with a circuit_breaker_error is there another way to cause a circuit breaker error?

I emulated circuit breaker with next steps:

  • enable circuit breaker with very low timeouts:
SET CLUSTER SETTING kv.replica_circuit_breaker.slow_replication_threshold = '100ms';
  • then run some workload on your cluster
./cockroach workload init movr 'postgresql://root@localhost:26257?sslmode=disable'
./cockroach workload run movr 'postgresql://root@localhost:26257?sslmode=disable'

and wait for a minute :)
then don't forget to set setting back to make cluster operate properly:

SET CLUSTER SETTING kv.replica_circuit_breaker.slow_replication_threshold = '0ms'; -- disable circuit breaker

nwm, it will cause circuit breaker event, but not an circuit breaker error.

@tbg
Copy link
Member

tbg commented Feb 4, 2022

@tbg for testing this it seems like the ProblemRangesResponse will return with "initial connection heartbeat failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:26259: connect: connection refused" in the error_message when 2 nodes are killed instead of with a circuit_breaker_error is there another way to cause a circuit breaker error?

I think you might have to wait a bit longer. The error you're seeing is likely caused by the problem ranges endpoint trying to reach the down nodes. After ~10s, the nodes should be marked as suspect (not live) so the problem ranges endpoint should be ignoring them. If this doesn't work, then there is a problem with the problem ranges that separately would need to be addressed. But I'm relatively confident that this will work.

@Santamaura
Copy link
Contributor Author

Thanks guys I'll test both methods

Copy link
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg hmm I killed the 2 nodes and waited ~30 seconds and the problem ranges endpoint still returns with an error message..after waiting some more time (a few mins) it seems like the entire dashboard doesn't even load anymore.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @koorosh, @tbg, and @zachlite)

@tbg
Copy link
Member

tbg commented Feb 7, 2022

Wait five minutes first, then kill two nodes. I want to rule out that you're accidentally losing quorum on the system ranges, in which case the cluster is pretty dysfunctional.

@Santamaura
Copy link
Contributor Author

Yeah waiting 5 mins then killing the 2 nodes seems to result in the same things observed that were mentioned ^^

@tbg
Copy link
Member

tbg commented Feb 7, 2022

Let me make sure I understand correctly. I set up a local 3 node cluster, waited for full replication, and brought down n1. Now the problem range endpoint gives me this, which is fine. It doesn't have any data for n1, but how could it, that node is down. It does report data for n2 and n3. (This is on master, not your PR).

In your experiment, you would look for open circuit breakers reported by nodes that are not down (the circuit breakers are tripped since replicas on live nodes are unable to make progress, as they would have to rely on the down nodes to do so at the replication layer). I have a demo script at 16m30s here that you could copy (it's in the right half of my terminal). When the workload returns all circuit breaker errors, the problem ranges endpoint (if the circuit_breaker_error field was plumbed around correctly) should be populated. The _status/ranges/local similarly should show it (and definitely already has the circuit_breaker_error field).

Let me know if it still doesn't work, and I'll take a look at the PR tomorrow and try to repro this for myself. (I'm unable to meet synchronously this week, or I would've offered that; if this drags out until next then we can go through it together).

@Santamaura
Copy link
Contributor Author

Santamaura commented Feb 7, 2022

Thanks for the followup! I noticed I was missing an addition in status.go for CircuitBreakerError. So after these changes I started up a 3 node cluster, waited a bit then killed n1. I'm retrieving the CircuitBreakerError from this

state := rep.State(ctx)
perhaps it is not the right place to get it from? It always seemed to not return an error. If this is the correct way to check for the problem ranges endpoint then most likely there's some issue with my config.

@Santamaura Santamaura changed the title [CRDB-12226] ui: dispay circuit breakers in problem ranges and range status [CRDB-12226] server, ui: dispay circuit breakers in problem ranges and range status Feb 7, 2022
@Santamaura
Copy link
Contributor Author

Santamaura commented Feb 8, 2022

So on a whim I followed Andrii's steps for his test (as mentioned above) and upon the workload encountering an error (Error: pq: result is ambiguous (raft status: {"id":"1","term":7,"vote":"1","commit":173,"lead":"1","raftState":"StateLeader","applied":173,"progress":{"1":{"match":174,"next":175,"state":"StateReplicate"},"2":{"match":173,"next":175,"state":"StateReplicate"},"3":{"match":173,"next":175,"state":"StateReplicate"}},"leadtransferee":"0"}: replica (n1,s1):1 unable to serve request to r90:/Table/6{0-1} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=10]) I believe this error is the circuit breaker error) The circuit breaker actually showed up on the problem ranges page! However by the time I click on the range that has the circuit breaker error or refresh the problem ranges page, the error is gone.

If I increase the slow_replication_threshold to a larger value the workload does not encounter an error and I do not see the circuit breaker error, even when killing a node.

@tbg
Copy link
Member

tbg commented Feb 8, 2022

I have a demo script at 16m30s here that you could copy

I just tried the above with this PR and it "just worked", not sure what went differently when you tried: https://www.loom.com/share/f560561bae6a42e58a34951226ed54da

I created the cluster via:

roachprod create local -n 5
roachprod put local ./cockroach
roachprod start local
roachprod run local:1 -- ./cockroach workload init kv --splits 100

open UI and wait for no under-replicated ranges

roachprod stop local:2-3

root@:26257/defaultdb> set cluster setting kv.replica_circuit_breaker.slow_replication_threshold = '5s';

roachprod run local:1 -- ./cockroach workload run kv --max-rate 50 --tolerate-errors 2> /dev/null
# wait a bit until you see errors/sec increase

open problem ranges

@tbg tbg changed the title [CRDB-12226] server, ui: dispay circuit breakers in problem ranges and range status [CRDB-12226] server, ui: display circuit breakers in problem ranges and range status Feb 8, 2022
@Santamaura
Copy link
Contributor Author

Santamaura commented Feb 8, 2022

Ah I was starting my cluster the non-roachprod way because I wanted to see hot reloads to the UI. I tried via the roachprod way and confirm it works 👍 thanks Tobias! Strange that it doesn't work when I start cluster directly from binary

@Santamaura Santamaura marked this pull request as ready for review February 8, 2022 21:59
@Santamaura Santamaura requested a review from a team as a code owner February 8, 2022 21:59
@Santamaura Santamaura force-pushed the circuit-breaker-ranges branch 2 times, most recently from 718391a to 0b35df8 Compare February 10, 2022 15:54
This PR adds changes to the reports/problemranges and reports/range pages.
Ranges with replicas that have a circuit breaker will show up as problem ranges and
the circuit breaker error will show up as a row on the status page.

Release note (ui change): display circuit breakers in problems ranges and range status
Copy link
Collaborator

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 7 files at r1, 3 of 5 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Santamaura, @tbg, and @zachlite)


pkg/server/serverpb/status.proto, line 392 at r4 (raw file):

  // When the raft log is too large, it can be a symptom of other issues.
  bool raft_log_too_large = 7;
  bool circuit_breaker_error = 9;

@Santamaura , what is the main purpose of this field? In case it is used as a flag to indicate presence of circuit breaker errors in response, than would it be possible to check circuit_breaker_error_range_ids (that's defined below)?

Copy link
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @koorosh, @tbg, and @zachlite)


pkg/server/serverpb/status.proto, line 392 at r4 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

@Santamaura , what is the main purpose of this field? In case it is used as a flag to indicate presence of circuit breaker errors in response, than would it be possible to check circuit_breaker_error_range_ids (that's defined below)?

It is used to check if there is a circuit breaker error on a per range basis. This flag actually determines whether we add a range id to the circuit_breaker_range_ids.

Copy link
Collaborator

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Santamaura, @tbg, and @zachlite)


pkg/server/serverpb/status.proto, line 392 at r4 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

It is used to check if there is a circuit breaker error on a per range basis. This flag actually determines whether we add a range id to the circuit_breaker_range_ids.

Is it possible to do the same validation on front-end side to remove this field? Also naming of this field intends to an error instead of bool.

Copy link
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @koorosh, @tbg, and @zachlite)


pkg/server/serverpb/status.proto, line 392 at r4 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

Is it possible to do the same validation on front-end side to remove this field? Also naming of this field intends to an error instead of bool.

Oh sorry to clarify, the boolean isn't used on the FE, it's used to determine whether to add the range id in the problemRanges function which will just return the list of id's to the FE

Copy link
Collaborator

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 7 files at r1, 2 of 5 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @Santamaura, @tbg, and @zachlite)

a discussion (no related file):
:lgtm:



pkg/server/serverpb/status.proto, line 392 at r4 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

Oh sorry to clarify, the boolean isn't used on the FE, it's used to determine whether to add the range id in the problemRanges function which will just return the list of id's to the FE

Got it! Thanks for making it clear!

@Santamaura
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 15, 2022

Build failed (retrying...):

@craig craig bot merged commit 3cb7eb0 into cockroachdb:master Feb 15, 2022
@craig
Copy link
Contributor

craig bot commented Feb 15, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants