Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High latency & node unavailability following increase in writes #36426

Closed
tim-o opened this issue Apr 2, 2019 · 10 comments
Closed

High latency & node unavailability following increase in writes #36426

tim-o opened this issue Apr 2, 2019 · 10 comments
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.

Comments

@tim-o
Copy link
Contributor

tim-o commented Apr 2, 2019

Debug zip available here: https://drive.google.com/drive/u/1/folders/1Vib6JWu2Aziwd63hrghW5SkqoQjAbqa7

Timeline of errors and warnings across nodes available here: https://docs.google.com/spreadsheets/d/1LFgMdlnvoI4NAPvqAIt0lvmpLk04JR1kbrLYSZsbUxU/edit#gid=338006228

Original screenshots:
Storage
SQL Dashboard 2
SQL Dashboard 1
Runtime
KeysWrittenPerSecond

tl;dr: a spike in latency and node unavailability was observed after kicking off a job to clean up a significant amount of data. The badness seems to start at 2019-03-28 5:10:43 on n1:

I190328 05:09:59.653574 502 gossip/gossip.go:560  [n1] gossip status (ok, 5 nodes)
gossip client (1/3 cur/max conns)
  2: fg-cockroachdb-1.fg-cockroachdb.fguru-p.svc.cluster.local:26257 (18h22m1s: infos 136900/313733 sent/received, bytes 32608058B/61864214B sent/received)
gossip server (0/3 cur/max conns, infos 136900/313733 sent/received, bytes 32608058B/61864214B sent/received)
I190328 05:10:04.584840 506 server/status/runtime.go:465  [n1] runtime stats: 5.7 GiB RSS, 600 goroutines, 221 MiB/164 MiB/524 MiB GO alloc/idle/total, 4.1 GiB/5.1 GiB CGO alloc/total, 9558.0 CGO/sec, 87.4/23.9 %(u/s)time, 0.0 %gc (6x), 17 MiB/18 MiB (r/w)net
I190328 05:10:14.587153 506 server/status/runtime.go:465  [n1] runtime stats: 5.7 GiB RSS, 608 goroutines, 195 MiB/170 MiB/524 MiB GO alloc/idle/total, 4.1 GiB/5.1 GiB CGO alloc/total, 10167.8 CGO/sec, 92.8/25.7 %(u/s)time, 0.0 %gc (7x), 18 MiB/18 MiB (r/w)net
I190328 05:10:24.589455 506 server/status/runtime.go:465  [n1] runtime stats: 5.7 GiB RSS, 594 goroutines, 237 MiB/158 MiB/524 MiB GO alloc/idle/total, 4.1 GiB/5.1 GiB CGO alloc/total, 10247.2 CGO/sec, 101.8/27.6 %(u/s)time, 0.0 %gc (7x), 20 MiB/21 MiB (r/w)net
I190328 05:10:34.591664 506 server/status/runtime.go:465  [n1] runtime stats: 5.6 GiB RSS, 604 goroutines, 237 MiB/135 MiB/500 MiB GO alloc/idle/total, 4.1 GiB/5.0 GiB CGO alloc/total, 9670.7 CGO/sec, 89.3/25.6 %(u/s)time, 0.0 %gc (7x), 17 MiB/18 MiB (r/w)net
I190328 05:10:43.154113 513 storage/node_liveness.go:817  [n1,hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (context deadline exceeded)
W190328 05:10:43.154137 513 storage/node_liveness.go:558  [n1,hb] slow heartbeat took 4.5s
W190328 05:10:43.154143 513 storage/node_liveness.go:494  [n1,hb] failed node liveness heartbeat: context deadline exceeded

And gets worse from there. Roughly 10 seconds later we see errors cascade to other nodes - handle raft ready, slow heartbeats, and heartbeat failures on epoch increments. The issues continue for roughly 10 minutes. At 5:19 the issue seems to peak with this summary on n2:
[n2,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:liveness.heartbeatfailures Value:65} {StoreID:2 Category:METRICS Description:queue.raftlog.process.failure Value:11} {StoreID:2 Category:METRICS Description:requests.slow.lease Value:48} {StoreID:2 Category:METRICS Description:requests.slow.raft Value:1} {StoreID:2 Category:METRICS Description:queue.replicagc.process.failure Value:5}]}

By 5:20 latency drops and nodes become available again.

As far as setup is concerned, the only thing to be aware of is that they're using CEPH on their stores. I confirmed that there were plenty of available iops, and there wasn't any detected spike in IO latency on CEPH's side.

Original customer report:

This morning we experienced 4 downtimes of our Application seemingly caused by CockroachDB. We updated to 2.1.6 yesterday, which went fine and after the update it also ran fine. 
Additionally we added some cleanup cronjobs to K8s, to delete some unused data from certain tables, that unfortunately keeps being persisted until several data model changes are made.

CRDB is deployed with resource limits of 16GB memory per node, however, this seems to not to be the issue. 
I attached screenshots from the Metrics we collected for the main outage happening from 5:11 to 5:21 UTC

Application behaviour is, that it looses connection to Cockroach and then goes into unready state as it relies heavily on the cockroachDB
@tim-o tim-o added C-investigation Further steps needed to qualify. C-label will change. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. labels Apr 2, 2019
@tim-o tim-o added this to Triage in Core Performance via automation Apr 2, 2019
@awoods187 awoods187 added this to Incoming in KV via automation Apr 2, 2019
@awoods187 awoods187 removed this from Triage in Core Performance Apr 2, 2019
@tim-o
Copy link
Contributor Author

tim-o commented Apr 3, 2019

Zendesk ticket #3224 has been linked to this issue.

@tbg
Copy link
Member

tbg commented Apr 3, 2019

Let's focus on the workload. What SQL queries were they running? From a high level this seems like something is severely overloading the cluster, and the newly added cleanup jobs are a likely candidate.

@tim-o could you (have someone from SQL) dig into what they were running and what that would translate to? I don't think I can get that from the debug zip (we have the statement statistics on master though not the query plan, see #36482, but this is 2.x so it doesn't have anything I know of to get that info)

Generally when a user reports problems after starting a new workload, what that workload is (and what queries/plans it translates to) is the most interesting bit of information.

@tbg tbg assigned tim-o and unassigned tbg Apr 3, 2019
@tbg
Copy link
Member

tbg commented Apr 3, 2019

Oh - and in the short term - definitely tell them to stop the deletion jobs. If that doesn't fix the problem, at least we know there's something else going on. But hopefully it will prevent the crash from reoccuring.

@christianhuening
Copy link

@tbg we moved the deletion jobs to midnight and reduced chunk size to 8, which stopped the problems from appearing.

@tbg
Copy link
Member

tbg commented Apr 3, 2019

@christianhuening great, that's good to know, could you share detail on the deletion jobs? I have the schemas in the debug zip, so the queries which are run and some idea of how much they ought to be deleting would be a good start.

@christianhuening
Copy link

@gigatexal could you elaborate on the nature of these deletion jobs?

@danhhz danhhz self-assigned this Apr 3, 2019
@gigatexal
Copy link

There's a job to clean up authorizations:
SQLAlchemy creates this (pulled from mysql but should be very similar to CRDB)

SELECT account_authorizations.id
FROM account_authorizations
WHERE account_authorizations.token_type = %(token_type_1)s AND NOT (EXISTS (SELECT *
FROM refresh_tokens
WHERE refresh_tokens.uid = account_authorizations.token_uid))
 LIMIT %(param_1)s

But I don't have the actual delete statement.

The access token cleanup runs this:

SELECT access_tokens.uid
FROM access_tokens
WHERE access_tokens.expire_timestamp < %(expire_timestamp_1)s AND access_tokens.client_id != %(client_id_1)s
 LIMIT %(param_1)s
DELETE FROM access_tokens WHERE access_tokens.uid IN (%(uid_1)s, %(uid_2)s,...the rest of the kids from the access tokens select query)

the refresh token cleanup runs this:

SELECT refresh_tokens.uid
FROM refresh_tokens
WHERE refresh_tokens.is_revoked IS true
 LIMIT %(param_1)s

and then issues a delete given that constructed list chunking the deletes given a parameter

the vault key cleanup job to remove users does this:

SELECT vault_keys.id
FROM vault_keys
WHERE NOT (EXISTS (SELECT *
FROM user
WHERE user.id = vault_keys.user_id))
 LIMIT %(param_1)s

and then issues a delete given that constructed list chunking as requested

the vault key cleanup job to clear up tokens runs this:

SELECT vault_keys.id
FROM vault_keys
WHERE vault_keys.token_uid NOT IN (%(token_uid_1)s, %(token_uid_2)s) AND NOT (EXISTS (SELECT *
FROM access_tokens
WHERE access_tokens.uid = vault_keys.token_uid)) AND NOT (EXISTS (SELECT *
FROM autosync_profile
WHERE autosync_profile.uid = vault_keys.token_uid)) AND NOT (EXISTS (SELECT *
FROM refresh_tokens
WHERE refresh_tokens.uid = vault_keys.token_uid))
 LIMIT %(param_1)s

DELETE FROM vault_keys WHERE vault_keys.id IN (%(id_1)s, %(id_2)s,...result set from previous select query)

@roncrdb
Copy link

roncrdb commented Apr 12, 2019

@jordanlewis here's the schema.

@danhhz danhhz removed their assignment Apr 16, 2019
@cretz
Copy link

cretz commented Dec 6, 2019

While I don't have a repro or schema or many details, but I witnessed similar in 19.2.1 (using start-single-node here). The logs have warnings:

W191206 18:19:53.716461 124 storage/store_raft.go:514  [n1,s1,r28/1:/Table/5{6-7}] handle raft ready: 1.5s [applied=1, batches=1, state_assertions=0]
W191206 18:19:58.116201 107 storage/store_raft.go:514  [n1,s1,r2/1:/System/NodeLiveness{-Max}] handle raft ready: 5.2s [applied=1, batches=1, state_assertions=0]
W191206 18:19:58.152412 109 storage/store_raft.go:514  [n1,s1,r27/1:/Table/5{5-6}] handle raft ready: 4.4s [applied=1, batches=1, state_assertions=0]
W191206 18:19:58.153005 124 storage/store_raft.go:514  [n1,s1,r28/1:/Table/5{6-7}] handle raft ready: 4.4s [applied=2, batches=2, state_assertions=0]
W191206 18:19:58.153215 110 storage/store_raft.go:514  [n1,s1,r1/1:/{Min-System/NodeL…}] handle raft ready: 0.7s [applied=1, batches=1, state_assertions=1]
I191206 18:19:58.188326 169 storage/node_liveness.go:846  [n1,liveness-hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (context deadline exceeded)
W191206 18:19:58.188346 169 storage/node_liveness.go:559  [n1,liveness-hb] slow heartbeat took 5.2s
W191206 18:19:58.188386 169 storage/node_liveness.go:484  [n1,liveness-hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
W191206 18:20:01.326630 115 storage/store_raft.go:514  [n1,s1,r27/1:/Table/5{5-6}] handle raft ready: 0.5s [applied=1, batches=1, state_assertions=0]

I have something continually polling a table that is getting frequent inserts and at one point I just started getting connection refused. Pardon my lack of details on the problem, just thought I'd share. I'll update if I do have any more info.

EDIT: I should mention that this is with the CPU maxed out, similar to what #31893 (comment) is seeing.

@irfansharif
Copy link
Contributor

The relevant zendesk issue was closed internally. I don't know that we were able to definitively say what was the cause here, but from my reading we suspected it to be along the lines of "bad things happened when the cluster is overprovisioned". Going to close this for now.

@cretz: Sorry for the complete silence here, I encourage you to file separately if you're still running into issues, or have more details to share. Those logs don't exactly say much, they usually point to some form of unhealthiness somewhere (bad disk, bad network conditions, etc), so should be thought of as a symptom of some other thing. It's not what we typically expect under stable operation.

KV automation moved this from Incoming to Closed Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

No branches or pull requests

8 participants