New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High latency & node unavailability following increase in writes #36426
Comments
Zendesk ticket #3224 has been linked to this issue. |
Let's focus on the workload. What SQL queries were they running? From a high level this seems like something is severely overloading the cluster, and the newly added cleanup jobs are a likely candidate. @tim-o could you (have someone from SQL) dig into what they were running and what that would translate to? I don't think I can get that from the debug zip (we have the statement statistics on master though not the query plan, see #36482, but this is 2.x so it doesn't have anything I know of to get that info) Generally when a user reports problems after starting a new workload, what that workload is (and what queries/plans it translates to) is the most interesting bit of information. |
Oh - and in the short term - definitely tell them to stop the deletion jobs. If that doesn't fix the problem, at least we know there's something else going on. But hopefully it will prevent the crash from reoccuring. |
@tbg we moved the deletion jobs to midnight and reduced chunk size to 8, which stopped the problems from appearing. |
@christianhuening great, that's good to know, could you share detail on the deletion jobs? I have the schemas in the debug zip, so the queries which are run and some idea of how much they ought to be deleting would be a good start. |
@gigatexal could you elaborate on the nature of these deletion jobs? |
There's a job to clean up authorizations:
But I don't have the actual delete statement. The access token cleanup runs this:
the refresh token cleanup runs this:
the vault key cleanup job to remove users does this:
the vault key cleanup job to clear up tokens runs this:
|
@jordanlewis here's the schema. |
While I don't have a repro or schema or many details, but I witnessed similar in 19.2.1 (using
I have something continually polling a table that is getting frequent inserts and at one point I just started getting connection refused. Pardon my lack of details on the problem, just thought I'd share. I'll update if I do have any more info. EDIT: I should mention that this is with the CPU maxed out, similar to what #31893 (comment) is seeing. |
The relevant zendesk issue was closed internally. I don't know that we were able to definitively say what was the cause here, but from my reading we suspected it to be along the lines of "bad things happened when the cluster is overprovisioned". Going to close this for now. @cretz: Sorry for the complete silence here, I encourage you to file separately if you're still running into issues, or have more details to share. Those logs don't exactly say much, they usually point to some form of unhealthiness somewhere (bad disk, bad network conditions, etc), so should be thought of as a symptom of some other thing. It's not what we typically expect under stable operation. |
Debug zip available here: https://drive.google.com/drive/u/1/folders/1Vib6JWu2Aziwd63hrghW5SkqoQjAbqa7
Timeline of errors and warnings across nodes available here: https://docs.google.com/spreadsheets/d/1LFgMdlnvoI4NAPvqAIt0lvmpLk04JR1kbrLYSZsbUxU/edit#gid=338006228
Original screenshots:
tl;dr: a spike in latency and node unavailability was observed after kicking off a job to clean up a significant amount of data. The badness seems to start at 2019-03-28 5:10:43 on n1:
And gets worse from there. Roughly 10 seconds later we see errors cascade to other nodes -
handle raft ready
, slow heartbeats, and heartbeat failures on epoch increments. The issues continue for roughly 10 minutes. At 5:19 the issue seems to peak with this summary on n2:[n2,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:liveness.heartbeatfailures Value:65} {StoreID:2 Category:METRICS Description:queue.raftlog.process.failure Value:11} {StoreID:2 Category:METRICS Description:requests.slow.lease Value:48} {StoreID:2 Category:METRICS Description:requests.slow.raft Value:1} {StoreID:2 Category:METRICS Description:queue.replicagc.process.failure Value:5}]}
By 5:20 latency drops and nodes become available again.
As far as setup is concerned, the only thing to be aware of is that they're using CEPH on their stores. I confirmed that there were plenty of available iops, and there wasn't any detected spike in IO latency on CEPH's side.
Original customer report:
The text was updated successfully, but these errors were encountered: