New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server crash causing EOF on clients #30227
Comments
Hi @alanhamlett, Thanks for submitting this issue. It looks like the node whose logs you shared is unable to join the cluster. This section on troubleshooting networking issues might be able to help: https://www.cockroachlabs.com/docs/v2.0/cluster-setup-troubleshooting.html#networking-troubleshooting Is the address specified in |
That's just a temporary issue, looking further down in the logs shows it does join the cluster. I don't think it's related to the crash which happens several hours afterwards. |
Got it. Are there any cockroach logs of the actual crash or are the last lines you see the If you don't see a stack trace caused by a panic, you are most likely running out of memory. How does the memory usage look before the crash? |
@alanhamlett, these logs look like the server is being shut down gracefully. Do you know of anything that could be sending a |
Oops, that was me applying a config change to decrease memory usage because I also suspected running out of ram as cause. I'll look for crash logs today, but last time I checked the log just ended abruptly without any error. |
Found the log line we've been looking for: Here's part of the cockroach log containing the fatal error crash. I truncated the end because it goes on for a while.
|
It hasn't crashed since decreasing memory used from 40% to 30%: 40% + 40% = 80% Why would setting max ram to 80% cause a crash? Are there other memory settings for cockroachdb that could put it over 100%? On a side note: after decreasing the memory settings, cockroachdb is taking 100% cpu and write latency has skyrocketed. |
@alanhamlett unfortunately CockroachDB will need some slack as far as memory goes. The I'm glad to see you're able to run successfully with a total of 60% and I agree that it'd be better if more of the memory could be "assigned" to CockroachDB via the flags. We're also aware that the performance degradation when you max out the CPU is not graceful enough. I know we have an issue filed for it, but was unable to look it up (@arjunravinarayan, @nvanbenschoten?). Could you post a snippet of regular log traffic with your new settings? |
Sure, I've decreased replication to 2 now and here's the tail of the current log file on that same machine:
Since we know it was just a memory issue I'll close this issue. Would like to follow the perf degradation issue since that's now a blocker for my long-term use case. |
These logs look better. Just to make sure I understand, did you decrease the replication factor to two? I'd stick with an odd number, i.e. three. A replication factor of two loses availability when either replica dies, which is likely not what you want (i.e. it's as bad as a single node in a sense). I'm going to reopen this so that @arjunravinarayan or @nvanbenschoten can point you at the issue tracking the performance degradation. |
to slightly amend @tschottdorf's comment above, a replication factor of two is as bad as a replication factor of one as far as availability goes. You still have the durability benefit if say, one machine gets destroyed (i.e. you can't recover the drive). I'll let @nvanbenschoten link to the appropriate issue. I'm unsure which one specifically you're talking about. |
Yes and no. Yes, the data will be there, but there's no straightforward way to spin up a cluster from it (unless |
This shows the performance degradation due to high CPU load causing queries per second (QPS) to drop significantly: The trigger is reading then deleting rows for an archive pipeline, but cockroach shouldn't face-plant like this under heavy load. This chart shows CPU usage on one of the cockroach machines on the same time scale as the above charts: |
Thanks for the graphs @alanhamlett, it pretty clearly demonstrates a shortcoming in Cockroch. We've been aware for some time that performance doesn't gracefully degrade under heavy load, but I actually couldn't find a tracking issue on github about the problem. I just opened #30489 to track this going forward. Please feel free to add any more information to that issue. Things that would be useful to know are:
|
BUG REPORT
Nodes in a 3-node cluster are periodically crashing. Around that time, the nodes report elevated CPU usage but I'm not sure if it's the cause or a symptom.
v2.0.5
Journal Crash log
Cockroach log
The text was updated successfully, but these errors were encountered: