Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.1.4] Wild logging when disk full #21756

Closed
gpaul opened this issue Jan 24, 2018 · 2 comments · Fixed by #21768
Closed

[v1.1.4] Wild logging when disk full #21756

gpaul opened this issue Jan 24, 2018 · 2 comments · Fixed by #21768
Assignees

Comments

@gpaul
Copy link

gpaul commented Jan 24, 2018

BUG REPORT

  1. Please supply the header (i.e. the first few lines) of your most recent
    log file for each node in your cluster. On most unix-based systems
    running with defaults, this boils down to the output of

    grep -F '[config]' cockroach-data/logs/cockroach.log

    When log files are not available, supply the output of cockroach version
    and all flags/environment variables passed to cockroach start instead.

# /opt/mesosphere/active/cockroach/bin/cockroach version
Build Tag:    v1.1.4
Build Time:   2018/01/22 14:27:22
Distribution: CCL
Platform:     linux amd64
Go Version:   go1.9.2
C Compiler:   gcc 4.9.3
Build SHA-1:  b794b52cbfffa2340cdaabf1c33be716ebde1db4
Build Type:   development
/opt/mesosphere/active/cockroach/bin/cockroach start --logtostderr --cache=100MiB --store=/var/lib/dcos/cockroach --certs-dir=/run/dcos/pki/cockroach --advertise-host=10.0.4.57 --host=10.0.4.57 --port=26257 --http-host=127.0.0.1 --http-port=8090 --extra-1.0-compatibility --pid-file=/run/dcos/cockroach/cockroach.pid --join=10.0.5.98,10.0.4.146,10.0.4.57
  1. Please describe the issue you observed:
  • What did you do?
    I have a 3-node cluster.
    I filled up the 80GB partition on the 1st node, the one on which the cockroachdb store directory is located.
    Requests were succeeding.
    I then started filling up the same partition on the 2nd node but when I got halfway I tried another request on node 1 and found that it took a very long time to respond.
    I checked the logs and found that they appeared to be logging in a tight loop.
    After a few seconds the process crashed with the attached logs and stacktrace.
    It looks like it tries to log an error if it fails to write a log.

  • What did you expect to see?
    I expect CockroachDB to exit on ENOSPC.

  • What did you see instead?
    Wild logging followed by a crash.
    logs.txt

@mrtracy
Copy link
Contributor

mrtracy commented Jan 24, 2018

That "no space left on device" is the error message associated with ENOSPC. I agree with your assessment, it appears that this got into a recursive loop trying to write the "no space left on device" log message and then got a stack overflow.

@a-robinson is this the kind of thing you've been looking at, or a new behavior?

@a-robinson
Copy link
Contributor

This is not what I've been looking at. I haven't seen this before. It should be easy to track down, though -- from the logs it looks like the logging exit function is trying to log something, which is calling the logging exit function, and so on.

@a-robinson a-robinson self-assigned this Jan 24, 2018
a-robinson added a commit to a-robinson/cockroach that referenced this issue Jan 24, 2018
Trying to write to a file when we're out of disk will trigger
exitLocked, but exitLocked tries to write to its file one last time in
order to help users understand why the process is exiting. This is very
valuable most of the time, when the problem isn't that the machine is
out of disk, but shouldn't cause a stack overflow when the machine is
out of space.

Fixes cockroachdb#21756

Release note (bug fix): fix a stack overflow in the code for shutting
down a server when out of disk space
a-robinson added a commit to a-robinson/cockroach that referenced this issue Jan 25, 2018
Trying to write to a file when we're out of disk will trigger
exitLocked, but exitLocked tries to write to its file one last time in
order to help users understand why the process is exiting. This is very
valuable most of the time, when the problem isn't that the machine is
out of disk, but shouldn't cause a stack overflow when the machine is
out of space.

Fixes cockroachdb#21756

Release note (bug fix): fix a stack overflow in the code for shutting
down a server when out of disk space
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants