[v1.1.4] Wild logging when disk full #21756

gpaul · 2018-01-24T17:34:15Z

BUG REPORT

Please supply the header (i.e. the first few lines) of your most recent
log file for each node in your cluster. On most unix-based systems
running with defaults, this boils down to the output of

grep -F '[config]' cockroach-data/logs/cockroach.log

When log files are not available, supply the output of cockroach version
and all flags/environment variables passed to cockroach start instead.

# /opt/mesosphere/active/cockroach/bin/cockroach version
Build Tag:    v1.1.4
Build Time:   2018/01/22 14:27:22
Distribution: CCL
Platform:     linux amd64
Go Version:   go1.9.2
C Compiler:   gcc 4.9.3
Build SHA-1:  b794b52cbfffa2340cdaabf1c33be716ebde1db4
Build Type:   development

/opt/mesosphere/active/cockroach/bin/cockroach start --logtostderr --cache=100MiB --store=/var/lib/dcos/cockroach --certs-dir=/run/dcos/pki/cockroach --advertise-host=10.0.4.57 --host=10.0.4.57 --port=26257 --http-host=127.0.0.1 --http-port=8090 --extra-1.0-compatibility --pid-file=/run/dcos/cockroach/cockroach.pid --join=10.0.5.98,10.0.4.146,10.0.4.57

Please describe the issue you observed:

What did you do?
I have a 3-node cluster.
I filled up the 80GB partition on the 1st node, the one on which the cockroachdb store directory is located.
Requests were succeeding.
I then started filling up the same partition on the 2nd node but when I got halfway I tried another request on node 1 and found that it took a very long time to respond.
I checked the logs and found that they appeared to be logging in a tight loop.
After a few seconds the process crashed with the attached logs and stacktrace.
It looks like it tries to log an error if it fails to write a log.
What did you expect to see?
I expect CockroachDB to exit on ENOSPC.
What did you see instead?
Wild logging followed by a crash.
logs.txt

The text was updated successfully, but these errors were encountered:

mrtracy · 2018-01-24T19:15:05Z

That "no space left on device" is the error message associated with ENOSPC. I agree with your assessment, it appears that this got into a recursive loop trying to write the "no space left on device" log message and then got a stack overflow.

@a-robinson is this the kind of thing you've been looking at, or a new behavior?

a-robinson · 2018-01-24T19:18:17Z

This is not what I've been looking at. I haven't seen this before. It should be easy to track down, though -- from the logs it looks like the logging exit function is trying to log something, which is calling the logging exit function, and so on.

Trying to write to a file when we're out of disk will trigger exitLocked, but exitLocked tries to write to its file one last time in order to help users understand why the process is exiting. This is very valuable most of the time, when the problem isn't that the machine is out of disk, but shouldn't cause a stack overflow when the machine is out of space. Fixes cockroachdb#21756 Release note (bug fix): fix a stack overflow in the code for shutting down a server when out of disk space

a-robinson self-assigned this Jan 24, 2018

a-robinson mentioned this issue Jan 24, 2018

util/log: Avoid infinite recursion out of disk errors cause an exit #21768

Merged

a-robinson closed this as completed in #21768 Jan 25, 2018

a-robinson mentioned this issue Jan 25, 2018

cherrypick-1.1: util/log: Avoid infinite recursion out of disk errors cause an exit #21804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.1.4] Wild logging when disk full #21756

[v1.1.4] Wild logging when disk full #21756

gpaul commented Jan 24, 2018

mrtracy commented Jan 24, 2018 •

edited

a-robinson commented Jan 24, 2018

[v1.1.4] Wild logging when disk full #21756

[v1.1.4] Wild logging when disk full #21756

Comments

gpaul commented Jan 24, 2018

mrtracy commented Jan 24, 2018 • edited

a-robinson commented Jan 24, 2018

mrtracy commented Jan 24, 2018 •

edited