-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-22.2: cli: close listeners and all open connections on disk stall #96145
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jbowens)
pkg/util/log/exit_override.go
line 32 at r1 (raw file):
// connections in the event of a disk stall that may prevent the process from // exiting. func EmergencyStop() {
We'll want to call this from util/log.flushAndMaybeSyncLocked
, though that can wait until we verify this approach works and to get buy in on the general design from someone more familiar with the server startup/shutdown sequence.
pkg/util/log/exit_override.go
Outdated
// SetEmergencyStopFunc, if any. EmergencyStop is a hack to close network | ||
// connections in the event of a disk stall that may prevent the process from | ||
// exiting. | ||
func EmergencyStop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize we shouldn't really "log", but I'm wondering if there's some way of leaving a paper trail that we got in here and did the emergency stop?
fc20ec2
to
1df2124
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @nicktrav)
pkg/util/log/exit_override.go
line 170 at r2 (raw file):
// Ignore errors so that if we can't close all sockets, we close as many // as we can. When finished, return the first error observed. if fdErr := forceCloseFD(fd); err == nil {
Between listing the file descriptor and attempting to close it the socket may have been closed and re-opened as an on-disk file which is blocked on disk IO. Is there anything to be done about this race? We could close the FDs in separate goroutines. Perhaps we should experiment with using unix.Shutdown
and see what the behavior of that system call is on sockets vs files.
pkg/util/log/exit_override.go
line 196 at r2 (raw file):
fd, err := strconv.Atoi(name) if err != nil { return nil, err
Should we stumble forward here rather than returning an error?
fbb85b4
to
74a5589
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nicktrav and @petermattis)
pkg/util/log/exit_override.go
line 32 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
We'll want to call this from
util/log.flushAndMaybeSyncLocked
, though that can wait until we verify this approach works and to get buy in on the general design from someone more familiar with the server startup/shutdown sequence.
Done.
pkg/util/log/exit_override.go
line 170 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Between listing the file descriptor and attempting to close it the socket may have been closed and re-opened as an on-disk file which is blocked on disk IO. Is there anything to be done about this race? We could close the FDs in separate goroutines. Perhaps we should experiment with using
unix.Shutdown
and see what the behavior of that system call is on sockets vs files.
Good catch. I'm trying shutdown
now. If it works, it seems better than close
-ing in separate goroutines. The separate goroutines can introduce a different race between the first pass through closeAllSockets
and the second pass, with the second pass listing the file descriptors before the first pass successfully closes a LISTEN socket.
pkg/util/log/exit_override.go
line 196 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Should we stumble forward here rather than returning an error?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nicktrav and @petermattis)
pkg/util/log/exit_override.go
line 32 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
I realize we shouldn't really "log", but I'm wondering if there's some way of leaving a paper trail that we got in here and did the emergency stop?
Yeah, I'm not really sure how though.
2cfc5e2
to
7eb0058
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main requests for this:
-
move the main socket shutdown code to the
cli
package. Just leave a function reference in thelog
package that thecli
package will inject its stuff into with a callback (dep injection) -
move the main code (that reads /proc) in the
_unix.go
file, it's not going to work on non-unix platforms. -
try
/dev/fd
first. It's more portable than/proc/self/fd
and is equivalent on linux.
Reviewed 1 of 3 files at r1, 4 of 8 files at r2, 2 of 3 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @nicktrav and @petermattis)
e3bebda
to
68199a4
Compare
68199a4
to
d39dd66
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @knz, this is ready for another look
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @nicktrav, @petermattis, and @RaduBerinde)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r6, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @jbowens, @nicktrav, @petermattis, and @RaduBerinde)
-- commits
line 10 at r6:
nit: grammar: was detected, would fatal -> would enter, would be unable (tense concordance)
pkg/cli/start_unix.go
line 199 at r6 (raw file):
// Ignore errors so that if we can't close all sockets, we close as many // as we can. When finished, return the first error observed. if fdErr := unix.Shutdown(fd, unix.SHUT_RDWR); err == nil {
More easy on the eye:
for _, fd := range fds {
// ...
fdErr := unix.Shutdown(...)
err = errors.CombineErrors(err, fdErr)
}
return err
pkg/cli/start_unix.go
line 220 at r6 (raw file):
dst, readLinkErr := os.Readlink(filepath.Join("/dev/fd", name)) if readLinkErr != nil { // Stumble forward.
Consider errors.CombineErrors() here and below.
pkg/util/log/exit_override.go
line 32 at r6 (raw file):
// connections in the event of a disk stall that may prevent the process from // exiting. func EmergencyStop() {
I think this function is not well named. As-is, its name suggests that it stops the process. But that is not the case. Maybe this should be called "MakeProcessUnavailable" or "ShutdownNetworkSockets" instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @jbowens, @nicktrav, and @RaduBerinde)
pkg/cli/start_unix.go
line 179 at r6 (raw file):
func closeAllSockets() { // Close all sockets twice. A LISTEN socket may open a new socket after we // list all FDs. If that's the case, be closed by the second call.
Nit: s/be closed/the socket will be closed/g
pkg/cli/start_unix.go
line 179 at r6 (raw file):
func closeAllSockets() { // Close all sockets twice. A LISTEN socket may open a new socket after we // list all FDs. If that's the case, be closed by the second call.
Should we add a TODO (or file an issue) that retry handling might open a new socket when an outgoing connection is closed. We could add some sort of poison marker to the rpc.Context
to prevent that.
d39dd66
to
0c84f1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz, @nicktrav, @petermattis, and @RaduBerinde)
Previously, knz (Raphael 'kena' Poss) wrote…
nit: grammar: was detected, would fatal -> would enter, would be unable (tense concordance)
Done.
pkg/cli/start_unix.go
line 179 at r6 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Nit:
s/be closed/the socket will be closed/g
Done.
pkg/cli/start_unix.go
line 179 at r6 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Should we add a TODO (or file an issue) that retry handling might open a new socket when an outgoing connection is closed. We could add some sort of poison marker to the
rpc.Context
to prevent that.
Done. Filed #96342 too.
pkg/cli/start_unix.go
line 199 at r6 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
More easy on the eye:
for _, fd := range fds { // ... fdErr := unix.Shutdown(...) err = errors.CombineErrors(err, fdErr) } return err
Done.
pkg/cli/start_unix.go
line 220 at r6 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Consider errors.CombineErrors() here and below.
Done.
pkg/util/log/exit_override.go
line 32 at r6 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
I think this function is not well named. As-is, its name suggests that it stops the process. But that is not the case. Maybe this should be called "MakeProcessUnavailable" or "ShutdownNetworkSockets" instead.
Done.
Disk stalls prevent a node from making progress. Any ranges for which the stalled node is leaseholder may also be prevented from making progress while the stalled node remains online but incapacitated. CockroachDB nodes detect stalls within their stores through timing all write filesystem operations. Previously, when a stall was detected, Cockroach would simply fatal the process. However, a process blocked on disk IO cannot be terminated. The process would enter the zombie state, but would be unable to be reaped. This commit adds a new step to disk stall handling, closing all open sockets. Epic: None Release note (bug fix): Fix a bug where a node with a disk stall would continue to accept new connections and preserve existing connections until the disk stall abated.
0c84f1a
to
0dca33f
Compare
TFTRs! |
96193: roachtest/awsdms: add large data replication test r=otan a=Jeremyyang920 This commit adds a test that will create a dms task to full load 200M rows into a crdb cluster. Fixes: #95329 Release note: None 96292: rpc: Make it explicit that we only Version from Settings r=knz a=andrewbaptist Cleaning up some of the code in this package, the entire Settings object was stored in the HeartbeatService. Only the Version was required. This PR removes the unnecessary struct. Epic: none Release note: None 96313: backupccl: fix key rewriter race in generative split and scatter processor r=rhu713 a=rhu713 The generative split and scatter processor is currently causing tests to fail under race because there are many goroutines that are operating with the same splitAndScatterer, which cannot be used concurrently as the underlying key rewriter cannot be used concurrently. Modify the processor so that every worker that uses the splitAndScatterer now uses its own instance. Fixes: #95808 Release note: None 96325: ci: update bazel builder image r=rickystewart a=cockroach-teamcity Release note: None Epic: None 96346: randgen: do not generate REGPROC and REGPROCEDURE columns r=mgartner a=mgartner Informs #95641 Epic: None Release note: None 96371: cli: close listeners and all open connections on disk stall r=nicktrav a=jbowens Forwardport of #96145. ---- Disk stalls prevent a node from making progress. Any ranges for which the stalled node is leaseholder may also be prevented from making progress while the stalled node remains online but incapacitated. CockroachDB nodes detect stalls within their stores through timing all write filesystem operations. Previously, when a stall was detected, Cockroach would simply fatal the process. However, a process blocked on disk IO cannot be terminated. The process would enter the zombie state, but would be unable to be reaped. This commit adds a new step to disk stall handling, closing all open sockets. Epic: None Release note (bug fix): Fix a bug where a node with a disk stall would continue to accept new connections and preserve existing connections until the disk stall abated. Co-authored-by: Jeremy Yang <jyang@cockroachlabs.com> Co-authored-by: Andrew Baptist <baptist@cockroachlabs.com> Co-authored-by: Rui Hu <rui@cockroachlabs.com> Co-authored-by: cockroach-teamcity <teamcity@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
Disk stalls prevent a node from making progress. Any ranges for which the
stalled node is leaseholder may also be prevented from making progress while
the stalled node remains online but incapacitated. CockroachDB nodes detect
stalls within their stores through timing all write filesystem operations.
Previously, when a stall was detected, Cockroach would simply fatal the
process. However, a process blocked on disk IO cannot be terminated. The
process will enter the zombie state, but will be unable to be reaped.
This commit adds a new step to disk stall handling, closing all open sockets.
Epic: None
Release note (bug fix): Fix a bug where a node with a disk stall would continue
to accept new connections and preserve existing connections until the disk
stall abated.
Release justification: Fixes high severity issue of a Cockroach process hanging
with open network connections in the event of a disk stall.