Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.2: cli: close listeners and all open connections on disk stall #96145

Merged
merged 1 commit into from
Feb 1, 2023

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Jan 28, 2023

Disk stalls prevent a node from making progress. Any ranges for which the
stalled node is leaseholder may also be prevented from making progress while
the stalled node remains online but incapacitated. CockroachDB nodes detect
stalls within their stores through timing all write filesystem operations.
Previously, when a stall was detected, Cockroach would simply fatal the
process. However, a process blocked on disk IO cannot be terminated. The
process will enter the zombie state, but will be unable to be reaped.

This commit adds a new step to disk stall handling, closing all open sockets.

Epic: None
Release note (bug fix): Fix a bug where a node with a disk stall would continue
to accept new connections and preserve existing connections until the disk
stall abated.
Release justification: Fixes high severity issue of a Cockroach process hanging
with open network connections in the event of a disk stall.

@blathers-crl
Copy link

blathers-crl bot commented Jan 28, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jbowens)


pkg/util/log/exit_override.go line 32 at r1 (raw file):

// connections in the event of a disk stall that may prevent the process from
// exiting.
func EmergencyStop() {

We'll want to call this from util/log.flushAndMaybeSyncLocked, though that can wait until we verify this approach works and to get buy in on the general design from someone more familiar with the server startup/shutdown sequence.

// SetEmergencyStopFunc, if any. EmergencyStop is a hack to close network
// connections in the event of a disk stall that may prevent the process from
// exiting.
func EmergencyStop() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize we shouldn't really "log", but I'm wondering if there's some way of leaving a paper trail that we got in here and did the emergency stop?

@jbowens jbowens force-pushed the 22.2-emergency-stop branch 3 times, most recently from fc20ec2 to 1df2124 Compare January 30, 2023 16:16
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @nicktrav)


pkg/util/log/exit_override.go line 170 at r2 (raw file):

		// Ignore errors so that if we can't close all sockets, we close as many
		// as we can. When finished, return the first error observed.
		if fdErr := forceCloseFD(fd); err == nil {

Between listing the file descriptor and attempting to close it the socket may have been closed and re-opened as an on-disk file which is blocked on disk IO. Is there anything to be done about this race? We could close the FDs in separate goroutines. Perhaps we should experiment with using unix.Shutdown and see what the behavior of that system call is on sockets vs files.


pkg/util/log/exit_override.go line 196 at r2 (raw file):

			fd, err := strconv.Atoi(name)
			if err != nil {
				return nil, err

Should we stumble forward here rather than returning an error?

@jbowens jbowens force-pushed the 22.2-emergency-stop branch 2 times, most recently from fbb85b4 to 74a5589 Compare January 30, 2023 16:47
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nicktrav and @petermattis)


pkg/util/log/exit_override.go line 32 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We'll want to call this from util/log.flushAndMaybeSyncLocked, though that can wait until we verify this approach works and to get buy in on the general design from someone more familiar with the server startup/shutdown sequence.

Done.


pkg/util/log/exit_override.go line 170 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Between listing the file descriptor and attempting to close it the socket may have been closed and re-opened as an on-disk file which is blocked on disk IO. Is there anything to be done about this race? We could close the FDs in separate goroutines. Perhaps we should experiment with using unix.Shutdown and see what the behavior of that system call is on sockets vs files.

Good catch. I'm trying shutdown now. If it works, it seems better than close-ing in separate goroutines. The separate goroutines can introduce a different race between the first pass through closeAllSockets and the second pass, with the second pass listing the file descriptors before the first pass successfully closes a LISTEN socket.


pkg/util/log/exit_override.go line 196 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Should we stumble forward here rather than returning an error?

Done.

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nicktrav and @petermattis)


pkg/util/log/exit_override.go line 32 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

I realize we shouldn't really "log", but I'm wondering if there's some way of leaving a paper trail that we got in here and did the emergency stop?

Yeah, I'm not really sure how though.

@jbowens jbowens force-pushed the 22.2-emergency-stop branch 2 times, most recently from 2cfc5e2 to 7eb0058 Compare January 30, 2023 19:26
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main requests for this:

  • move the main socket shutdown code to the cli package. Just leave a function reference in the log package that the cli package will inject its stuff into with a callback (dep injection)

  • move the main code (that reads /proc) in the _unix.go file, it's not going to work on non-unix platforms.

  • try /dev/fd first. It's more portable than /proc/self/fdand is equivalent on linux.

Reviewed 1 of 3 files at r1, 4 of 8 files at r2, 2 of 3 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nicktrav and @petermattis)

@jbowens jbowens force-pushed the 22.2-emergency-stop branch 3 times, most recently from e3bebda to 68199a4 Compare January 31, 2023 19:40
@jbowens jbowens changed the title crl-release-22.2: server: close listeners and all open connections on disk stall crl-release-22.2: cli: close listeners and all open connections on disk stall Jan 31, 2023
@jbowens jbowens marked this pull request as ready for review January 31, 2023 20:00
@jbowens jbowens requested a review from a team January 31, 2023 20:00
@jbowens jbowens requested review from a team as code owners January 31, 2023 20:00
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @knz, this is ready for another look

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nicktrav, @petermattis, and @RaduBerinde)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: with nits

Reviewed 13 of 13 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jbowens, @nicktrav, @petermattis, and @RaduBerinde)


-- commits line 10 at r6:
nit: grammar: was detected, would fatal -> would enter, would be unable (tense concordance)


pkg/cli/start_unix.go line 199 at r6 (raw file):

		// Ignore errors so that if we can't close all sockets, we close as many
		// as we can. When finished, return the first error observed.
		if fdErr := unix.Shutdown(fd, unix.SHUT_RDWR); err == nil {

More easy on the eye:

for _, fd := range fds {
   // ... 
   fdErr := unix.Shutdown(...)
   err = errors.CombineErrors(err, fdErr)
}
return err

pkg/cli/start_unix.go line 220 at r6 (raw file):

		dst, readLinkErr := os.Readlink(filepath.Join("/dev/fd", name))
		if readLinkErr != nil {
			// Stumble forward.

Consider errors.CombineErrors() here and below.


pkg/util/log/exit_override.go line 32 at r6 (raw file):

// connections in the event of a disk stall that may prevent the process from
// exiting.
func EmergencyStop() {

I think this function is not well named. As-is, its name suggests that it stops the process. But that is not the case. Maybe this should be called "MakeProcessUnavailable" or "ShutdownNetworkSockets" instead.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jbowens, @nicktrav, and @RaduBerinde)


pkg/cli/start_unix.go line 179 at r6 (raw file):

func closeAllSockets() {
	// Close all sockets twice. A LISTEN socket may open a new socket after we
	// list all FDs. If that's the case, be closed by the second call.

Nit: s/be closed/the socket will be closed/g


pkg/cli/start_unix.go line 179 at r6 (raw file):

func closeAllSockets() {
	// Close all sockets twice. A LISTEN socket may open a new socket after we
	// list all FDs. If that's the case, be closed by the second call.

Should we add a TODO (or file an issue) that retry handling might open a new socket when an outgoing connection is closed. We could add some sort of poison marker to the rpc.Context to prevent that.

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz, @nicktrav, @petermattis, and @RaduBerinde)


-- commits line 10 at r6:

Previously, knz (Raphael 'kena' Poss) wrote…

nit: grammar: was detected, would fatal -> would enter, would be unable (tense concordance)

Done.


pkg/cli/start_unix.go line 179 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: s/be closed/the socket will be closed/g

Done.


pkg/cli/start_unix.go line 179 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Should we add a TODO (or file an issue) that retry handling might open a new socket when an outgoing connection is closed. We could add some sort of poison marker to the rpc.Context to prevent that.

Done. Filed #96342 too.


pkg/cli/start_unix.go line 199 at r6 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

More easy on the eye:

for _, fd := range fds {
   // ... 
   fdErr := unix.Shutdown(...)
   err = errors.CombineErrors(err, fdErr)
}
return err

Done.


pkg/cli/start_unix.go line 220 at r6 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Consider errors.CombineErrors() here and below.

Done.


pkg/util/log/exit_override.go line 32 at r6 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I think this function is not well named. As-is, its name suggests that it stops the process. But that is not the case. Maybe this should be called "MakeProcessUnavailable" or "ShutdownNetworkSockets" instead.

Done.

Disk stalls prevent a node from making progress. Any ranges for which the
stalled node is leaseholder may also be prevented from making progress while
the stalled node remains online but incapacitated. CockroachDB nodes detect
stalls within their stores through timing all write filesystem operations.
Previously, when a stall was detected, Cockroach would simply fatal the
process. However, a process blocked on disk IO cannot be terminated. The
process would enter the zombie state, but would be unable to be reaped.

This commit adds a new step to disk stall handling, closing all open sockets.

Epic: None
Release note (bug fix): Fix a bug where a node with a disk stall would continue
to accept new connections and preserve existing connections until the disk
stall abated.
@dhartunian dhartunian removed the request for review from a team February 1, 2023 15:47
@jbowens
Copy link
Collaborator Author

jbowens commented Feb 1, 2023

TFTRs!

@jbowens jbowens merged commit f2079be into cockroachdb:release-22.2 Feb 1, 2023
@jbowens jbowens deleted the 22.2-emergency-stop branch February 1, 2023 19:05
@jbowens jbowens changed the title crl-release-22.2: cli: close listeners and all open connections on disk stall release-22.2: cli: close listeners and all open connections on disk stall Feb 1, 2023
craig bot pushed a commit that referenced this pull request Feb 1, 2023
96193: roachtest/awsdms: add large data replication test r=otan a=Jeremyyang920

This commit adds a test that will create a dms task to full load 200M rows into a crdb cluster.

Fixes: #95329
Release note: None

96292: rpc: Make it explicit that we only Version from Settings r=knz a=andrewbaptist

Cleaning up some of the code in this package, the entire Settings object was stored in the HeartbeatService. Only the Version was required.

This PR removes the unnecessary struct.

Epic: none
Release note: None

96313: backupccl: fix key rewriter race in generative split and scatter processor r=rhu713 a=rhu713

The generative split and scatter processor is currently causing tests to fail under race because there are many goroutines that are operating with the same splitAndScatterer, which cannot be used concurrently as the underlying key rewriter cannot be used concurrently. Modify the processor so that every worker that uses the splitAndScatterer now uses its own instance.

Fixes: #95808

Release note: None

96325: ci: update bazel builder image r=rickystewart a=cockroach-teamcity

Release note: None
Epic: None


96346: randgen: do not generate REGPROC and REGPROCEDURE columns r=mgartner a=mgartner

Informs #95641

Epic: None

Release note: None

96371: cli: close listeners and all open connections on disk stall r=nicktrav a=jbowens

Forwardport of #96145.

----

Disk stalls prevent a node from making progress. Any ranges for which the stalled node is leaseholder may also be prevented from making progress while the stalled node remains online but incapacitated. CockroachDB nodes detect stalls within their stores through timing all write filesystem operations. Previously, when a stall was detected, Cockroach would simply fatal the process. However, a process blocked on disk IO cannot be terminated. The process would enter the zombie state, but would be unable to be reaped.

This commit adds a new step to disk stall handling, closing all open sockets.

Epic: None
Release note (bug fix): Fix a bug where a node with a disk stall would continue to accept new connections and preserve existing connections until the disk stall abated.

Co-authored-by: Jeremy Yang <jyang@cockroachlabs.com>
Co-authored-by: Andrew Baptist <baptist@cockroachlabs.com>
Co-authored-by: Rui Hu <rui@cockroachlabs.com>
Co-authored-by: cockroach-teamcity <teamcity@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants