Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: make SQL connection keep alive easily configurable #115833

Merged
merged 2 commits into from Jan 17, 2024

Conversation

fqazi
Copy link
Collaborator

@fqazi fqazi commented Dec 7, 2023

Previously, the only way to control the keep-alive time for CRDB was through the environment variable COCKROACH_SQL_TCP_KEEP_ALIVE, which set the keep-alive idle and probe times only. This was difficult to use, and the default meant that CRDB kept connections alive for 10 minutes of idle time. To help make this more configurable, we will introduce two cluster settings: server.sql_tcp_keep_alive_probe_interval and server.sql_tcp_keep_alive_timeout. These settings will control the keep-alive probe / idle intervals, and the probe count (derived from the timeout) before the connection will be dropped.

Fixes: #115422

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@fqazi fqazi force-pushed the setKeepAliveProperly branch 11 times, most recently from eb76445 to d22e3be Compare December 8, 2023 19:10
@fqazi fqazi marked this pull request as ready for review December 8, 2023 19:10
@fqazi fqazi requested review from a team as code owners December 8, 2023 19:10
@fqazi fqazi requested review from abarganier and removed request for a team December 8, 2023 19:10
@fqazi fqazi force-pushed the setKeepAliveProperly branch 2 times, most recently from b398efe to c753ad9 Compare December 9, 2023 02:59
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change may be too large to backport. Are we planning a more targeted fix for backports?

Reviewed 5 of 5 files at r1, 8 of 8 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @fqazi)


-- commits line 27 at r2:
This also changes the default values, and removes the env var -- we should call that out as well.


pkg/server/tcp_keepalive_manager.go line 28 at r2 (raw file):

var KeepAliveTimeout = settings.RegisterDurationSetting(
	settings.ApplicationLevel,
	"server.sql_tcp_keep_alive.timeout",

Did you consider simply exposing the parameters directly, like Postgres does?


pkg/server/tcp_keepalive_manager.go line 29 at r2 (raw file):

	settings.ApplicationLevel,
	"server.sql_tcp_keep_alive.timeout",
	"maximum time for which an idle TCP connection to the server will be kept alive",

Consider "unresponsive" instead of "idle". It's important to distinguish this from the case of a connected live client that's simply not doing anything.


pkg/server/tcp_keepalive_manager.go line 30 at r2 (raw file):

	"server.sql_tcp_keep_alive.timeout",
	"maximum time for which an idle TCP connection to the server will be kept alive",
	time.Minute*2,

2 minutes seems very high here. I think the only benefit of a high timeout is to tolerate high network latency and transient disconnects, while the downsides can be quite severe (workload stall). Is it important to handle a 2 minute network latency or disconnect?


pkg/server/tcp_keepalive_manager.go line 79 at r2 (raw file):

	// probe count.
	maximumConnectionTime := KeepAliveTimeout.Get(&k.settings.SV)
	probeFrequency := KeepAliveProbeFrequency.Get(&k.settings.SV)

We don't guard against either of these being set to 0 (KeepAliveTimeout = 0 will result in division by 0 below).

We should either take 0 to mean no probes are sent (like in Postgres), or disallow it.


pkg/server/tcp_keepalive_manager_test.go line 31 at r2 (raw file):

)

func TestKeepAliveManager(t *testing.T) {

We should have an end-to-end test as well, which tests that an unresponsive SQL client connection is in fact disconnected and its session torn down within the expected time.

@abarganier abarganier removed their request for review December 12, 2023 15:27
@fqazi fqazi requested a review from a team as a code owner January 10, 2024 18:42
@fqazi fqazi requested review from DarrylWong and renatolabs and removed request for a team January 10, 2024 18:42
Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to say we just narrow down the bits for linux and make them an environment variable enabled by default. What do you think?

@rafiss @erikgrinaker RFAL

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @erikgrinaker, and @renatolabs)


pkg/server/tcp_keepalive_manager.go line 28 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Did you consider simply exposing the parameters directly, like Postgres does?

Yeah, lets keep this simple, and I named it as:
server.sql_tcp_keep_alive.count


pkg/server/tcp_keepalive_manager.go line 29 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Consider "unresponsive" instead of "idle". It's important to distinguish this from the case of a connected live client that's simply not doing anything.

Good point changed this to:

maximum number of probes that will be sent out before a connection dropped because its unresponsive (Linux and Darwin only)


pkg/server/tcp_keepalive_manager.go line 30 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

2 minutes seems very high here. I think the only benefit of a high timeout is to tolerate high network latency and transient disconnects, while the downsides can be quite severe (workload stall). Is it important to handle a 2 minute network latency or disconnect?

I reduced this to a minute, let me know what your gut feel is for this setting.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @fqazi, and @renatolabs)


pkg/cmd/roachtest/tests/network.go line 333 at r3 (raw file):

			return err
		}
		// We expect 2 since the query will itself

nit: comment seems stale (count != 1)


pkg/cmd/roachtest/tests/network.go line 349 at r3 (raw file):

# Drop any client traffic to CRDB.
sudo iptables -A INPUT -p tcp --dport 26257 -j DROP;

This should be --sport 26257, not --dport. The traffic is coming from the CRDB process on the remote node (so its source port is 26257), to a random client port (dport) on this node.

As is, the client will receive packets from the server, but can't send packets to it. That still works, but we may as well sever the connection completely.


pkg/cmd/roachtest/tests/network.go line 352 at r3 (raw file):

sudo iptables -A OUTPUT -p tcp --dport 26257 -j DROP;

sudo iptables-save

nit: does this have any purpose? This just saves the ruleset to disk in case the host restarts, the rules are already applied.


pkg/cmd/roachtest/tests/network.go line 363 at r3 (raw file):

		const restoreNet = `
set -e;
sudo iptables -D INPUT -p tcp --dport 26257 -j DROP;

nit: can also just do iptables -F INPUT to remove all rules.


pkg/server/tcp_keepalive_manager.go line 30 at r2 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

I reduced this to a minute, let me know what your gut feel is for this setting.

I think that's still towards the higher end. For comparison, CRDB RPC connections use a keepalive timeout of 10 seconds, and that's only because gRPC doesn't allow lower values for various reasons.

// 10 seconds is the minimum keepalive interval permitted by gRPC.
// Setting it to a value lower than this will lead to gRPC adjusting to this
// value and annoyingly logging "Adjusting keepalive ping interval to minimum
// period of 10s". See grpc/grpc-go#2642.
const minimumClientKeepaliveInterval = 10 * time.Second

This basically says "how long do we want to allow a client to disconnect from the network and then reconnect without closing its session"? Since we'll be holding onto the transaction's locks in the meanwhile, I'm inclined to make this more aggressive -- it seems better for a client to retry instead if this should happen. Maybe 20-30 seconds? That still feels pretty lenient -- I don't think it's unreasonable to disconnect a client who falls off the network for that long. Then again, it's probably going to break something for someone somewhere, and 1 minute would be the conservative option -- it's still way better than 10 minutes.

I guess there's also a question of how chatty we want to be on idle client connections.

I'll leave this to SQL Foundations' discretion, since you probably know more about client environments and such -- personally I'd lean towards something like 30 seconds, but 1 minute also seems reasonable. Hard to say what the optimal value is.


pkg/server/tcp_keepalive_manager.go line 29 at r3 (raw file):

	settings.ApplicationLevel,
	"server.sql_tcp_keep_alive.count",
	"maximum number of probes that will be sent out before a connection dropped because "+

nit: "a connection is dropped" and "it's unresponsive"

Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @erikgrinaker, and @renatolabs)


pkg/cmd/roachtest/tests/network.go line 352 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: does this have any purpose? This just saves the ruleset to disk in case the host restarts, the rules are already applied.

Done.

Good point, cleaned this up.


pkg/server/tcp_keepalive_manager.go line 30 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I think that's still towards the higher end. For comparison, CRDB RPC connections use a keepalive timeout of 10 seconds, and that's only because gRPC doesn't allow lower values for various reasons.

// 10 seconds is the minimum keepalive interval permitted by gRPC.
// Setting it to a value lower than this will lead to gRPC adjusting to this
// value and annoyingly logging "Adjusting keepalive ping interval to minimum
// period of 10s". See grpc/grpc-go#2642.
const minimumClientKeepaliveInterval = 10 * time.Second

This basically says "how long do we want to allow a client to disconnect from the network and then reconnect without closing its session"? Since we'll be holding onto the transaction's locks in the meanwhile, I'm inclined to make this more aggressive -- it seems better for a client to retry instead if this should happen. Maybe 20-30 seconds? That still feels pretty lenient -- I don't think it's unreasonable to disconnect a client who falls off the network for that long. Then again, it's probably going to break something for someone somewhere, and 1 minute would be the conservative option -- it's still way better than 10 minutes.

I guess there's also a question of how chatty we want to be on idle client connections.

I'll leave this to SQL Foundations' discretion, since you probably know more about client environments and such -- personally I'd lean towards something like 30 seconds, but 1 minute also seems reasonable. Hard to say what the optimal value is.

Done.

Yeah, I dropped this down to ~30 seconds with a probe every 10 seconds. Also, I'm probably being too cautious, since I wouldn't expect that heavy packet loss to a database server.

@fqazi
Copy link
Collaborator Author

fqazi commented Jan 16, 2024

@erikgrinaker TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 16, 2024

Build failed (retrying...):

@renatolabs
Copy link
Collaborator

This PR has drifted and the roachtest changes here no longer compile on master because of #117320.

TLDR: The list of nodes to run a command in should be passed toRunE using option.WithNodes(nodes).

bors r-

@craig
Copy link
Contributor

craig bot commented Jan 16, 2024

Canceled.

Previously, we relied solely on setting the keep alive probe interval
and keep alive idle time to control how long a connection stayed active.
Unfortunately, this did not have the desired effect since the number of
probes was not set. As a result, idle connections for up to 10 minutes
would be kept around. To allow better configuration of these settings,
we need API to set the keep alive count on a socket. This patch will add
a sysutil to control the keep-alive count to address this.

Release note: None
Previously, we only had an environment variable for controlling the keep
alive probe interval and keep alive idle time for pgwire connections.
Unfortunately, this defaulted to around 10 minutes on Linux since the
probe count was not configurable either. To address this, this patch
adds the settings: server.sql_tcp_keep_alive.count and
server.sql_tcp_keep_alive.interval, which allow much better control than
previously. These settings can be used to pick an overall connection
time out for idle connections and the probe interval.

Fixes: cockroachdb#115422
Release note (sql change): Add configurable settings for
total TCP keep alive probes (server.sql_tcp_keep_alive.count) and TCP
probe intervals (server.sql_tcp_keep_alive.interval) for SQL
connections. Removes the COCKROACH_SQL_TCP_KEEP_ALIVE environment
variable subsuming it.
@fqazi
Copy link
Collaborator Author

fqazi commented Jan 16, 2024

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 17, 2024

Build succeeded:

@craig craig bot merged commit b743201 into cockroachdb:master Jan 17, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: detect broken TCP client connections sooner
4 participants