Detect client disconnection while running query and immediately interrupt its execution by maksm90 · Pull Request #198 · arenadata/gpdb

maksm90 · 2021-05-21T08:20:48Z

Motivation

This PR fixes the case when the server is executing a lengthy query and the client breaks the connection. The operating system will be aware that the connection is no more, but postgres node doesn't notice this, because it doesn't try to read from or write to the socket while running query. So we'll get a zombie connection. In theory, the query could be one that runs for a million years, continues to chew up CPU and I/O and occupies a connection slot that's sad. Worse still, a sent query might be modifiable and not return any data, then it might be surprising for disconnected client that his previously sent modification will be accepted at some point later - at completion of execution. For these reasons, the query have to be interrupted as much earlier as possible.

Implementation details

The patch provides a new GUC check_client_connection_interval that can be used to periodically check via CLIENT_CHECK_CONNECTION_TIMEOUT interrupts whether the client connection has gone away, while running very long queries. It is disabled by default.
For non-locking check of socket state the patch uses a non-standard Linux extension (also adopted by at least one other OS) - POLLRDHUP option that is not defined by POSIX.

Backport from PostgreSQL commits:

Scenarios to test functionality

There are at least three test cases that have to be passed to make sure of patch correction:

Killing of client process.
In this case client's OS sends FIN message and server have to handle changed socket state on the next iteration of CLIENT_CONNECTION_CHECK_TIMEOUT interrupt triggering.
Emulation cable breakdown.
This case requires to set settings related with keepalive behavior (tcp_keepalives_idle, tcp_keepalives_interval and tcp_keepalives_count). And server terminates "zombie" backend after connection reset ensured by keepalive procedure.
Graceful closing connection by client asynchronously executing long query.
After asynchronous sending long query client have to close connection by sending 'X' message (calling PQfinish libpq function). Hereon, connection is gracefully closed by client and backend process have to cancel query execution on the next iteration of CLIENT_CONNECTION_CHECK_TIMEOUT interrupt.

Fixes the case when the server is executing a lengthy query and the client breaks the connection. The operating system will be aware that the connection is no more, but postgres node doesn't notice this, because it doesn't try to read from or write to the socket while running query. So we'll get a zombie connection. In theory, the query could be one that runs for a million years, continues to chew up CPU and I/O and occupies a connection slot that's sad. Worse still, a sent query might be modifiable and not return any data, then it might be surprising for disconnected client that his previously sent modification will be accepted at some point later - at completion of execution. For these reasons, the query have to be interrupted as much earlier as possible. The patch provides a new GUC check_client_connection_interval that can be used to periodically check via CLIENT_CHECK_CONNECTION_TIMEOUT interrupts whether the client connection has gone away, while running very long queries. It is disabled by default. For non-locking check of socket state the patch uses a non-standard Linux extension (also adopted by at least one other OS) - POLLRDHUP option that is not defined by POSIX. Backport from PostgreSQL commits: - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c30f54ad732ca5c8762bb68bbe0f51de9137dd72 - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=22f6f2c1ccb56e0d6a159d4562418587e4b10e01

darthunix

May be we should also import postgres/postgres@f8e5f15 and postgres/postgres@be42015 commits as well to make the code base fully equivalent to the upstream postgres? It doesn't seem to be a big back-port from my point of view (but it is up to you, @maksm90 )

darthunix · 2021-05-24T00:50:01Z

src/backend/tcop/postgres.c


+		/* Start timeout for checking if the client has gone away if necessary. */
+		if (client_connection_check_interval > 0 &&
+			Gp_role != GP_ROLE_EXECUTE &&


What is the reason to avoid executor to check its TCP connection status with CLIENT_CONNECTION_CHECK_TIMEOUT (may we you have confused it with STATEMENT_TIMEOUT)? At first glance it doesn't seems to be bad if executor stops its slice processing on connection teardown with coordinator.

It makes sense. I'll try to exhaustively test the cases of possible executor interrupts to check the correctness of our logic for executor role.

I have tested different scenarios of disconnection executor with coordinator and our logic comes up big. Pushed fix

Restored this guard condition, as handle of QD<->QEs failure will be implemented in future - not within this PR

maksm90 · 2021-05-24T09:20:08Z

May be we should also import postgres/postgres@f8e5f15 and postgres/postgres@be42015 commits as well to make the code base fully equivalent to the upstream postgres? It doesn't seem to be a big back-port from my point of view (but it is up to you, @maksm90 )

All successive improvements around STATEMENT_TIMEOUT are not straight relevant to the purpose of our PR. I think we'll have to import those works if their need arises.

leskin-in · 2021-05-27T09:39:54Z

src/backend/utils/misc/guc.c

+#if !(defined(POLLRDHUP) || defined(__darwin__))
+	/* Linux and OSX only, for now.  See pq_check_connection(). */
 	if (*newval != 0)
 	{
-		GUC_check_errdetail("client_connection_check_interval must be set to 0 on platforms that lack POLLRDHUP.");
+		GUC_check_errdetail("client_connection_check_interval must be set to 0 on platforms that lack POLLRDHUP and not OSX.";


Note this check is necessary for other parts of this commit to be safe on systems other than having POLLRDHUP and OSX.

darthunix

I can see that current PR doesn't help when we isolate QE port with firewall. In this case query hangs as master backend doesn't detect network problem, though keepalive messages caused RST. I believe we should fix GPDB cluster dispatching code as well in the current PR.

darthunix · 2021-05-28T00:10:14Z

src/backend/libpq/pqcomm.c

+	int         rc;
+
+	pollfd.fd = MyProcPort->sock;
+#ifdef POLLRDHUP


I don't like the nested preprocessing as such code is not easy to read. May be we should unroll it to a single level #if defined(POLLRDHUP) ... #elif defined(__darwin__) ... #end?

I agree, but don't like fixed version much, because of copypaste.
Will it be better to move polling logic to separate function with two parameters?
static bool poll_fd(short init_events, short tgt_revents)

Or add additional logic which can costs additional processing time, but still acceptable.

bool pq_check_connection(void){ short rdhup_ev; #if defined(POLLRDHUP) rdhup_ev = POLLRDHUP; #elif defined(__darwin__) rdhup_ev = 0; #else return true; #endif /*...*/ pollfd.events = POLLOUT | POLLIN | rdhup_ev; /*...*/ else if (rc == 1 && (pollfd.revents & (POLLHUP | rdhup_ev))) /*...*/ }

Or any variation of this.

Rewritten, thanks

…oles except executor

maksm90 · 2021-05-31T21:52:46Z

I can see that current PR doesn't help when we isolate QE port with firewall. In this case query hangs as master backend doesn't detect network problem, though keepalive messages caused RST. I believe we should fix GPDB cluster dispatching code as well in the current PR.

I have decided to implement coordinator disconnection check logic for QEs in separate PR

leskin-in

Looks good to me overall.

src/backend/libpq/pqcomm.c

Stolb27

I've tested changes on provided cases. On the first look, all work as expected.

…ly interrupt its execution (#198)" in favor upstream version of this patch from 0bb081e This reverts commit 9e0f877.

This patch is a result of using futurize -w -n --stage1 {python file} on the python code base.

This patch is a result of using futurize -w -n --stage2 {python file} on the python code base.

1. Add install of future python package to Dockerfile 2. Add gpMgmt/test/behave/mgmt_utils/steps directory to PYTHONPATH in behave test to fix import 3. Partly ported changes to sql_isolation_testcase.py from 4. Use string type variables from `six` in a number of places to correctly determine string type. 5. Use a lambda function instead of str.strip in map to not depend on actual string types in the array. 6. Cast argument of io.StringIO to unicode 7. Specify encoding when converting str to bytes in gpconfig_helper.py 8. Fix expected test output for parallel_retrieve_cursor/corner 9. Fix output in sql_isolation_testcase.py 10. Add installation of future package in ABi tests. 11. Do not use str and char types from the future package since it brings problems with type comparisons. 12. Do not use standard library aliases since it brings problems with unit tests. 13. In unit tests compare json strings as dict because order of elements in json is not stable. 14. Use future version 0.16 for distributions where we use Python 2, as this version is used in these distributions.

maksm90 requested review from Stolb27, darthunix and leskin-in May 21, 2021 08:21

leskin-in previously approved these changes May 21, 2021

View reviewed changes

darthunix reviewed May 24, 2021

View reviewed changes

Allow client disconnection handle logic for processes with executor role

ecba4f6

maksm90 dismissed leskin-in’s stale review via ecba4f6 May 26, 2021 21:21

Add port to OSX

5d143e5

leskin-in reviewed May 27, 2021

View reviewed changes

darthunix reviewed May 28, 2021

View reviewed changes

Maksim Milyutin added 2 commits June 1, 2021 00:29

Enable client connection check logic and disable legacy one for all r…

d8ce9ef

…oles except executor

Unroll nested #if #endif blocks

1916c77

maksm90 requested a review from InnerLife0 June 1, 2021 06:22

leskin-in previously approved these changes Jun 1, 2021

View reviewed changes

src/backend/libpq/pqcomm.c Outdated Show resolved Hide resolved

maksm90 dismissed leskin-in’s stale review via 6bab8fa June 1, 2021 12:22

Refactor pq_check_connection() function

5c7f2fc

maksm90 force-pushed the ADBDEV-1532 branch from 6bab8fa to 5c7f2fc Compare June 1, 2021 12:26

InnerLife0 approved these changes Jun 1, 2021

View reviewed changes

Stolb27 approved these changes Jun 15, 2021

View reviewed changes

darthunix approved these changes Jun 21, 2021

View reviewed changes

Stolb27 changed the base branch from adb-6.x to 6.16.3_arenadata22 July 9, 2021 08:27

Stolb27 merged commit 9e0f877 into 6.16.3_arenadata22 Jul 9, 2021

Stolb27 deleted the ADBDEV-1532 branch July 9, 2021 08:28

Stolb27 added a commit that referenced this pull request Apr 27, 2022

Revert "Detect client disconnection while running query and immediate…

7cbaf3f

…ly interrupt its execution (#198)" in favor upstream version of this patch from 0bb081e This reverts commit 9e0f877.

Stolb27 pushed a commit that referenced this pull request Mar 10, 2026

Apply stage 1 of futurize (#198)

5c20386

This patch is a result of using futurize -w -n --stage1 {python file} on the python code base.

Stolb27 pushed a commit that referenced this pull request Mar 10, 2026

Apply stage 2 of futurize (#198)

76d05ae

This patch is a result of using futurize -w -n --stage2 {python file} on the python code base.

Conversation

maksm90 commented May 21, 2021 • edited by leskin-in Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Implementation details

Scenarios to test functionality

Uh oh!

darthunix left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maksm90 commented May 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

darthunix left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maksm90 commented May 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leskin-in left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Stolb27 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

maksm90 commented May 21, 2021 •

edited by leskin-in

Loading

maksm90 commented May 31, 2021 •

edited

Loading