Skip to content

Detect client disconnection while running query and immediately interrupt its execution#198

Merged
Stolb27 merged 6 commits into6.16.3_arenadata22from
ADBDEV-1532
Jul 9, 2021
Merged

Detect client disconnection while running query and immediately interrupt its execution#198
Stolb27 merged 6 commits into6.16.3_arenadata22from
ADBDEV-1532

Conversation

@maksm90
Copy link

@maksm90 maksm90 commented May 21, 2021

Motivation

This PR fixes the case when the server is executing a lengthy query and the client breaks the connection. The operating system will be aware that the connection is no more, but postgres node doesn't notice this, because it doesn't try to read from or write to the socket while running query. So we'll get a zombie connection. In theory, the query could be one that runs for a million years, continues to chew up CPU and I/O and occupies a connection slot that's sad. Worse still, a sent query might be modifiable and not return any data, then it might be surprising for disconnected client that his previously sent modification will be accepted at some point later - at completion of execution. For these reasons, the query have to be interrupted as much earlier as possible.

Implementation details

The patch provides a new GUC check_client_connection_interval that can be used to periodically check via CLIENT_CHECK_CONNECTION_TIMEOUT interrupts whether the client connection has gone away, while running very long queries. It is disabled by default.
For non-locking check of socket state the patch uses a non-standard Linux extension (also adopted by at least one other OS) - POLLRDHUP option that is not defined by POSIX.

Backport from PostgreSQL commits:

Scenarios to test functionality

There are at least three test cases that have to be passed to make sure of patch correction:

  1. Killing of client process.
    In this case client's OS sends FIN message and server have to handle changed socket state on the next iteration of CLIENT_CONNECTION_CHECK_TIMEOUT interrupt triggering.
  2. Emulation cable breakdown.
    This case requires to set settings related with keepalive behavior (tcp_keepalives_idle, tcp_keepalives_interval and tcp_keepalives_count). And server terminates "zombie" backend after connection reset ensured by keepalive procedure.
  3. Graceful closing connection by client asynchronously executing long query.
    After asynchronous sending long query client have to close connection by sending 'X' message (calling PQfinish libpq function). Hereon, connection is gracefully closed by client and backend process have to cancel query execution on the next iteration of CLIENT_CONNECTION_CHECK_TIMEOUT interrupt.

Fixes the case when the server is executing a lengthy query and the
client breaks the connection. The operating system will be aware that
the connection is no more, but postgres node doesn't notice this,
because it doesn't try to read from or write to the socket while running
query. So we'll get a zombie connection. In theory, the query could be
one that runs for a million years, continues to chew up CPU and I/O and
occupies a connection slot that's sad. Worse still, a sent query might
be modifiable and not return any data, then it might be surprising for
disconnected client that his previously sent modification will be
accepted at some point later - at completion of execution. For these
reasons, the query have to be interrupted as much earlier as possible.

The patch provides a new GUC check_client_connection_interval that can
be used to periodically check via CLIENT_CHECK_CONNECTION_TIMEOUT
interrupts whether the client connection has gone away, while running
very long queries. It is disabled by default.
For non-locking check of socket state the patch uses a non-standard
Linux extension (also adopted by at least one other OS) - POLLRDHUP
option that is not defined by POSIX.

Backport from PostgreSQL commits:
 - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c30f54ad732ca5c8762bb68bbe0f51de9137dd72
 - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=22f6f2c1ccb56e0d6a159d4562418587e4b10e01
leskin-in
leskin-in previously approved these changes May 21, 2021
Copy link

@darthunix darthunix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we should also import postgres/postgres@f8e5f15 and postgres/postgres@be42015 commits as well to make the code base fully equivalent to the upstream postgres? It doesn't seem to be a big back-port from my point of view (but it is up to you, @maksm90 )


/* Start timeout for checking if the client has gone away if necessary. */
if (client_connection_check_interval > 0 &&
Gp_role != GP_ROLE_EXECUTE &&

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to avoid executor to check its TCP connection status with CLIENT_CONNECTION_CHECK_TIMEOUT (may we you have confused it with STATEMENT_TIMEOUT)? At first glance it doesn't seems to be bad if executor stops its slice processing on connection teardown with coordinator.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense. I'll try to exhaustively test the cases of possible executor interrupts to check the correctness of our logic for executor role.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested different scenarios of disconnection executor with coordinator and our logic comes up big. Pushed fix

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored this guard condition, as handle of QD<->QEs failure will be implemented in future - not within this PR

@maksm90
Copy link
Author

maksm90 commented May 24, 2021

May be we should also import postgres/postgres@f8e5f15 and postgres/postgres@be42015 commits as well to make the code base fully equivalent to the upstream postgres? It doesn't seem to be a big back-port from my point of view (but it is up to you, @maksm90 )

All successive improvements around STATEMENT_TIMEOUT are not straight relevant to the purpose of our PR. I think we'll have to import those works if their need arises.

Comment on lines +10093 to +10097
#if !(defined(POLLRDHUP) || defined(__darwin__))
/* Linux and OSX only, for now. See pq_check_connection(). */
if (*newval != 0)
{
GUC_check_errdetail("client_connection_check_interval must be set to 0 on platforms that lack POLLRDHUP.");
GUC_check_errdetail("client_connection_check_interval must be set to 0 on platforms that lack POLLRDHUP and not OSX.";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this check is necessary for other parts of this commit to be safe on systems other than having POLLRDHUP and OSX.

Copy link

@darthunix darthunix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that current PR doesn't help when we isolate QE port with firewall. In this case query hangs as master backend doesn't detect network problem, though keepalive messages caused RST. I believe we should fix GPDB cluster dispatching code as well in the current PR.

int rc;

pollfd.fd = MyProcPort->sock;
#ifdef POLLRDHUP

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the nested preprocessing as such code is not easy to read. May be we should unroll it to a single level #if defined(POLLRDHUP) ... #elif defined(__darwin__) ... #end?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but don't like fixed version much, because of copypaste.
Will it be better to move polling logic to separate function with two parameters?
static bool poll_fd(short init_events, short tgt_revents)

Or add additional logic which can costs additional processing time, but still acceptable.

bool
pq_check_connection(void){
	short rdhup_ev;
#if defined(POLLRDHUP)
	rdhup_ev = POLLRDHUP;
#elif defined(__darwin__)
	rdhup_ev = 0;
#else
	return true;
#endif
	/*...*/
	pollfd.events = POLLOUT | POLLIN | rdhup_ev;
	/*...*/
	else if (rc == 1 && (pollfd.revents & (POLLHUP | rdhup_ev)))
	/*...*/
}

Or any variation of this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewritten, thanks

@maksm90
Copy link
Author

maksm90 commented May 31, 2021

I can see that current PR doesn't help when we isolate QE port with firewall. In this case query hangs as master backend doesn't detect network problem, though keepalive messages caused RST. I believe we should fix GPDB cluster dispatching code as well in the current PR.

I have decided to implement coordinator disconnection check logic for QEs in separate PR

@maksm90 maksm90 requested a review from InnerLife0 June 1, 2021 06:22
leskin-in
leskin-in previously approved these changes Jun 1, 2021
Copy link

@leskin-in leskin-in left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall.

Copy link
Collaborator

@Stolb27 Stolb27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested changes on provided cases. On the first look, all work as expected.

@Stolb27 Stolb27 changed the base branch from adb-6.x to 6.16.3_arenadata22 July 9, 2021 08:27
@Stolb27 Stolb27 merged commit 9e0f877 into 6.16.3_arenadata22 Jul 9, 2021
@Stolb27 Stolb27 deleted the ADBDEV-1532 branch July 9, 2021 08:28
Stolb27 added a commit that referenced this pull request Apr 27, 2022
…ly interrupt its execution (#198)"

in favor upstream version of this patch from 0bb081e

This reverts commit 9e0f877.
Stolb27 pushed a commit that referenced this pull request Mar 10, 2026
This patch is a result of using futurize -w -n --stage1 {python file} on the
python code base.
Stolb27 pushed a commit that referenced this pull request Mar 10, 2026
This patch is a result of using futurize -w -n --stage2 {python file} on the
python code base.
Stolb27 pushed a commit that referenced this pull request Mar 10, 2026
1. Add install of future python package to Dockerfile
2. Add gpMgmt/test/behave/mgmt_utils/steps directory to PYTHONPATH in behave
test to fix import
3. Partly ported changes to sql_isolation_testcase.py from
4. Use string type variables from `six` in a number of places to correctly
determine string type.
5. Use a lambda function instead of str.strip in map to not depend on actual
string types in the array.
6. Cast argument of io.StringIO to unicode
7. Specify encoding when converting str to bytes in gpconfig_helper.py
8. Fix expected test output for parallel_retrieve_cursor/corner
9. Fix output in sql_isolation_testcase.py
10. Add installation of future package in ABi tests.
11. Do not use str and char types from the future package since it brings
problems with type comparisons.
12. Do not use standard library aliases since it brings problems with unit tests.
13. In unit tests compare json strings as dict because order of elements in json
is not stable.
14. Use future version 0.16 for distributions where we use Python 2, as this
version is used in these distributions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants