Fix stuck query after cancel or termination when segment is not responding #948

whitehawk · 2024-05-05T10:49:44Z

Fix stuck query after cancel or termination when segment is not responding

Problem:
The following scenario caused a stuck state of a coordinator backend process:

At some moment one of the segments stopped processing incoming requests (the
reason itself is not important, as sometimes bad things may happen with any
segment, and the system should be able to recover).
A query, which dispatches requests to segments, was executed (for example,
select from 'gp_toolkit.gp_resgroup_status_per_segment'). As one of the segments
was not responding, the coordinator hanged in 'checkDispatchResult' (it is
expected, and can be handled by the FTS).
The stuck query was canceled or terminated (by 'pg_cancel_backend' or
'pg_terminate_backend'). It didn't help to return from the stuck state, but
after this step the query became completely unrecoverable. Even after FTS had
detected the malfunction segment and had promoted the mirror, the query was
still hanging forever.

Expected behavior is that after FTS mirror promotion, all stuck queries are
unblocked and canceled successfully.

Note: if FTS mirror promotion happened before step 3, the FTS canceled the
query successfully.

Root cause:
During cancel or termination, the coordinator tried to do abort of the current
transaction, and it hanged in the function 'internal_cancel' (called from
PQcancel) on 'poll' system call. It has a timeout of 660 seconds, and, moreover,
after the timeout expired, it looped forever trying to do 'poll' again. So, if
the socket was opened on the segment side, but nobody replied (as the segment
process became not available for some reason), 'internal_cancel' had no way to
return.

Fix:
Once FTS promotes a mirror, it sends the signal to the coordinator
postmaster (with PMSIGNAL_FTS_PROMOTED_MIRROR reason). On receiving of the
signal, the coordinator postmaster sends the SIGUSR1 signal (with
PROCSIG_FTS_PROMOTED_MIRROR reason) to all of its usual backends. Once the
backend receives the signal, if it is in the state of cancelling or terminating
of the query, it sets a flag in libpq. 'internal_cancel' checks this flag before
calling the 'poll' system call. If it is set, it will return with an error.
Thus:
a. if the FTS promotion happens before cancel/terminate, the query will be
canceled by the old logic;
b. if the FTS promotion happens after cancel/terminate, but before the
'internal_cancel' calls the 'poll', 'internal_cancel' will return an error
without calling the 'poll';
c. if the FTS promotion happens when the 'poll' is already called, the 'poll'
will return EINTR (as the SIGUSR1 was received), and a new 'poll' will not be
called, as the flag is set.

BenderArenadata · 2024-05-05T11:22:46Z

Allure report https://allure.adsw.io/launch/70286

whitehawk · 2024-05-06T01:48:13Z

Additional motivation about the patch. Once we are in the 'internal_cancel', there are several ways how to get out of the poll:

timeout
long jump from a signal handler
interruption by a signal

Option 1 - there is already a timeout (660 sec), which was introduced a long time ago to fix another issue.
After this timeout expires, the code loops again to poll in internal_cancel. Changing this timeout may break this old fix. And removing the internal loop will not help very much - as it will unblock after 11 min, that is quite long.

Option 2 - jumping back to sigsetjmp in PostgresMain could work for pg_cancel_backend, but it breaks pg_terminate_backend.

So, option 3 is the way to go. But we should not get out of the internal_cancel by any signal, as it will lead to interruption of internal_cancel in cases when it is actually not needed. So it required modification of libpq code and introducing additional check in order to filter only the 'target' signal, that is intended to break the infinite poll.
The source of the signal would be correct to assign to the FTS, as the FTS is the root of trust when judging if the segment is alive or not. So, once FTS detects that one of the segments is stuck, it should send a signal that mirror has been promoted to the Postmaster (as direct communication from FTS to backends doesn't seem a proper thing to do), and the Postmaster in turn should propagate the signal to backends.

BenderArenadata · 2024-05-06T04:48:17Z

Allure report https://allure.adsw.io/launch/70294

BenderArenadata · 2024-05-14T05:54:12Z

Failed job Build for ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1386247

BenderArenadata · 2024-05-14T05:55:50Z

Failed job Build for x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1386246

BenderArenadata · 2024-05-14T23:52:46Z

Failed job Build for ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1390048

BenderArenadata · 2024-05-14T23:54:01Z

Failed job Build for x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1390049

BenderArenadata · 2024-05-15T07:22:50Z

Failed job Build for ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1391002

BenderArenadata · 2024-05-15T07:23:43Z

Failed job Build for x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1391003

BenderArenadata · 2024-05-15T07:59:55Z

Allure report https://allure.adsw.io/launch/70746

src/interfaces/libpq/fe-connect.c

src/test/isolation2/sql/resgroup/resgroup_views_seg_down.sql

whitehawk · 2024-05-17T02:44:21Z

How reliable is it to transmit information through signals like this, especially twice?

I consider the signals are reliable. Unless somebody masks/unmasks them frequently for long periods of time, but I didn't find evidences for such case in the code.
If we have non-reliable IPC via signals, it means that we have problems in order of magnitude more sever - because signals are used widely in FTS, recovery mechanism, etc...

And regarding twice signalling - similar approach is already used in the FTS (but in the other direction): a backend can call FtsNotifyProber, that will send PMSIGNAL_WAKEN_FTS to the postmaster, which in turn will send SIGINT to the FTS probe process.

BenderArenadata · 2024-05-18T23:42:03Z

Allure report https://allure.adsw.io/launch/71045

BenderArenadata · 2024-05-20T00:54:05Z

Allure report https://allure.adsw.io/launch/71051

whitehawk · 2024-05-20T04:09:35Z

Can an FTS promotion (or a signal about it) occur immediately after calling recv in the internal_cancel function?

I don't think that such situation is possible.

If the internal_cancel reached the point of recv, it means that poll returned success and 'There is data to read' (as it was called with POLLIN flag), and it means that the segment closed the connection.
So, recv should not be blocked as it is looking only for 1 byte to be transmitted and there is at least one byte in the socket's buffer (even if the segment process becomes unavailable in the moment of recv).
If the segment process hangs after its postmaster closed the connection, but before internal_cancel completes the recv, the signal from the FTS will not come immediately. FTS will need some time to complete probe process (making several attempts to talk to the segment). Thus, the FTS promotion signal should not interfere with the recv call.

KnightMurloc · 2024-05-21T12:54:24Z

Can you explain why connect to the failed segment doesn't fail? I also think it would be nice to add a comment for the new code in internal_cancel.

whitehawk · 2024-05-22T04:29:45Z

Can you explain why connect to the failed segment doesn't fail?

My understanding is following: the segment's postmaster, before being stopped, had successfully created a socket, did bind and listen. It is enough for the TCP stack of the OS to process an incoming connect from the client. The fact that at this moment the postmaster process is stopped, doesn't prevent the OS from establishing a connection and receiving data into a socket buffer in the kernel.

I also think it would be nice to add a comment for the new code in internal_cancel.

Ok.

BenderArenadata · 2024-05-22T05:32:10Z

Allure report https://allure.adsw.io/launch/71212

KnightMurloc · 2024-05-22T12:09:51Z

Everything seems to be fine, but maybe it's worth moving the tests to isolation2 instead of the tests of resource groups? Because the problem itself does not relate to resource groups in any way, and the fts tests are located in isolation2.

whitehawk · 2024-05-22T12:47:19Z

Everything seems to be fine, but maybe it's worth moving the tests to isolation2 instead of the tests of resource groups? Because the problem itself does not relate to resource groups in any way, and the fts tests are located in isolation2.

I think it is better to leave them at the current location, because:

The test is pretty long, and moving it to isolation2 schedule will increase the time of the CI testing round (I guess the regression testing is the longest job). When this test is in resgroup schedule, it is executed in parallel with the regression job, so the overall CI testing time is not increased.
Current testing scenario (and the original customer's issue) relies on the gp_toolkit.gp_resgroup_status_per_segment which works only with resgroups enabled. Moving it to isolation2 means that we need to rework test scenario.

Reason (2) itself is not a blocker of course, but I consider reason (1) is important enough to leave the test in the resgroup schedule.

KnightMurloc · 2024-05-22T12:54:52Z

I easily reproduced the problem with a just select without resource groups. About the time, it really doesn't sound very good, maybe it's really worth leaving it that way.

KnightMurloc · 2024-05-23T05:44:54Z

what is the difference between pg_cancel_backend and pg_terminate_backend that we want to test both?

RekGRpth · 2024-05-23T05:47:13Z

what is the difference between pg_cancel_backend and pg_terminate_backend that we want to test both?

if (QueryCancelCleanup || TermSignalReceived)

src/backend/cdb/cdbutil.c

BenderArenadata · 2024-05-24T10:32:45Z

Allure report https://allure.adsw.io/launch/71435

BenderArenadata · 2024-05-24T12:45:35Z

Failed job Regression tests with Postgres on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1430645

RekGRpth · 2024-05-24T16:33:41Z

Failed job Regression tests with Postgres on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1430645

new flaky test?

DIFF FILE: ../gpdb_src/src/test/regress/regression.diffs
----------------------------------------------------------------------
--- \/home\/gpadmin\/gpdb_src\/src\/test\/regress\/expected\/alter_db_set_tablespace\.out	2024-05-24 11:15:47.525072044 +0000
+++ \/home\/gpadmin\/gpdb_src\/src\/test\/regress\/results\/alter_db_set_tablespace\.out	2024-05-24 11:15:47.673062295 +0000
@@ -1261,14 +1270,8 @@
 
 -- Ensure that the mirrors including the standby master have removed the dboid dir under the target tablespace
 SELECT gp_wait_until_triggered_fault('after_drop_database_directories', 1, dbid) FROM gp_segment_configuration WHERE role='m';
- gp_wait_until_triggered_fault 
--------------------------------
- Success:
- Success:
- Success:
- Success:
-(4 rows)
-
+ERROR:  failed to inject fault: ERROR:  fault not triggered, fault name:'after_drop_database_directories' fault type:'wait_until_triggered'
+DETAIL:  Timed-out as 10 minutes max wait happens until triggered. (gp_inject_fault.c:132)
 -- Then all the files of the database should remain in the dboid directory in the source tablespace directory for all database instances.
 -- Note: Sometimes the pg_internal.init is not yet formed on the recovering primary. It is not important for our test.
 CREATE TEMPORARY TABLE after_alter AS SELECT * FROM stat_db_objects('alter_db', 'adst_source_tablespace');

BenderArenadata · 2024-05-27T06:01:52Z

Allure report https://allure.adsw.io/launch/71528

Fix (initial version) ADBDEV-5517

678ec1b

whitehawk changed the title ~~(DRAFT for CI) Fix (initial version) ADBDEV-5517~~ Fix stuck query after cancel or termination when segment is not responding May 6, 2024

whitehawk added 2 commits May 6, 2024 14:07

Update the solution

75ab25d

Add more comments and fix code style a little bit

be9517f

Fix test hanging if it is executed in docker

c3eb8b3

Merge branch 'adb-6.x-dev' into ADBDEV-5517

8a86a61

whitehawk marked this pull request as ready for review May 15, 2024 10:42

RekGRpth reviewed May 16, 2024

View reviewed changes

src/interfaces/libpq/fe-connect.c Show resolved Hide resolved

src/test/isolation2/sql/resgroup/resgroup_views_seg_down.sql Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

Fix hardcoded datadir name in the test

f7ad5b3

Fix spaces in the test

d575919

This comment was marked as resolved.

Sign in to view

RekGRpth previously approved these changes May 20, 2024

View reviewed changes

Add comments into libpq

d8b5221

whitehawk dismissed RekGRpth’s stale review via d8b5221 May 22, 2024 04:59

KnightMurloc reviewed May 23, 2024

View reviewed changes

src/backend/cdb/cdbutil.c Show resolved Hide resolved

KnightMurloc approved these changes May 23, 2024

View reviewed changes

RekGRpth approved these changes May 23, 2024

View reviewed changes

Merge branch 'adb-6.x-dev' into ADBDEV-5517

4f7faa8

Merge branch 'adb-6.x-dev' into ADBDEV-5517

bb1f059

andr-sokolov merged commit 96ed1ad into adb-6.x-dev May 27, 2024
5 checks passed

andr-sokolov deleted the ADBDEV-5517 branch May 27, 2024 09:37

Stolb27 mentioned this pull request Jul 30, 2024

Arenadata patchset 57 #1008

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stuck query after cancel or termination when segment is not responding #948

Fix stuck query after cancel or termination when segment is not responding #948

whitehawk commented May 5, 2024 •

edited

Loading

BenderArenadata commented May 5, 2024

whitehawk commented May 6, 2024

BenderArenadata commented May 6, 2024

BenderArenadata commented May 14, 2024

BenderArenadata commented May 14, 2024

BenderArenadata commented May 14, 2024

BenderArenadata commented May 14, 2024

BenderArenadata commented May 15, 2024

BenderArenadata commented May 15, 2024

BenderArenadata commented May 15, 2024

This comment was marked as resolved.

whitehawk commented May 17, 2024

BenderArenadata commented May 18, 2024

BenderArenadata commented May 20, 2024

This comment was marked as resolved.

whitehawk commented May 20, 2024

KnightMurloc commented May 21, 2024

whitehawk commented May 22, 2024

BenderArenadata commented May 22, 2024

KnightMurloc commented May 22, 2024

whitehawk commented May 22, 2024

KnightMurloc commented May 22, 2024

KnightMurloc commented May 23, 2024

RekGRpth commented May 23, 2024

BenderArenadata commented May 24, 2024

BenderArenadata commented May 24, 2024

RekGRpth commented May 24, 2024

BenderArenadata commented May 27, 2024

Fix stuck query after cancel or termination when segment is not responding #948

Fix stuck query after cancel or termination when segment is not responding #948

Conversation

whitehawk commented May 5, 2024 • edited Loading

BenderArenadata commented May 5, 2024

whitehawk commented May 6, 2024

BenderArenadata commented May 6, 2024

BenderArenadata commented May 14, 2024

BenderArenadata commented May 14, 2024

BenderArenadata commented May 14, 2024

BenderArenadata commented May 14, 2024

BenderArenadata commented May 15, 2024

BenderArenadata commented May 15, 2024

BenderArenadata commented May 15, 2024

This comment was marked as resolved.

whitehawk commented May 17, 2024

BenderArenadata commented May 18, 2024

BenderArenadata commented May 20, 2024

This comment was marked as resolved.

whitehawk commented May 20, 2024

KnightMurloc commented May 21, 2024

whitehawk commented May 22, 2024

BenderArenadata commented May 22, 2024

KnightMurloc commented May 22, 2024

whitehawk commented May 22, 2024

KnightMurloc commented May 22, 2024

KnightMurloc commented May 23, 2024

RekGRpth commented May 23, 2024

BenderArenadata commented May 24, 2024

BenderArenadata commented May 24, 2024

RekGRpth commented May 24, 2024

BenderArenadata commented May 27, 2024

whitehawk commented May 5, 2024 •

edited

Loading