Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stuck query after cancel or termination when segment is not responding #948

Merged
merged 10 commits into from
May 27, 2024

Conversation

whitehawk
Copy link

@whitehawk whitehawk commented May 5, 2024

Fix stuck query after cancel or termination when segment is not responding

Problem:
The following scenario caused a stuck state of a coordinator backend process:

  1. At some moment one of the segments stopped processing incoming requests (the
    reason itself is not important, as sometimes bad things may happen with any
    segment, and the system should be able to recover).
  2. A query, which dispatches requests to segments, was executed (for example,
    select from 'gp_toolkit.gp_resgroup_status_per_segment'). As one of the segments
    was not responding, the coordinator hanged in 'checkDispatchResult' (it is
    expected, and can be handled by the FTS).
  3. The stuck query was canceled or terminated (by 'pg_cancel_backend' or
    'pg_terminate_backend'). It didn't help to return from the stuck state, but
    after this step the query became completely unrecoverable. Even after FTS had
    detected the malfunction segment and had promoted the mirror, the query was
    still hanging forever.

Expected behavior is that after FTS mirror promotion, all stuck queries are
unblocked and canceled successfully.

Note: if FTS mirror promotion happened before step 3, the FTS canceled the
query successfully.

Root cause:
During cancel or termination, the coordinator tried to do abort of the current
transaction, and it hanged in the function 'internal_cancel' (called from
PQcancel) on 'poll' system call. It has a timeout of 660 seconds, and, moreover,
after the timeout expired, it looped forever trying to do 'poll' again. So, if
the socket was opened on the segment side, but nobody replied (as the segment
process became not available for some reason), 'internal_cancel' had no way to
return.

Fix:
Once FTS promotes a mirror, it sends the signal to the coordinator
postmaster (with PMSIGNAL_FTS_PROMOTED_MIRROR reason). On receiving of the
signal, the coordinator postmaster sends the SIGUSR1 signal (with
PROCSIG_FTS_PROMOTED_MIRROR reason) to all of its usual backends. Once the
backend receives the signal, if it is in the state of cancelling or terminating
of the query, it sets a flag in libpq. 'internal_cancel' checks this flag before
calling the 'poll' system call. If it is set, it will return with an error.
Thus:
a. if the FTS promotion happens before cancel/terminate, the query will be
canceled by the old logic;
b. if the FTS promotion happens after cancel/terminate, but before the
'internal_cancel' calls the 'poll', 'internal_cancel' will return an error
without calling the 'poll';
c. if the FTS promotion happens when the 'poll' is already called, the 'poll'
will return EINTR (as the SIGUSR1 was received), and a new 'poll' will not be
called, as the flag is set.

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/70286

@whitehawk whitehawk changed the title (DRAFT for CI) Fix (initial version) ADBDEV-5517 Fix stuck query after cancel or termination when segment is not responding May 6, 2024
@whitehawk
Copy link
Author

Additional motivation about the patch. Once we are in the 'internal_cancel', there are several ways how to get out of the poll:

  1. timeout
  2. long jump from a signal handler
  3. interruption by a signal

Option 1 - there is already a timeout (660 sec), which was introduced a long time ago to fix another issue.
After this timeout expires, the code loops again to poll in internal_cancel. Changing this timeout may break this old fix. And removing the internal loop will not help very much - as it will unblock after 11 min, that is quite long.

Option 2 - jumping back to sigsetjmp in PostgresMain could work for pg_cancel_backend, but it breaks pg_terminate_backend.

So, option 3 is the way to go. But we should not get out of the internal_cancel by any signal, as it will lead to interruption of internal_cancel in cases when it is actually not needed. So it required modification of libpq code and introducing additional check in order to filter only the 'target' signal, that is intended to break the infinite poll.
The source of the signal would be correct to assign to the FTS, as the FTS is the root of trust when judging if the segment is alive or not. So, once FTS detects that one of the segments is stuck, it should send a signal that mirror has been promoted to the Postmaster (as direct communication from FTS to backends doesn't seem a proper thing to do), and the Postmaster in turn should propagate the signal to backends.

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/70294

@BenderArenadata
Copy link

Failed job Build for ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1386247

@BenderArenadata
Copy link

@BenderArenadata
Copy link

Failed job Build for ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1390048

@BenderArenadata
Copy link

@BenderArenadata
Copy link

Failed job Build for ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1391002

@BenderArenadata
Copy link

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/70746

@whitehawk whitehawk marked this pull request as ready for review May 15, 2024 10:42
@RekGRpth

This comment was marked as resolved.

@whitehawk
Copy link
Author

How reliable is it to transmit information through signals like this, especially twice?

I consider the signals are reliable. Unless somebody masks/unmasks them frequently for long periods of time, but I didn't find evidences for such case in the code.
If we have non-reliable IPC via signals, it means that we have problems in order of magnitude more sever - because signals are used widely in FTS, recovery mechanism, etc...

And regarding twice signalling - similar approach is already used in the FTS (but in the other direction): a backend can call FtsNotifyProber, that will send PMSIGNAL_WAKEN_FTS to the postmaster, which in turn will send SIGINT to the FTS probe process.

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/71045

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/71051

@RekGRpth

This comment was marked as resolved.

@whitehawk
Copy link
Author

Can an FTS promotion (or a signal about it) occur immediately after calling recv in the internal_cancel function?

I don't think that such situation is possible.

  1. If the internal_cancel reached the point of recv, it means that poll returned success and 'There is data to read' (as it was called with POLLIN flag), and it means that the segment closed the connection.
    So, recv should not be blocked as it is looking only for 1 byte to be transmitted and there is at least one byte in the socket's buffer (even if the segment process becomes unavailable in the moment of recv).

  2. If the segment process hangs after its postmaster closed the connection, but before internal_cancel completes the recv, the signal from the FTS will not come immediately. FTS will need some time to complete probe process (making several attempts to talk to the segment). Thus, the FTS promotion signal should not interfere with the recv call.

RekGRpth
RekGRpth previously approved these changes May 20, 2024
@KnightMurloc
Copy link

Can you explain why connect to the failed segment doesn't fail? I also think it would be nice to add a comment for the new code in internal_cancel.

@whitehawk
Copy link
Author

Can you explain why connect to the failed segment doesn't fail?

My understanding is following: the segment's postmaster, before being stopped, had successfully created a socket, did bind and listen. It is enough for the TCP stack of the OS to process an incoming connect from the client. The fact that at this moment the postmaster process is stopped, doesn't prevent the OS from establishing a connection and receiving data into a socket buffer in the kernel.

I also think it would be nice to add a comment for the new code in internal_cancel.

Ok.

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/71212

@KnightMurloc
Copy link

Everything seems to be fine, but maybe it's worth moving the tests to isolation2 instead of the tests of resource groups? Because the problem itself does not relate to resource groups in any way, and the fts tests are located in isolation2.

@whitehawk
Copy link
Author

Everything seems to be fine, but maybe it's worth moving the tests to isolation2 instead of the tests of resource groups? Because the problem itself does not relate to resource groups in any way, and the fts tests are located in isolation2.

I think it is better to leave them at the current location, because:

  1. The test is pretty long, and moving it to isolation2 schedule will increase the time of the CI testing round (I guess the regression testing is the longest job). When this test is in resgroup schedule, it is executed in parallel with the regression job, so the overall CI testing time is not increased.
  2. Current testing scenario (and the original customer's issue) relies on the gp_toolkit.gp_resgroup_status_per_segment which works only with resgroups enabled. Moving it to isolation2 means that we need to rework test scenario.

Reason (2) itself is not a blocker of course, but I consider reason (1) is important enough to leave the test in the resgroup schedule.

@KnightMurloc
Copy link

I easily reproduced the problem with a just select without resource groups. About the time, it really doesn't sound very good, maybe it's really worth leaving it that way.

@KnightMurloc
Copy link

what is the difference between pg_cancel_backend and pg_terminate_backend that we want to test both?

@RekGRpth
Copy link
Member

what is the difference between pg_cancel_backend and pg_terminate_backend that we want to test both?

if (QueryCancelCleanup || TermSignalReceived)

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/71435

@BenderArenadata
Copy link

Failed job Regression tests with Postgres on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1430645

@RekGRpth
Copy link
Member

Failed job Regression tests with Postgres on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1430645

new flaky test?

DIFF FILE: ../gpdb_src/src/test/regress/regression.diffs
----------------------------------------------------------------------
--- \/home\/gpadmin\/gpdb_src\/src\/test\/regress\/expected\/alter_db_set_tablespace\.out	2024-05-24 11:15:47.525072044 +0000
+++ \/home\/gpadmin\/gpdb_src\/src\/test\/regress\/results\/alter_db_set_tablespace\.out	2024-05-24 11:15:47.673062295 +0000
@@ -1261,14 +1270,8 @@
 
 -- Ensure that the mirrors including the standby master have removed the dboid dir under the target tablespace
 SELECT gp_wait_until_triggered_fault('after_drop_database_directories', 1, dbid) FROM gp_segment_configuration WHERE role='m';
- gp_wait_until_triggered_fault 
--------------------------------
- Success:
- Success:
- Success:
- Success:
-(4 rows)
-
+ERROR:  failed to inject fault: ERROR:  fault not triggered, fault name:'after_drop_database_directories' fault type:'wait_until_triggered'
+DETAIL:  Timed-out as 10 minutes max wait happens until triggered. (gp_inject_fault.c:132)
 -- Then all the files of the database should remain in the dboid directory in the source tablespace directory for all database instances.
 -- Note: Sometimes the pg_internal.init is not yet formed on the recovering primary. It is not important for our test.
 CREATE TEMPORARY TABLE after_alter AS SELECT * FROM stat_db_objects('alter_db', 'adst_source_tablespace');

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/71528

@andr-sokolov andr-sokolov merged commit 96ed1ad into adb-6.x-dev May 27, 2024
5 checks passed
@andr-sokolov andr-sokolov deleted the ADBDEV-5517 branch May 27, 2024 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants