-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix stuck query after cancel or termination when segment is not responding #948
Conversation
Allure report https://allure.adsw.io/launch/70286 |
Additional motivation about the patch. Once we are in the 'internal_cancel', there are several ways how to get out of the poll:
Option 1 - there is already a timeout (660 sec), which was introduced a long time ago to fix another issue. Option 2 - jumping back to So, option 3 is the way to go. But we should not get out of the |
Allure report https://allure.adsw.io/launch/70294 |
Failed job Build for ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1386247 |
Failed job Build for x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1386246 |
Failed job Build for ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1390048 |
Failed job Build for x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1390049 |
Failed job Build for ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1391002 |
Failed job Build for x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1391003 |
Allure report https://allure.adsw.io/launch/70746 |
This comment was marked as resolved.
This comment was marked as resolved.
I consider the signals are reliable. Unless somebody masks/unmasks them frequently for long periods of time, but I didn't find evidences for such case in the code. And regarding twice signalling - similar approach is already used in the FTS (but in the other direction): a backend can call |
Allure report https://allure.adsw.io/launch/71045 |
Allure report https://allure.adsw.io/launch/71051 |
This comment was marked as resolved.
This comment was marked as resolved.
I don't think that such situation is possible.
|
Can you explain why |
My understanding is following: the segment's postmaster, before being stopped, had successfully created a socket, did
Ok. |
Allure report https://allure.adsw.io/launch/71212 |
Everything seems to be fine, but maybe it's worth moving the tests to isolation2 instead of the tests of resource groups? Because the problem itself does not relate to resource groups in any way, and the fts tests are located in isolation2. |
I think it is better to leave them at the current location, because:
Reason (2) itself is not a blocker of course, but I consider reason (1) is important enough to leave the test in the resgroup schedule. |
I easily reproduced the problem with a just select without resource groups. About the time, it really doesn't sound very good, maybe it's really worth leaving it that way. |
what is the difference between |
if (QueryCancelCleanup || TermSignalReceived) |
Allure report https://allure.adsw.io/launch/71435 |
Failed job Regression tests with Postgres on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1430645 |
new flaky test? DIFF FILE: ../gpdb_src/src/test/regress/regression.diffs
----------------------------------------------------------------------
--- \/home\/gpadmin\/gpdb_src\/src\/test\/regress\/expected\/alter_db_set_tablespace\.out 2024-05-24 11:15:47.525072044 +0000
+++ \/home\/gpadmin\/gpdb_src\/src\/test\/regress\/results\/alter_db_set_tablespace\.out 2024-05-24 11:15:47.673062295 +0000
@@ -1261,14 +1270,8 @@
-- Ensure that the mirrors including the standby master have removed the dboid dir under the target tablespace
SELECT gp_wait_until_triggered_fault('after_drop_database_directories', 1, dbid) FROM gp_segment_configuration WHERE role='m';
- gp_wait_until_triggered_fault
--------------------------------
- Success:
- Success:
- Success:
- Success:
-(4 rows)
-
+ERROR: failed to inject fault: ERROR: fault not triggered, fault name:'after_drop_database_directories' fault type:'wait_until_triggered'
+DETAIL: Timed-out as 10 minutes max wait happens until triggered. (gp_inject_fault.c:132)
-- Then all the files of the database should remain in the dboid directory in the source tablespace directory for all database instances.
-- Note: Sometimes the pg_internal.init is not yet formed on the recovering primary. It is not important for our test.
CREATE TEMPORARY TABLE after_alter AS SELECT * FROM stat_db_objects('alter_db', 'adst_source_tablespace'); |
Allure report https://allure.adsw.io/launch/71528 |
Fix stuck query after cancel or termination when segment is not responding
Problem:
The following scenario caused a stuck state of a coordinator backend process:
reason itself is not important, as sometimes bad things may happen with any
segment, and the system should be able to recover).
select from 'gp_toolkit.gp_resgroup_status_per_segment'). As one of the segments
was not responding, the coordinator hanged in 'checkDispatchResult' (it is
expected, and can be handled by the FTS).
'pg_terminate_backend'). It didn't help to return from the stuck state, but
after this step the query became completely unrecoverable. Even after FTS had
detected the malfunction segment and had promoted the mirror, the query was
still hanging forever.
Expected behavior is that after FTS mirror promotion, all stuck queries are
unblocked and canceled successfully.
Note: if FTS mirror promotion happened before step 3, the FTS canceled the
query successfully.
Root cause:
During cancel or termination, the coordinator tried to do abort of the current
transaction, and it hanged in the function 'internal_cancel' (called from
PQcancel) on 'poll' system call. It has a timeout of 660 seconds, and, moreover,
after the timeout expired, it looped forever trying to do 'poll' again. So, if
the socket was opened on the segment side, but nobody replied (as the segment
process became not available for some reason), 'internal_cancel' had no way to
return.
Fix:
Once FTS promotes a mirror, it sends the signal to the coordinator
postmaster (with PMSIGNAL_FTS_PROMOTED_MIRROR reason). On receiving of the
signal, the coordinator postmaster sends the SIGUSR1 signal (with
PROCSIG_FTS_PROMOTED_MIRROR reason) to all of its usual backends. Once the
backend receives the signal, if it is in the state of cancelling or terminating
of the query, it sets a flag in libpq. 'internal_cancel' checks this flag before
calling the 'poll' system call. If it is set, it will return with an error.
Thus:
a. if the FTS promotion happens before cancel/terminate, the query will be
canceled by the old logic;
b. if the FTS promotion happens after cancel/terminate, but before the
'internal_cancel' calls the 'poll', 'internal_cancel' will return an error
without calling the 'poll';
c. if the FTS promotion happens when the 'poll' is already called, the 'poll'
will return EINTR (as the SIGUSR1 was received), and a new 'poll' will not be
called, as the flag is set.