forked from greenplum-db/gpdb-archive
-
Notifications
You must be signed in to change notification settings - Fork 22
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix stuck query after cancel or termination when segment is not respo…
…nding (#948) Problem: The following scenario caused a stuck state of a coordinator backend process: 1. At some moment one of the segments stopped processing incoming requests (the reason itself is not important, as sometimes bad things may happen with any segment, and the system should be able to recover). 2. A query, which dispatches requests to segments, was executed (for example, select from 'gp_toolkit.gp_resgroup_status_per_segment'). As one of the segments was not responding, the coordinator hanged in 'checkDispatchResult' (it is expected, and can be handled by the FTS). 3. The stuck query was canceled or terminated (by 'pg_cancel_backend' or 'pg_terminate_backend'). It didn't help to return from the stuck state, but after this step the query became completely unrecoverable. Even after FTS had detected the malfunction segment and had promoted the mirror, the query was still hanging forever. Expected behavior is that after FTS mirror promotion, all stuck queries are unblocked and canceled successfully. Note: if FTS mirror promotion happened before step 3, the FTS canceled the query successfully. Root cause: During cancel or termination, the coordinator tried to do abort of the current transaction, and it hanged in the function 'internal_cancel' (called from PQcancel) on 'poll' system call. It has a timeout of 660 seconds, and, moreover, after the timeout expired, it looped forever trying to do 'poll' again. So, if the socket was opened on the segment side, but nobody replied (as the segment process became not available for some reason), 'internal_cancel' had no way to return. Fix: Once FTS promotes a mirror, it sends the signal to the coordinator postmaster (with PMSIGNAL_FTS_PROMOTED_MIRROR reason). On receiving of the signal, the coordinator postmaster sends the SIGUSR1 signal (with PROCSIG_FTS_PROMOTED_MIRROR reason) to all of its usual backends. Once the backend receives the signal, if it is in the state of cancelling or terminating of the query, it sets a flag in libpq. 'internal_cancel' checks this flag before calling the 'poll' system call. If it is set, it will return with an error. Thus: a. if the FTS promotion happens before cancel/terminate, the query will be canceled by the old logic; b. if the FTS promotion happens after cancel/terminate, but before the 'internal_cancel' calls the 'poll', 'internal_cancel' will return an error without calling the 'poll'; c. if the FTS promotion happens when the 'poll' is already called, the 'poll' will return EINTR (as the SIGUSR1 was received), and a new 'poll' will not be called, as the flag is set.
- Loading branch information
Showing
12 changed files
with
553 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.