Give more specific messages if a job was killed due to SIGTERM or SIGKILL signals #12435

AlanCoding · 2022-06-28T14:36:22Z

SUMMARY

The current behavior is that a dispatcher worker with a running job will not exit on SIGTERM. This means that it runs out the clock for its supervisor grace period and then gets a SIGKILL. Eventually the job still gets reaped anyway, but with potentially missing events and a vague message (it's not self-consistent after a SIGKILL, lost information).

The intent of this change is that we get snappy SIGTERM exits of the dispatcher, writes new messages for the different cases, and does some other minor things to make debugging easier.

ISSUE TYPE

Bug or Docs Fix

COMPONENT NAME

API

ADDITIONAL INFORMATION

This is pulling in particular parts of #11745, but not doing the architectural change of changing our canceling mechanism. This processes the SIGTERM and SIGINT signals and cancels jobs with a descriptive message. Those signals can come from a supervisorctl stop kind of command, or something else on the machine going rogue doing it by itself.

I've tried to apply everything I've learned in an appropriate manner. PR 11745 has some awkward code because it's passing around the objects related to signal handling. Fundamentally, the python signal library works with python globals, so we should also do so, but, take action to limit the context in which they apply. We also have a highly specific an opinionated use case with project syncs before job runs. If you look at this_is_outermost_caller, this is some additional work to ensure that nested applications of the decorator work with the same state.

awx/main/dispatch/reaper.py

Co-authored-by: Elijah DeLee <kdelee@redhat.com>

shanemcd · 2022-06-30T16:10:00Z

awx/main/dispatch/reaper.py

+        job_ids.append(j.id)
+        j.status = 'failed'
+        j.start_args = ''
+        j.job_explanation += 'Task was marked as running at system start up. The system must have not shut down properly, so it has been marked as failed.'


shanemcd · 2022-06-30T16:13:22Z

awx/main/dispatch/worker/base.py

-                    logger.warning(f"Postgres event consumer has not recovered in {self.pg_max_wait} s, exiting")
+                current_downtime = time.time() - self.pg_down_time
+                if current_downtime > self.pg_max_wait:
+                    logger.exception(f"Postgres event consumer has not recovered in {current_downtime} s, exiting")


…nsible#12435) * Reap jobs on dispatcher startup to increase clarity, replace existing reaping logic * Exit jobs if receiving SIGTERM signal * Fix unwanted reaping on shutdown, let subprocess close out * Add some sanity tests for signal module * Add a log for an unhandled dispatcher error * Refine wording of error messages Co-authored-by: Elijah DeLee <kdelee@redhat.com>

Make logs from database outage more manageable Raise exception if update_model never recovers from problem Remove unused current_user cookie Register system again if deleted by another pod Avoid cases where missing instance would throw error on startup this gives time for heartbeat to register it Give specific messages if job was killed due to SIGTERM or SIGKILL (ansible#12435) * Reap jobs on dispatcher startup to increase clarity, replace existing reaping logic * Exit jobs if receiving SIGTERM signal * Fix unwanted reaping on shutdown, let subprocess close out * Add some sanity tests for signal module * Add a log for an unhandled dispatcher error * Refine wording of error messages Co-authored-by: Elijah DeLee <kdelee@redhat.com> Split reaper for running and waiting jobs Avoid running jobs that have already been reapted Co-authored-by: Elijah DeLee <kdelee@redhat.com> Co-authored-by: Shane McDonald <me@shanemcd.com> Remove unnecessary extra actions Fix waiting jobs in other cases of reaping Add logs to debug waiting bottlenecking Add logs about heartbeat skew Co-authored-by: Shane McDonald <me@shanemcd.com> Replace git shallow clone with shutil.copytree Introduce build_project_dir method the base method will create an empty project dir for workdir Share code between job and inventory tasks with new mixin combine rest of pre_run_hook logic structure to hold lock for entire sync process force sync to run for inventory updates due to UI issues Remove reference to removed scm_last_revision field Fix consuming control capacity on container groups Somehow we lost the critical line where we consume control impact on container groups This needs forward-ported to devel Remove debug method that calls cleanup - It's unclear why this was here. - Removing it doesnt appear to cause any problems. - It still gets called during heartbeats. Log chosen control node for container group tasks Add grace period settings for task manager timeout, and pod / job waiting reapers Co-authored-by: Alan Rominger <arominge@redhat.com> Add setting for missed heartbeats before marking node offline Allow for passing custom job_explanation to reaper methods Co-authored-by: Alan Rominger <arominge@redhat.com> Add extra workers if computing based on memory Co-authored-by: Elijah DeLee <kdelee@redhat.com> Apply a failed status if cancel_flag is not set

…nsible#12435) * Reap jobs on dispatcher startup to increase clarity, replace existing reaping logic * Exit jobs if receiving SIGTERM signal * Fix unwanted reaping on shutdown, let subprocess close out * Add some sanity tests for signal module * Add a log for an unhandled dispatcher error * Refine wording of error messages Co-authored-by: Elijah DeLee <kdelee@redhat.com>

github-actions bot added the component:api label Jun 28, 2022

AlanCoding mentioned this pull request Jun 28, 2022

Close database connections while processing job output #11745

Merged

AlanCoding added 10 commits June 28, 2022 13:30

Reap jobs on dispatcher startup to increase clarity

5046f02

Exit jobs if receiving SIGTERM signal

863d6d4

Fix unwanted reaping on shutdown, let subprocess close out

62b0ebf

Back out the exception on receiving signals

e4a5e4a

Replace existing reaping logic with the new startup reaper

01d948c

Add some sanity tests for signal module

673c5fa

Only log reaped jobs when jobs were reaped

84c62d5

Add a log for an unhandled dispatcher error

cc58c51

Refine wording of error messages

54265fd

Fix syntax bug

9ec335d

AlanCoding force-pushed the reaper_cases branch from 23da73a to 9ec335d Compare June 28, 2022 17:31

kdelee approved these changes Jun 29, 2022

View reviewed changes

kdelee reviewed Jun 29, 2022

View reviewed changes

awx/main/dispatch/reaper.py Outdated Show resolved Hide resolved

AlanCoding and others added 2 commits June 30, 2022 11:55

Change wording for reaping of jobs on startup

14a676e

Co-authored-by: Elijah DeLee <kdelee@redhat.com>

Remove a word from the error message

638b468

shanemcd reviewed Jun 30, 2022

View reviewed changes

shanemcd approved these changes Jun 30, 2022

View reviewed changes

AlanCoding merged commit fd671ec into ansible:devel Jun 30, 2022

AlanCoding mentioned this pull request Sep 2, 2022

Canceling a job can give wrong error message for related project sync #12822

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give more specific messages if a job was killed due to SIGTERM or SIGKILL signals #12435

Give more specific messages if a job was killed due to SIGTERM or SIGKILL signals #12435

AlanCoding commented Jun 28, 2022 •

edited

shanemcd Jun 30, 2022

shanemcd Jun 30, 2022

Give more specific messages if a job was killed due to SIGTERM or SIGKILL signals #12435

Give more specific messages if a job was killed due to SIGTERM or SIGKILL signals #12435

Conversation

AlanCoding commented Jun 28, 2022 • edited

SUMMARY

ISSUE TYPE

COMPONENT NAME

ADDITIONAL INFORMATION

shanemcd Jun 30, 2022

Choose a reason for hiding this comment

shanemcd Jun 30, 2022

Choose a reason for hiding this comment

AlanCoding commented Jun 28, 2022 •

edited