Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give more specific messages if a job was killed due to SIGTERM or SIGKILL signals #12435

Merged
merged 12 commits into from
Jun 30, 2022

Conversation

AlanCoding
Copy link
Member

@AlanCoding AlanCoding commented Jun 28, 2022

SUMMARY

The current behavior is that a dispatcher worker with a running job will not exit on SIGTERM. This means that it runs out the clock for its supervisor grace period and then gets a SIGKILL. Eventually the job still gets reaped anyway, but with potentially missing events and a vague message (it's not self-consistent after a SIGKILL, lost information).

The intent of this change is that we get snappy SIGTERM exits of the dispatcher, writes new messages for the different cases, and does some other minor things to make debugging easier.

ISSUE TYPE
  • Bug or Docs Fix
COMPONENT NAME
  • API
ADDITIONAL INFORMATION

This is pulling in particular parts of #11745, but not doing the architectural change of changing our canceling mechanism. This processes the SIGTERM and SIGINT signals and cancels jobs with a descriptive message. Those signals can come from a supervisorctl stop kind of command, or something else on the machine going rogue doing it by itself.

I've tried to apply everything I've learned in an appropriate manner. PR 11745 has some awkward code because it's passing around the objects related to signal handling. Fundamentally, the python signal library works with python globals, so we should also do so, but, take action to limit the context in which they apply. We also have a highly specific an opinionated use case with project syncs before job runs. If you look at this_is_outermost_caller, this is some additional work to ensure that nested applications of the decorator work with the same state.

awx/main/dispatch/reaper.py Outdated Show resolved Hide resolved
job_ids.append(j.id)
j.status = 'failed'
j.start_args = ''
j.job_explanation += 'Task was marked as running at system start up. The system must have not shut down properly, so it has been marked as failed.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍰

logger.warning(f"Postgres event consumer has not recovered in {self.pg_max_wait} s, exiting")
current_downtime = time.time() - self.pg_down_time
if current_downtime > self.pg_max_wait:
logger.exception(f"Postgres event consumer has not recovered in {current_downtime} s, exiting")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@AlanCoding AlanCoding merged commit fd671ec into ansible:devel Jun 30, 2022
shanemcd pushed a commit to shanemcd/awx that referenced this pull request Jul 14, 2022
…nsible#12435)

* Reap jobs on dispatcher startup to increase clarity, replace existing reaping logic

* Exit jobs if receiving SIGTERM signal

* Fix unwanted reaping on shutdown, let subprocess close out

* Add some sanity tests for signal module

* Add a log for an unhandled dispatcher error

* Refine wording of error messages

Co-authored-by: Elijah DeLee <kdelee@redhat.com>
shanemcd pushed a commit to shanemcd/awx that referenced this pull request Jul 14, 2022
…nsible#12435)

* Reap jobs on dispatcher startup to increase clarity, replace existing reaping logic

* Exit jobs if receiving SIGTERM signal

* Fix unwanted reaping on shutdown, let subprocess close out

* Add some sanity tests for signal module

* Add a log for an unhandled dispatcher error

* Refine wording of error messages

Co-authored-by: Elijah DeLee <kdelee@redhat.com>
AlanCoding added a commit to AlanCoding/awx that referenced this pull request Jul 25, 2022
Make logs from database outage more manageable

Raise exception if update_model never recovers from problem

Remove unused current_user cookie

Register system again if deleted by another pod

Avoid cases where missing instance
  would throw error on startup
  this gives time for heartbeat to register it

Give specific messages if job was killed due to SIGTERM or SIGKILL (ansible#12435)

* Reap jobs on dispatcher startup to increase clarity, replace existing reaping logic

* Exit jobs if receiving SIGTERM signal

* Fix unwanted reaping on shutdown, let subprocess close out

* Add some sanity tests for signal module

* Add a log for an unhandled dispatcher error

* Refine wording of error messages

Co-authored-by: Elijah DeLee <kdelee@redhat.com>

Split reaper for running and waiting jobs

Avoid running jobs that have already been reapted

Co-authored-by: Elijah DeLee <kdelee@redhat.com>
Co-authored-by: Shane McDonald <me@shanemcd.com>

Remove unnecessary extra actions

Fix waiting jobs in other cases of reaping

Add logs to debug waiting bottlenecking

Add logs about heartbeat skew

Co-authored-by: Shane McDonald <me@shanemcd.com>

Replace git shallow clone with shutil.copytree

Introduce build_project_dir method
  the base method will create an empty project dir for workdir

Share code between job and inventory tasks with new mixin
  combine rest of pre_run_hook logic
  structure to hold lock for entire sync process

force sync to run for inventory updates due to UI issues

Remove reference to removed scm_last_revision field

Fix consuming control capacity on container groups

Somehow we lost the critical line where we consume control impact on container groups

This needs forward-ported to devel

Remove debug method that calls cleanup

- It's unclear why this was here.
- Removing it doesnt appear to cause any problems.
- It still gets called during heartbeats.

Log chosen control node for container group tasks

Add grace period settings for task manager timeout, and pod / job waiting reapers

Co-authored-by: Alan Rominger <arominge@redhat.com>

Add setting for missed heartbeats before marking node offline

Allow for passing custom job_explanation to reaper methods

Co-authored-by: Alan Rominger <arominge@redhat.com>

Add extra workers if computing based on memory

Co-authored-by: Elijah DeLee <kdelee@redhat.com>

Apply a failed status if cancel_flag is not set
AlanCoding added a commit to AlanCoding/awx that referenced this pull request Aug 18, 2022
…nsible#12435)

* Reap jobs on dispatcher startup to increase clarity, replace existing reaping logic

* Exit jobs if receiving SIGTERM signal

* Fix unwanted reaping on shutdown, let subprocess close out

* Add some sanity tests for signal module

* Add a log for an unhandled dispatcher error

* Refine wording of error messages

Co-authored-by: Elijah DeLee <kdelee@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants