Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close database connections while processing job output #11745

Merged
merged 2 commits into from
Sep 6, 2022

Conversation

AlanCoding
Copy link
Member

@AlanCoding AlanCoding commented Feb 15, 2022

SUMMARY

Replaces #11464

Most of the change here (taken from that prior PR) is to replace the cancel callback method so that it no longer needs to poll the database. Instead, it processes a SIGTERM signal to know when to close.

This closes database connections before we start the ansible-runner process work to consume output from a remote job. During processing, we need to pull from receptor and push to redis. After the job is finished, it will re-connect to the database to update status and trigger other actions.

ISSUE TYPE
  • New or Enhanced Feature
COMPONENT NAME
  • API
ADDITIONAL INFORMATION

The challenging part of this is ordering of actions, race conditions, when to close stuff, and that sort of thing. This involved a lot of experimentation.

An important detail is that when you do supervisorctl service restart, it cancels jobs by this mechanism, because supervisor itself sends a SIGTERM signal to the process group.

@matburt
Copy link
Member

matburt commented May 3, 2022

This looks good to me... what's the hold up?

@AlanCoding
Copy link
Member Author

Through doing #12435, I've learned more, and I should apply those lessons correctly before reconsidering this.

Really quick runthough:

The signal callback method will only be ran in the main process. This means the code in the awx/main/tasks/receptor.py module.

If you set the callback method to raise an exception, that may or may not be respected depending on what the code is doing. File reads will ignore it, which means that we can't 100% get rid of reads in the processing step. This is something I had already discovered last time I was playing with this. However, the line first_future = concurrent.futures.wait(futures, return_when=concurrent.futures.FIRST_COMPLETED) will properly get a traceback raised from the signal connected method. This is useful information, as we could reduce the thread pool works from 2 to just 1, and we may be able to do this with no polling at all. This is very appealing, and I hope I can return to it.

@AlanCoding AlanCoding removed the blocked label Jul 6, 2022
@AlanCoding AlanCoding force-pushed the cancel_rework_no_close branch 3 times, most recently from e636da9 to 7018416 Compare July 6, 2022 01:06
logger.info(f'Job id {unified_job_id} has been deleted, cancel aborted')
return
time.sleep(1)
logger.info(f'{unified_job.log_format} stopped before obtaining a task id, sigterm not needed')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm increasingly thinking this method cancel_unified_job shouldn't exist. The race condition it exists to address (which is pending/waiting jobs) shouldn't actually be a problem, as the cancel_flag can be relied on as the signaling methods in the cases that matter.

There are much more narrow race conditions to worry about, like when the job is "running" but has not yet been submitted to receptor. Again, this would be the wrong place to handle that. After we set raise_exception, we should check if the flag has been set, and treat it as a cancel if it has been. By itself, that should be a complete solution to the potential race conditions.

@AlanCoding
Copy link
Member Author

I couldn't make myself happy with how I was breaking with existing code patterns in the dispatcher, now I think I can be happy with this.

The gist of the last change is that I stopped using a pseudo-task to tell the dispatcher parent process to SIGTERM a task's worker. This is replaced by the dispatcher control-and-reply mechanism. We were already using this mechanism in the actually_running property, but due to the aggressiveness and slowness, I believe that it was causing a good deal of race conditions. Instead of querying the dispatcher for all running tasks (for fairly minor reasons), this PR takes the more direct approach of telling the dispatcher "try to cancel task ____, and then tell me what you really did".

This gets us to the point of getting an acknowledgement that the quit signal was sent to the job's control process within the request-response cycle of doing a POST to the /cancel/ endpoint. After that, we still wait for the signal processing of the job to close sockets, do a receptor cancel, and save any half-way progress the job has made.

@AlanCoding AlanCoding force-pushed the cancel_rework_no_close branch 2 times, most recently from 5bc6146 to 3143449 Compare July 14, 2022 13:59
@AlanCoding
Copy link
Member Author

At last I've gotten around to putting up #12586

Screenshot from 2022-07-28 13-15-00

See the linked PR for a reference.

@matburt
Copy link
Member

matburt commented Aug 12, 2022

looking good now?

@AlanCoding
Copy link
Member Author

#12653 now puts up very heavy conflicts with this. It's not solving the same problem, but the same code structures definitely evolve in two different directions in both of these, so we need some discussion around that.

@AlanCoding AlanCoding force-pushed the cancel_rework_no_close branch 2 times, most recently from 4af0973 to a4ea8fa Compare August 18, 2022 18:13
@AlanCoding
Copy link
Member Author

I had to squash commits again in order to rebase, because some intermediate commits had conflicts with devel.

@jainnikhil30
Copy link
Contributor

jainnikhil30 commented Aug 25, 2022

Okay I went ahead and tested this PR and this looks really good. Following are the results:

  • Without the PR (with 3 concurrent long running jobs):
awx=> SELECT pid, age(clock_timestamp(), query_start), usename, state, application_name, query 
FROM pg_stat_activity 
WHERE query != '<IDLE>' AND query = 'COMMIT' AND query NOT ILIKE '%pg_stat_activity%' AND datname='awx'
;
   pid   |       age       | usename | state |                     application_name                     | query  
---------+-----------------+---------+-------+----------------------------------------------------------+--------
 2172572 | 00:20:59.366217 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172900 | 00:20:40.453337 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172700 | 00:20:44.746103 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172502 | 00:00:00.277442 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172368 | 00:21:03.665407 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172194 | 00:21:21.029097 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172193 | 00:00:00.797692 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172170 | 00:21:23.316118 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172830 | 00:00:00.443355 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT

As you can see we had about 9 idle connections with query COMMIT and with long age (20 mins and counting)

  • With this PR (with 3 concurrent long running jobs):
awx=> SELECT pid, age(clock_timestamp(), query_start), usename, state, application_name, query 
FROM pg_stat_activity 
WHERE query != '<IDLE>' AND query = 'COMMIT' AND query NOT ILIKE '%pg_stat_activity%' AND datname='awx'
;
 pid | age | usename | state | application_name | query 
-----+-----+---------+-------+------------------+-------
(0 rows)

Now we have no idle connections with query COMMIT and with long age.

This solves loads of problems with connections. We have seen in the past that you can't run more than a certain amount of job on controller because we run out of DB connections. We have also seen those idle connection consumes resources (memory and CPU) and can cause issues. This PR would address both the issues. I think we should let this go through the formal testing and merge it.

@shanemcd
Copy link
Member

Currently playing around with this PR and am seeing this in my logs when trying to cancel jobs:

tools_awx_1     | 2022-08-25 15:33:05,375 ERROR    [0d2b5791] awx.main.models.unified_jobs could not reach dispatcher on awx_1 within 5s

# Use control and reply mechanism to cancel and obtain confirmation
timeout = 5
canceled = ControlDispatcher('dispatcher', self.controller_node).cancel([self.celery_task_id])
except (socket.timeout, RuntimeError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty dangerous to include both of these exception types here. We should catch RuntimeError separately and use logger.exception. Right now this is eating:

tools_awx_1     | 2022-08-25 16:12:37,974 ERROR    [bf11a9d4] awx.main.models.unified_jobs could not reach dispatcher on awx_1 within 5s
tools_awx_1     | Traceback (most recent call last):
tools_awx_1     |   File "/awx_devel/./awx/main/models/unified_jobs.py", line 1420, in cancel_dispatcher_process
tools_awx_1     |     canceled = ControlDispatcher('dispatcher', self.controller_node).cancel([self.celery_task_id])
tools_awx_1     |   File "/awx_devel/./awx/main/dispatch/control.py", line 41, in cancel
tools_awx_1     |     return self.control_with_reply('cancel', *args, extra_data={'task_ids': task_ids}, **kwargs)
tools_awx_1     |   File "/awx_devel/./awx/main/dispatch/control.py", line 59, in control_with_reply
tools_awx_1     |     for reply in conn.events(select_timeout=timeout, yield_timeouts=True):
tools_awx_1     |   File "/awx_devel/./awx/main/dispatch/__init__.py", line 35, in events
tools_awx_1     |     raise RuntimeError('Listening for events can only be done in autocommit mode')
tools_awx_1     | RuntimeError: Listening for events can only be done in autocommit mode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. This is also happening on devel. Yikes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I'm trying to pick up the error handling changes faithfully with a rebase, and will push in a second.

@AlanCoding AlanCoding force-pushed the cancel_rework_no_close branch 2 times, most recently from c26139d to bc257e7 Compare August 30, 2022 00:49
If canceled attempted before, still allow attempting another cancel
in this case, attempt to send the sigterm signal again.
Keep clicking, you might help!

Replace other cancel_callbacks with sigterm watcher
  adapt special inventory mechanism for this too

Get rid of the cancel_watcher method with exception in main thread

Handle academic case of sigterm race condition

Process cancelation as control signal

Fully connect cancel method and run_dispatcher to control

Never transition workflows directly to canceled, add logs
@AlanCoding
Copy link
Member Author

I'm not working to test AlanCoding#48, which is an evolution from #12653 rebased on top of these changes. Much of those changes are non-trivial.

@AlanCoding
Copy link
Member Author

I wanted to be extra sure that merging this doesn't mess anything up. So let me explain some of the testing and where this led me.

control-standalone

This scenario has been very stable on this branch, and fixes 500 errors from some scenarios that run it up to near capacity. I don't know for sure what caused us to get closer to the connection limit, but I suspect it could be the change that added 7 additional max workers.

hybrid-cluster

This looks considerably better than the reference runs, because I think the reference runs have similar 500 errors

control-hop-to-execution-nodes cluster

This branch had 1 more failure than the reference run in a particularly concerning place. This was in a workflow that was canceled, and expected the jobs in it to be marked as canceled. I dove into this bug, and found that the problem was that canceling a job will not work as expected if the project sync is running because the signal gets read by the project sync as a SIGTERM, but from its perspective this contradicts the cancel_flag, and it has special handling as if the cluster was shutting down (not the case). The rework here places an additional burden on the fail chains, in that we need to set the cancel_flag for anything that might be the active process.

Because of that, I pushed an additional commit which I found fixes this. I developed a minimal test case to hit this issue, but I did find that even devel doesn't handle the case of canceling the project sync while doing this. So I'm going to need to gather up all these discoveries and document them.

With the additional commit, I feel safe about merging now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants