Close database connections while processing job output #11745

AlanCoding · 2022-02-15T15:47:47Z

SUMMARY

Replaces #11464

Most of the change here (taken from that prior PR) is to replace the cancel callback method so that it no longer needs to poll the database. Instead, it processes a SIGTERM signal to know when to close.

This closes database connections before we start the ansible-runner process work to consume output from a remote job. During processing, we need to pull from receptor and push to redis. After the job is finished, it will re-connect to the database to update status and trigger other actions.

ISSUE TYPE

New or Enhanced Feature

COMPONENT NAME

API

ADDITIONAL INFORMATION

The challenging part of this is ordering of actions, race conditions, when to close stuff, and that sort of thing. This involved a lot of experimentation.

An important detail is that when you do supervisorctl service restart, it cancels jobs by this mechanism, because supervisor itself sends a SIGTERM signal to the process group.

matburt · 2022-05-03T15:19:07Z

This looks good to me... what's the hold up?

AlanCoding · 2022-06-28T14:59:00Z

Through doing #12435, I've learned more, and I should apply those lessons correctly before reconsidering this.

Really quick runthough:

The signal callback method will only be ran in the main process. This means the code in the awx/main/tasks/receptor.py module.

If you set the callback method to raise an exception, that may or may not be respected depending on what the code is doing. File reads will ignore it, which means that we can't 100% get rid of reads in the processing step. This is something I had already discovered last time I was playing with this. However, the line first_future = concurrent.futures.wait(futures, return_when=concurrent.futures.FIRST_COMPLETED) will properly get a traceback raised from the signal connected method. This is useful information, as we could reduce the thread pool works from 2 to just 1, and we may be able to do this with no polling at all. This is very appealing, and I hope I can return to it.

AlanCoding · 2022-07-06T13:19:54Z

awx/main/tasks/system.py

+            logger.info(f'Job id {unified_job_id} has been deleted, cancel aborted')
+            return
+        time.sleep(1)
+    logger.info(f'{unified_job.log_format} stopped before obtaining a task id, sigterm not needed')


I'm increasingly thinking this method cancel_unified_job shouldn't exist. The race condition it exists to address (which is pending/waiting jobs) shouldn't actually be a problem, as the cancel_flag can be relied on as the signaling methods in the cases that matter.

There are much more narrow race conditions to worry about, like when the job is "running" but has not yet been submitted to receptor. Again, this would be the wrong place to handle that. After we set raise_exception, we should check if the flag has been set, and treat it as a cancel if it has been. By itself, that should be a complete solution to the potential race conditions.

AlanCoding · 2022-07-07T19:09:32Z

I couldn't make myself happy with how I was breaking with existing code patterns in the dispatcher, now I think I can be happy with this.

The gist of the last change is that I stopped using a pseudo-task to tell the dispatcher parent process to SIGTERM a task's worker. This is replaced by the dispatcher control-and-reply mechanism. We were already using this mechanism in the actually_running property, but due to the aggressiveness and slowness, I believe that it was causing a good deal of race conditions. Instead of querying the dispatcher for all running tasks (for fairly minor reasons), this PR takes the more direct approach of telling the dispatcher "try to cancel task ____, and then tell me what you really did".

This gets us to the point of getting an acknowledgement that the quit signal was sent to the job's control process within the request-response cycle of doing a POST to the /cancel/ endpoint. After that, we still wait for the signal processing of the job to close sockets, do a receptor cancel, and save any half-way progress the job has made.

AlanCoding · 2022-07-28T17:15:25Z

At last I've gotten around to putting up #12586

See the linked PR for a reference.

matburt · 2022-08-12T13:46:23Z

looking good now?

AlanCoding · 2022-08-15T18:11:12Z

#12653 now puts up very heavy conflicts with this. It's not solving the same problem, but the same code structures definitely evolve in two different directions in both of these, so we need some discussion around that.

AlanCoding · 2022-08-18T18:14:38Z

I had to squash commits again in order to rebase, because some intermediate commits had conflicts with devel.

jainnikhil30 · 2022-08-25T14:03:24Z

Okay I went ahead and tested this PR and this looks really good. Following are the results:

Without the PR (with 3 concurrent long running jobs):

awx=> SELECT pid, age(clock_timestamp(), query_start), usename, state, application_name, query 
FROM pg_stat_activity 
WHERE query != '<IDLE>' AND query = 'COMMIT' AND query NOT ILIKE '%pg_stat_activity%' AND datname='awx'
;
   pid   |       age       | usename | state |                     application_name                     | query  
---------+-----------------+---------+-------+----------------------------------------------------------+--------
 2172572 | 00:20:59.366217 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172900 | 00:20:40.453337 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172700 | 00:20:44.746103 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172502 | 00:00:00.277442 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172368 | 00:21:03.665407 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172194 | 00:21:21.029097 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172193 | 00:00:00.797692 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172170 | 00:21:23.316118 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT
 2172830 | 00:00:00.443355 | awx     | idle  | 165.192.128.236-62364-/usr/bin/awx-manage run_dispatcher | COMMIT

As you can see we had about 9 idle connections with query COMMIT and with long age (20 mins and counting)

With this PR (with 3 concurrent long running jobs):

awx=> SELECT pid, age(clock_timestamp(), query_start), usename, state, application_name, query 
FROM pg_stat_activity 
WHERE query != '<IDLE>' AND query = 'COMMIT' AND query NOT ILIKE '%pg_stat_activity%' AND datname='awx'
;
 pid | age | usename | state | application_name | query 
-----+-----+---------+-------+------------------+-------
(0 rows)

Now we have no idle connections with query COMMIT and with long age.

This solves loads of problems with connections. We have seen in the past that you can't run more than a certain amount of job on controller because we run out of DB connections. We have also seen those idle connection consumes resources (memory and CPU) and can cause issues. This PR would address both the issues. I think we should let this go through the formal testing and merge it.

shanemcd · 2022-08-25T15:33:49Z

Currently playing around with this PR and am seeing this in my logs when trying to cancel jobs:

tools_awx_1     | 2022-08-25 15:33:05,375 ERROR    [0d2b5791] awx.main.models.unified_jobs could not reach dispatcher on awx_1 within 5s

shanemcd · 2022-08-25T16:14:06Z

awx/main/models/unified_jobs.py

+            # Use control and reply mechanism to cancel and obtain confirmation
+            timeout = 5
+            canceled = ControlDispatcher('dispatcher', self.controller_node).cancel([self.celery_task_id])
+        except (socket.timeout, RuntimeError):


It's pretty dangerous to include both of these exception types here. We should catch RuntimeError separately and use logger.exception. Right now this is eating:

tools_awx_1 | 2022-08-25 16:12:37,974 ERROR [bf11a9d4] awx.main.models.unified_jobs could not reach dispatcher on awx_1 within 5s tools_awx_1 | Traceback (most recent call last): tools_awx_1 | File "/awx_devel/./awx/main/models/unified_jobs.py", line 1420, in cancel_dispatcher_process tools_awx_1 | canceled = ControlDispatcher('dispatcher', self.controller_node).cancel([self.celery_task_id]) tools_awx_1 | File "/awx_devel/./awx/main/dispatch/control.py", line 41, in cancel tools_awx_1 | return self.control_with_reply('cancel', *args, extra_data={'task_ids': task_ids}, **kwargs) tools_awx_1 | File "/awx_devel/./awx/main/dispatch/control.py", line 59, in control_with_reply tools_awx_1 | for reply in conn.events(select_timeout=timeout, yield_timeouts=True): tools_awx_1 | File "/awx_devel/./awx/main/dispatch/__init__.py", line 35, in events tools_awx_1 | raise RuntimeError('Listening for events can only be done in autocommit mode') tools_awx_1 | RuntimeError: Listening for events can only be done in autocommit mode

Never mind. This is also happening on devel. Yikes.

yep, I'm trying to pick up the error handling changes faithfully with a rebase, and will push in a second.

If canceled attempted before, still allow attempting another cancel in this case, attempt to send the sigterm signal again. Keep clicking, you might help! Replace other cancel_callbacks with sigterm watcher adapt special inventory mechanism for this too Get rid of the cancel_watcher method with exception in main thread Handle academic case of sigterm race condition Process cancelation as control signal Fully connect cancel method and run_dispatcher to control Never transition workflows directly to canceled, add logs

AlanCoding · 2022-09-02T01:33:22Z

I'm not working to test AlanCoding#48, which is an evolution from #12653 rebased on top of these changes. Much of those changes are non-trivial.

AlanCoding · 2022-09-06T13:57:46Z

I wanted to be extra sure that merging this doesn't mess anything up. So let me explain some of the testing and where this led me.

control-standalone

This scenario has been very stable on this branch, and fixes 500 errors from some scenarios that run it up to near capacity. I don't know for sure what caused us to get closer to the connection limit, but I suspect it could be the change that added 7 additional max workers.

hybrid-cluster

This looks considerably better than the reference runs, because I think the reference runs have similar 500 errors

control-hop-to-execution-nodes cluster

This branch had 1 more failure than the reference run in a particularly concerning place. This was in a workflow that was canceled, and expected the jobs in it to be marked as canceled. I dove into this bug, and found that the problem was that canceling a job will not work as expected if the project sync is running because the signal gets read by the project sync as a SIGTERM, but from its perspective this contradicts the cancel_flag, and it has special handling as if the cluster was shutting down (not the case). The rework here places an additional burden on the fail chains, in that we need to set the cancel_flag for anything that might be the active process.

Because of that, I pushed an additional commit which I found fixes this. I developed a minimal test case to hit this issue, but I did find that even devel doesn't handle the case of canceling the project sync while doing this. So I'm going to need to gather up all these discoveries and document them.

With the additional commit, I feel safe about merging now.

github-actions bot added the component:api label Feb 15, 2022

AlanCoding force-pushed the cancel_rework_no_close branch from 45e3817 to 537f3c3 Compare March 12, 2022 02:20

github-actions bot added the component:docs label Mar 12, 2022

AlanCoding force-pushed the cancel_rework_no_close branch from 45ed4b7 to c4040dc Compare March 16, 2022 13:26

AlanCoding mentioned this pull request Mar 29, 2022

Increase resiliency when application crashes #11955

Merged

AlanCoding mentioned this pull request Jun 28, 2022

Give more specific messages if a job was killed due to SIGTERM or SIGKILL signals #12435

Merged

AlanCoding added the blocked label Jun 28, 2022

AlanCoding removed the blocked label Jul 6, 2022

AlanCoding force-pushed the cancel_rework_no_close branch 3 times, most recently from e636da9 to 7018416 Compare July 6, 2022 01:06

AlanCoding commented Jul 6, 2022

View reviewed changes

AlanCoding force-pushed the cancel_rework_no_close branch 2 times, most recently from 5bc6146 to 3143449 Compare July 14, 2022 13:59

AlanCoding force-pushed the cancel_rework_no_close branch from ffd5188 to 9ebd7a7 Compare July 28, 2022 15:28

hesmithrh assigned AlanCoding Aug 1, 2022

AlanCoding requested a review from shanemcd August 15, 2022 18:26

kdelee approved these changes Aug 16, 2022

View reviewed changes

AlanCoding force-pushed the cancel_rework_no_close branch 2 times, most recently from 4af0973 to a4ea8fa Compare August 18, 2022 18:13

AlanCoding force-pushed the cancel_rework_no_close branch from a4ea8fa to 224cf00 Compare August 20, 2022 12:56

shanemcd reviewed Aug 25, 2022

View reviewed changes

shanemcd mentioned this pull request Aug 25, 2022

"could not reach dispatcher" errors when canceling jobs #12740

Closed

9 tasks

AlanCoding force-pushed the cancel_rework_no_close branch 2 times, most recently from c26139d to bc257e7 Compare August 30, 2022 00:49

shanemcd approved these changes Aug 30, 2022

View reviewed changes

AlanCoding force-pushed the cancel_rework_no_close branch from bc257e7 to c59bbde Compare September 1, 2022 19:20

AlanCoding mentioned this pull request Sep 2, 2022

Rebase extended canceling scope onto connection closing work AlanCoding/awx#48

Closed

AlanCoding mentioned this pull request Sep 2, 2022

Allow running awx-manage run_dispatcher utility methods against other nodes #12821

Open

9 tasks

Add project sync to job cancel chain

f512971

AlanCoding mentioned this pull request Sep 6, 2022

Canceling a job can give wrong error message for related project sync #12822

Closed

9 tasks

AlanCoding merged commit 15964dc into ansible:devel Sep 6, 2022

kdelee mentioned this pull request Jan 27, 2023

Set CONN_MAX_AGE to better integrate with databases that time out idle connections #13481

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close database connections while processing job output #11745

Close database connections while processing job output #11745

AlanCoding commented Feb 15, 2022 •

edited

Loading

matburt commented May 3, 2022

AlanCoding commented Jun 28, 2022

AlanCoding Jul 6, 2022

AlanCoding commented Jul 7, 2022

AlanCoding commented Jul 28, 2022

matburt commented Aug 12, 2022

AlanCoding commented Aug 15, 2022

AlanCoding commented Aug 18, 2022

jainnikhil30 commented Aug 25, 2022 •

edited

Loading

shanemcd commented Aug 25, 2022

shanemcd Aug 25, 2022

shanemcd Aug 25, 2022

shanemcd Aug 25, 2022

AlanCoding Aug 29, 2022

AlanCoding commented Sep 2, 2022

AlanCoding commented Sep 6, 2022

Close database connections while processing job output #11745

Close database connections while processing job output #11745

Conversation

AlanCoding commented Feb 15, 2022 • edited Loading

SUMMARY

ISSUE TYPE

COMPONENT NAME

ADDITIONAL INFORMATION

matburt commented May 3, 2022

AlanCoding commented Jun 28, 2022

AlanCoding Jul 6, 2022

Choose a reason for hiding this comment

AlanCoding commented Jul 7, 2022

AlanCoding commented Jul 28, 2022

matburt commented Aug 12, 2022

AlanCoding commented Aug 15, 2022

AlanCoding commented Aug 18, 2022

jainnikhil30 commented Aug 25, 2022 • edited Loading

shanemcd commented Aug 25, 2022

shanemcd Aug 25, 2022

Choose a reason for hiding this comment

shanemcd Aug 25, 2022

Choose a reason for hiding this comment

shanemcd Aug 25, 2022

Choose a reason for hiding this comment

AlanCoding Aug 29, 2022

Choose a reason for hiding this comment

AlanCoding commented Sep 2, 2022

AlanCoding commented Sep 6, 2022

control-standalone

hybrid-cluster

control-hop-to-execution-nodes cluster

AlanCoding commented Feb 15, 2022 •

edited

Loading

jainnikhil30 commented Aug 25, 2022 •

edited

Loading