Fix job canceling when getting stuck reading or writing receptor socket #12653

shanemcd · 2022-08-15T15:23:37Z

SUMMARY

On a recent customer case they reported not being able to cancel a running job. After looking at the code, there were 2 additional places (submit_work and get_work_results) where it is possible to get stuck on a read or write with the receptor socket. This patch introduces a new pattern that will allow us to eject whenever we might be stuck.

ISSUE TYPE

Bug, Docs Fix or other nominal change

COMPONENT NAME

API

AlanCoding · 2022-08-15T17:31:44Z

In that case, was it confirmed that the job process was really running? It would be good to get strace information to find out what specifically the control process is hanging on. I agree with your assessments and this is adding in good general defense to code points that might result in hanging, but there are other scenarios for jobs stuck in running.

AlanCoding · 2022-08-15T17:39:13Z

awx/main/tasks/receptor.py

        try:
            ansible_runner.interface.run(streamer='transmit', _output=_socket.makefile('wb'), **self.runner_params)
        finally:
            # Socket must be shutdown here, or the reader will hang forever.
            _socket.shutdown(socket.SHUT_WR)

+    @cleanup_new_process
+    def submitter(self, payload_reader):
+        # Prepare the submit_work kwargs before creating threads, because references to settings are not thread-safe


comment doesn't work, because you are already in a thread at this point.

AlanCoding · 2022-08-15T17:45:41Z

awx/main/tasks/receptor.py

                try:
-                    receptor_ctl.simple_command(f"work release {self.unit_id}")
+                    self.receptor_ctl.simple_command(f"work release {unit_id}")


If the concept is the, generally, receptor commands are not safe from hangs, then I expect that would apply to receptorctl work release as well. I believe I've had it hang here before.

AlanCoding · 2022-08-15T17:52:48Z

awx/main/tasks/receptor.py

+            if res and res.status == "canceled":
+                return res, unit_id
+
+            resultsock, resultfile = work_results_future.result()


All work_results does is return the socket. It seems somewhat extraordinary that it would be hanging.

https://github.com/ansible/receptor/blob/ba3ed4532509e9d92a4d0bd89c0a284ee13f58a0/receptorctl/receptorctl/socket_interface.py#L248

It does read from the general communication socket to receptor, so it absolutely could hang.

AlanCoding · 2022-08-15T17:54:37Z

awx/main/tasks/receptor.py

+                payload_reader_file.close()
+                self.receptor_ctl._socket.shutdown(socket.SHUT_RDWR)
+                self.receptor_ctl._socket.close()
+                self.receptor_ctl._sockfile.close()


Somewhat of a tangent - I have many thought it would be better to call self.receptor_ctl.close(), to replace the 3 lines above here. It's written defensively enough, and would be more clear / DRY.

self.receptor_ctl._socket.shutdown(socket.SHUT_RDWR) is not call in receptorctl's close, so we could keep that line. But the other 2 lines could be replaced

AlanCoding · 2022-08-16T12:37:41Z

awx/main/tasks/receptor.py

+        # This ThreadPoolExecutor runs for the duration of the job.
+        # The cancel_func pattern is intended to guard againsnt any situation where we may
+        # end up stuck while reading or writing from the receptor socket.
+        with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:


I checked out your branch and ran graphs

This is for running 5 jobs. The increase in database connections is 30-10 = 20. Then 20 / 5 = 4 threads per job, as expected. This is because max_workers=3 doesn't include the main thread, so 4 threads in total, and without connection pooling between threads, a new connection is created for each. Currently it's 3 per job, and particularly if we're backporting, increasing this number by 1 will be disastrous for some large (or poorly balanced) deployments.

Really, it should be either 1 or 0. But for now, I'm going to pull out some of my tricks to suggest to mitigate what you're doing here, because watching for cancels over the transmit phase is something I know we were lacking.

AlanCoding · 2022-08-16T13:14:03Z

Link shanemcd#75, I may still have more changes I want to see after this.

At this stage I might also try to run some tests.

Use cancel_watcher as method, not thread

Fix job canceling when getting stuck reading or writing receptor socket

339a66a

github-actions bot added the component:api label Aug 15, 2022

AlanCoding reviewed Aug 15, 2022

View reviewed changes

AlanCoding mentioned this pull request Aug 15, 2022

Close database connections while processing job output #11745

Merged

AlanCoding reviewed Aug 16, 2022

View reviewed changes

Use cancel_watcher as method, not thread

39c8f45

Merge pull request #75 from AlanCoding/not_thread

17d86df

Use cancel_watcher as method, not thread

AlanCoding mentioned this pull request Aug 18, 2022

jobs running indefinitely because of receptor error #12645

Closed

9 tasks

AlanCoding mentioned this pull request Sep 2, 2022

Rebase extended canceling scope onto connection closing work AlanCoding/awx#48

Closed

shanemcd closed this May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix job canceling when getting stuck reading or writing receptor socket #12653

Fix job canceling when getting stuck reading or writing receptor socket #12653

shanemcd commented Aug 15, 2022

AlanCoding commented Aug 15, 2022

AlanCoding Aug 15, 2022

AlanCoding Aug 15, 2022

AlanCoding Aug 15, 2022

AlanCoding Aug 15, 2022

fosterseth Aug 15, 2022

AlanCoding Aug 16, 2022

AlanCoding commented Aug 16, 2022

Fix job canceling when getting stuck reading or writing receptor socket #12653

Fix job canceling when getting stuck reading or writing receptor socket #12653

Conversation

shanemcd commented Aug 15, 2022

SUMMARY

ISSUE TYPE

COMPONENT NAME

AlanCoding commented Aug 15, 2022

AlanCoding Aug 15, 2022

Choose a reason for hiding this comment

AlanCoding Aug 15, 2022

Choose a reason for hiding this comment

AlanCoding Aug 15, 2022

Choose a reason for hiding this comment

AlanCoding Aug 15, 2022

Choose a reason for hiding this comment

fosterseth Aug 15, 2022

Choose a reason for hiding this comment

AlanCoding Aug 16, 2022

Choose a reason for hiding this comment

AlanCoding commented Aug 16, 2022