Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix job canceling when getting stuck reading or writing receptor socket #12653

Closed
wants to merge 3 commits into from

Conversation

shanemcd
Copy link
Member

SUMMARY

On a recent customer case they reported not being able to cancel a running job. After looking at the code, there were 2 additional places (submit_work and get_work_results) where it is possible to get stuck on a read or write with the receptor socket. This patch introduces a new pattern that will allow us to eject whenever we might be stuck.

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API

@AlanCoding
Copy link
Member

In that case, was it confirmed that the job process was really running? It would be good to get strace information to find out what specifically the control process is hanging on. I agree with your assessments and this is adding in good general defense to code points that might result in hanging, but there are other scenarios for jobs stuck in running.

try:
ansible_runner.interface.run(streamer='transmit', _output=_socket.makefile('wb'), **self.runner_params)
finally:
# Socket must be shutdown here, or the reader will hang forever.
_socket.shutdown(socket.SHUT_WR)

@cleanup_new_process
def submitter(self, payload_reader):
# Prepare the submit_work kwargs before creating threads, because references to settings are not thread-safe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment doesn't work, because you are already in a thread at this point.

try:
receptor_ctl.simple_command(f"work release {self.unit_id}")
self.receptor_ctl.simple_command(f"work release {unit_id}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the concept is the, generally, receptor commands are not safe from hangs, then I expect that would apply to receptorctl work release as well. I believe I've had it hang here before.

if res and res.status == "canceled":
return res, unit_id

resultsock, resultfile = work_results_future.result()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All work_results does is return the socket. It seems somewhat extraordinary that it would be hanging.

https://github.com/ansible/receptor/blob/ba3ed4532509e9d92a4d0bd89c0a284ee13f58a0/receptorctl/receptorctl/socket_interface.py#L248

It does read from the general communication socket to receptor, so it absolutely could hang.

payload_reader_file.close()
self.receptor_ctl._socket.shutdown(socket.SHUT_RDWR)
self.receptor_ctl._socket.close()
self.receptor_ctl._sockfile.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat of a tangent - I have many thought it would be better to call self.receptor_ctl.close(), to replace the 3 lines above here. It's written defensively enough, and would be more clear / DRY.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.receptor_ctl._socket.shutdown(socket.SHUT_RDWR) is not call in receptorctl's close, so we could keep that line. But the other 2 lines could be replaced

# This ThreadPoolExecutor runs for the duration of the job.
# The cancel_func pattern is intended to guard againsnt any situation where we may
# end up stuck while reading or writing from the receptor socket.
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked out your branch and ran graphs

Screenshot from 2022-08-16 08-32-44

This is for running 5 jobs. The increase in database connections is 30-10 = 20. Then 20 / 5 = 4 threads per job, as expected. This is because max_workers=3 doesn't include the main thread, so 4 threads in total, and without connection pooling between threads, a new connection is created for each. Currently it's 3 per job, and particularly if we're backporting, increasing this number by 1 will be disastrous for some large (or poorly balanced) deployments.

Really, it should be either 1 or 0. But for now, I'm going to pull out some of my tricks to suggest to mitigate what you're doing here, because watching for cancels over the transmit phase is something I know we were lacking.

@AlanCoding
Copy link
Member

Link shanemcd#75, I may still have more changes I want to see after this.

At this stage I might also try to run some tests.

Use cancel_watcher as method, not thread
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants