Gunicorn doesn't shutdown immediately on receiving SIGTERM #1256

Closed
suriya opened this Issue May 9, 2016 · 15 comments

Projects

None yet

3 participants

@suriya
Contributor
suriya commented May 9, 2016

I am seeing a weird behavior where Gunicorn does not shutdown immediately when sent a SIGTERM. I see this with Python 3.5 on Ubuntu 16.04. The following environments (that I have access to) do not exhibit this issue:

  • Python 2.7 on Ubuntu 16.04
  • Python 2.7 on Ubuntu 14.04
  • Python 3.4 on Ubuntu 14.04

Refer to https://gist.github.com/suriya/1d9a11ea9ad33a5b1df0b2ba7ed0cd0e on how to reproduce this issue.

After debugging a while, I found that the Gunicorn arbiter properly signals the child worker. The child workers handle the signal properly as well (on all versions). However, something prevents the loop in run_for_one function https://github.com/benoitc/gunicorn/blob/9d158bec94c91c62cd5cc28a27268a5b8804d795/gunicorn/workers/sync.py#L58 from exiting right away with Python 3.5.

Could Python 3.5's change in the behavior of select.select() https://docs.python.org/3/library/select.html#select.select explain this issue.

@tilgovi
Collaborator
tilgovi commented May 9, 2016

Do you have open connections when this happens? If so, that is expected. SIGTERM is a graceful shutdown and subject to the --graceful-timeout setting (default 30s).

At present, at least on the master branch, I observe a worker exiting within 1s of sending SIGTERM to the arbiter, tested on Python 3.5.

@suriya
Contributor
suriya commented May 10, 2016 edited

There is no open connection. As shown in the gist, All that the script does is start gunicorn and send a SIGTERM.

@suriya
Contributor
suriya commented May 10, 2016

Happens to me with the master branch as well. This is with Python 3.5 on Ubuntu 16.04. Maybe, this happens because of Ubuntu.

$ python --version
Python 3.5.1+
$ gunicorn --version
gunicorn (version 19.5.0)
$ time ./run-and-stop-gunicorn.sh 
[2016-05-10 01:32:59 +0000] [8537] [INFO] Starting gunicorn 19.5.0
[2016-05-10 01:32:59 +0000] [8537] [INFO] Listening at: http://127.0.0.1:8000 (8537)
[2016-05-10 01:32:59 +0000] [8537] [INFO] Using worker: sync
[2016-05-10 01:32:59 +0000] [8541] [INFO] Booting worker with pid: 8541
[2016-05-10 01:33:00 +0000] [8537] [INFO] Handling signal: term
[2016-05-10 01:33:30 +0000] [8537] [INFO] Shutting down: Master

real    0m31.102s
user    0m0.320s
sys 0m0.012s
@suriya
Contributor
suriya commented May 10, 2016

The same behavior happens with Python 3.5, Gunicorn (master, commit 9d158be), Centos 7.

$ time ./run-and-stop-gunicorn.sh 
CentOS Linux release 7.2.1511 (Core) 
Python 3.5.1
[2016-05-10 04:25:12 +0000] [1448] [INFO] Starting gunicorn 19.5.0
[2016-05-10 04:25:12 +0000] [1448] [INFO] Listening at: http://127.0.0.1:8000 (1448)
[2016-05-10 04:25:12 +0000] [1448] [INFO] Using worker: sync
[2016-05-10 04:25:12 +0000] [1452] [INFO] Booting worker with pid: 1452
[2016-05-10 04:25:13 +0000] [1448] [INFO] Handling signal: term
[2016-05-10 04:25:43 +0000] [1448] [INFO] Shutting down: Master

real    0m31.109s
user    0m0.231s
sys 0m0.053s

Python 3.4 does not have this issue.

$ time ./run-and-stop-gunicorn.sh 
CentOS Linux release 7.2.1511 (Core) 
Python 3.4.4
[2016-05-10 04:24:09 +0000] [1386] [INFO] Starting gunicorn 19.5.0
[2016-05-10 04:24:09 +0000] [1386] [INFO] Listening at: http://127.0.0.1:8000 (1386)
[2016-05-10 04:24:09 +0000] [1386] [INFO] Using worker: sync
[2016-05-10 04:24:09 +0000] [1390] [INFO] Booting worker with pid: 1390
[2016-05-10 04:24:09 +0000] [1386] [INFO] Handling signal: term
[2016-05-10 04:24:09 +0000] [1390] [INFO] Worker exiting (pid: 1390)
[2016-05-10 04:24:09 +0000] [1386] [INFO] Shutting down: Master

real    0m1.103s
user    0m0.240s
sys 0m0.058s
@tilgovi
Collaborator
tilgovi commented May 10, 2016

You're right. I do observe this on Python 3. Perhaps I made a mistake earlier when I was testing.

@tilgovi
Collaborator
tilgovi commented May 10, 2016

I also agree with the diagnosis.

@tilgovi
Collaborator
tilgovi commented May 10, 2016

I don't believe this affects any of the async workers or the threaded worker.

@suriya
Contributor
suriya commented May 10, 2016

PEP 475 https://www.python.org/dev/peps/pep-0475/#backward-compatibility has some information how this should be addressed.

Applications relying on the fact that system calls are interrupted with InterruptedError will hang. The authors of this PEP don't think that such applications exist, since they would be exposed to other issues such as race conditions (there is an opportunity for deadlock if the signal comes before the system call). Besides, such code would be non-portable.

In any case, those applications must be fixed to handle signals differently, to have a reliable behaviour on all platforms and all Python versions. A possible strategy is to set up a signal handler raising a well-defined exception, or use a wakeup file descriptor.

For applications using event loops, signal.set_wakeup_fd() is the recommanded option to handle signals. Python's low-level signal handler will write signal numbers into the file descriptor and the event loop will be awaken to read them. The event loop can handle those signals without the restriction of signal handlers (for example, the loop can be woken up in any thread, not just the main thread).

@benoitc
Owner
benoitc commented May 10, 2016

Just tested it and was able to reproduce it. At least for me gunicorn exit is delayed. Removing the --timeout arguments makes the problem gone.

After debugging a little, it appears that after receiving the signal, the select is indeed blocking there:
https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/sync.py#L35

I am not sure yet how to solve the issue, I will reread the PEP 475 carefully this morning to see if it can give a solution, thanks for the link!

@benoitc benoitc added a commit that referenced this issue May 10, 2016
@benoitc unblock the wait loop under python 3.5
in python 3.5 the select is blocking when waiting for it which prevent quick
exit on SIGTERM.

The problem is described:
https://www.python.org/dev/peps/pep-0475/#backward-compatibility

This change fix it by listening for signal event on the worker pipe. Once an
event is triggered it will forcefully wake up the select and return.

fix #1256
b0c0333
@benoitc
Owner
benoitc commented May 10, 2016

@suriya @tilgovi please test #1257 and let me know :)

@suriya
Contributor
suriya commented May 10, 2016

The test script run-and-stop-gunicorn.sh behaves identically with Python 3.5 and Python 3.4. The pull request addresses the SIGTERM issue.

@benoitc Are there other parts of the gunicorn that break because of PEP 475?

@benoitc
Owner
benoitc commented May 10, 2016

@suriya not to my knowledge, I tested with other workers and it was fine.

@benoitc
Owner
benoitc commented May 10, 2016

@suriya so the patch is fixing your issue right?

@suriya
Contributor
suriya commented May 10, 2016

@benoitc Yes, absolutely fixing the issue. Thank you.

I was just wondering if there would be other places that might need to be fixed since the effects of PEP 475 seem to be pervasive.

@benoitc
Owner
benoitc commented May 10, 2016

@suriya normally not, but will keep an eye on it :-)

@benoitc benoitc closed this in #1257 May 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment