New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gunicorn doesn't shutdown immediately on receiving SIGTERM #1256

Closed
suriya opened this Issue May 9, 2016 · 16 comments

Comments

Projects
None yet
4 participants
@suriya
Contributor

suriya commented May 9, 2016

I am seeing a weird behavior where Gunicorn does not shutdown immediately when sent a SIGTERM. I see this with Python 3.5 on Ubuntu 16.04. The following environments (that I have access to) do not exhibit this issue:

  • Python 2.7 on Ubuntu 16.04
  • Python 2.7 on Ubuntu 14.04
  • Python 3.4 on Ubuntu 14.04

Refer to https://gist.github.com/suriya/1d9a11ea9ad33a5b1df0b2ba7ed0cd0e on how to reproduce this issue.

After debugging a while, I found that the Gunicorn arbiter properly signals the child worker. The child workers handle the signal properly as well (on all versions). However, something prevents the loop in run_for_one function

while self.alive:
from exiting right away with Python 3.5.

Could Python 3.5's change in the behavior of select.select() https://docs.python.org/3/library/select.html#select.select explain this issue.

@tilgovi

This comment has been minimized.

Show comment
Hide comment
@tilgovi

tilgovi May 9, 2016

Collaborator

Do you have open connections when this happens? If so, that is expected. SIGTERM is a graceful shutdown and subject to the --graceful-timeout setting (default 30s).

At present, at least on the master branch, I observe a worker exiting within 1s of sending SIGTERM to the arbiter, tested on Python 3.5.

Collaborator

tilgovi commented May 9, 2016

Do you have open connections when this happens? If so, that is expected. SIGTERM is a graceful shutdown and subject to the --graceful-timeout setting (default 30s).

At present, at least on the master branch, I observe a worker exiting within 1s of sending SIGTERM to the arbiter, tested on Python 3.5.

@suriya

This comment has been minimized.

Show comment
Hide comment
@suriya

suriya May 10, 2016

Contributor

There is no open connection. As shown in the gist, All that the script does is start gunicorn and send a SIGTERM.

Contributor

suriya commented May 10, 2016

There is no open connection. As shown in the gist, All that the script does is start gunicorn and send a SIGTERM.

@suriya

This comment has been minimized.

Show comment
Hide comment
@suriya

suriya May 10, 2016

Contributor

Happens to me with the master branch as well. This is with Python 3.5 on Ubuntu 16.04. Maybe, this happens because of Ubuntu.

$ python --version
Python 3.5.1+
$ gunicorn --version
gunicorn (version 19.5.0)
$ time ./run-and-stop-gunicorn.sh 
[2016-05-10 01:32:59 +0000] [8537] [INFO] Starting gunicorn 19.5.0
[2016-05-10 01:32:59 +0000] [8537] [INFO] Listening at: http://127.0.0.1:8000 (8537)
[2016-05-10 01:32:59 +0000] [8537] [INFO] Using worker: sync
[2016-05-10 01:32:59 +0000] [8541] [INFO] Booting worker with pid: 8541
[2016-05-10 01:33:00 +0000] [8537] [INFO] Handling signal: term
[2016-05-10 01:33:30 +0000] [8537] [INFO] Shutting down: Master

real    0m31.102s
user    0m0.320s
sys 0m0.012s
Contributor

suriya commented May 10, 2016

Happens to me with the master branch as well. This is with Python 3.5 on Ubuntu 16.04. Maybe, this happens because of Ubuntu.

$ python --version
Python 3.5.1+
$ gunicorn --version
gunicorn (version 19.5.0)
$ time ./run-and-stop-gunicorn.sh 
[2016-05-10 01:32:59 +0000] [8537] [INFO] Starting gunicorn 19.5.0
[2016-05-10 01:32:59 +0000] [8537] [INFO] Listening at: http://127.0.0.1:8000 (8537)
[2016-05-10 01:32:59 +0000] [8537] [INFO] Using worker: sync
[2016-05-10 01:32:59 +0000] [8541] [INFO] Booting worker with pid: 8541
[2016-05-10 01:33:00 +0000] [8537] [INFO] Handling signal: term
[2016-05-10 01:33:30 +0000] [8537] [INFO] Shutting down: Master

real    0m31.102s
user    0m0.320s
sys 0m0.012s
@suriya

This comment has been minimized.

Show comment
Hide comment
@suriya

suriya May 10, 2016

Contributor

The same behavior happens with Python 3.5, Gunicorn (master, commit 9d158be), Centos 7.

$ time ./run-and-stop-gunicorn.sh 
CentOS Linux release 7.2.1511 (Core) 
Python 3.5.1
[2016-05-10 04:25:12 +0000] [1448] [INFO] Starting gunicorn 19.5.0
[2016-05-10 04:25:12 +0000] [1448] [INFO] Listening at: http://127.0.0.1:8000 (1448)
[2016-05-10 04:25:12 +0000] [1448] [INFO] Using worker: sync
[2016-05-10 04:25:12 +0000] [1452] [INFO] Booting worker with pid: 1452
[2016-05-10 04:25:13 +0000] [1448] [INFO] Handling signal: term
[2016-05-10 04:25:43 +0000] [1448] [INFO] Shutting down: Master

real    0m31.109s
user    0m0.231s
sys 0m0.053s

Python 3.4 does not have this issue.

$ time ./run-and-stop-gunicorn.sh 
CentOS Linux release 7.2.1511 (Core) 
Python 3.4.4
[2016-05-10 04:24:09 +0000] [1386] [INFO] Starting gunicorn 19.5.0
[2016-05-10 04:24:09 +0000] [1386] [INFO] Listening at: http://127.0.0.1:8000 (1386)
[2016-05-10 04:24:09 +0000] [1386] [INFO] Using worker: sync
[2016-05-10 04:24:09 +0000] [1390] [INFO] Booting worker with pid: 1390
[2016-05-10 04:24:09 +0000] [1386] [INFO] Handling signal: term
[2016-05-10 04:24:09 +0000] [1390] [INFO] Worker exiting (pid: 1390)
[2016-05-10 04:24:09 +0000] [1386] [INFO] Shutting down: Master

real    0m1.103s
user    0m0.240s
sys 0m0.058s
Contributor

suriya commented May 10, 2016

The same behavior happens with Python 3.5, Gunicorn (master, commit 9d158be), Centos 7.

$ time ./run-and-stop-gunicorn.sh 
CentOS Linux release 7.2.1511 (Core) 
Python 3.5.1
[2016-05-10 04:25:12 +0000] [1448] [INFO] Starting gunicorn 19.5.0
[2016-05-10 04:25:12 +0000] [1448] [INFO] Listening at: http://127.0.0.1:8000 (1448)
[2016-05-10 04:25:12 +0000] [1448] [INFO] Using worker: sync
[2016-05-10 04:25:12 +0000] [1452] [INFO] Booting worker with pid: 1452
[2016-05-10 04:25:13 +0000] [1448] [INFO] Handling signal: term
[2016-05-10 04:25:43 +0000] [1448] [INFO] Shutting down: Master

real    0m31.109s
user    0m0.231s
sys 0m0.053s

Python 3.4 does not have this issue.

$ time ./run-and-stop-gunicorn.sh 
CentOS Linux release 7.2.1511 (Core) 
Python 3.4.4
[2016-05-10 04:24:09 +0000] [1386] [INFO] Starting gunicorn 19.5.0
[2016-05-10 04:24:09 +0000] [1386] [INFO] Listening at: http://127.0.0.1:8000 (1386)
[2016-05-10 04:24:09 +0000] [1386] [INFO] Using worker: sync
[2016-05-10 04:24:09 +0000] [1390] [INFO] Booting worker with pid: 1390
[2016-05-10 04:24:09 +0000] [1386] [INFO] Handling signal: term
[2016-05-10 04:24:09 +0000] [1390] [INFO] Worker exiting (pid: 1390)
[2016-05-10 04:24:09 +0000] [1386] [INFO] Shutting down: Master

real    0m1.103s
user    0m0.240s
sys 0m0.058s
@tilgovi

This comment has been minimized.

Show comment
Hide comment
@tilgovi

tilgovi May 10, 2016

Collaborator

You're right. I do observe this on Python 3. Perhaps I made a mistake earlier when I was testing.

Collaborator

tilgovi commented May 10, 2016

You're right. I do observe this on Python 3. Perhaps I made a mistake earlier when I was testing.

@tilgovi

This comment has been minimized.

Show comment
Hide comment
@tilgovi

tilgovi May 10, 2016

Collaborator

I also agree with the diagnosis.

Collaborator

tilgovi commented May 10, 2016

I also agree with the diagnosis.

@tilgovi

This comment has been minimized.

Show comment
Hide comment
@tilgovi

tilgovi May 10, 2016

Collaborator

I don't believe this affects any of the async workers or the threaded worker.

Collaborator

tilgovi commented May 10, 2016

I don't believe this affects any of the async workers or the threaded worker.

@suriya

This comment has been minimized.

Show comment
Hide comment
@suriya

suriya May 10, 2016

Contributor

PEP 475 https://www.python.org/dev/peps/pep-0475/#backward-compatibility has some information how this should be addressed.

Applications relying on the fact that system calls are interrupted with InterruptedError will hang. The authors of this PEP don't think that such applications exist, since they would be exposed to other issues such as race conditions (there is an opportunity for deadlock if the signal comes before the system call). Besides, such code would be non-portable.

In any case, those applications must be fixed to handle signals differently, to have a reliable behaviour on all platforms and all Python versions. A possible strategy is to set up a signal handler raising a well-defined exception, or use a wakeup file descriptor.

For applications using event loops, signal.set_wakeup_fd() is the recommanded option to handle signals. Python's low-level signal handler will write signal numbers into the file descriptor and the event loop will be awaken to read them. The event loop can handle those signals without the restriction of signal handlers (for example, the loop can be woken up in any thread, not just the main thread).

Contributor

suriya commented May 10, 2016

PEP 475 https://www.python.org/dev/peps/pep-0475/#backward-compatibility has some information how this should be addressed.

Applications relying on the fact that system calls are interrupted with InterruptedError will hang. The authors of this PEP don't think that such applications exist, since they would be exposed to other issues such as race conditions (there is an opportunity for deadlock if the signal comes before the system call). Besides, such code would be non-portable.

In any case, those applications must be fixed to handle signals differently, to have a reliable behaviour on all platforms and all Python versions. A possible strategy is to set up a signal handler raising a well-defined exception, or use a wakeup file descriptor.

For applications using event loops, signal.set_wakeup_fd() is the recommanded option to handle signals. Python's low-level signal handler will write signal numbers into the file descriptor and the event loop will be awaken to read them. The event loop can handle those signals without the restriction of signal handlers (for example, the loop can be woken up in any thread, not just the main thread).

@benoitc

This comment has been minimized.

Show comment
Hide comment
@benoitc

benoitc May 10, 2016

Owner

Just tested it and was able to reproduce it. At least for me gunicorn exit is delayed. Removing the --timeout arguments makes the problem gone.

After debugging a little, it appears that after receiving the signal, the select is indeed blocking there:
https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/sync.py#L35

I am not sure yet how to solve the issue, I will reread the PEP 475 carefully this morning to see if it can give a solution, thanks for the link!

Owner

benoitc commented May 10, 2016

Just tested it and was able to reproduce it. At least for me gunicorn exit is delayed. Removing the --timeout arguments makes the problem gone.

After debugging a little, it appears that after receiving the signal, the select is indeed blocking there:
https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/sync.py#L35

I am not sure yet how to solve the issue, I will reread the PEP 475 carefully this morning to see if it can give a solution, thanks for the link!

benoitc added a commit that referenced this issue May 10, 2016

unblock the wait loop under python 3.5
in python 3.5 the select is blocking when waiting for it which prevent quick
exit on SIGTERM.

The problem is described:
https://www.python.org/dev/peps/pep-0475/#backward-compatibility

This change fix it by listening for signal event on the worker pipe. Once an
event is triggered it will forcefully wake up the select and return.

fix #1256
@benoitc

This comment has been minimized.

Show comment
Hide comment
@benoitc

benoitc May 10, 2016

Owner

@suriya @tilgovi please test #1257 and let me know :)

Owner

benoitc commented May 10, 2016

@suriya @tilgovi please test #1257 and let me know :)

@suriya

This comment has been minimized.

Show comment
Hide comment
@suriya

suriya May 10, 2016

Contributor

The test script run-and-stop-gunicorn.sh behaves identically with Python 3.5 and Python 3.4. The pull request addresses the SIGTERM issue.

@benoitc Are there other parts of the gunicorn that break because of PEP 475?

Contributor

suriya commented May 10, 2016

The test script run-and-stop-gunicorn.sh behaves identically with Python 3.5 and Python 3.4. The pull request addresses the SIGTERM issue.

@benoitc Are there other parts of the gunicorn that break because of PEP 475?

@benoitc

This comment has been minimized.

Show comment
Hide comment
@benoitc

benoitc May 10, 2016

Owner

@suriya not to my knowledge, I tested with other workers and it was fine.

Owner

benoitc commented May 10, 2016

@suriya not to my knowledge, I tested with other workers and it was fine.

@benoitc

This comment has been minimized.

Show comment
Hide comment
@benoitc

benoitc May 10, 2016

Owner

@suriya so the patch is fixing your issue right?

Owner

benoitc commented May 10, 2016

@suriya so the patch is fixing your issue right?

@suriya

This comment has been minimized.

Show comment
Hide comment
@suriya

suriya May 10, 2016

Contributor

@benoitc Yes, absolutely fixing the issue. Thank you.

I was just wondering if there would be other places that might need to be fixed since the effects of PEP 475 seem to be pervasive.

Contributor

suriya commented May 10, 2016

@benoitc Yes, absolutely fixing the issue. Thank you.

I was just wondering if there would be other places that might need to be fixed since the effects of PEP 475 seem to be pervasive.

@benoitc

This comment has been minimized.

Show comment
Hide comment
@benoitc

benoitc May 10, 2016

Owner

@suriya normally not, but will keep an eye on it :-)

Owner

benoitc commented May 10, 2016

@suriya normally not, but will keep an eye on it :-)

@benoitc benoitc closed this in #1257 May 10, 2016

fofanov pushed a commit to fofanov/gunicorn that referenced this issue Mar 16, 2018

unblock the wait loop under python 3.5
in python 3.5 the select is blocking when waiting for it which prevent quick
exit on SIGTERM.

The problem is described:
https://www.python.org/dev/peps/pep-0475/#backward-compatibility

This change fix it by listening for signal event on the worker pipe. Once an
event is triggered it will forcefully wake up the select and return.

fix #1256
@confiq

This comment has been minimized.

Show comment
Hide comment
@confiq

confiq Jun 19, 2018

Contributor

For those who arrive here and wonder in what version in gunicorn this was fixed, it's 19.5.0

Contributor

confiq commented Jun 19, 2018

For those who arrive here and wonder in what version in gunicorn this was fixed, it's 19.5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment