New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRITICAL WORKER TIMEOUT when running Flask app #1801

Open
bigunyak opened this Issue Jun 5, 2018 · 30 comments

Comments

Projects
None yet
@bigunyak

bigunyak commented Jun 5, 2018

It seems there have been already several reports related to [CRITICAL] WORKER TIMEOUT error but it just keeps popping up. Here is my issue.

I'm running this Flask hello world application.

from flask import Flask
application = Flask(__name__)

@application.route('/')
def hello_world():
    return 'Hello, World!'

The gunicorn command is this one:

gunicorn -b 0.0.0.0:5000 --log-level=debug hello

And this is the console output:

[2018-06-05 14:56:21 +0200] [11229] [INFO] Starting gunicorn 19.8.1
[2018-06-05 14:56:21 +0200] [11229] [DEBUG] Arbiter booted
[2018-06-05 14:56:21 +0200] [11229] [INFO] Listening at: http://0.0.0.0:5000 (11229)
[2018-06-05 14:56:21 +0200] [11229] [INFO] Using worker: sync
[2018-06-05 14:56:21 +0200] [11232] [INFO] Booting worker with pid: 11232
[2018-06-05 14:56:21 +0200] [11229] [DEBUG] 1 workers
[2018-06-05 14:56:32 +0200] [11232] [DEBUG] GET /
[2018-06-05 14:56:57 +0200] [11232] [DEBUG] Closing connection. 
[2018-06-05 14:57:16 +0200] [11232] [DEBUG] GET /
[2018-06-05 14:57:47 +0200] [11229] [CRITICAL] WORKER TIMEOUT (pid:11232)
[2018-06-05 14:57:47 +0200] [11232] [INFO] Worker exiting (pid: 11232)
[2018-06-05 14:57:47 +0200] [11324] [INFO] Booting worker with pid: 11324

Can you please clearly explain why do I get the error and if it's expected in this example? How do I fix it or if it's an expected behavior why critical error then?

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Jun 5, 2018

The error is not expected, but there is nothing from your example that shows why it happens. Tell us more about your environment.

  • What client is used to connect to Gunicorn?
  • What reverse proxy, if any, is used to connect to Gunicorn?
@bigunyak

This comment has been minimized.

bigunyak commented Jun 6, 2018

Tell us more about your environment.

  • What client is used to connect to Gunicorn?
    => I just send request from Chromium: http://localhost:5000/
  • What reverse proxy, if any, is used to connect to Gunicorn?
    => No proxy, just Gunicorn + Flask

I've just reproduced the problem on a completely fresh setup, here are the steps:

mkdir gunicorn
cd gunicorn/
pipenv --python 3.6
pipenv install flask
pipenv install gunicorn
vim hello.py
pipenv shell
gunicorn -b 0.0.0.0:5000 --log-level=debug hello

The hello.py is exactly the same Flask application as I posted in the initial report.
Below is the complete log.

~$ gunicorn -b 0.0.0.0:5000 --log-level=debug hello
[2018-06-06 09:16:21 +0200] [19829] [DEBUG] Current configuration:
  config: None
  bind: ['0.0.0.0:5000']
  backlog: 2048
  workers: 1
  worker_class: sync
  threads: 1
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 30
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /home/dima/work/gunicorn
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 1000
  group: 985
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: None
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: debug
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  statsd_prefix: 
  proc_name: None
  default_proc_name: hello
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0x7f9757112d08>
  on_reload: <function OnReload.on_reload at 0x7f9757112e18>
  when_ready: <function WhenReady.when_ready at 0x7f9757112f28>
  pre_fork: <function Prefork.pre_fork at 0x7f9756c230d0>
  post_fork: <function Postfork.post_fork at 0x7f9756c231e0>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f9756c232f0>
  worker_int: <function WorkerInt.worker_int at 0x7f9756c23400>
  worker_abort: <function WorkerAbort.worker_abort at 0x7f9756c23510>
  pre_exec: <function PreExec.pre_exec at 0x7f9756c23620>
  pre_request: <function PreRequest.pre_request at 0x7f9756c23730>
  post_request: <function PostRequest.post_request at 0x7f9756c237b8>
  child_exit: <function ChildExit.child_exit at 0x7f9756c238c8>
  worker_exit: <function WorkerExit.worker_exit at 0x7f9756c239d8>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f9756c23ae8>
  on_exit: <function OnExit.on_exit at 0x7f9756c23bf8>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: TLSv1
  raw_paste_global_conf: []
[2018-06-06 09:16:21 +0200] [19829] [INFO] Starting gunicorn 19.8.1
[2018-06-06 09:16:21 +0200] [19829] [DEBUG] Arbiter booted
[2018-06-06 09:16:21 +0200] [19829] [INFO] Listening at: http://0.0.0.0:5000 (19829)
[2018-06-06 09:16:21 +0200] [19829] [INFO] Using worker: sync
[2018-06-06 09:16:21 +0200] [19832] [INFO] Booting worker with pid: 19832
[2018-06-06 09:16:22 +0200] [19829] [DEBUG] 1 workers
[2018-06-06 09:16:48 +0200] [19832] [DEBUG] GET /
[2018-06-06 09:17:19 +0200] [19829] [CRITICAL] WORKER TIMEOUT (pid:19832)
[2018-06-06 09:17:19 +0200] [19832] [INFO] Worker exiting (pid: 19832)
[2018-06-06 09:17:19 +0200] [19872] [INFO] Booting worker with pid: 19872
^C[2018-06-06 09:17:26 +0200] [19829] [INFO] Handling signal: int
[2018-06-06 09:17:26 +0200] [19872] [INFO] Worker exiting (pid: 19872)
[2018-06-06 09:17:26 +0200] [19829] [INFO] Shutting down: Master
~$ pip list
Package      Version
------------ -------
click        6.7    
Flask        1.0.2  
gunicorn     19.8.1 
itsdangerous 0.24   
Jinja2       2.10   
MarkupSafe   1.0    
pip          10.0.1 
setuptools   39.2.0 
Werkzeug     0.14.1 
wheel        0.31.1
@yunstanford

This comment has been minimized.

Contributor

yunstanford commented Jun 8, 2018

@bigunyak I think it's because of the default timeout, your worker has been silent for 30s. http://docs.gunicorn.org/en/stable/settings.html#timeout

From your log,

[2018-06-05 14:57:16 +0200] [11232] [DEBUG] GET /
[2018-06-05 14:57:47 +0200] [11229] [CRITICAL] WORKER TIMEOUT (pid:11232)
[2018-06-06 09:16:48 +0200] [19832] [DEBUG] GET /
[2018-06-06 09:17:19 +0200] [19829] [CRITICAL] WORKER TIMEOUT (pid:19832)
@yunstanford

This comment has been minimized.

Contributor

yunstanford commented Jun 8, 2018

I'm seeing the same thing: workers are timing out even when serving no requests, with sync worker.
To that sense, the critical level log is quite confusing.

Try use Gevent worker could solve this.

@bigunyak

This comment has been minimized.

bigunyak commented Jun 8, 2018

To that sense, the critical level log is quite confusing.

Exactly, that was my original question: if it's an expected behavior why critical error then?
Would be also nice to get some background on why workers need to be restarted, maybe this could be added to the Design document.

@berkerpeksag

This comment has been minimized.

Collaborator

berkerpeksag commented Jun 8, 2018

I'm seeing this as well (reproduced by using examples/test.py with gunicorn test:app -b localhost:9595 --log-level=debug --timeout=5) and I agree that the critical level is a bit confusing. I'd be OK change it to debug level. @benoitc @tilgovi what do you think?

@yunstanford

This comment has been minimized.

Contributor

yunstanford commented Jun 9, 2018

I'm thinking info level might be a little bit better.

@neocolor

This comment has been minimized.

neocolor commented Jun 14, 2018

I had the same with MSYS2 on Win10 but finally could solved.

in notify() of ...\gunicorn\workers\workertmp.py, os.fchmod is used originally. But it does not work in MSYS. Instead of os.fchmod, I used os.utime. The code is followed. I think it could work for all platform.

    def notify(self):
        try:
            self.spinner = (self.spinner + 1) % 2
            os.fchmod(self._tmp.fileno(), self.spinner)
            if PLATFORM.startswith('MSYS') :
                os.utime(self._tmp.fileno(), None)
        except AttributeError:
            # python < 2.6
            self._tmp.truncate(0)
            os.write(self._tmp.fileno(), b"X")
@benoitc

This comment has been minimized.

Owner

benoitc commented Jun 20, 2018

@berkerpeksag I wouldn't expect that the worker exit because no requests happen. This error should only happen if the worker has been kept busy for a time > to the timeout. So the error is critical. Imo we should improve the documentation to provide more use cases and responses to such errors.

If the error still happen when the worker is not kept busy then there is something else happening and we have probably a bug.

@Tberdy

This comment has been minimized.

Tberdy commented Jul 6, 2018

[EDIT]
Same bug for me.
With Django 1.10 / gunicorn 19.6.0 / Python 2.7.15 in python2.7-alpine on Debian 8.8 and stock kernel 3.16, all was working fine.
After updating to Django 1.11 and gunicorn 19.8.1, the workers keep failing at boot with [CRITICAL WORKER TIMEOUT].
Downgrading gunicorn to 19.6.0 doesn't fix the problem.
We updated the host kernel to 4.9.0 (it was scheduled), and the workers successfully booted without errors.
But:

  • we use 4 workers
  • after exactly 4 calls to the Django app, next calls will timeout, showing [CRITICAL WORKER TIMEOUT] in the logs
  • the linux top command shows 4 gunicorn processes stuck with pretty high CPU consumption.

We're going to try gunicorn gevent to see if we'are able to get our app back online.

@Tberdy

This comment has been minimized.

Tberdy commented Jul 6, 2018

Using gunicorn with gevent didn't fixed the bug.

@DanielClaudeBaeriswyl

This comment has been minimized.

DanielClaudeBaeriswyl commented Aug 2, 2018

any update on this issue ?

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Aug 3, 2018

It looks like @neocolor identified a real bug under MSYS. It might deserve a separate issue.

@bigunyak what platform are you running under? I have tried to reproduce with the simple example and I cannot do it following exactly the steps outlined above. This agrees with my experience running multiple applications in production on multiple frameworks. The worker notification system has not changed recently, to my knowledge. My platform is Python 3.7 on MacOS 10.13.6, but I run Gunicorn in production with sync workers for several applications on Python 2.7 and 3.6.5 and only see worker timeouts when there are legitimately long requests that block the workers.

For @Tberdy: what happens if you try to set --worker-tmp-dir to somewhere outside the tmpfs filesystem? I am just wondering if maybe alpine or docker or the combination interferes somehow with how workers notify the arbiter.

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Aug 3, 2018

See also #1388 for Docker related tmpfs issues.

@timoj58

This comment has been minimized.

timoj58 commented Aug 7, 2018

i have this issue to.

@cjmash

This comment has been minimized.

cjmash commented Aug 9, 2018

I have this issue too, gunicorn sync was working perfectly well until yesternight, ut started reporting, workers timeout [CRITICAL] using gevent seems to solve my issue, but I'd really want to know why this happened .

@benoitc

This comment has been minimized.

Owner

benoitc commented Aug 9, 2018

@timoj58 @cjmash can you provides more detail about the issue ? How are you running gunicorn (in a vm?, options ...), which fileystem, OS? ANthing that could help to reproduce would be very helpful :)

@cjmash

This comment has been minimized.

cjmash commented Aug 10, 2018

@benoitc I am running gunicorn to start my Django project on kubernetes my gunicorn arguments are --bind=$port --workers=7 --timeout=1200 --log-level=debug --access-logfile - error-logfile -"

the errors i get from the logs

 
E  [2018-08-09 21:47:56 +0000] [14] [INFO] Booting worker with pid: 14
 
E  [2018-08-09 21:47:56 +0000] [12] [INFO] Booting worker with pid: 12
 
E  [2018-08-09 21:47:56 +0000] [1] [DEBUG] 7 workers
 
E  [2018-08-09 21:47:56 +0000] [11] [INFO] Booting worker with pid: 11
 
E  [2018-08-09 21:47:55 +0000] [10] [INFO] Booting worker with pid: 10
 
E  [2018-08-09 21:47:55 +0000] [9] [INFO] Booting worker with pid: 9
 
E  [2018-08-09 21:47:55 +0000] [8] [INFO] Booting worker with pid: 8
 
E  [2018-08-09 21:47:55 +0000] [1] [INFO] Using worker: sync
 
E  [2018-08-09 21:47:55 +0000] [1] [INFO] Listening at: http://0.0.0.0:4040 (1)
 
E  [2018-08-09 21:47:55 +0000] [1] [DEBUG] Arbiter booted
 
E  [2018-08-09 21:47:55 +0000] [1] [INFO] Starting gunicorn 19.7.1
 
E    raw_paste_global_conf: []
 
E    ciphers: TLSv1
 
E    do_handshake_on_connect: False
 
E    suppress_ragged_eofs: True
 
E    ca_certs: None
 
E    cert_reqs: 0
 
E    ssl_version: 2
 
E    certfile: None
 
E    keyfile: None
 
E    proxy_allow_ips: ['127.0.0.1']
 
E    proxy_protocol: False
 
E    on_exit: <function OnExit.on_exit at 0x7f9612aad7b8>
 
E    nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f9612aad620>
 
E    worker_exit: <function WorkerExit.worker_exit at 0x7f9612aad488>
 
E    child_exit: <function ChildExit.child_exit at 0x7f9612aad2f0>
 
E    post_request: <function PostRequest.post_request at 0x7f9612aad158>
 
E    pre_request: <function PreRequest.pre_request at 0x7f9612aad048>
 
E    pre_exec: <function PreExec.pre_exec at 0x7f9612aa8e18>
 
E    worker_abort: <function WorkerAbort.worker_abort at 0x7f9612aa8c80>
 
E    worker_int: <function WorkerInt.worker_int at 0x7f9612aa8ae8>
 
E    post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f9612aa8950>
 
E    post_fork: <function Postfork.post_fork at 0x7f9612aa87b8>
 
E    pre_fork: <function Prefork.pre_fork at 0x7f9612aa8620>
 
E    when_ready: <function WhenReady.when_ready at 0x7f9612aa8488>
 
E    on_reload: <function OnReload.on_reload at 0x7f9612aa82f0>
 
E    on_starting: <function OnStarting.on_starting at 0x7f9612aa8158>
 
E    paste: None
 
E    pythonpath: None
 
E    default_proc_name: art.wsgi
 
E    proc_name: None
 
E    statsd_prefix: 
 
E    statsd_host: None
 
E    enable_stdio_inheritance: False
 
E    syslog_facility: user
 
E    syslog_prefix: None
 
E    syslog: False
 
E    syslog_addr: udp://localhost:514
 
E    logconfig: None
 
E    logger_class: gunicorn.glogging.Logger
 
E    capture_output: False
 
E    loglevel: debug
 
E    errorlog: -
 
E    access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
 
E    accesslog: -
 
E    forwarded_allow_ips: ['127.0.0.1']
 
E    secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
 
E    tmp_upload_dir: None
 
E    initgroups: False
 
E    umask: 0
 
E    group: 0
 
E    user: 0
 
E    worker_tmp_dir: None
 
E    pidfile: None
 
E    raw_env: []
 
E    daemon: False
 
E    chdir: /usr/src/app
 
E    sendfile: None
 
E    preload_app: False
 
E    check_config: False
 
E    spew: False
 
E    reload_engine: auto
 
E    reload: False
 
E    limit_request_field_size: 8190
 
E    limit_request_fields: 100
 
E    limit_request_line: 4094
 
E    keepalive: 2
 
E    graceful_timeout: 30
 
E    timeout: 1200
 
E    max_requests_jitter: 0
 
E    max_requests: 0
 
E    worker_connections: 1000
 
E    threads: 1
 
E    worker_class: sync
 
E    workers: 7
 
E    backlog: 2048
 
E    bind: ['0.0.0.0:4040']
 
E    config: None
 
E  [2018-08-09 21:47:55 +0000] [1] [DEBUG] Current configuration:
 
I    No migrations to apply.
 
I  Running migrations:
 
I    Apply all migrations: admin, auth, contenttypes, core, dashboard, jet, oauth2_provider, sessions
 
I  Operations to perform:
 
E  [2018-08-09 21:47:20 +0000] [13] [INFO] Worker exiting (pid: 13)
 
E    os.path.dirname(os.path.dirname(__file__)), '.env'))
 
E  /usr/src/app/art/wsgi.py:19: UserWarning: Not reading /usr/src/app/.env - it doesn't exist.
 
E  [2018-08-09 21:47:00 +0000] [21] [INFO] Booting worker with pid: 21
 
E  [2018-08-09 21:47:00 +0000] [1] [INFO] Handling signal: term
 
E  [2018-08-09 21:46:35 +0000] [12] [INFO] Booting worker with pid: 12
 
E  [2018-08-09 21:46:34 +0000] [13] [INFO] Booting worker with pid: 13
 
E  [2018-08-09 21:46:34 +0000] [11] [INFO] Booting worker with pid: 11
 
E  [2018-08-09 21:46:34 +0000] [1] [DEBUG] 7 workers
 
E  [2018-08-09 21:46:34 +0000] [10] [INFO] Booting worker with pid: 10
 
E  [2018-08-09 21:46:34 +0000] [9] [INFO] Booting worker with pid: 9
 
E  [2018-08-09 21:46:34 +0000] [8] [INFO] Booting worker with pid: 8
 
E  [2018-08-09 21:46:34 +0000] [7] [INFO] Booting worker with pid: 7
 
E  [2018-08-09 21:46:34 +0000] [1] [INFO] Using worker: sync
 
E  [2018-08-09 21:46:34 +0000] [1] [INFO] Listening at: http://0.0.0.0:4040 (1)
 
E  [2018-08-09 21:46:34 +0000] [1] [DEBUG] Arbiter booted
 
E  [2018-08-09 21:46:34 +0000] [1] [INFO] Starting gunicorn 19.7.1
 
E    raw_paste_global_conf: []
 
E    ciphers: TLSv1
 
E    do_handshake_on_connect: False
 
E    suppress_ragged_eofs: True
 
E    ca_certs: None
 
E    cert_reqs: 0
 
E    ssl_version: 2
 
E    certfile: None
 
E    keyfile: None
 
E    proxy_allow_ips: ['127.0.0.1']
 
E    proxy_protocol: False
 
E    on_exit: <function OnExit.on_exit at 0x7f07f2cfe7b8>
 
E    nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f07f2cfe620>
 
E    worker_exit: <function WorkerExit.worker_exit at 0x7f07f2cfe488>
 
E    child_exit: <function ChildExit.child_exit at 0x7f07f2cfe2f0>
 
E    post_request: <function PostRequest.post_request at 0x7f07f2cfe158>
 
E    pre_request: <function PreRequest.pre_request at 0x7f07f2cfe048>
 
E    pre_exec: <function PreExec.pre_exec at 0x7f07f2cf9e18>
 
E    worker_abort: <function WorkerAbort.worker_abort at 0x7f07f2cf9c80>
 
E    worker_int: <function WorkerInt.worker_int at 0x7f07f2cf9ae8>
 
E    post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f07f2cf9950>
 
E    post_fork: <function Postfork.post_fork at 0x7f07f2cf97b8>
 
E    pre_fork: <function Prefork.pre_fork at 0x7f07f2cf9620>
 
E    when_ready: <function WhenReady.when_ready at 0x7f07f2cf9488>
 
E    on_reload: <function OnReload.on_reload at 0x7f07f2cf92f0>
 
E    on_starting: <function OnStarting.on_starting at 0x7f07f2cf9158>
 
E    paste: None
 
E    pythonpath: None
 
E    default_proc_name: art.wsgi
 
E    proc_name: None
 
E    statsd_prefix: 
 
E    statsd_host: None
 
E    enable_stdio_inheritance: False
 
E    syslog_facility: user
 
E    syslog_prefix: None
 
E    syslog: False
 
E    syslog_addr: udp://localhost:514
 
E    logconfig: None
 
E    logger_class: gunicorn.glogging.Logger
 
E    capture_output: False
 
E    loglevel: debug
 
E    errorlog: -
 
E    access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
 
E    accesslog: -
 
E    forwarded_allow_ips: ['127.0.0.1']
 
E    secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
 
E    tmp_upload_dir: None
 
E    initgroups: False
 
E    umask: 0
 
E    group: 0
 
E    user: 0
 
E    worker_tmp_dir: None
 
E    pidfile: None
 
E    raw_env: []
 
E    daemon: False
 
E    chdir: /usr/src/app
 
E    sendfile: None
 
E    preload_app: False
 
E    check_config: False
 
E    spew: False
 
E    reload_engine: auto
 
E    reload: False
 
E    limit_request_field_size: 8190
 
E    limit_request_fields: 100
 
E    limit_request_line: 4094
 
E    keepalive: 2
 
E    graceful_timeout: 30
 
E    timeout: 1200
 
E    max_requests_jitter: 0
 
E    max_requests: 0
 
E    worker_connections: 1000
 
E    threads: 1
 
E    worker_class: sync
 
E    workers: 7
 
E    backlog: 2048
 
E    bind: ['0.0.0.0:4040']
 
E    config: None
 
E  [2018-08-09 21:46:34 +0000] [1] [DEBUG] Current configuration:
 
@bigunyak

This comment has been minimized.

bigunyak commented Aug 10, 2018

I struggled a bit to reproduce the problem this time but it's still there in the latest gunicorn version 19.9.0.
The steps to reproduce it are exactly the same as I described in this post.
My system is Arch Linux x86_64 GNU/Linux (kernel 4.17.2-1-ARCH), Python 3.6.5
Here is a fresh log trace.

~$ gunicorn -b 0.0.0.0:5000 --log-level=debug hello
[2018-08-10 09:48:40 +0200] [23114] [DEBUG] Current configuration:
  config: None
  bind: ['0.0.0.0:5000']
  backlog: 2048
  workers: 1
  worker_class: sync
  threads: 1
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 30
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /home/dima/lerning/python/modules/gunicorn
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 1000
  group: 985
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: None
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: debug
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  statsd_prefix: 
  proc_name: None
  default_proc_name: hello
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0x7f5f8b9dad08>
  on_reload: <function OnReload.on_reload at 0x7f5f8b9dae18>
  when_ready: <function WhenReady.when_ready at 0x7f5f8b9daf28>
  pre_fork: <function Prefork.pre_fork at 0x7f5f8b4ec0d0>
  post_fork: <function Postfork.post_fork at 0x7f5f8b4ec1e0>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f5f8b4ec2f0>
  worker_int: <function WorkerInt.worker_int at 0x7f5f8b4ec400>
  worker_abort: <function WorkerAbort.worker_abort at 0x7f5f8b4ec510>
  pre_exec: <function PreExec.pre_exec at 0x7f5f8b4ec620>
  pre_request: <function PreRequest.pre_request at 0x7f5f8b4ec730>
  post_request: <function PostRequest.post_request at 0x7f5f8b4ec7b8>
  child_exit: <function ChildExit.child_exit at 0x7f5f8b4ec8c8>
  worker_exit: <function WorkerExit.worker_exit at 0x7f5f8b4ec9d8>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f5f8b4ecae8>
  on_exit: <function OnExit.on_exit at 0x7f5f8b4ecbf8>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: TLSv1
  raw_paste_global_conf: []
[2018-08-10 09:48:40 +0200] [23114] [INFO] Starting gunicorn 19.9.0
[2018-08-10 09:48:40 +0200] [23114] [DEBUG] Arbiter booted
[2018-08-10 09:48:40 +0200] [23114] [INFO] Listening at: http://0.0.0.0:5000 (23114)
[2018-08-10 09:48:40 +0200] [23114] [INFO] Using worker: sync
[2018-08-10 09:48:40 +0200] [23117] [INFO] Booting worker with pid: 23117
[2018-08-10 09:48:40 +0200] [23114] [DEBUG] 1 workers
[2018-08-10 09:48:45 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:48:54 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:00 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:18 +0200] [23117] [DEBUG] Closing connection. 
[2018-08-10 09:49:18 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:23 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:37 +0200] [23117] [DEBUG] Closing connection. 
[2018-08-10 09:49:37 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:50 +0200] [23117] [DEBUG] Closing connection. 
[2018-08-10 09:51:11 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:51:13 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:51:43 +0200] [23114] [CRITICAL] WORKER TIMEOUT (pid:23117)
[2018-08-10 09:51:43 +0200] [23117] [INFO] Worker exiting (pid: 23117)
[2018-08-10 09:51:44 +0200] [23229] [INFO] Booting worker with pid: 23229

As before to test it I was just hitting http://0.0.0.0:5000/ in Chromium.
Below is the content of Pipfile and Pipfile.lock for you to see the exact environment.

  • Pipfile
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
flask = "*"
gunicorn = "*"

[dev-packages]

[requires]
python_version = "3.6"
  • Pipfile.lock
{
    "_meta": {
        "hash": {
            "sha256": "81cb5d5f0b11719d8d9c5ec9cc683fdcf959c652fda256d5552a82d0f459a99c"
        },
        "pipfile-spec": 6,
        "requires": {
            "python_version": "3.6"
        },
        "sources": [
            {
                "name": "pypi",
                "url": "https://pypi.org/simple",
                "verify_ssl": true
            }
        ]
    },
    "default": {
        "click": {
            "hashes": [
                "sha256:29f99fc6125fbc931b758dc053b3114e55c77a6e4c6c3a2674a2dc986016381d",
                "sha256:f15516df478d5a56180fbf80e68f206010e6d160fc39fa508b65e035fd75130b"
            ],
            "version": "==6.7"
        },
        "flask": {
            "hashes": [
                "sha256:2271c0070dbcb5275fad4a82e29f23ab92682dc45f9dfbc22c02ba9b9322ce48",
                "sha256:a080b744b7e345ccfcbc77954861cb05b3c63786e93f2b3875e0913d44b43f05"
            ],
            "index": "pypi",
            "version": "==1.0.2"
        },
        "gunicorn": {
            "hashes": [
                "sha256:aa8e0b40b4157b36a5df5e599f45c9c76d6af43845ba3b3b0efe2c70473c2471",
                "sha256:fa2662097c66f920f53f70621c6c58ca4a3c4d3434205e608e121b5b3b71f4f3"
            ],
            "index": "pypi",
            "version": "==19.9.0"
        },
        "itsdangerous": {
            "hashes": [
                "sha256:cbb3fcf8d3e33df861709ecaf89d9e6629cff0a217bc2848f1b41cd30d360519"
            ],
            "version": "==0.24"
        },
        "jinja2": {
            "hashes": [
                "sha256:74c935a1b8bb9a3947c50a54766a969d4846290e1e788ea44c1392163723c3bd",
                "sha256:f84be1bb0040caca4cea721fcbbbbd61f9be9464ca236387158b0feea01914a4"
            ],
            "version": "==2.10"
        },
        "markupsafe": {
            "hashes": [
                "sha256:a6be69091dac236ea9c6bc7d012beab42010fa914c459791d627dad4910eb665"
            ],
            "version": "==1.0"
        },
        "werkzeug": {
            "hashes": [
                "sha256:c3fd7a7d41976d9f44db327260e263132466836cef6f91512889ed60ad26557c",
                "sha256:d5da73735293558eb1651ee2fddc4d0dedcfa06538b8813a2e20011583c9e49b"
            ],
            "version": "==0.14.1"
        }
    },
    "develop": {}
}
@jdeltoft

This comment has been minimized.

jdeltoft commented Aug 15, 2018

Just FYI, I'm also seeing this failure very regularly with:

nohup gunicorn -w 8 --access-logfile=- --bind 127.0.0.1:5000 wsgi &

I just have a wsgi.py that has

from chart_api import application
if __name__ == "__main__":
    application.run()

Let me know if there is anything you want me to try/experiment or if there are specifics in the logs you want me to check. I'm running flask on a GCP VM.

@timoj58

This comment has been minimized.

timoj58 commented Aug 15, 2018

sorry for late reply. I am running it as

gunicorn --log-file=/home/ubuntu/log/gunicorn.log predictor_api:app -b localhost:5000 &

i did use the gevent setting etc, but i have changed my design of what i required this for in order to work around issue, hence the basic setting above (which also failed but this is to be expected given no gevent)

Python version 3.6
env: aws tensorflow_p36 optmized (ubuntu)
nginx is sat in front of gunicorn, which is executing a flask app.

flask version 1.0.2
nginx version 1.10.3
gunicorn version 19.9.0

I did change the nginx timeouts as well incase this may have caused it.

@tiru1930

This comment has been minimized.

tiru1930 commented Oct 23, 2018

Facing the same problem with gunicorn server

gunicorn ApplicationServer:app -b 0.0.0.0:6001 -w 8 --threads 4 --backlog 2048 \

--timeout 120 --graceful-timeout 60 --access-logfile logs/access.log \

--error-logfile logs/error.log --log-level=info

Flask==0.12.1
gunicorn==19.7.1

when i start server with above command, my system got freezing for some time and worker pids are keeping on booting , though i kept timeout 120 sec, and the server is not accepting single request.

@anilpai

This comment has been minimized.

anilpai commented Nov 5, 2018

Any update on this issue ? I have same issue

[CRITICAL] WORKER TIMEOUT

@javabrett

This comment has been minimized.

Contributor

javabrett commented Nov 5, 2018

Wondering whether anyone has successfully reproduced this in a Docker image?

@SuicidalTissue

This comment has been minimized.

SuicidalTissue commented Nov 7, 2018

Also seeing this when trying to implement datadog's ddtrace-run worker on an existing application starting with gunicorn -k gevent --threads 4.

Funny trace of a SystemExit I've never seen before too...
[2018-11-07 11:11:50 +0000] [15987] [INFO] Booting worker with pid: 15987 [2018-11-07 11:11:50 +0000] [15977] [DEBUG] 1 workers [2018-11-07 11:12:20 +0000] [15977] [CRITICAL] WORKER TIMEOUT (pid:15987) Exception SystemExit: 1 in <bound method LibevLoop._loop_will_run of <cassandra.io.libevreactor.LibevLoop object at 0x7f3cb66d4a50>> ignored

@anilpai

This comment has been minimized.

anilpai commented Nov 9, 2018

I am able to resolve this issue by matching the number of workers and number of threads.

I had set workers = (2 * cpu_count) + 1 and did not set threads.

Once I changed threads = workers , everything started working fine. Just in case, if this helps someone.

This is how it looks now

def run(host='0.0.0.0', port=8080, workers=1 + (multiprocessing.cpu_count() * 2)):
    """Run the app with Gunicorn."""

    if app.debug:
        app.run(host, int(port), use_reloader=False)
    else:
        gunicorn = WSGIApplication()
        gunicorn.load_wsgiapp = lambda: app
        gunicorn.cfg.set('bind', '%s:%s' % (host, port))
        gunicorn.cfg.set('workers', workers)
        gunicorn.cfg.set('threads', workers)
        gunicorn.cfg.set('pidfile', None)
        gunicorn.cfg.set('worker_class', 'sync')
        gunicorn.cfg.set('keepalive', 10)
        gunicorn.cfg.set('accesslog', '-')
        gunicorn.cfg.set('errorlog', '-')
        gunicorn.cfg.set('reload', True)
        gunicorn.chdir()
        gunicorn.run()
@clintonb

This comment has been minimized.

clintonb commented Nov 26, 2018

I encountered this issue running Django with a single Docker container on AWS Elastic Beanstalk. I resolved the issue by fixing my security groups to ensure my EC2 instance could talk to my RDS instance. I recognize this may not be the solution for 99% of folks on this issue, but I'm leaving this note to help others avoid wasting hours falling down this rabbit hole.

@fred-revel

This comment has been minimized.

fred-revel commented Dec 3, 2018

In my case the issue was resolved by performing systemctl daemon-reload.
Long story short I knew of the default timeout at 30sec, but was unable to change it until i figured i had to of course reload the systemd daemon to apply the changes to the service.

@asnisarenko

This comment has been minimized.

asnisarenko commented Dec 11, 2018

@bigunyak @benoitc @tilgovi
After a long google goose chase and some experimentation I believe the root cause of this issue is chrome "preconnect/prediction service" (which is enabled by default in both Chrome and Chromium).

@jeiting did a nice write up about this issue
https://hackernoon.com/chrome-preconnect-breaks-singly-threaded-servers-95944be16400
Some additional reading:
corydolphin/flask-cors#147
pallets/flask#2169

TLDR
In some cases Chrome/Chromium opens and holds an "empty" TCP connection(s) (i.e its predicting that its going to fetch another resource soon). If the "empty" TCP connection(s) hits your gunicorn server first then any subsequent "real" request from chrome can get stuck behind the "empty" request until the worker(s) handling the "empty" request times out. This is more likely to happen if you are only running a single sync worker in flask. But as my experiments found it can happen even if you are running multiple sync workers.

My Environment

  • My native OS is Ubuntu 17
  • I have a rest API running in a local docker container. Relevant lines from the Dockefile
FROM ubuntu:18.04
...
RUN pip3 install Flask==1.0.2
RUN pip3 install gunicorn==19.9.0
RUN pip3 install flask-cors==3.0.6
......
  • I have 3 browsers installed: Firefox 61.0.1, Chromium 67.0.3396.99, Chrome 70.0.3538.102
  • I have a react app that is served from a different docker container on a different port
  • The react app makes CORS AJAX requests to my API (they are CORS requests because the javascript is served on one localhost port while its sending API calls to another localhost port)
  • The request that always gets "blocked" in my experiments is a CORS OPTIONS request (triggered by the browser to ask for permission to then perform a POST request ). Seems logical that a browser would try to make predictive connections with the OPTIONS call because it's likely going to follow up with a POST call.

Experiment 1
Gunicorn config: 1 sync worker (default 30s timeout)

Firefox: On almost every react page load, the CORS OPTIONS request gets blocked for 5s, and then succeeds.
Chromium: On EVERY react page load, CORS OPTIONS request gets blocked for 1.5minutes!!!! and then succeeds.
Chromium(prediction service disabled): Everything loads fine
Chrome: Everything loads fine

Experiment 2
Gunicorn config: 4 sync workers (default 30s timeout)

Firefox: Everything loads fine
Chromium: On every 3rd react page load, CORS OPTIONS request gets blocked for 30s and then succeeds.
Chromium(prediction service disabled): Everything loads fine
Chrome: Everything loads fine

Experiment 3
Gunicorn config: 8 sync workers (default 30s timeout)

Firefox: Everything loads fine
Chromium: Everything loads fine
Chrome: Everything loads fine

Experiment 4
Run flask dev server with threaded=True

Firefox: Everything loads fine
Chromium: Everything loads fine
Chrome: Everything loads fine

Experiment Summary

  • Firefox seems to also have prediction service, but it seems more graceful. (Experiment 1)
  • Chromium seems the most aggressive in creating "prediction" connections. It appears to launch up to 3-4 "prediction" connections during an OPTIONS request (Experiment 1 and 2)
  • Surprisingly even though Chrome also has prediction service enabled it didn't cause any issues in any of the experiments (maybe the order or the number of "prediction" connections is different from Chromium)

Solution
In Production: Easiest fix is to put nginx in front of gunicorn. If you want to understand why nginx solves this issue, here is nice article that explains it: https://www.brianstorti.com/the-role-of-a-reverse-proxy-to-protect-your-application-against-slow-clients/

In dev: Easiest fix is to run flask dev server with threaded=True. Or you can disable the prediction service in Chrome/Chromium.

gunicorn debug improvements
To help debug an issue like this in the future, I think it's a good idea to add debug log statements next to the select() and accept() calls in the sync worker.

client, addr = listener.accept()

ret = select.select(self.wait_fds, [], [], timeout)

This would show that a worker has accepted a new TCP connection, but is not receiving any data.

@javabrett

This comment has been minimized.

Contributor

javabrett commented Dec 11, 2018

@asnisarenko super write-up, thanks. #1929 is another unusual case of a slow-client which yields similar symptom in the single-threaded server - in that case, TLS handshake to a non-TLS port looks like a slow-writing client, since it fails to send sensible headers quickly.

It may be the Gunicorn, for single-threaded sync workers is going to need some new tunable, an aggressive client-drop for clients which fail to send at least a first request-header line, if not all headers, within a reasonable time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment