Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't boot eventlet workers with eventlet 0.21.0 #1584

Closed
jwg4 opened this Issue Sep 6, 2017 · 17 comments

Comments

Projects
None yet
4 participants
@jwg4
Copy link

jwg4 commented Sep 6, 2017

Eventlet workers fail to boot using the most recent release of eventlet:

To reproduce this I am using the minimal app from http://docs.gunicorn.org/en/latest/run.html

Error message:

(venv) jack@neptune-vm:~/z/test_gunicorn$ gunicorn -b 0.0.0.0:5015 -w 4 myapp:app -k eventlet
[2017-09-06 17:40:13 +0000] [8760] [INFO] Starting gunicorn 19.7.1
[2017-09-06 17:40:13 +0000] [8760] [INFO] Listening at: http://0.0.0.0:5015 (8760)
[2017-09-06 17:40:13 +0000] [8760] [INFO] Using worker: eventlet
[2017-09-06 17:40:13 +0000] [8765] [INFO] Booting worker with pid: 8765
[2017-09-06 17:40:13 +0000] [8765] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/home/jack/z/test_gunicorn/venv/local/lib/python2.7/site-packages/gunicorn/arbiter.py", line 578, in spawn_worker
    worker.init_process()
  File "/home/jack/z/test_gunicorn/venv/local/lib/python2.7/site-packages/gunicorn/workers/geventlet.py", line 102, in init_process
    self.patch()
  File "/home/jack/z/test_gunicorn/venv/local/lib/python2.7/site-packages/gunicorn/workers/geventlet.py", line 91, in patch
    hubs.use_hub()
  File "/home/jack/z/test_gunicorn/venv/local/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 70, in use_hub
    mod = get_default_hub()
  File "/home/jack/z/test_gunicorn/venv/local/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 38, in get_default_hub
    import eventlet.hubs.epolls
  File "/home/jack/z/test_gunicorn/venv/local/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 27, in <module>
    from eventlet.hubs.hub import BaseHub
  File "/home/jack/z/test_gunicorn/venv/local/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 23, in <module>
    from eventlet.support import greenlets as greenlet, clear_sys_exc_info, monotonic, six
  File "/home/jack/z/test_gunicorn/venv/local/lib/python2.7/site-packages/eventlet/support/monotonic.py", line 167, in <module>
    raise RuntimeError('no suitable implementation for this system')
RuntimeError: no suitable implementation for this system
[2017-09-06 17:40:13 +0000] [8765] [INFO] Worker exiting (pid: 8765)
[2017-09-06 17:40:13 +0000] [8760] [INFO] Shutting down: Master
[2017-09-06 17:40:13 +0000] [8760] [INFO] Reason: Worker failed to boot.

System details:

(venv) jack@neptune-vm:~/z/test_gunicorn$ uname -a 
Linux neptune-vm 4.4.0-65-generic #86-Ubuntu SMP Thu Feb 23 17:49:58 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
(venv) jack@neptune-vm:~/z/test_gunicorn$ python --version
Python 2.7.12
(venv) jack@neptune-vm:~/z/test_gunicorn$ pip freeze
enum-compat==0.0.2
enum34==1.1.6
eventlet==0.21.0
greenlet==0.4.12
gunicorn==19.7.1

To be clear, this works with eventlet 0.20.0. I reported this as a bug to eventlet. I posted to what seemed to be a related bug eventlet/eventlet#401 Based on the help of people there, it seems to be a bug which is not related to the eventlet bug in that issue. However, changes in eventlet seem to have caused this problem firstly to appear, and secondly to be caught as an exception by their new code for monkey-patching the monotonic clock. This explains why I got the same error message as the other users who were encountered the (now fixed) eventlet bug.

@geofft

This comment has been minimized.

Copy link

geofft commented Sep 6, 2017

The specific problem as I understand it is that, as of eventlet 0.21, eventlet.hubs.use_hub() calls import monotonic, which calls ctypes.util.find_library, which runs ldconfig -p in a subprocess (using either os.popen or subprocess.Popen depending on your Python version). Waiting on that child process fails, because gunicorn installs a SIGCHLD handler that waits on all children, so ldconfig has already been waited on by the time control returns to ctypes.util.find_library.

If gunicorn can delay installing the SIGCHLD handler until after hubs.use_hub() is called, that should fix the problem.

@jwg4

This comment has been minimized.

Copy link
Author

jwg4 commented Sep 6, 2017

I created a pull request as a suggestion here: #1585 I simply loop through the worker pids and wait on each of them, rather than wait on all child processes. This has apparently fixed the problem for me, in that I can run gunicorn, so I thought it might be useful. However, I might be completely wrong about how this would affect gunicorn's worker management, so please ignore it if it's misguided.

@jwg4

This comment has been minimized.

Copy link
Author

jwg4 commented Sep 6, 2017

It might be difficult to not add the SIGCHLD handler until before hubs.use_hubs() is called, since gunicorn wants to be able to handle the death of a worker even if it hasn't yet been correctly set up, and it wants to do this in a way that is independent of which worker is used.

temoto added a commit to eventlet/eventlet that referenced this issue Sep 7, 2017

init: second workaround for monotonic "no suitable implementation"; T…
…hanks to Geoffrey Thomas

Force sooner find_library child process by early import monotonic.
Helps with gunicorn and possibly other applications.

#401
benoitc/gunicorn#1584
@benoitc

This comment has been minimized.

Copy link
Owner

benoitc commented Sep 7, 2017

@jwg4 thanks will have a closer look over the week-end on the issue.

Probably a silly question but why the check isn't done during the package installation instead of doing it at runtime?

@benoitc

This comment has been minimized.

Copy link
Owner

benoitc commented Sep 11, 2017

@jwg4 i'm not sure how it is related though. Normally childs process outside the worker doesn't break the arbiter: https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L530-L532

The issue is more that eventlet is lazily testing a capavility of the system at startup and return a runtime error then. Which crash the worker.I'm not sure why it does that. Why doesn't it fallback to something else, instead of raising an issue. If it's required then it means you can't use eventlet at all anyway, so it's quite logic that the worker shutdown. Otherwise, it shouldn't raise a runtime error where I would expect an exception that can be catched eventually.

Alternatively what could be patched is the worker setup:
https://github.com/benoitc/gunicorn/blame/master/gunicorn/workers/geventlet.py#L90

and check for the runtime error....

@jwg4

This comment has been minimized.

Copy link
Author

jwg4 commented Sep 11, 2017

Thanks a lot for looking at this @benoitc. I don't think the worker is crashing. Rather I think the following happens:

  1. Gunicorn attaches a signal handler on SIGCHLD which will call reap_workers.
  2. Gunicorn calls into eventlet to setup the workers.
  3. eventlet while initializing the module, forks a child process (not a worker).
  4. The child process completes successfully and SIGCHLD is sent to the parent process.
  5. SIGCHLD is caught by gunicorn and the handler is triggered.
  6. reap_workers is called.
  7. reap_workers waits on the child process forked by eventlet. Since this process already closed, it retrieves the pid for this process.
  8. Since the pid isn't one of the worker pids it knows about, it ignores it (as you pointed out).
  9. Control reverts to the eventlet setup code.
  10. eventlet waits on the pid of the child process it forked.
  11. Since wait() was already called by reap_workers, it fails to wait on the pid of the child. This is what causes the exception to be thrown in eventlet, and no workers can be created.

Thus as you see, the problem is caused not by a runtime error, or a crash of the worker, but by the fact that gunicorn catches SIGCHLD for the eventlet child (gunicorn is expecting that the SIGCHLD relates to one of its own worker processes, but it's not the case), and gunicorn calling waitpid on the eventlet child (where eventlet expects that no-one else tries to retrieve the fact that the child terminated). I don't try here to imply that either gunicorn or eventlet is responsible or should be expected to change.

From the point of gunicorn, there are two things it could do:

  1. Don't start to handle SIGCHLD until later.
  2. Don't wait on the pid of the eventlet child.

Solution 1. was suggested by some people on the eventlet issue thread eventlet/eventlet#401 . I believe it is problematic, because gunicorn wants to be already listening for a worker crash, before it starts to call any workers. Otherwise workers crashing at startup would not be cleaned up correctly.

Solution 2. is what my patch #1585 tries to do. I agree with you and @RonRothman that this also causes a race condition - if the child fails immediately after fork() and before adding itself to the dict of workers, it also won't be cleaned up.

I will try and confirm better that the description of 1-11 above is correct. If you have advice on a correct approach to fix this in gunicorn let me know. I will also create a separate eventlet issue since this bug was confused with another similar. They have also suggested a fix so any change to gunicorn might not be needed.

Thanks for your help.

@benoitc

This comment has been minimized.

Copy link
Owner

benoitc commented Sep 11, 2017

@jwg4 cn you try to simply move the line https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/geventlet.py#L102 after it has initialised the process and let me know about the result?

@jwg4

This comment has been minimized.

Copy link
Author

jwg4 commented Sep 11, 2017

Sure @benoitc. It boots correctly for me with that change.

@benoitc

This comment has been minimized.

Copy link
Owner

benoitc commented Sep 11, 2017

@jwg4 just to confirm, so moving that line only doesn't raise the issue above?

I will test it myself later today anyway.

@jwg4

This comment has been minimized.

Copy link
Author

jwg4 commented Sep 11, 2017

Exactly. I used latest gunicorn and eventlet, gunicorn startup failed. Changed the one line that you suggested, it worked with no problem.

@mathewcohle

This comment has been minimized.

Copy link

mathewcohle commented Nov 2, 2017

Is there plan to fix this any time soon? :)

@benoitc

This comment has been minimized.

Copy link
Owner

benoitc commented Nov 2, 2017

@mathewcohle i've been sidetracked but there will be a release tomorrow. Can you confirm moving the line above also worked for you?

@mathewcohle

This comment has been minimized.

Copy link

mathewcohle commented Nov 2, 2017

I'll check tomorrow and let you know @benoitc.

mathewcohle pushed a commit to mathewcohle/gunicorn that referenced this issue Nov 3, 2017

mathewcohle
Fix eventlet workers boot
```
[2017-11-03 09:11:04 +0000] [6] [DEBUG] Current configuration:
  proxy_protocol: False
  worker_connections: 1000
  statsd_host: None
  max_requests_jitter: 0
  post_fork: <function post_fork at 0x7f810d1a8c08>
  errorlog: -
  enable_stdio_inheritance: False
  worker_class: eventlet
  ssl_version: 2
  suppress_ragged_eofs: True
  syslog: False
  syslog_facility: user
  when_ready: <function when_ready at 0x7f810d1a8938>
  pre_fork: <function pre_fork at 0x7f810d1a8aa0>
  cert_reqs: 0
  preload_app: False
  keepalive: 2
  accesslog: None
  group: 0
  graceful_timeout: 30
  do_handshake_on_connect: False
  spew: False
  workers: 1
  proc_name: None
  sendfile: None
  pidfile: None
  umask: 0
  on_reload: <function on_reload at 0x7f810d1a87d0>
  pre_exec: <function pre_exec at 0x7f810d1ae230>
  worker_tmp_dir: None
  limit_request_fields: 100
  pythonpath: None
  on_exit: <function on_exit at 0x7f810d1aea28>
  config: gunicorn.conf.py
  logconfig: None
  check_config: False
  statsd_prefix:
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  reload_engine: auto
  proxy_allow_ips: ['127.0.0.1']
  pre_request: <function pre_request at 0x7f810d1ae398>
  post_request: <function post_request at 0x7f810d1ae488>
  forwarded_allow_ips: ['127.0.0.1']
  worker_int: <function worker_int at 0x7f810d1a8ed8>
  raw_paste_global_conf: []
  threads: 1
  max_requests: 0
  chdir: /home/test
  daemon: False
  user: 0
  limit_request_line: 4094
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  certfile: None
  on_starting: <function on_starting at 0x7f810d1a8668>
  post_worker_init: <function post_worker_init at 0x7f810d1a8d70>
  child_exit: <function child_exit at 0x7f810d1ae5f0>
  worker_exit: <function worker_exit at 0x7f810d1ae758>
  paste: None
  default_proc_name: sk.api.api:app
  syslog_addr: udp://localhost:514
  syslog_prefix: None
  ciphers: TLSv1
  worker_abort: <function worker_abort at 0x7f810d1ae0c8>
  loglevel: DEBUG
  bind: [':8080']
  raw_env: []
  initgroups: False
  capture_output: False
  reload: False
  limit_request_field_size: 8190
  nworkers_changed: <function nworkers_changed at 0x7f810d1ae8c0>
  timeout: 30
  keyfile: None
  ca_certs: None
  tmp_upload_dir: None
  backlog: 2048
  logger_class: gunicorn.glogging.Logger
[2017-11-03 09:11:04 +0000] [6] [INFO] Starting gunicorn 19.7.1
[2017-11-03 09:11:04 +0000] [6] [DEBUG] Arbiter booted
[2017-11-03 09:11:04 +0000] [6] [INFO] Listening at: http://0.0.0.0:8080 (6)
[2017-11-03 09:11:04 +0000] [6] [INFO] Using worker: eventlet
[2017-11-03 09:11:04 +0000] [13] [INFO] Booting worker with pid: 13
[2017-11-03 09:11:04 +0000] [13] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/arbiter.py", line 578, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/geventlet.py", line 102, in init_process
    self.patch()
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/geventlet.py", line 91, in patch
    hubs.use_hub()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/__init__.py", line 70, in use_hub
    mod = get_default_hub()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/__init__.py", line 38, in get_default_hub
    import eventlet.hubs.epolls
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/epolls.py", line 27, in <module>
    from eventlet.hubs.hub import BaseHub
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 23, in <module>
    from eventlet.support import greenlets as greenlet, clear_sys_exc_info, monotonic, six
  File "/usr/local/lib/python2.7/dist-packages/eventlet/support/monotonic.py", line 167, in <module>
    raise RuntimeError('no suitable implementation for this system')
RuntimeError: no suitable implementation for this system
[2017-11-03 09:11:04 +0000] [13] [INFO] Worker exiting (pid: 13)
[2017-11-03 09:11:05 +0000] [6] [DEBUG] 1 workers
[2017-11-03 09:11:05 +0000] [6] [INFO] Shutting down: Master
[2017-11-03 09:11:05 +0000] [6] [INFO] Reason: Worker failed to boot.
```

Closes benoitc#1584.
@mathewcohle

This comment has been minimized.

Copy link

mathewcohle commented Nov 3, 2017

So I've done some testing.

I was able to reproduce and fix the error for our production code (running in Docker):

  • broken when I pip installed gunicorn from https://github.com/mathewcohle/gunicorn.git
  • fixed when I used https://github.com/mathewcohle/gunicorn.git@eb965739bbb405bbb9c07464e2e2c05dd2893783

and also for minimal Docker app:

FROM ubuntu:16.04

RUN apt-get update && apt-get install -y \
    build-essential \
    git \
    python-dev \
    python-setuptools \
    python-pip \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

RUN pip install --upgrade pip

RUN pip install \
    eventlet \
    git+https://github.com/mathewcohle/gunicorn.git@eb965739bbb405bbb9c07464e2e2c05dd2893783

COPY test.py /home/gun/test.py

WORKDIR /home/gun

CMD gunicorn --log-level DEBUG -k eventlet test:app

(failed from master, works from eb965739bb)

However, I am not able to reproduce the error locally. My setup:

$ uname -a
Darwin 172-3-0-4.lightspeed.irvnca.sbcglobal.net 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64

$ python --version
Python 2.7.14

$ pip freeze
enum-compat==0.0.2
enum34==1.1.6
eventlet==0.21.0
greenlet==0.4.12
gunicorn==19.7.1

Which is kind of strange. Not sure wether it's related to the fact that it's running in Docker or it's Ubuntu/Linux related.

@benoitc

@geofft

This comment has been minimized.

Copy link

geofft commented Nov 3, 2017

@mathewcohle That's expected—monotonic only calls a subprocess (via ctypes.util.find_library) on Linux or BSD. On macOS it hardcodes the library path. https://github.com/atdt/monotonic/blob/master/monotonic.py

@mathewcohle

This comment has been minimized.

Copy link

mathewcohle commented Nov 3, 2017

Should I create PR? I mean, all credit goes to @jwg4 and @benoitc for debugging and solving the issue, it's just that I have the patch ready.

@benoitc benoitc closed this in 120f503 Nov 12, 2017

@benoitc

This comment has been minimized.

Copy link
Owner

benoitc commented Nov 12, 2017

@mathewcohle ah sorry didn't see your last message. I just committed the patch I had . Thanks for the test anyway!

fofanov pushed a commit to fofanov/gunicorn that referenced this issue Mar 16, 2018

fix eventlet worker boot
new version of eventlet uses an external process to check a feature at
runtime (monotonic). ut since gunicorn already installed a signal handler,
eventlet never get the return and then was raising a runtime error making the worker crash.

This patch fix it by patching the worker after it has been initialized
and the signals resetted.

fix benoitc#1584
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.