New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threaded workers accept connections even when no ready thread exist #908

Closed
tahajahangir opened this Issue Oct 3, 2014 · 19 comments

Comments

Projects
None yet
5 participants
@tahajahangir

tahajahangir commented Oct 3, 2014

When using threads (with sync worker class), worker accepts connection even when all threads is busy. In our environment, sometimes (because a deadlock in cairocffi library) all threads of a worker are locked, but worker continues accepting connections until limit of open-files is reached.

I looked at the code in gthread.py and it regards the worker_connections setting (despite what the documentation says, that it only affects Eventlet and Gevent workers).

I think in threaded workers, worker_connections always should be set to number of threads.

@benoitc

This comment has been minimized.

Show comment
Hide comment
@benoitc

benoitc Oct 3, 2014

Owner

what do you mean by "When using threads (with sync worker class), worker accepts connection even when all threads is busy. " ? With the --thread option?

Owner

benoitc commented Oct 3, 2014

what do you mean by "When using threads (with sync worker class), worker accepts connection even when all threads is busy. " ? With the --thread option?

@tahajahangir

This comment has been minimized.

Show comment
Hide comment
@tahajahangir

tahajahangir Oct 3, 2014

yes, (actually --threads option)

tahajahangir commented Oct 3, 2014

yes, (actually --threads option)

@benoitc benoitc self-assigned this Oct 9, 2014

@benoitc benoitc added this to the R19.2 milestone Oct 9, 2014

@benoitc

This comment has been minimized.

Show comment
Hide comment
@benoitc

benoitc Oct 9, 2014

Owner

thanks for your answer. will work on it during the week. Expect a fix for the coming release this we.

Owner

benoitc commented Oct 9, 2014

thanks for your answer. will work on it during the week. Expect a fix for the coming release this we.

@benoitc

This comment has been minimized.

Show comment
Hide comment
@benoitc

benoitc Oct 19, 2014

Owner

Hrm, I am not sure how to reproduce that issue. When I look at the code of the threaded workers it appears, that the worker should crash if the number of active connections is greater than the number of accepted worker connections:

https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/gthread.py#L174-L178

When it happen and no futures have returned in the timeout the connections breaks. Are you sure the connections are not handled by another worker?

Owner

benoitc commented Oct 19, 2014

Hrm, I am not sure how to reproduce that issue. When I look at the code of the threaded workers it appears, that the worker should crash if the number of active connections is greater than the number of accepted worker connections:

https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/gthread.py#L174-L178

When it happen and no futures have returned in the timeout the connections breaks. Are you sure the connections are not handled by another worker?

@tahajahangir

This comment has been minimized.

Show comment
Hide comment
@tahajahangir

tahajahangir Oct 20, 2014

Steps to regenerate

  • create foo.py. It's a helloworld app which waits 100s if url is '?wait'
import time
import os

def application(env, start):
   t = time.time()
   print('---- start in ', os.getpid())
   if env['QUERY_STRING'] == 'wait':
       while time.time() - t < 100: time.sleep(1)
   print('--- end in', os.getpid())
   start('200 OK', []);
   return [b'Hello World']
  • Run gunicorn with:
    gunicorn --error-logfile - --workers 2 --threads 2 hello. The server has 2 workers, each with 2 threads.
  • Open a browser and make exactly 3 requests to http://127.0.0.1:8000/?wait'. Now log looks like:
[2014-10-20 10:26:49 +0330] [5419] [INFO] Starting gunicorn 19.1.1
[2014-10-20 10:26:49 +0330] [5419] [INFO] Listening at: http://127.0.0.1:8000 (5419)
[2014-10-20 10:26:49 +0330] [5419] [INFO] Using worker: threads
[2014-10-20 10:26:49 +0330] [5422] [INFO] Booting worker with pid: 5422
[2014-10-20 10:26:49 +0330] [5423] [INFO] Booting worker with pid: 5423
---- start in  5423
---- start in  5423
---- start in  5422

3 of 4 threads are now busy, and one is idle.

  • Now make several requests to http://127.0.0.1:8000/ (without cache, use Control+F5). Some of requests will succeed, but some of them won't generate any output (and stuck in "Waiting for 127.0.0.1" status) , even always there is a 1 idle thread.

tahajahangir commented Oct 20, 2014

Steps to regenerate

  • create foo.py. It's a helloworld app which waits 100s if url is '?wait'
import time
import os

def application(env, start):
   t = time.time()
   print('---- start in ', os.getpid())
   if env['QUERY_STRING'] == 'wait':
       while time.time() - t < 100: time.sleep(1)
   print('--- end in', os.getpid())
   start('200 OK', []);
   return [b'Hello World']
  • Run gunicorn with:
    gunicorn --error-logfile - --workers 2 --threads 2 hello. The server has 2 workers, each with 2 threads.
  • Open a browser and make exactly 3 requests to http://127.0.0.1:8000/?wait'. Now log looks like:
[2014-10-20 10:26:49 +0330] [5419] [INFO] Starting gunicorn 19.1.1
[2014-10-20 10:26:49 +0330] [5419] [INFO] Listening at: http://127.0.0.1:8000 (5419)
[2014-10-20 10:26:49 +0330] [5419] [INFO] Using worker: threads
[2014-10-20 10:26:49 +0330] [5422] [INFO] Booting worker with pid: 5422
[2014-10-20 10:26:49 +0330] [5423] [INFO] Booting worker with pid: 5423
---- start in  5423
---- start in  5423
---- start in  5422

3 of 4 threads are now busy, and one is idle.

  • Now make several requests to http://127.0.0.1:8000/ (without cache, use Control+F5). Some of requests will succeed, but some of them won't generate any output (and stuck in "Waiting for 127.0.0.1" status) , even always there is a 1 idle thread.
@marcinkuzminski

This comment has been minimized.

Show comment
Hide comment
@marcinkuzminski

marcinkuzminski Oct 20, 2014

We're having the same problem, settings workers to 2+ threads causes exact same issue as explained above.

marcinkuzminski commented Oct 20, 2014

We're having the same problem, settings workers to 2+ threads causes exact same issue as explained above.

@benoitc

This comment has been minimized.

Show comment
Hide comment
@benoitc

benoitc Oct 22, 2014

Owner

on which OS? kernel?

Owner

benoitc commented Oct 22, 2014

on which OS? kernel?

@marcinkuzminski

This comment has been minimized.

Show comment
Hide comment
@marcinkuzminski

marcinkuzminski Oct 22, 2014

Description: Ubuntu 12.04.4 LTS
Linux code 3.8.0-34-generic 49~precise1-Ubuntu SMP Wed Nov 13 18:05:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

marcinkuzminski commented Oct 22, 2014

Description: Ubuntu 12.04.4 LTS
Linux code 3.8.0-34-generic 49~precise1-Ubuntu SMP Wed Nov 13 18:05:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

@benoitc

This comment has been minimized.

Show comment
Hide comment
@benoitc

benoitc Oct 22, 2014

Owner

I've run the example provided above and I think I understood the problem.

So the issue is that the threaded worker is not designed to accept long connections right now. It is mostly working like the sync worker. Using 2 threads / workers mean that only 4 simultaneous connections can be handled. Trying to wait like this will rapidly lock exhaust the number of threads. However since the accepting loop is async and handled in its own loop, you may have more connections accepted that will all wait for a free worker,

I need to think more about that design. I am thinking to remove any async accepting loop and handle the pool synchronously instead.

Owner

benoitc commented Oct 22, 2014

I've run the example provided above and I think I understood the problem.

So the issue is that the threaded worker is not designed to accept long connections right now. It is mostly working like the sync worker. Using 2 threads / workers mean that only 4 simultaneous connections can be handled. Trying to wait like this will rapidly lock exhaust the number of threads. However since the accepting loop is async and handled in its own loop, you may have more connections accepted that will all wait for a free worker,

I need to think more about that design. I am thinking to remove any async accepting loop and handle the pool synchronously instead.

@tilgovi

This comment has been minimized.

Show comment
Hide comment
@tilgovi

tilgovi Oct 22, 2014

Collaborator

👍 better to handle the pool synchronously. That is better for upstream load balancers because they will see a slow connect and might fail over.

Collaborator

tilgovi commented Oct 22, 2014

👍 better to handle the pool synchronously. That is better for upstream load balancers because they will see a slow connect and might fail over.

@marcinkuzminski

This comment has been minimized.

Show comment
Hide comment
@marcinkuzminski

marcinkuzminski Oct 24, 2014

Could it be that this error is also present even if you have 1 threaded worker ? Seems that we have a odd situation that even after commenting threads=2 in our config we're able to reproduce state when a deadlocked worker accepts connections.

EDIT:

The interesting thing is that the deadlock is during the requests initialization phase.

user - > nginx -> gunicorn(webapp) -> rpcserver the deadlock happens on connecting to rpcserver, and when that happens we have gunicorn accept connections and wait...

marcinkuzminski commented Oct 24, 2014

Could it be that this error is also present even if you have 1 threaded worker ? Seems that we have a odd situation that even after commenting threads=2 in our config we're able to reproduce state when a deadlocked worker accepts connections.

EDIT:

The interesting thing is that the deadlock is during the requests initialization phase.

user - > nginx -> gunicorn(webapp) -> rpcserver the deadlock happens on connecting to rpcserver, and when that happens we have gunicorn accept connections and wait...

@tahajahangir

This comment has been minimized.

Show comment
Hide comment
@tahajahangir

tahajahangir Oct 25, 2014

Listening socket is always open, and connecting clients will be queued in TCP backlog (even gunicorn does not accept a new connection). That's not related to this issue.

But there is another issue: After sending stop signal, while it waits for current requests to complete, gunicorn does not close listen socket.

tahajahangir commented Oct 25, 2014

Listening socket is always open, and connecting clients will be queued in TCP backlog (even gunicorn does not accept a new connection). That's not related to this issue.

But there is another issue: After sending stop signal, while it waits for current requests to complete, gunicorn does not close listen socket.

@benoitc

This comment has been minimized.

Show comment
Hide comment
@benoitc

benoitc Oct 25, 2014

Owner

@tahajahangir this actually by desig. But why would you want it? Though this is not related to this ticket, can you open a new ticket about it?

Owner

benoitc commented Oct 25, 2014

@tahajahangir this actually by desig. But why would you want it? Though this is not related to this ticket, can you open a new ticket about it?

@tilgovi

This comment has been minimized.

Show comment
Hide comment
@tilgovi

tilgovi Oct 25, 2014

Collaborator

I have raised this issue before and tried to solve it. I forget why I
failed. I could dig up the comments maybe. But we should stop listening as
soon as we start shutting down.

👍 opening a new issue for it.
On Oct 25, 2014 12:53 AM, "Benoit Chesneau" notifications@github.com
wrote:

@tahajahangir https://github.com/tahajahangir this actually by desig.
But why would you want it? Though this is not related to this ticket, can
you open a new ticket about it?


Reply to this email directly or view it on GitHub
#908 (comment).

Collaborator

tilgovi commented Oct 25, 2014

I have raised this issue before and tried to solve it. I forget why I
failed. I could dig up the comments maybe. But we should stop listening as
soon as we start shutting down.

👍 opening a new issue for it.
On Oct 25, 2014 12:53 AM, "Benoit Chesneau" notifications@github.com
wrote:

@tahajahangir https://github.com/tahajahangir this actually by desig.
But why would you want it? Though this is not related to this ticket, can
you open a new ticket about it?


Reply to this email directly or view it on GitHub
#908 (comment).

@benoitc

This comment has been minimized.

Show comment
Hide comment
@benoitc

benoitc Oct 25, 2014

Owner

@tilgovi If I recall correctly, the reason we are not doing it is for the case where we d a USR2 and stop the old master. In that case we don't want to close socket so the master can still listen on it. I am actually wondering if the system doesn't take care of that for us.... (also if it's portable)

Owner

benoitc commented Oct 25, 2014

@tilgovi If I recall correctly, the reason we are not doing it is for the case where we d a USR2 and stop the old master. In that case we don't want to close socket so the master can still listen on it. I am actually wondering if the system doesn't take care of that for us.... (also if it's portable)

@tilgovi

This comment has been minimized.

Show comment
Hide comment
@tilgovi

tilgovi Oct 27, 2014

Collaborator

That was indeed the reason. I'll follow up on the other issue.
On Oct 25, 2014 3:21 AM, "Benoit Chesneau" notifications@github.com wrote:

@tilgovi https://github.com/tilgovi If I recall correctly, the reason
we are not doing it is for the case where we d a USR2 and stop the old
master. In that case we don't want to close socket so the master can still
listen on it. I am actually wondering if the system doesn't take care of
that for us.... (also if it's portable)


Reply to this email directly or view it on GitHub
#908 (comment).

Collaborator

tilgovi commented Oct 27, 2014

That was indeed the reason. I'll follow up on the other issue.
On Oct 25, 2014 3:21 AM, "Benoit Chesneau" notifications@github.com wrote:

@tilgovi https://github.com/tilgovi If I recall correctly, the reason
we are not doing it is for the case where we d a USR2 and stop the old
master. In that case we don't want to close socket so the master can still
listen on it. I am actually wondering if the system doesn't take care of
that for us.... (also if it's portable)


Reply to this email directly or view it on GitHub
#908 (comment).

@diwu1989

This comment has been minimized.

Show comment
Hide comment
@diwu1989

diwu1989 Nov 1, 2014

When the max_requests option is in use, and there are pending requests queued up for the worker, but the worker is shutting down, the queued up requests are closed off forcely 5xx error.

diwu1989 commented Nov 1, 2014

When the max_requests option is in use, and there are pending requests queued up for the worker, but the worker is shutting down, the queued up requests are closed off forcely 5xx error.

@benoitc

This comment has been minimized.

Show comment
Hide comment
@benoitc

benoitc Dec 19, 2014

Owner

i am slightly revisiting the gthreads worker. The new workflow is quite more simple:

  1. listeners are put in the poller. On read we try to accept on them.
  2. When a connection is accepted it is put in the execution queue
  3. When a request is done and the socket can be kept alived, we put it in the poller, on read event we will try to handle the new request. If it is not put out of the poller before the keepalive timeout the socket will be closed.
  4. if all threads are busy we are waiting until one request complet. If it doesn't complete before the timeout we kill the worker.

I will have a patch available tomorrow.

Owner

benoitc commented Dec 19, 2014

i am slightly revisiting the gthreads worker. The new workflow is quite more simple:

  1. listeners are put in the poller. On read we try to accept on them.
  2. When a connection is accepted it is put in the execution queue
  3. When a request is done and the socket can be kept alived, we put it in the poller, on read event we will try to handle the new request. If it is not put out of the poller before the keepalive timeout the socket will be closed.
  4. if all threads are busy we are waiting until one request complet. If it doesn't complete before the timeout we kill the worker.

I will have a patch available tomorrow.

benoitc added a commit that referenced this issue Dec 20, 2014

stop to accept more requests when maximum accepted is achieved
this change makes sure that a worker don't handle more requests than it can
achieved.  The new workflow is quite more simple:

listeners are put in the poller. On read we try to accept on them.
When a connection is accepted it is put in the execution queue
When a request is done and the socket can be kept alived, we put it in the
poller, on read event we will try to handle the new request. If it is not put
out of the poller before the keepalive timeout the socket will be closed.
if all threads are busy we are waiting until one request complet. If it
doesn't complete before the timeout we kill the worker.

fix #908
@benoitc

This comment has been minimized.

Show comment
Hide comment
@benoitc

benoitc Dec 20, 2014

Owner

please review #962 which should solve the issue.

Owner

benoitc commented Dec 20, 2014

please review #962 which should solve the issue.

@benoitc benoitc closed this in #962 Jan 21, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment