Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers not timing out when requests.get hangs #1655

Closed
hingston opened this issue Dec 1, 2017 · 9 comments
Closed

Workers not timing out when requests.get hangs #1655

hingston opened this issue Dec 1, 2017 · 9 comments

Comments

@hingston
Copy link

hingston commented Dec 1, 2017

I'm making calls to an API and multiple times a day requests.get hangs then throws 'Remote end closed connection without response' after 1-2 minutes even with the timeout set to 40 seconds. The workers all hang until the exception is thrown. Nothing in the logs until [CRITICAL] WORKER TIMEOUT after the exception.

Expected Result

I would expect it to timeout after 40 seconds.

Actual Result (from logs)

No requests served for 1:55 minutes then all workers time out.

Time elapsed (hh:mm:ss.ms) 0:01:55.254653

  File "/home/vmagent/app/checker/views.py", line 127, in call_api
    return sessions[region].get(url, timeout=3)
  File "/env/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
    return self.request('GET', url, **kwargs)
  File "/env/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/env/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/env/lib/python3.6/site-packages/requests/adapters.py", line 490, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

[CRITICAL] WORKER TIMEOUT

Reproduction Steps

import datetime
import traceback

from requests import session, Timeout

s = session()
s.headers.update({'X-API-KEY': 'API-KEY-#####'})


def call_api(url):
    start_time = datetime.datetime.now()
    try:
        return s.get(url, timeout=3)
    except:
        time_elapsed = datetime.datetime.now() - start_time
        print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))
        traceback.print_exc()
        return 520  # Unknown error

gunicorn.conf.py

from psycogreen.gevent import patch_psycopg

workers = 3
keepalive = 10
timeout = 40
worker_class = 'gevent'
max_requests = 200
max_requests_jitter = 200
graceful_timeout = 5
worker_connections = 250
loglevel = 'debug'


def post_fork(server, worker):
    patch_psycopg()

app.yaml

runtime: python
env: flex
threadsafe: yes

manual_scaling:
  instances: 1

entrypoint: gunicorn -c gunicorn.conf.py -b :$PORT lolnamesgg.wsgi

beta_settings:
  cloud_sql_instances: lolnames-177809:us-central1:lolnames12

runtime_config:
  python_version: 3.6

System Information

Python 3.6.3
Google App Engine Flexible

pip==9.0.1
mysqlclient==1.3.12
Django==2.0rc1
django-db-geventpool==1.21
django-removewww==0.1.2
gevent==1.2.2
gunicorn==19.7.1
psycopg2==2.7.3.2
python-dateutil==2.6.1
pytz==2017.3
regex==2017.11.9
requests-toolbelt==0.8.0
requests[security]
six==1.11.0
wheel==0.30.0
@tilgovi
Copy link
Collaborator

tilgovi commented Dec 4, 2017

That is very strange. With the gevent worker, your requests.get() should not block the worker from sending heartbeats to the master process, so you should never see a worker timeout at all, no matter how long the request takes or what your timeout is set to.

Also, the WORKER TIMEOUT message is logged in the master process, so it should happen independently of anything happening in the worker.

I am not sure where to start debugging, beyond sharing that context.

@hingston
Copy link
Author

hingston commented Dec 4, 2017

I've been dealing with this for a long time. I initially thought it might have been a memory issue but doubling the memory does nothing to help. I'm not sure how to better debug this as I can't reproduce it on my local system. It only ever happens when a requests exception is thrown. It doesn't have to be the requests.exceptions.ConnectionError I mentioned it is sometimes related to SSL handshakes and other requests exceptions but it is only when a requests exception occurs that I see this global hang from gunicorn. My solution so far is to run 2 instances (doubling the cost and not ideal) as it happens in frequently (4-5 times a day for 1-8 minutes). My site serves ~80k people a day and I'm a student so it's a hobby so I'm not too concerned about user experience but I would like to keep my costs low!

@tilgovi
Copy link
Collaborator

tilgovi commented Dec 6, 2017

Could this be anything like #588?

@tilgovi
Copy link
Collaborator

tilgovi commented Dec 9, 2017

Here's a simple application I am using to try to reproduce the problem. I cannot.

import requests

s = requests.session()

def application(environ, start_response):
    try:
        return s.get('http://httpbin.org/delay/5', timeout=3)
    except:
        start_response('502 Bad Gateway', {})
        return iter([])

    start_response('200 OK', {})
    return iter([b'Hello'])

I start with gunicorn -k gevent api_client:application. With sync workers the result is the same. The endpoint takes 5 seconds to respond, but requests times out the attempt after 3. The error is raised, handled, and 502 is returned to my client.

Can you provide a simple example that reproduces the problem, maybe in a repo? Do you experience it locally, or only on App Engine?

@benoitc
Copy link
Owner

benoitc commented Dec 12, 2017

Sorry that’s not clear for me. By no requests served you mean that no other requests using different endpoints were also returning?

@hingston
Copy link
Author

I have to apologise for the slow reply. I'll have some time to try and replicate this more reliably over my xmas break.

@benoitc It completely hangs, then eventually serves 502s after the exception is raised or until (very occasionally) Google Cloud's health checks timeout at 3 minutes.

It does look similar to #588 in some ways. I'll be completely free after the 19th to investigate this fully.

@hingston
Copy link
Author

hingston commented Dec 13, 2017

Ohh sorry also @tilgovi return s.get('http://httpbin.org/delay/5', timeout=3) isn't the going to raise the requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',)) exception. Requests never actually reaches a timeout.

timeout is not a time limit on the entire response download; rather, an exception is raised if the server has not issued a response for timeout seconds (more precisely, if no bytes have been received on the underlying socket for timeout seconds). If no timeout is specified explicitly, requests do not time out.

I don't know how to reproduce exceptions like that currently which has been my main struggle in reproducing this locally.

@hingston
Copy link
Author

I was never able to reproduce it locally. This week after the following changes:

pip==9.0.1
Django==2.0.2
django-db-geventpool==2
django-removewww==0.1.2
gevent==1.3a1
gunicorn==19.7.1
psycopg2-binary==2.7.4
python-dateutil==2.6.1
pytz==2018.3
regex==2018.2.8
requests[security]

It is all working fine. So it almost certainly wasn't gunicorn.

@tilgovi
Copy link
Collaborator

tilgovi commented Feb 11, 2018

I'll close this issue, but feel free to re-open if you track down any issues.

@tilgovi tilgovi closed this as completed Feb 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants