Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cheroot==8.1.0 regression] Occasional connection resets with concurrent requests #263

Closed
1 of 3 tasks
tobiashenkel opened this issue Feb 3, 2020 · 12 comments · Fixed by #277
Closed
1 of 3 tasks
Labels
bug Something is broken regression Something that worked earlier got broken in new release reproducer: present This PR or issue contains code, which reproduce the problem described or clearly understandable STR

Comments

@tobiashenkel
Copy link

I'm submitting a ...

  • bug report
  • feature request
  • question about the decisions made in the repository

Do you want to request a feature or report a bug?

Reporting a bug

What is the current behavior?

When running concurrent requests against a cherrypy the client occasionally gets 'Connection reset by peer' errors.

If the current behavior is a bug, please provide the steps to reproduce and if possible a screenshots and logs of the problem. If you can, show us your code.

This can be reproduced by running the following server and client code. Tested both on MacOS with python 3.7 and Ubuntu bionic on python 3.6. Both reveal the same connection errors:

ERROR:test:FAILED http://localhost:5000/info: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))
(...)
ERROR:test:FAILED http://localhost:5000/info: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))
ERROR:test:FAILED http://localhost:5000/info: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))
INFO:test:921 out of 1000 successful

When reproducing locally the issue reproduces already with 20 workers occasionally and with 50 workers on every run of the client script. I also played with various values of threadpool and queue size settings but couldn't get it working.

Server code:

import logging
import cherrypy

logging.basicConfig(level=logging.DEBUG)


class TestAPI:

    @cherrypy.expose
    def info(self):
        return 'test\n'


def main():
    route_map = cherrypy.dispatch.RoutesDispatcher()
    api = TestAPI()
    route_map.connect('info', '/info', controller=api, action='info')

    conf = {
        '/': {
            'request.dispatch': route_map
        }
    }

    cherrypy.tree.mount(api, '/', config=conf)

    cherrypy.config.update({
        'global': {
            'environment': 'production',
            'server.socket_host': '0.0.0.0',
            'server.socket_port': 5000,
            # 'server.thread_pool': 20,
            # 'server.thread_pool_max': 20,
            # 'server.socket_queue_size': 128,
            # 'server.accepted_queue_size': 100000,
        },
    })

    cherrypy.engine.start()
    cherrypy.engine.block()


if __name__ == '__main__':
    main()

Client code:

from concurrent.futures.thread import ThreadPoolExecutor
from urllib.parse import urljoin

import requests
import logging

baseurl = 'http://localhost:5000/'

log = logging.getLogger('test')


def configure_logging(context):
    logger = logging.getLogger(context)
    logger.setLevel(logging.DEBUG)


def request_info(session):
    url = urljoin(baseurl, f'info')
    try:
        r = session.get(url)
    except requests.exceptions.ConnectionError as e:
        log.error('FAILED %s: %s', url, e)
        return False
    r.raise_for_status()
    return True


def main():
    session = requests.session()
    a = requests.adapters.HTTPAdapter(pool_connections=1, pool_maxsize=1000)
    session.mount('http://', a)

    count = 1000
    success = 0
    with ThreadPoolExecutor(max_workers=50) as tp:
        tasks = []
        for i in range(count):
            tasks.append(tp.submit(request_info, session))

        for task in tasks:
            if task.result():
                success += 1

    log.info('%s out of %s successful', success, count)


if __name__ == "__main__":
    logging.basicConfig()
    configure_logging('test')
    main()

We see the same behavior when running a production service between apache as a reverse proxy.

What is the expected behavior?

There are no 'Connection reset by peer' errors.

What is the motivation / use case for changing the behavior?

Please tell us about your environment:

  • Cheroot version: 8.2.1
  • CherryPy version: 18.5.0
  • Python version: 3.6.9 (Ubuntu bionic) and python 3.7 (MacOS)
  • OS: MacOS and Ubuntu bionic
  • Browser: requests library and apache reverse proxy

Other information (e.g. detailed explanation, stacktraces, related issues, suggestions how to fix, links for us to have context, e.g. stackoverflow, gitter, etc.)

Note that when running with Tornado as HTTP server the problem doesn't reproduce:

import logging
import cherrypy

logging.basicConfig(level=logging.DEBUG)


class TestAPI:

    @cherrypy.expose
    def info(self):
        return 'test\n'


def main():
    import tornado
    import tornado.httpserver
    import tornado.wsgi

    route_map = cherrypy.dispatch.RoutesDispatcher()
    api = TestAPI()
    route_map.connect('info', '/info', controller=api, action='info')

    conf = {
        '/': {
            'request.dispatch': route_map
        }
    }

    wsgiapp = cherrypy.tree.mount(api, '/', config=conf)

    # Disable the autoreload which won't play well
    cherrypy.config.update({'engine.autoreload.on': False})

    # let's not start the CherryPy HTTP server
    cherrypy.server.unsubscribe()

    # use CherryPy's signal handling
    cherrypy.engine.signals.subscribe()

    # Prevent CherryPy logs to be propagated
    # to the Tornado logger
    cherrypy.log.error_log.propagate = False

    # Run the engine but don't block on it
    cherrypy.engine.start()

    # Run thr tornado stack
    container = tornado.wsgi.WSGIContainer(wsgiapp)
    http_server = tornado.httpserver.HTTPServer(container)
    http_server.listen(5000)

    # Publish to the CherryPy engine as if
    # we were using its mainloop
    tornado.ioloop.PeriodicCallback(
        lambda: cherrypy.engine.publish('main'), 100).start()
    tornado.ioloop.IOLoop.instance().start()


if __name__ == '__main__':
    main()
@webknjaz
Copy link
Member

webknjaz commented Feb 3, 2020

@webknjaz
Copy link
Member

webknjaz commented Feb 3, 2020

Hm.. I cannot reproduce this on my laptop (Gentoo Linux, Python 3.7). But I probably have some different kernel limits maybe:

$ cat /proc/sys/net/core/somaxconn
128
$ sysctl -n net.core.somaxconn
128

@tobiashenkel could you pls check what value you have there? Can you set server.socket_queue_size to that value? Can you increase your system setting and test with that too?

Ref: https://stackoverflow.com/a/2444483/595220

P.S. Another thing I noticed is that the network stack is completely different in these examples: tornado picks asyncio as its event loop so some settings may derive from this lib while Cheroot only has a synchronous thread-based implementation.

@webknjaz
Copy link
Member

webknjaz commented Feb 3, 2020

Another setting to check:

$ sysctl -n net.ipv4.tcp_abort_on_overflow                  
0

Ref: https://serverfault.com/a/519152/111608

@webknjaz
Copy link
Member

webknjaz commented Feb 3, 2020

Another ref on the connection backlog: https://serverfault.com/a/715873/111608

@tobiashenkel
Copy link
Author

tobiashenkel commented Feb 3, 2020

Thanks, I re-tested and checked the mentioned settings:
net.core.somaxconn is by default 128
net.ipv4.tcp_abort_on_overflow is by default 0 as well

Re-tested with somaxconn and socket_queue_size of 128 and 1024 on ubuntu with no change on the observed behavior.

The Ubuntu test has been done on a freshly created VM in aws.

@webknjaz
Copy link
Member

webknjaz commented Feb 3, 2020

I'll need to test it in ubuntu container, then... Meanwhile, try downgrading to cheroot<8.1.0 to verify that it's not a regression of the refactoring introduced in #199

@webknjaz webknjaz transferred this issue from cherrypy/cherrypy Feb 3, 2020
@webknjaz webknjaz added bug Something is broken regression Something that worked earlier got broken in new release reproducer: present This PR or issue contains code, which reproduce the problem described or clearly understandable STR labels Feb 3, 2020
@triage-new-issues triage-new-issues bot removed the triage label Feb 3, 2020
@webknjaz webknjaz changed the title Occasional connection resets with concurrent requests [cheroot==8.1.0 regression] Occasional connection resets with concurrent requests Feb 3, 2020
@webknjaz webknjaz mentioned this issue Feb 3, 2020
15 tasks
@tobiashenkel
Copy link
Author

tobiashenkel commented Feb 3, 2020

I've successfully reproduced the issue directly with cheroot with this server code:

from cheroot import wsgi

def my_crazy_app(environ, start_response):
    status = '200 OK'
    response_headers = [('Content-type', 'text/plain')]
    start_response(status, response_headers)
    return [b'Hello world!']


addr = '0.0.0.0', 5000
server = wsgi.Server(addr, my_crazy_app)
server.start()

The client code needed a few more threads though without cherrypy in the mix.

@webknjaz
Copy link
Member

webknjaz commented Feb 3, 2020

FTR the client reproducer code uses urllib3.connectionpool.HTTPConnectionPool under the hood meaning that the TCP connections are reused. AFAIU requests does not support HTTP pipelining so they should be treated as usual HTTP requests that just happen to arrive via the same TCP connection.
I have a hunch that maybe the current connection manager that @the-allanc implemented drops TCP connections too early.

@webknjaz
Copy link
Member

webknjaz commented Feb 3, 2020

Interesting, I can't seem to reproduce it with

$ ab -v4 -k -c 1020 -t 5 http://localhost:5000/info

(doesn't reuse connections?)

@the-allanc
Copy link
Contributor

I know what the cause of this is and I've got an in-progress fix for it. I'll post an update shortly.

@webknjaz
Copy link
Member

webknjaz commented Apr 6, 2020

@the-allanc thanks! Will it also fix #249?

@the-allanc
Copy link
Contributor

@the-allanc thanks! Will it also fix #249?

I'd be surprised if it does, but I'd be happy if it did. I've submitted #277 to address this issue.

jaraco added a commit that referenced this issue Apr 11, 2020
wmfgerrit pushed a commit to wikimedia/integration-zuul that referenced this issue Jun 15, 2020
CherryPy uses cheroot underneath as http server. The release 8.1.0
introduced a regression which leads to occasional connection resets
while handling concurrent requests [1]. Cap it until there is a
release available that fixes this issue.

[1] cherrypy/cheroot#263

Change-Id: I39716e9c203cc42ab320cb47e090fbc8f67320c9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken regression Something that worked earlier got broken in new release reproducer: present This PR or issue contains code, which reproduce the problem described or clearly understandable STR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants