[cheroot==8.1.0 regression] Occasional connection resets with concurrent requests #263

tobiashenkel · 2020-02-03T13:00:59Z

I'm submitting a ...

bug report
feature request
question about the decisions made in the repository

Do you want to request a feature or report a bug?

Reporting a bug

What is the current behavior?

When running concurrent requests against a cherrypy the client occasionally gets 'Connection reset by peer' errors.

If the current behavior is a bug, please provide the steps to reproduce and if possible a screenshots and logs of the problem. If you can, show us your code.

This can be reproduced by running the following server and client code. Tested both on MacOS with python 3.7 and Ubuntu bionic on python 3.6. Both reveal the same connection errors:

ERROR:test:FAILED http://localhost:5000/info: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))
(...)
ERROR:test:FAILED http://localhost:5000/info: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))
ERROR:test:FAILED http://localhost:5000/info: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))
INFO:test:921 out of 1000 successful

When reproducing locally the issue reproduces already with 20 workers occasionally and with 50 workers on every run of the client script. I also played with various values of threadpool and queue size settings but couldn't get it working.

Server code:

import logging
import cherrypy

logging.basicConfig(level=logging.DEBUG)


class TestAPI:

    @cherrypy.expose
    def info(self):
        return 'test\n'


def main():
    route_map = cherrypy.dispatch.RoutesDispatcher()
    api = TestAPI()
    route_map.connect('info', '/info', controller=api, action='info')

    conf = {
        '/': {
            'request.dispatch': route_map
        }
    }

    cherrypy.tree.mount(api, '/', config=conf)

    cherrypy.config.update({
        'global': {
            'environment': 'production',
            'server.socket_host': '0.0.0.0',
            'server.socket_port': 5000,
            # 'server.thread_pool': 20,
            # 'server.thread_pool_max': 20,
            # 'server.socket_queue_size': 128,
            # 'server.accepted_queue_size': 100000,
        },
    })

    cherrypy.engine.start()
    cherrypy.engine.block()


if __name__ == '__main__':
    main()

Client code:

from concurrent.futures.thread import ThreadPoolExecutor
from urllib.parse import urljoin

import requests
import logging

baseurl = 'http://localhost:5000/'

log = logging.getLogger('test')


def configure_logging(context):
    logger = logging.getLogger(context)
    logger.setLevel(logging.DEBUG)


def request_info(session):
    url = urljoin(baseurl, f'info')
    try:
        r = session.get(url)
    except requests.exceptions.ConnectionError as e:
        log.error('FAILED %s: %s', url, e)
        return False
    r.raise_for_status()
    return True


def main():
    session = requests.session()
    a = requests.adapters.HTTPAdapter(pool_connections=1, pool_maxsize=1000)
    session.mount('http://', a)

    count = 1000
    success = 0
    with ThreadPoolExecutor(max_workers=50) as tp:
        tasks = []
        for i in range(count):
            tasks.append(tp.submit(request_info, session))

        for task in tasks:
            if task.result():
                success += 1

    log.info('%s out of %s successful', success, count)


if __name__ == "__main__":
    logging.basicConfig()
    configure_logging('test')
    main()

We see the same behavior when running a production service between apache as a reverse proxy.

What is the expected behavior?

There are no 'Connection reset by peer' errors.

What is the motivation / use case for changing the behavior?

Please tell us about your environment:

Cheroot version: 8.2.1
CherryPy version: 18.5.0
Python version: 3.6.9 (Ubuntu bionic) and python 3.7 (MacOS)
OS: MacOS and Ubuntu bionic
Browser: requests library and apache reverse proxy

Other information (e.g. detailed explanation, stacktraces, related issues, suggestions how to fix, links for us to have context, e.g. stackoverflow, gitter, etc.)

Note that when running with Tornado as HTTP server the problem doesn't reproduce:

import logging
import cherrypy

logging.basicConfig(level=logging.DEBUG)


class TestAPI:

    @cherrypy.expose
    def info(self):
        return 'test\n'


def main():
    import tornado
    import tornado.httpserver
    import tornado.wsgi

    route_map = cherrypy.dispatch.RoutesDispatcher()
    api = TestAPI()
    route_map.connect('info', '/info', controller=api, action='info')

    conf = {
        '/': {
            'request.dispatch': route_map
        }
    }

    wsgiapp = cherrypy.tree.mount(api, '/', config=conf)

    # Disable the autoreload which won't play well
    cherrypy.config.update({'engine.autoreload.on': False})

    # let's not start the CherryPy HTTP server
    cherrypy.server.unsubscribe()

    # use CherryPy's signal handling
    cherrypy.engine.signals.subscribe()

    # Prevent CherryPy logs to be propagated
    # to the Tornado logger
    cherrypy.log.error_log.propagate = False

    # Run the engine but don't block on it
    cherrypy.engine.start()

    # Run thr tornado stack
    container = tornado.wsgi.WSGIContainer(wsgiapp)
    http_server = tornado.httpserver.HTTPServer(container)
    http_server.listen(5000)

    # Publish to the CherryPy engine as if
    # we were using its mainloop
    tornado.ioloop.PeriodicCallback(
        lambda: cherrypy.engine.publish('main'), 100).start()
    tornado.ioloop.IOLoop.instance().start()


if __name__ == '__main__':
    main()

webknjaz · 2020-02-03T13:54:00Z

Possibly related:

webknjaz · 2020-02-03T14:11:09Z

Hm.. I cannot reproduce this on my laptop (Gentoo Linux, Python 3.7). But I probably have some different kernel limits maybe:

$ cat /proc/sys/net/core/somaxconn
128
$ sysctl -n net.core.somaxconn
128

@tobiashenkel could you pls check what value you have there? Can you set server.socket_queue_size to that value? Can you increase your system setting and test with that too?

Ref: https://stackoverflow.com/a/2444483/595220

P.S. Another thing I noticed is that the network stack is completely different in these examples: tornado picks asyncio as its event loop so some settings may derive from this lib while Cheroot only has a synchronous thread-based implementation.

webknjaz · 2020-02-03T14:12:54Z

Another setting to check:

$ sysctl -n net.ipv4.tcp_abort_on_overflow                  
0

Ref: https://serverfault.com/a/519152/111608

webknjaz · 2020-02-03T14:14:34Z

Another ref on the connection backlog: https://serverfault.com/a/715873/111608

tobiashenkel · 2020-02-03T14:31:00Z

Thanks, I re-tested and checked the mentioned settings:
net.core.somaxconn is by default 128
net.ipv4.tcp_abort_on_overflow is by default 0 as well

Re-tested with somaxconn and socket_queue_size of 128 and 1024 on ubuntu with no change on the observed behavior.

The Ubuntu test has been done on a freshly created VM in aws.

webknjaz · 2020-02-03T14:41:37Z

I'll need to test it in ubuntu container, then... Meanwhile, try downgrading to cheroot<8.1.0 to verify that it's not a regression of the refactoring introduced in #199

tobiashenkel · 2020-02-03T15:18:42Z

I've successfully reproduced the issue directly with cheroot with this server code:

from cheroot import wsgi

def my_crazy_app(environ, start_response):
    status = '200 OK'
    response_headers = [('Content-type', 'text/plain')]
    start_response(status, response_headers)
    return [b'Hello world!']


addr = '0.0.0.0', 5000
server = wsgi.Server(addr, my_crazy_app)
server.start()

The client code needed a few more threads though without cherrypy in the mix.

webknjaz · 2020-02-03T23:20:17Z

FTR the client reproducer code uses urllib3.connectionpool.HTTPConnectionPool under the hood meaning that the TCP connections are reused. AFAIU requests does not support HTTP pipelining so they should be treated as usual HTTP requests that just happen to arrive via the same TCP connection.
I have a hunch that maybe the current connection manager that @the-allanc implemented drops TCP connections too early.

webknjaz · 2020-02-03T23:33:28Z

Interesting, I can't seem to reproduce it with

$ ab -v4 -k -c 1020 -t 5 http://localhost:5000/info

(doesn't reuse connections?)

the-allanc · 2020-04-06T21:09:04Z

I know what the cause of this is and I've got an in-progress fix for it. I'll post an update shortly.

webknjaz · 2020-04-06T21:10:15Z

@the-allanc thanks! Will it also fix #249?

the-allanc · 2020-04-09T23:16:32Z

@the-allanc thanks! Will it also fix #249?

I'd be surprised if it does, but I'd be happy if it did. I've submitted #277 to address this issue.

CherryPy uses cheroot underneath as http server. The release 8.1.0 introduced a regression which leads to occasional connection resets while handling concurrent requests [1]. Cap it until there is a release available that fixes this issue. [1] cherrypy/cheroot#263 Change-Id: I39716e9c203cc42ab320cb47e090fbc8f67320c9

webknjaz transferred this issue from cherrypy/cherrypy Feb 3, 2020

triage-new-issues bot added the triage label Feb 3, 2020

webknjaz added bug Something is broken regression Something that worked earlier got broken in new release reproducer: present This PR or issue contains code, which reproduce the problem described or clearly understandable STR labels Feb 3, 2020

triage-new-issues bot removed the triage label Feb 3, 2020

webknjaz changed the title ~~Occasional connection resets with concurrent requests~~ [cheroot==8.1.0 regression] Occasional connection resets with concurrent requests Feb 3, 2020

webknjaz mentioned this issue Feb 3, 2020

Request based workers #199

Merged

15 tasks

jaraco added a commit that referenced this issue Apr 4, 2020

Add fixture for simple WSGI server. Ref #263.

5d4f1ca

jaraco added a commit that referenced this issue Apr 4, 2020

Add test capturing failure. Ref #263.

ec07ec4

jaraco mentioned this issue Apr 4, 2020

[cheroot==8.1.0 regression] Spin on Linux when socket filedescriptor is too big for select #249

Closed

the-allanc mentioned this issue Apr 9, 2020

Be more graceful when terminating keep-alive connections #277

Merged

13 tasks

jaraco added a commit that referenced this issue Apr 10, 2020

Add fixture for simple WSGI server. Ref #263.

b79a702

jaraco added a commit that referenced this issue Apr 10, 2020

Add test capturing failure. Ref #263.

e271455

jaraco added a commit that referenced this issue Apr 10, 2020

Add fixture for simple WSGI server. Ref #263.

8d5c43f

jaraco added a commit that referenced this issue Apr 10, 2020

Add test capturing failure. Ref #263.

228baf8

jaraco added a commit that referenced this issue Apr 10, 2020

Add fixture for simple WSGI server. Ref #263.

c6817d5

jaraco added a commit that referenced this issue Apr 11, 2020

Add test capturing failure. Ref #263.

e6028e4

jaraco added a commit that referenced this issue Apr 11, 2020

Add test capturing failure. Ref #263.

f0c044f

webknjaz closed this as completed in #277 Jul 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cheroot==8.1.0 regression] Occasional connection resets with concurrent requests #263

[cheroot==8.1.0 regression] Occasional connection resets with concurrent requests #263

tobiashenkel commented Feb 3, 2020

webknjaz commented Feb 3, 2020

webknjaz commented Feb 3, 2020

webknjaz commented Feb 3, 2020

webknjaz commented Feb 3, 2020

tobiashenkel commented Feb 3, 2020 •

edited

Loading

webknjaz commented Feb 3, 2020

tobiashenkel commented Feb 3, 2020 •

edited by webknjaz

Loading

webknjaz commented Feb 3, 2020

webknjaz commented Feb 3, 2020

the-allanc commented Apr 6, 2020

webknjaz commented Apr 6, 2020

the-allanc commented Apr 9, 2020

[cheroot==8.1.0 regression] Occasional connection resets with concurrent requests #263

[cheroot==8.1.0 regression] Occasional connection resets with concurrent requests #263

Comments

tobiashenkel commented Feb 3, 2020

webknjaz commented Feb 3, 2020

webknjaz commented Feb 3, 2020

webknjaz commented Feb 3, 2020

webknjaz commented Feb 3, 2020

tobiashenkel commented Feb 3, 2020 • edited Loading

webknjaz commented Feb 3, 2020

tobiashenkel commented Feb 3, 2020 • edited by webknjaz Loading

webknjaz commented Feb 3, 2020

webknjaz commented Feb 3, 2020

the-allanc commented Apr 6, 2020

webknjaz commented Apr 6, 2020

the-allanc commented Apr 9, 2020

tobiashenkel commented Feb 3, 2020 •

edited

Loading

tobiashenkel commented Feb 3, 2020 •

edited by webknjaz

Loading