speed up connector limiting #2937

thehesiod · 2018-04-13T20:45:02Z

see conversation in #1821

thehesiod · 2018-04-13T20:49:07Z

aiohttp/connector.py

@@ -392,11 +396,18 @@ def closed(self):

            try:
                await fut
-            finally:
-                # remove a waiter even if it was cancelled
-                waiters.remove(fut)


this is slow because it's trying to find a future in a list for each request that's waiting on a connector

thehesiod · 2018-04-13T20:50:48Z

aiohttp/connector.py

    def _release_waiter(self):
        # always release only one waiter

        if self._limit:
            # if we have limit and we have available
            if self._limit - len(self._acquired) > 0:
                for key, waiters in self._waiters.items():
-                    if waiters:
-                        if not waiters[0].done():


with the new model we can guarantee that there are only active waiters in the list so we can greatly simplify this to always popping and notifying the first item in the list

pfreixes · 2018-04-14T05:42:33Z

aiohttp/connector.py

+            return False
+
+        waiter = waiters.pop(0)
+        waiter.set_result(None)


You can not guarantee that the element is a none canceled future. Even you are removing from the list here [1].

Once the Future is set with an exception the callback [2] that will trigger this Exception will be scheduled by the loop, so before the callback that cancels the Task is really executed, the task that was holding a connection and wants to release it might pop a canceled future.

[1] https://github.com/aio-libs/aiohttp/pull/2937/files#diff-7f25afde79f309e2f8722c26cf1f10adR399
[2] https://github.com/python/cpython/blob/master/Lib/asyncio/futures.py#L254

I will mimic this code [1] https://github.com/python/cpython/blob/master/Lib/asyncio/locks.py#L450

I don't see how this can happen because the async stack ends on the waiter for Future and there are no done callbacks associated with it, see this example: https://ideone.com/izIfqw

Think of how this would work, currently no-one sets an exception on this future besides cancelling the waiter at [1], we can verify this by seeing the future is only available from self._waiters and no one sets an exception on it.

so, if you cancel task at [1], you have two options:

you don't wait on the cancelled task:
1.1) you can release it (ok, removed from list), and then wait on cancelled (item already gone from list)
1.2) you can not release it (ok), then wait on cancelled (item will be removed from list)

you wait on the cancelled task, in which case the baseexcept runs and removes item from list. Nothing else can happen because we're just waiting on the future and no-one else is waiting on said future or has callbacks against said future.

let me know if I'm missing something in my logic, I know it can get complex

The task will end up once the reactor schedules again the task, the result that is given back to the task depends on the future taht yielded the task before. The cancel is only a signal that wakes up the task with an CanceledError exception.

So the task might be still there just waiting for its turn in the reactor/loop. While its happenning you have chances to have the situation that I told tou.

Please review the asyncio lock code, all of the different implementations take this into consideration.

thanks @pfreixes for that example, so it seems it depends on the ordering of the tasks called during cancellation. Here's a question, can you have an outstanding release for a single connector? It seems like you could end up releasing two connectors for a single cancel no if the except clause executes before wakeup_next_waiter

Not sure about the question, but once the race condition is considered and mitigated the code should work as is expected.

@pfreixes my "picture" :)

import asyncio loop = asyncio.get_event_loop() waiters = [] async def task(waiter): try: await waiter except: print("task finalized") if waiter in waiters: print("waiter removed") try: waiters.remove(waiter) except ValueError: print("Waiter already removed") else: print("waiter not present") raise def wakeup_next_waiter(): if not waiters: return waiter = waiters.pop(0) try: if not waiter.done(): waiter.set_result(None) except Exception as e: print(f"Exception calling set_result {e!r}") raise async def main(loop): # We add two waiters waiter = loop.create_future() waiters.append(waiter) waiter = loop.create_future() waiters.append(waiter) # create the task that will wait till either the waiter # is finished or the task is cancelled. t = asyncio.ensure_future(task(waiter)) # make a loop iteration to allow the task reach the # await waiter await asyncio.sleep(0) # put in the loop a callback to wake up the waiter. loop.call_later(0.1, wakeup_next_waiter) # cancel the task, this will mark the task as cancelled # but will be pending a loop iteration to wake up the # task, having as a result a CanceledError exception. # This implicitly will schedule the Task._wakeup function # to be executed in the next loop iteration. t.cancel() try: await t except asyncio.CancelledError: pass # wait for the release to run await asyncio.sleep(1) # now we have zero waiters even though only one was cancelled print(len(waiters)) loop.run_until_complete(main(loop))

note in this example I add two waiters, cancel one, but at the end none are left because I ensure that the wakeup_next_waiter happens after task

I don't see the issue in the code - be carefull with extra ifs that are not needed in thte task function. If you are claiming that both waiters are removed this IMO is perfectly fine.

The first waiter that is created is removed through the happy path of the wakeup_next_waiter, obviously being removed by waiter = waiters.pop(0), no task is wakened up because you didn't start any task with that waiter. The second waiter is canceled and the task related will remove the waiter by the waiters.remove(waiter).

Just to put all of us in the same page, the release of a used connection is done automatically by the context manager of a request calling the resp.release() [1]. So every time that a code path goes out of the context will release the connection and will try to wake up pending requests.

[1] https://github.com/aio-libs/aiohttp/blob/master/aiohttp/client_reqrep.py#L786

ok I think ya the way to think about this is that there are N waiters and M connectors, while they're represented in self._waiters the "ownership" on each side is unique. You may inefficiently wake up a cancelled waiter with the current algo but it's probably easier to deal with. Thanks ago for convo was enlightening. Will work on unittest fixes and recommended change today.

pfreixes · 2018-04-14T05:56:03Z

aiohttp/connector.py

-                if not waiters:
-                    del self._waiters[key]
+            except BaseException:
+                # remove a waiter even if it was cancelled, normally it's


if the future has been canceled, we do need to wake up another waiter. Take a look at the semaphore implementation [1]

[1] https://github.com/python/cpython/blob/master/Lib/asyncio/locks.py#L478

this method created the future, why would we need to wake up another waiter? That doesn't make sense as it would imply yet another connector is available. This is 1-1, one waiter was added, one removed. Also note that code is if the future was not cancelled, in this scenario it can only be cancelled

My mistake, the wake up will be done automariaclly by the exit of the context manager in whatever scenario. So forget about this

pfreixes · 2018-04-14T06:01:03Z

aiohttp/connector.py

+
+        # {host_key: FIFO list of waiters}
+        # NOTE: this is not a true FIFO because the true order is lost amongst
+        #       the dictionary keys


I would suggest using the deque [1] data strucutre that is the one used by al of the implementations of asyncio.locks. More likely because lists have the following constraint:

lists incur O(n) memory movement costs for pop(0)

[1] https://github.com/python/cpython/blob/master/Lib/asyncio/locks.py#L437

great idea, done, slight speed improvement :)

asvetlov · 2018-04-17T10:36:14Z

aiohttp/connector.py

-                waiters.remove(fut)
-                if not waiters:
-                    del self._waiters[key]
+            except BaseException:


Sorry, I don't understand why the deletion is moved from finally to except.
Why shouldn't we remove the waiter if no exceptions was raised?

@asvetlov this is for two reasons: performance, and that if no exception is thrown the removal happened by the release method.

pfreixes · 2018-04-18T04:58:15Z

@thehesiod will be nice if we can get the CI in green, right now tests related to stuff that you are improving are failing.

If the numbers that you claim [1] improve the Aiohttp in such a measure we would need to release ASAP a new version.

So, let's try to focus on resolving the MR comments and move ahead this MR.

[1] #1821 (comment)

…or-speed2

thehesiod · 2018-04-18T20:27:18Z

failures don't seem like my fault:
AttributeError: module 'asyncio.coroutines' has no attribute 'debug_wrapper'

codecov-io · 2018-04-18T20:34:41Z

Codecov Report

Merging #2937 into master will increase coverage by <.01%.
The diff coverage is 80.95%.

@@            Coverage Diff             @@
##           master    #2937      +/-   ##
==========================================
+ Coverage   97.99%   97.99%   +<.01%     
==========================================
  Files          40       40              
  Lines        7520     7531      +11     
  Branches     1318     1317       -1     
==========================================
+ Hits         7369     7380      +11     
- Misses         48       49       +1     
+ Partials      103      102       -1

Impacted Files	Coverage Δ
aiohttp/connector.py	`96.86% <80.95%> (+0.04%)`	⬆️
aiohttp/web_app.py	`99.09% <0%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26802e0...66333b5. Read the comment docs.

- fix tests on OSX - add new test for coverage - fix wrong number of items in _conns

pfreixes · 2018-04-19T04:48:37Z

@thehesiod seems that you are still having a linter issue

12.70s$ flake8 aiohttp examples tests demos
tests/test_connector.py:1752:5: F841 local variable 'i' is assigned to but never used

cecton · 2018-04-19T08:47:40Z

aiohttp/connector.py

+        if not waiters:
+            del self._waiters[key]
+
+        return True


If you don't do anything with the result, don't implement a result at all, it's like dead code. This is just a small remark

? result is used in _release_waiter

Ah I see sry

pfreixes · 2018-04-19T12:55:49Z

aiohttp/connector.py

+
+        # {host_key: FIFO list of waiters}
+        # NOTE: this is not a true FIFO because the true order is lost amongst
+        #       the dictionary keys


I guess that this can be removed, the FIFO is for each dictionary key. Different keys mean different tuple values of (hosts, port) which FIFO does not make sense.

each deque is indeed a FIFO as the first one in will be the first one out (amongst items in that list), however across keys it's not a FIFO because it currently iterates across keys (which theoretically can be in any order) when choosing from which FIFO to release.

Which deque to release is not a random choice and its based on the hash of the host and port, so those connections that are waiting for a free connection and match the host and port will share the same deque in a FIFO way.

Yes we are saying the same, the dictionary is just the structure that keeps all of the FIFO queues.

Let's save the comments for what is really not understandable.

I didn't say it was random, I said it wasn't a true FIFO queue because it's choosing which queue to release a connector from in dictionary key order, and not in FIFO order. Anyways, removed the comment and people will have to figure this out themselves now. If this were to be "correct" there would need to be a separate priority queue with pointers back to these queues....or perhaps a multiindex priority queue :)

pfreixes · 2018-04-19T13:50:16Z

@asvetlov this MR is quite critical and once merged will ship a good performance improvement when the client suffers a lot of concurrent connections and starts to apply backpressure, the figures speak by them self

Before

limit_builtin:
  Responses: Counter({200: 9999})
  Took:      49.72031831741333 seconds
limit_semaphore:
  Responses: Counter({200: 9999})
  Took:      22.222461223602295 seconds

After

limit_builtin:
  Responses: Counter({200: 9999})
  Took:      19.374465942382812 seconds
limit_semaphore:
  Responses: Counter({200: 9999})
  Took:      23.060685396194458 seconds

Let's keep pushing/discussing the changes that are asked to allow Aiohttp merge at some point this improvement. Also as a critical part of the code, I would like to have a second approval from @asvetlov , mine will be there once the two pending issues that I've commented are solved.

asvetlov

LGTM.

@pfreixes please merge when you will be ok with all changes.

pfreixes · 2018-04-19T13:21:21Z

aiohttp/connector.py

+                    if not waiters:
+                        del self._waiters[key]
+                except ValueError:  # fut may no longer be in list
+                    pass


Be careful with that, in case a ValueError exception you will mask the original one

try: val = 1 / 0 except Exception: try: a = {} a['foo'] except KeyError: pass raise

I agree with @pfreixes , try to scope what you are catching as much as possible.

This is much better imo:

try: ... if not waiters: try: del self._waiters[key] except ValueError: pass ... except: ...

I don't see what you're saying here, waiters.remove(fut) throws the ValueError, del self._waiters[key] could throw a KeyError, not another ValueError. Not going to change this unless it's really needed.

Read my first comment, unless you make it explicit with a raise from e, the second try/except masks the original exception in case of ValueError.

gotchya, thanks

pfreixes · 2018-04-20T04:33:35Z

aiohttp/connector.py

+
+        # {host_key: FIFO list of waiters}
+        # NOTE: this is not a true FIFO because the true order is lost amongst
+        #       the dictionary keys


Which deque to release is not a random choice and its based on the hash of the host and port, so those connections that are waiting for a free connection and match the host and port will share the same deque in a FIFO way.

Yes we are saying the same, the dictionary is just the structure that keeps all of the FIFO queues.

Let's save the comments for what is really not understandable.

pfreixes · 2018-04-24T10:25:49Z

@thehesiod thanks for the hard work that has been done, let's try to close the opened discussions.

I don't want to block the PR for just one source code comment, but the exception masked its IMO something that needs to be addressed.

cecton

No method remove in defaultdict

cecton · 2018-04-24T11:40:45Z

aiohttp/connector.py

+                # remove a waiter even if it was cancelled, normally it's
+                #  removed when it's notified
+                try:
+                    waiters.remove(fut)


I just checked and there is no remove method in defaultdict.

cecton · 2018-04-24T11:42:23Z

aiohttp/connector.py

+                    if not waiters:
+                        del self._waiters[key]
+                except ValueError:  # fut may no longer be in list
+                    pass


I agree with @pfreixes , try to scope what you are catching as much as possible.

This is much better imo:

try: ... if not waiters: try: del self._waiters[key] except ValueError: pass ... except: ...

cecton

ahhh that waiters is a list. My bad, sry again

thehesiod · 2018-04-24T19:10:57Z

enlightening PR for all of us it seems, mostly me I guess :) thanks guys

pfreixes · 2018-04-25T09:26:52Z

aiohttp/connector.py

+        if not waiters:
+            return False
+
+        waiter = waiters.popleft()


Umm, you have to persist till reaches a waiter not done or you run out of waiters, see the implementation of CPython [1].

[1] https://github.com/python/cpython/blob/master/Lib/asyncio/locks.py#L450

are you saying the old implementation was wrong: https://github.com/aio-libs/aiohttp/pull/2937/files#diff-7f25afde79f309e2f8722c26cf1f10adL481 ? I don't believe this is the case. There are two ways a waiter can be removed:

An exception happened while waiting (in exception handler)

a release was dispatched for said waiter (someone will see a release)

through this method

What you describe would create a double release for 1.i. This is in fact the scenario you before alluded to

Yes, the issue was already there, indeed I can see the following issues with the code that we have in master:

The iteration till reach a none canceled waiter has to be done through all of the items of a list, right now is only done on top of the head of each list.

Each time that we try to release a waiter we have to calculate if the limit and the number of concurrent connections allows us to make it. This is done only in one when the release_waiter is called explicitly but not when we had an exception trying to make the connection.

The limit per host, TBH, i would say that is not well calculated.

So we have to fix them, but true that they were already there and would be nice if we decouple both things.

ya I have a feeling this is the tip of the iceberg :) I have a strong suspicious there's a leak in aiohttp or something aiohttp uses as right now we're leaking ~40MB / week in prod

pfreixes · 2018-04-26T21:17:49Z

LGTM, I cant merge it because of the coverage issue. Can you help us @asvetlov?

In any case, let's hold on a new release I would like to work on the issues that I've commented here [1]

[1] #2937 (comment)

asvetlov · 2018-04-27T07:10:39Z

Sure, I can merge it but what prevents from adding new tests for ensuring the full coverage?
@pfreixes if you want to do it in separate PR -- I'm fine with it.

pfreixes · 2018-04-27T07:58:58Z

I will provide a new MR with the fixes and more coverage.

asvetlov · 2018-04-27T08:22:47Z

Ok. Merged

asvetlov · 2018-04-27T08:22:58Z

thanks to all

lock · 2019-10-28T04:04:41Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a [new issue] for related bugs.
If you feel like there's important points made in this discussion, please include those exceprts into that [new issue].
[new issue]: https://github.com/aio-libs/aiohttp/issues/new

thehesiod and others added 13 commits May 22, 2017 15:11

Merge remote-tracking branch 'aio-libs/master'

a647b47

Merge branch 'aio-libs/master'

dd56884

Merge branch 'aio-libs/master'

5d681f6

Merge branch 'aio-libs/master'

2b90858

Merge branch 'aio-libs/master'

b16fafe

Merge branch 'aio-libs/master'

d78d618

Merge remote-tracking branch 'aio-libs/master'

f8d1b60

Merge remote-tracking branch 'aio-libs/master'

438528c

Merge remote-tracking branch 'aio-libs/master'

5bff7b8

Merge remote-tracking branch 'aio-libs/master'

6ecd39d

Merge remote-tracking branch 'origin/master'

de6043d

speed up connector limiting

8ab132d

remove debug code

ebdbe9f

thehesiod commented Apr 13, 2018

View reviewed changes

pfreixes reviewed Apr 14, 2018

View reviewed changes

pfreixes requested changes Apr 14, 2018

View reviewed changes

thehesiod added 2 commits April 14, 2018 01:21

perf enhancement based on review

0a594bd

bugfix

bdff085

asvetlov reviewed Apr 17, 2018

View reviewed changes

thehesiod added 5 commits April 18, 2018 12:24

Merge remote-tracking branch 'aio-libs/master' into thehesiod-connect…

e48fe4a

…or-speed2

change based on review

9494cb2

fix unittests

f157307

add changes file

cf93f57

fix linting

37dd8ac

test updates

9cc1b73

- fix tests on OSX - add new test for coverage - fix wrong number of items in _conns

flake fix

dd03363

cecton reviewed Apr 19, 2018

View reviewed changes

pfreixes reviewed Apr 19, 2018

View reviewed changes

asvetlov approved these changes Apr 24, 2018

View reviewed changes

asvetlov assigned pfreixes Apr 24, 2018

pfreixes reviewed Apr 24, 2018

View reviewed changes

cecton suggested changes Apr 24, 2018

View reviewed changes

cecton approved these changes Apr 24, 2018

View reviewed changes

thehesiod added 2 commits April 24, 2018 09:11

remove this part

5143a12

changes based on review

66333b5

pfreixes reviewed Apr 25, 2018

View reviewed changes

pfreixes approved these changes Apr 26, 2018

View reviewed changes

asvetlov merged commit 03d590e into aio-libs:master Apr 27, 2018

thehesiod added a commit to thehesiod-forks/aiohttp that referenced this pull request Apr 28, 2018

speed up connector limiting (aio-libs#2937)

c2e7f79

asvetlov mentioned this pull request May 4, 2018

Degrading performance over time... #1821

Closed

thehesiod deleted the thehesiod-connector-speed2 branch June 1, 2018 23:27

lock bot added the outdated label Oct 28, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019

psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Oct 28, 2019

speed up connector limiting #2937

speed up connector limiting #2937

Conversation

thehesiod commented Apr 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thehesiod Apr 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pfreixes commented Apr 18, 2018

thehesiod commented Apr 18, 2018

codecov-io commented Apr 18, 2018 • edited Loading

Codecov Report

pfreixes commented Apr 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thehesiod Apr 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pfreixes commented Apr 19, 2018

asvetlov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pfreixes commented Apr 24, 2018

cecton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cecton left a comment

Choose a reason for hiding this comment

thehesiod commented Apr 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pfreixes Apr 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pfreixes commented Apr 26, 2018

asvetlov commented Apr 27, 2018

pfreixes commented Apr 27, 2018

asvetlov commented Apr 27, 2018

asvetlov commented Apr 27, 2018

lock bot commented Oct 28, 2019

thehesiod Apr 14, 2018 •

edited

Loading

codecov-io commented Apr 18, 2018 •

edited

Loading

thehesiod Apr 19, 2018 •

edited

Loading

thehesiod commented Apr 24, 2018 •

edited

Loading

pfreixes Apr 26, 2018 •

edited

Loading