New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aiomcache concurrency issue #196
Comments
aiocache
hits a TimeoutError
it doesn't recover
Hi @achedeuzot, I'll investigate this during the weekend. The small testcase would definitely help, if not I'll also try to build something myself too |
Mmmmm I tried a really simple example to see if it was recovering and seems it does (I hacked the code of import asyncio
from aiomcache import Client
from aiocache import MemcachedCache
client = MemcachedCache(pool_size=1)
timeout = 0
async def get(key):
try:
global timeout
timeout += 1
return await asyncio.wait_for(client.get(key), timeout=timeout)
except asyncio.TimeoutError:
print("Timeout Error")
loop = asyncio.get_event_loop()
print(loop.run_until_complete(client.set("Hello", "World")))
for x in range(5):
print(loop.run_until_complete(get("Hello"))) Output: True
Timeout Error
None
Timeout Error
None
World
World
World I did this simple one to check if aiomcache is releasing the pool connection correctly when there is a The example is not accurate because in real code, the cancellation event occurs while the connection is being used and here is during the sleep. I want to try if I can address this to reproduce it. |
I've done another test simulating slow network with The updated example: import asyncio
from aiomcache import Client
from aiocache import MemcachedCache
MAX_TIMEOUT = 100
client = MemcachedCache(pool_size=1)
async def get(key, timeout):
try:
print("Calling client get with timeout {}, pool size {}".format(timeout, client.client._pool.size()))
return await client.get(key, timeout=timeout)
except asyncio.TimeoutError:
print("Timeout Error")
loop = asyncio.get_event_loop()
for x in range(1, MAX_TIMEOUT, 2):
print(loop.run_until_complete(get("Hello", x))) and the output (I've added some print lines inside aiomcache code to see where it was being cancelled, no sleeps in the middle now though).: Calling client get with timeout 1, pool size 0
Timeout Error
None
Calling client get with timeout 3, pool size 0
Timeout Error
None
Calling client get with timeout 5, pool size 0
aiomcache: received get b'Hello'
aiomcache: Wrote command
Timeout Error
None
Calling client get with timeout 7, pool size 0
aiomcache: received get b'Hello'
aiomcache: Wrote command
Timeout Error
None
Calling client get with timeout 9, pool size 0
aiomcache: received get b'Hello'
aiomcache: Wrote command
aiomcache: Read command: b'VALUE Hello 0 5 35\r\n'
aiomcache: retrieved get b'World'
World
Calling client get with timeout 11, pool size 1
aiomcache: received get b'Hello'
aiomcache: Wrote command
aiomcache: Read command: b'VALUE Hello 0 5 35\r\n'
aiomcache: retrieved get b'World'
World Note that although the delay is only 2s, it takes much more than that to receive the value. This is because multiple loopback requests are done (i.e. acquire connection, write, read) and the 2s delay is for each one of them. Nevertheless it ends up resolving it correctly and recovering. |
Hi Manuel, I've tried creating a simple test case as you did but didn't manage to easily reproduce it. But I have more information about the circumstances where I see such behavior. I'm using What I noticed is that if I use a
Under load, it seems that What I've currently done to 'fix' the issue is to catch the Thank you very much for your help with this issue ! :) |
Do you have the script to generate the load in order to reproduce the error in the example you've posted? I would like to play a bit with it to see it for myself. |
Hi Manuel, I managed to reproduce it with the following script:
The pool size is small (2) and the timeout is low (0.1) so that the The load is sent using I created a repo with Dockerfile + docker-compose to easily spawn the server here: https://github.com/achedeuzot/aiocache-test1 Thanks for your feedback if you manage to have the same results as I did. |
So, continuing the investigation, I can now catch the I also noticed that the Memcached connection pool (from aiomcache.MemcachePool) looks like this If I increase the connection pool, this error occurs a lot less but can still happen. I also noticed that on average, the connections that was used during a One final observation is that if I do only |
Huh, that's interesting and nice progress. Being able to catch the exception makes much sense, previous situation you were describing sounded so weird. Did you have to change anything regarding the catching of the exception? Regarding the release not being called, I'll check this afternoon too. There must be some bug or something because in the test script I posted in the previous comment pool was recovering fine. PS: I don't know if its the case, but this starts to mark the path for this being a bug for aiomcache more than for aiocache (although I will solve it anyway ofc :P). |
Yep, I had to await the return of I don't know if it's a |
Yeah, but still the user can use the import random
import string
import logging
import asyncio
import aiohttp
import aiocache
import aiomcache
from aiohttp import web
logger = logging.getLogger(__name__)
class CacheManager:
def __init__(self):
self.cache = aiomcache.Client("127.0.0.1", 11211, pool_size=2)
async def get(self, key):
return await asyncio.wait_for(self.cache.get(key), 0.1)
async def set(self, key, value):
return await asyncio.wait_for(self.cache.set(key, value), 0.1)
async def handler_post(req):
try:
data = await req.app['cache'].get(b'testkey')
if data:
return web.Response(text=data.decode())
except asyncio.TimeoutError as e:
logger.error("handler_post exception:")
logger.error(e)
data = ''.join(random.choices(string.ascii_uppercase + string.digits, k=1024))
await req.app['cache'].set(b'testkey', data.encode())
return web.Response(text=data)
if __name__ == '__main__':
app = web.Application()
app['cache'] = CacheManager()
app.router.add_route('GET', '/', handler_post)
web.run_app(app) I've changed the query to a GET one so to do the load testing: Debugging aiomcache I see its exiting in https://github.com/aio-libs/aiomcache/blob/master/aiomcache/pool.py#L55 when the exception raises, still on it to see if this is expected or what... @fafhrd91 @asvetlov any ideas? TLDR: It seems aiomcache doesn't deal well when tasks get cancelled due to a TimeoutError under high load. |
So you managed to have the same issue ? That's good news ! Now we need to find why this happens and how to fix this :) Thank you so much for your cooperation with this ! |
Yup, sorry my previous comment was wrong. I've managed to reproduce both using aiocache and just aiomcache (my example is just with aiomcache). no problem, checking it because aiocache really depends on aiomcache :P |
Just for the record, seems the same example with |
check if aioredis uses streams api. |
@fafhrd91 It seems like it's also the case, because it uses |
yup, it uses Dunno if that's the main problem but deque is thread-safe while asyncio.Queue is not. |
@achedeuzot I'm closing this issue because there is nothing in |
@argaen Great ! :D |
I'm having a strange issue with
wait_for(fut, timeout, *, loop=None)
+aiocache
on memcache.We're storing values using
aiocache.MemcachedCache
and most methods ofaiocache
are decorated with@API.timeout
which usesawait asyncio.wait_for(fn(self, *args, **kwargs), timeout)
(with a default timeout of 5 (seconds)).When load testing our application, we see that with a big load, the
asyncio
loop clogs up and some requests to memcache raiseasyncio.TimeoutError
which is perfectly acceptable. The issue is that when we stop the load and allow for the loop to catch up, if we make any new request, all the memcache connections will fail with aclass 'concurrent.futures._base.TimeoutError'
. In other words, if we ever get aTimeoutError
the application cache is completely broken and the only way to repair the application is the kill it and restart it, which is unacceptable. It seems as though the wholeaiocache
connection pool is closed and I don't find where this happens and how to prevent it.I've tried the following:
asyncio.wait_for()
in ashield()
function so it won't cancel the associatedTask
, no differenceasyncio.CancelledError
,TimeoutError
,asyncio.futures.TimeoutError
,asyncio.TimeoutError
or globalException
with no success, it seems my catching of the error is too lateThe only thing that helps is increasing the connection pool size (2 by default to 500 for example) but even with a big connection pool, if we have a
TimeoutError
, we hit the same issue and the whole pool spins into everlasting errors.And finally, if I remove the timeout by setting it to
0
orNone
, the library doesn't useasyncio.wait_for()
but a simpleawait fn()
and even though we have some slowness under load, there is noTimeoutError
and the application always works. But waiting too long for cache is not a good idea, so I'd really like to use the timeout feature.If anyone has any idea how to tackle this, I'd love to hear your input !
The versions involved:
I'm currently writing a small testcase to see if I can easily reproduce the issue. I'll post it here when it's done.
The text was updated successfully, but these errors were encountered: