New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aiobotocore or aiohttp stuck after many requests #738
Comments
One thing to note is that the code in question does essentially: return await self._aio_session.request(...) While official async with session.request(...):
return xxx |
the session is aentered: https://github.com/aio-libs/aiobotocore/blob/master/aiobotocore/endpoint.py#L98 you need to make your code like the following to call aexit: response = await client.method(...)
async with response['Body'] as stream:
pass I've seen issues with newer aiohttp, could you try with 3.2.2 w/ python 3.6? Another long standing issue is if you wait long enough, eventually botocore will attempt to refresh your credentials and that whole section hasn't been async'ified yet: #619 at our company we have code that periodically refreshes the credentials, I'll see if I can open source it. |
on the other hand, this looks like aio-libs/aiohttp#4258 |
re: credential refresh |
I've poked the instance where container is running:
My hunch is, this is unrelated to credentials rollover. It's something else, maybe connection pool or DNS or whatnot... |
I worked hard with 3.2.2 to ensure connect/read timeouts worked correctly. DNS also has timeout in aiohttp. can you try above suggestions re versions. |
I'm curious as well as we want to move to the new aiohttp and python releases, but cannot if socket timeouts are broken |
Edit: Was: |
sounds like an aiohttp issue, |
this really sounds like aio-libs/aiohttp#4258, could you report there? Note from linked bug: |
actually first I would come up with a reproduceable test case. You can use moto for mocking the backend. |
|
I've updated the issue template again for future issues to help narrow down the issue |
I wonder if something like #724 might fix it... |
That was a no op, it was returning a future/coroutine before that could be awaited |
I've faced with the problem too while I was uploading 1000 objects to S3 with code similar to this one: session = aiobotocore.get_session(loop=event_loop)
async with session.create_client('s3') as s3_client:
awaitables = []
for instance in instances:
awaitable = s3_client.put_object(
Bucket=s3_bucket_name,
Key=instance.key,
Body=instance.body,
)
awaitables.append(awaitable)
await asyncio.gather(*awaitables) It happens around two of five retries, it uploads 999 files and stucks. Python 3.7 and aiobotocore 0.11.1 |
Still getting stuck for me with following versions:
It's kinda reproducible (not every time) in prod and staging against real aws dynamodb. I'm beginning to suspect it's somehow more complicated, for example, I'd guess that the dynamodb endpoint resolves to multiple ip addresses, perhaps re-connection or retry and/or switching to another ip address is the key? 🤔 |
can you try aiohttp 3.3.2, |
Is there something significant changed in subsequent version, or what? |
Our company has been using 3.3.2 for some time and have never run into this issue, and I've seen aiohttp bugs about newer versions hanging, so I want to see if this is an aiohttp regression |
Hello I am experiencing a similar issue. My code is fetching messages from SQS and invoking a lambda function asynchronously. I can reproduce this with a sample of 200k messages in SQS. A few requests to the lambda client will hang and be released only when I cancel the task. (Log message with stack trace can be seen below). Error sampling is around 0.5%. The code that hangs look like this
Python version : 3.6 Stack trace of error when cancelling the task
|
Re-formatted stack trace from @marcioemiranda Traceback (most recent call last):
File "/var/task/orchestrator/lambda_function.py", line 242, in invoke_worker
Payload=payload
File "/opt/python/aiobotocore/client.py", line 77, in _make_api_call
operation_model, request_dict, request_context)
File "/opt/python/aiobotocore/client.py", line 96, in _make_request
return await self._endpoint.make_request(operation_model, request_dict)
File "/opt/python/aiobotocore/endpoint.py", line 74, in _send_request
exception):
File "/opt/python/aiobotocore/endpoint.py", line 176, in _needs_retry
caught_exception=caught_exception, request_dict=request_dict)
File "/opt/python/botocore/hooks.py", line 356, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/opt/python/botocore/hooks.py", line 228, in emit
return self._emit(event_name, kwargs)
File "/opt/python/botocore/hooks.py", line 211, in _emit
response = handler(**kwargs)
File "/opt/python/botocore/retryhandler.py", line 183, in __call__
if self._checker(attempts, response, caught_exception):
File "/opt/python/botocore/retryhandler.py", line 251, in __call__
caught_exception)
File "/opt/python/botocore/retryhandler.py", line 269, in _should_retry
return self._checker(attempt_number, response, caught_exception)
File "/opt/python/botocore/retryhandler.py", line 317, in __call__
caught_exception)
File "/opt/python/botocore/retryhandler.py", line 223, in __call__
attempt_number, caught_exception)
File "/opt/python/botocore/retryhandler.py", line 359, in _check_caught_exception
raise caught_exception
File "/opt/python/aiobotocore/endpoint.py", line 137, in _do_get_response
http_response = await self._send(request)
File "/opt/python/aiobotocore/endpoint.py", line 218, in _send
request.method, url=url, headers=headers_, data=data, proxy=proxy)
File "/opt/python/aiohttp/client.py", line 483, in _request
timeout=real_timeout
File "/opt/python/aiohttp/connector.py", line 499, in connect
raise e
File "/opt/python/aiohttp/connector.py", line 490, in connect
await fut
concurrent.futures._base.CancelledError |
In case it helps anyone, we ran into this issue as well recently and were able to work around it with an asyncio Our specific scenario was:
We found that after uploading ~10-30k files we would reliably find a small number that hung "forever" inside of For reference, this was tested on:
Note: We also tried downgrading to While we didn't yet find the underlying issue, we were able to workaround it reliably by using an external semaphore with a limit set to the same For example: max_connection_pool = 100
#BEFORE WORKAROUND - THIS WOULD HANG EVENTUALLY
# Note that this is simplified for clarity, we have a loop launching coroutines that each call
# s3_client.put_object() once.
await s3_client.put_object(Bucket=bucket, Key=path, Body=body_contents)
# WORKAROUND - THIS DOES NOT HANG
# inside shared state between all coroutines
upload_semaphore = asyncio.BoundedSemaphore(max_connection_pool)
# inside each coroutine
async with upload_semaphore:
await s3_client.put_object(Bucket=bucket, Key=path, Body=body_contents) This doesn't resolve the underlying problem but in case someone else is blocked on this issue urgently and needs a robust workaround, we found the above to work very reliably. It hasn't been in production long but we've gone from 100% hitting this issue to 0% (so far) hitting it and it's an easy change. Note that this also lets you put an overall timeout on the |
I am trying that, just running since few hour and the work arrounds since to prevent task to be stuck. I will share more after few day. |
Yes, using a semaphore that way should fix the problem. Unfortunately, it seems that aiohttp is not actively maintained at the moment and there is a small chance that a new version will be released soon 🙁 |
|
Hi @thehesiod it's sad to hear that Aiohttp is not as maintained used to be after all of the giant effort done by the authors and the community. The connector pool has received many changes for fixing several issues, from what I can see one of the last ones spotted was this one [1], but it seems it was fixed in the 3.6.0 version, so it should not be related with what you are presenting here. Seems that @illia-v might have caught the bug, so would be nice if someone from Aiohttp could review the PR - If I have time I will do it. In any case, something that is not really clear to me, from what I can read in the code - correct me if I'm wrong - the connection is never released unless you use a streaming response, otherwise, the returned object to the user is a simple dictionary which does not provide access to the methods for closing or releasing the connection, am I wrong? |
@pfreixes aiohttp since its inception has sorta been counter culture. It refused to behave like all the other http libraries, just getting them to support connection/read timeouts was a battle. There have been many questionable design decisions imo. bugs like these should not be happening on a library with this level of maturity, it speaks to this some fundamental issues. In terms of the last paragraph, are you saying its never releasing the connection on the aiobotocore side? Aiobotocore has code that detects if the API has a streaming response or not. If it doesn't have a streaming response it reads the body completely before returning the response dict to the client. This follows the botocore behavior, see aiobotocore/aiobotocore/endpoint.py Line 57 in ec0b676
|
Hi @thehesiod regarding when the connection is released and returned back to the pool TBH I'm a bit confused, looking at this [1] it seems that the What Im missing here? [1] https://github.com/aio-libs/aiobotocore/blob/master/aiobotocore/endpoint.py#L235 |
|
Hi @thehesiod regarding your last message, the update meant there is a clear path to have a switchable backend? Thanks! |
I meant that in particular for httpx I found some critical issue so for that one backend in particular for our scenario didn't seem worth the effort, but in general agree having a switchable backend is a great idea. |
ah, this then sounds like aio-libs/aiohttp#4562. mind trying with that fix? |
Upgrade to the latest aiohttp did not resolve the issue. |
Describe the bug
It's a bug that happens rarely in production.
We've not been able to reproduce it in an isolated environment yet.
The code in question is a background worker, it pulls a job from SQS and runs it.
The job in question was to delete 100s of objects, which means 800 concurrent coro's each of which issues 1..several concurrent
delete_item
calls.gather(800 obj = gather( 1+ delete_item()))
In this case, it just stuck on object 799.
Checklist
pip check
passes without errorspip freeze
resultspip freeze results
Environment:
3.7.5
3.7.5-alpine10
(official Docker image + deps + our code)Additional context
Thanks to https://github.com/dimaqq/awaitwhat I have a nice coro trace when the process was stuck:
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: