aiobotocore or aiohttp stuck after many requests #738

dimaqq · 2019-10-24T08:17:01Z

Describe the bug
It's a bug that happens rarely in production.
We've not been able to reproduce it in an isolated environment yet.

The code in question is a background worker, it pulls a job from SQS and runs it.
The job in question was to delete 100s of objects, which means 800 concurrent coro's each of which issues 1..several concurrent delete_item calls.

gather(800 obj = gather( 1+ delete_item()))

In this case, it just stuck on object 799.

Checklist

I have reproduced in environment where pip check passes without errors
I have provided pip freeze results
I have provided sample code or detailed way to reproduce
I have tried the same code in botocore to ensure this is a aiobotocore specific issue

pip freeze results

aiobotocore==0.10.3
aiodns==2.0.0
aiodynamo==19.9
aiohttp==3.6.2
boto3==1.9.189
botocore==1.12.189
requests==2.22.0
requests-toolbelt==0.9.1
urllib3==1.25.6

Environment:

Python Version: 3.7.5
OS name and version: 3.7.5-alpine10 (official Docker image + deps + our code)

Additional context
Thanks to https://github.com/dimaqq/awaitwhat I have a nice coro trace when the process was stuck:
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

dimaqq · 2019-10-24T08:19:56Z

One thing to note is that the code in question does essentially:

return await self._aio_session.request(...)

While official aiohttp documentation appears to recommend:

async with session.request(...):
    return xxx

thehesiod · 2019-10-24T16:52:47Z

the session is aentered: https://github.com/aio-libs/aiobotocore/blob/master/aiobotocore/endpoint.py#L98

you need to make your code like the following to call aexit:

response = await client.method(...)
async with response['Body'] as stream:
    pass

I've seen issues with newer aiohttp, could you try with 3.2.2 w/ python 3.6? Another long standing issue is if you wait long enough, eventually botocore will attempt to refresh your credentials and that whole section hasn't been async'ified yet: #619

at our company we have code that periodically refreshes the credentials, I'll see if I can open source it.

thehesiod · 2019-10-24T16:56:02Z

ok added: https://gist.github.com/thehesiod/05298c1c89c7b5a38da0abc4ccbed7b7

thehesiod · 2019-10-24T16:57:24Z

on the other hand, this looks like aio-libs/aiohttp#4258

dimaqq · 2019-10-25T01:54:22Z

re: credential refresh
I'm not well-versed in the deployment... Do I understand it right that if we don't use AWS_PROFILE, then there's nothing to refresh? As far as I can tell we don't use that.

dimaqq · 2019-10-25T05:17:50Z

I've poked the instance where container is running:

security-credentials are updated every hour
they are valid for 8h30min
process got stuck 40 minutes after latest update, that's 20 minutes before next 🤔
process was running (mostly idle) 20h45min before it got stuck while executing a job

My hunch is, this is unrelated to credentials rollover.

It's something else, maybe connection pool or DNS or whatnot...

thehesiod · 2019-10-25T05:27:42Z

I worked hard with 3.2.2 to ensure connect/read timeouts worked correctly. DNS also has timeout in aiohttp. can you try above suggestions re versions.

thehesiod · 2019-10-25T05:28:21Z

I'm curious as well as we want to move to the new aiohttp and python releases, but cannot if socket timeouts are broken

dimaqq · 2019-10-25T05:41:52Z

Edit: 100 was red herring

Was: 99 done tasks + 1 stuck task == 100 which was suspicious.

thehesiod · 2019-10-25T06:16:37Z

sounds like an aiohttp issue, max_pool_connections is defaulted to 10: https://github.com/boto/botocore/blob/develop/botocore/endpoint.py#L35, which we obey: https://github.com/aio-libs/aiobotocore/blob/master/aiobotocore/endpoint.py#L176

thehesiod · 2019-10-25T06:18:19Z

this really sounds like aio-libs/aiohttp#4258, could you report there? Note from linked bug: We run aiohttp 3.6.2 before on Hass.io 190. After we had some issues with ingress, we detect that request going lost or more specific was blocked and never response inside aiohttp core.

thehesiod · 2019-10-25T06:20:26Z

actually first I would come up with a reproduceable test case. You can use moto for mocking the backend.

dimaqq · 2019-10-25T06:34:41Z

100 was a red herring, sorry. It was batching in our code. Original comment edited.

thehesiod · 2019-10-28T21:57:48Z

I've updated the issue template again for future issues to help narrow down the issue

dimaqq · 2019-10-29T04:38:46Z

I wonder if something like #724 might fix it...
I'll post an update when 0.10.4 is thoroughly tested.

thehesiod · 2019-10-30T07:40:10Z

That was a no op, it was returning a future/coroutine before that could be awaited

illia-v · 2020-02-03T14:31:37Z

I've faced with the problem too while I was uploading 1000 objects to S3 with code similar to this one:

session = aiobotocore.get_session(loop=event_loop)
async with session.create_client('s3') as s3_client:
    awaitables = []
    for instance in instances:
        awaitable = s3_client.put_object(
            Bucket=s3_bucket_name,
            Key=instance.key,
            Body=instance.body,
        )
        awaitables.append(awaitable)
    await asyncio.gather(*awaitables)

It happens around two of five retries, it uploads 999 files and stucks.

Python 3.7 and aiobotocore 0.11.1

dimaqq · 2020-02-04T02:14:37Z

Still getting stuck for me with following versions:

Python 3.8.1
aiobotocore 0.11.1
aiodns 2.0.0
aiohttp 3.6.2
botocore 1.13.14
boto3 1.10.14

It's kinda reproducible (not every time) in prod and staging against real aws dynamodb.
However it's not reproducible locally against dynalite.

I'm beginning to suspect it's somehow more complicated, for example, I'd guess that the dynamodb endpoint resolves to multiple ip addresses, perhaps re-connection or retry and/or switching to another ip address is the key? 🤔

thehesiod · 2020-02-05T05:52:52Z

can you try aiohttp 3.3.2,

dimaqq · 2020-02-05T06:28:48Z

can you try aiohttp 3.3.2,

Is there something significant changed in subsequent version, or what?

thehesiod · 2020-02-05T08:48:39Z

Our company has been using 3.3.2 for some time and have never run into this issue, and I've seen aiohttp bugs about newer versions hanging, so I want to see if this is an aiohttp regression

anand870 · 2020-02-16T07:26:25Z

aiohttp/aiobotocore hangs and raises timeout in case of 400 bad request with minio

marcioemiranda · 2020-03-30T23:30:58Z

Hello I am experiencing a similar issue.

My code is fetching messages from SQS and invoking a lambda function asynchronously. I can reproduce this with a sample of 200k messages in SQS.

A few requests to the lambda client will hang and be released only when I cancel the task. (Log message with stack trace can be seen below). Error sampling is around 0.5%.

The code that hangs look like this

session = get_session()
aio_lambda = session.create_client('lambda')
q = Queue()
producers = [ensure_future(produce(q, context)) for _ in range(num_producers)]
consumers = [ensure_future(consume(q, context)) for _ in range(num_consumers)]
await gather(*producers)
...

# this is called 1800 times in 3 minutes inside a consumer task. Around 10 calls will hang
response = await aio_lambda.invoke(
                FunctionName=environ['WORKER_FUNCTION_ARN'],
                InvocationType='Event',
                Payload=payload
                )

Python version : 3.6
All dependencies came from pip aioboto3
aiobotocore-0.12.0
aiohttp-3.6.2

Stack trace of error when cancelling the task

{
    "exception": "Traceback (most recent call last):\n  File \"/var/task/orchestrator/lambda_function.py\", line 242, in invoke_worker\n    Payload=payload\n  File \"/opt/python/aiobotocore/client.py\", line 77, in _make_api_call\n    operation_model, request_dict, request_context)\n  File \"/opt/python/aiobotocore/client.py\", line 96, in _make_request\n    return await self._endpoint.make_request(operation_model, request_dict)\n  File \"/opt/python/aiobotocore/endpoint.py\", line 74, in _send_request\n    exception):\n  File \"/opt/python/aiobotocore/endpoint.py\", line 176, in _needs_retry\n    caught_exception=caught_exception, request_dict=request_dict)\n  File \"/opt/python/botocore/hooks.py\", line 356, in emit\n    return self._emitter.emit(aliased_event_name, **kwargs)\n  File \"/opt/python/botocore/hooks.py\", line 228, in emit\n    return self._emit(event_name, kwargs)\n  File \"/opt/python/botocore/hooks.py\", line 211, in _emit\n    response = handler(**kwargs)\n  File \"/opt/python/botocore/retryhandler.py\", line 183, in __call__\n    if self._checker(attempts, response, caught_exception):\n  File \"/opt/python/botocore/retryhandler.py\", line 251, in __call__\n    caught_exception)\n  File \"/opt/python/botocore/retryhandler.py\", line 269, in _should_retry\n    return self._checker(attempt_number, response, caught_exception)\n  File \"/opt/python/botocore/retryhandler.py\", line 317, in __call__\n    caught_exception)\n  File \"/opt/python/botocore/retryhandler.py\", line 223, in __call__\n    attempt_number, caught_exception)\n  File \"/opt/python/botocore/retryhandler.py\", line 359, in _check_caught_exception\n    raise caught_exception\n  File \"/opt/python/aiobotocore/endpoint.py\", line 137, in _do_get_response\n    http_response = await self._send(request)\n  File \"/opt/python/aiobotocore/endpoint.py\", line 218, in _send\n    request.method, url=url, headers=headers_, data=data, proxy=proxy)\n  File \"/opt/python/aiohttp/client.py\", line 483, in _request\n    timeout=real_timeout\n  File \"/opt/python/aiohttp/connector.py\", line 499, in connect\n    raise e\n  File \"/opt/python/aiohttp/connector.py\", line 490, in connect\n    await fut\nconcurrent.futures._base.CancelledError"
}

dimaqq · 2020-03-31T01:16:02Z

Re-formatted stack trace from @marcioemiranda

Traceback (most recent call last):
  File "/var/task/orchestrator/lambda_function.py", line 242, in invoke_worker
    Payload=payload
  File "/opt/python/aiobotocore/client.py", line 77, in _make_api_call
    operation_model, request_dict, request_context)
  File "/opt/python/aiobotocore/client.py", line 96, in _make_request
    return await self._endpoint.make_request(operation_model, request_dict)
  File "/opt/python/aiobotocore/endpoint.py", line 74, in _send_request
    exception):
  File "/opt/python/aiobotocore/endpoint.py", line 176, in _needs_retry
    caught_exception=caught_exception, request_dict=request_dict)
  File "/opt/python/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/opt/python/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/opt/python/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/opt/python/botocore/retryhandler.py", line 183, in __call__
    if self._checker(attempts, response, caught_exception):
  File "/opt/python/botocore/retryhandler.py", line 251, in __call__
    caught_exception)
  File "/opt/python/botocore/retryhandler.py", line 269, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/opt/python/botocore/retryhandler.py", line 317, in __call__
    caught_exception)
  File "/opt/python/botocore/retryhandler.py", line 223, in __call__
    attempt_number, caught_exception)
  File "/opt/python/botocore/retryhandler.py", line 359, in _check_caught_exception
    raise caught_exception
  File "/opt/python/aiobotocore/endpoint.py", line 137, in _do_get_response
    http_response = await self._send(request)
  File "/opt/python/aiobotocore/endpoint.py", line 218, in _send
    request.method, url=url, headers=headers_, data=data, proxy=proxy)
  File "/opt/python/aiohttp/client.py", line 483, in _request
    timeout=real_timeout
  File "/opt/python/aiohttp/connector.py", line 499, in connect
    raise e
  File "/opt/python/aiohttp/connector.py", line 490, in connect
    await fut
concurrent.futures._base.CancelledError

timburke · 2020-06-04T18:23:25Z

In case it helps anyone, we ran into this issue as well recently and were able to work around it with an asyncio Semaphore to ensure we never attempt to make an aiohttp call when there is not an available connection pool slot.

Our specific scenario was:

Uploading thousands of files to s3 from a realtime messaging system. We set max_pool_connections in AoiConfig to N between 100 and 500 and would then launch coroutines for every incoming message we received that would await s3_client.put_object() using the max_pool_connections limit to ultimately put backpressure on the file upload process and rate limit everything back to the message broker queue (rabbitmq).

We found that after uploading ~10-30k files we would reliably find a small number that hung "forever" inside of put_object() and based on reading this issue and linked ones, likely ultimately inside of aoihttp.

For reference, this was tested on:

Windows 10
Selector event loop
aoibotocore == 1.0.4
aiodns  == 2.0.0
aiohttp == 3.6.2
aioitertools == 0.7.0

Note: We also tried downgrading to aoihttp == 3.3.2 and the problem remained unchanged.

While we didn't yet find the underlying issue, we were able to workaround it reliably by using an external semaphore with a limit set to the same max_pool_connections that we passed to aiobotocore to block concurrent calls into put_object() when we know they will block on the connection pool.

For example:

max_connection_pool = 100

#BEFORE WORKAROUND -  THIS WOULD HANG EVENTUALLY
# Note that this is simplified for clarity, we have a loop launching coroutines that each call
# s3_client.put_object() once.
await s3_client.put_object(Bucket=bucket, Key=path, Body=body_contents)


# WORKAROUND - THIS DOES NOT HANG
# inside shared state between all coroutines
upload_semaphore = asyncio.BoundedSemaphore(max_connection_pool)

# inside each coroutine 
async with upload_semaphore:
        await s3_client.put_object(Bucket=bucket, Key=path, Body=body_contents)

This doesn't resolve the underlying problem but in case someone else is blocked on this issue urgently and needs a robust workaround, we found the above to work very reliably. It hasn't been in production long but we've gone from 100% hitting this issue to 0% (so far) hitting it and it's an easy change.

Note that this also lets you put an overall timeout on the s3_client.put_object() if needed as a last resort since you know it will never block waiting for a connection pool slot but only on actually connecting / sending data.

AyWa · 2020-06-09T10:07:39Z

In case it helps anyone, we ran into this issue as well recently and were able to work around it with an asyncio Semaphore to ensure we never attempt to make an aiohttp call when there is not an available connection pool slot

I am trying that, just running since few hour and the work arrounds since to prevent task to be stuck. I will share more after few day.

illia-v · 2020-06-09T13:53:42Z

In case it helps anyone, we ran into this issue as well recently and were able to work around it with an asyncio Semaphore to ensure we never attempt to make an aiohttp call when there is not an available connection pool slot.

Yes, using a semaphore that way should fix the problem.
When I was investigating the bug, this fix in the aiohttp code worked for me https://github.com/aio-libs/aiohttp/pull/4562/files#diff-7f25afde79f309e2f8722c26cf1f10ad.

Unfortunately, it seems that aiohttp is not actively maintained at the moment and there is a small chance that a new version will be released soon 🙁

thehesiod · 2020-06-09T16:48:29Z

~~I've been wondering about trying https://www.python-httpx.org/ as a backend~~ (ping me privately)

pfreixes · 2020-06-17T21:19:47Z

Hi @thehesiod it's sad to hear that Aiohttp is not as maintained used to be after all of the giant effort done by the authors and the community.

The connector pool has received many changes for fixing several issues, from what I can see one of the last ones spotted was this one [1], but it seems it was fixed in the 3.6.0 version, so it should not be related with what you are presenting here.

Seems that @illia-v might have caught the bug, so would be nice if someone from Aiohttp could review the PR - If I have time I will do it.

In any case, something that is not really clear to me, from what I can read in the code - correct me if I'm wrong - the connection is never released unless you use a streaming response, otherwise, the returned object to the user is a simple dictionary which does not provide access to the methods for closing or releasing the connection, am I wrong?

[1] aio-libs/aiohttp#3640

thehesiod · 2020-06-17T22:55:50Z

@pfreixes aiohttp since its inception has sorta been counter culture. It refused to behave like all the other http libraries, just getting them to support connection/read timeouts was a battle. There have been many questionable design decisions imo. bugs like these should not be happening on a library with this level of maturity, it speaks to this some fundamental issues. In terms of the last paragraph, are you saying its never releasing the connection on the aiobotocore side? Aiobotocore has code that detects if the API has a streaming response or not. If it doesn't have a streaming response it reads the body completely before returning the response dict to the client. This follows the botocore behavior, see

aiobotocore/aiobotocore/endpoint.py

Line 57 in ec0b676

response_dict['body'] = await http_response.read()

pfreixes · 2020-06-18T08:42:52Z

Hi @thehesiod regarding when the connection is released and returned back to the pool TBH I'm a bit confused, looking at this [1] it seems that the response.close() or response.release() methods are never called unless the object returned is a stream which in that case the user would have the chance of calling it implicitly thanks to the async context.

What Im missing here?

[1] https://github.com/aio-libs/aiobotocore/blob/master/aiobotocore/endpoint.py#L235

thehesiod · 2020-06-21T08:03:57Z

@pfreixes I'll look into that asap, been working on a lot of stuff at work. However related to my httpx comment above I got to a breaking point with aiohttp when downloading weather data from NASA that I swapped one of our prod ingests to httpx and have nothing but praise for it. So I think it may be viable swapping over to solve all these hangs. I don't think it'll be too hard. I'll have to investigate how to make an interface so people can swap between the two (no more)

bosr · 2020-07-17T15:36:51Z

Hi @thehesiod regarding your last message, the update meant there is a clear path to have a switchable backend? Thanks!

thehesiod · 2020-07-17T18:52:51Z

I meant that in particular for httpx I found some critical issue so for that one backend in particular for our scenario didn't seem worth the effort, but in general agree having a switchable backend is a great idea.

thehesiod · 2020-08-04T17:34:43Z

In case it helps anyone, we ran into this issue as well recently and were able to work around it with an asyncio Semaphore to ensure we never attempt to make an aiohttp call when there is not an available connection pool slot.

ah, this then sounds like aio-libs/aiohttp#4562. mind trying with that fix?

alvassin · 2021-02-03T07:41:26Z

Upgrade to the latest aiohttp did not resolve the issue.

illia-v mentioned this issue Feb 9, 2020

Fix a problem with connection waiters that are never awaited aio-libs/aiohttp#4562

Merged

5 tasks

marcioemiranda mentioned this issue May 9, 2020

Using aiobotocore to invoke lambdas hang after many requests #809

Open

alvassin mentioned this issue Aug 21, 2021

Multipart uploading aiokitchen/aiohttp-s3-client#1

Merged

vangheem mentioned this issue Apr 26, 2022

Use semaphore for s3 requests and fix tests plone/guillotina_s3storage#19

Merged

aiobotocore or aiohttp stuck after many requests #738

aiobotocore or aiohttp stuck after many requests #738

Comments

dimaqq commented Oct 24, 2019

dimaqq commented Oct 24, 2019

thehesiod commented Oct 24, 2019

thehesiod commented Oct 24, 2019

thehesiod commented Oct 24, 2019

dimaqq commented Oct 25, 2019

dimaqq commented Oct 25, 2019

thehesiod commented Oct 25, 2019

thehesiod commented Oct 25, 2019

dimaqq commented Oct 25, 2019 • edited

thehesiod commented Oct 25, 2019

thehesiod commented Oct 25, 2019

thehesiod commented Oct 25, 2019

dimaqq commented Oct 25, 2019

thehesiod commented Oct 28, 2019

dimaqq commented Oct 29, 2019

thehesiod commented Oct 30, 2019

illia-v commented Feb 3, 2020

dimaqq commented Feb 4, 2020

thehesiod commented Feb 5, 2020

dimaqq commented Feb 5, 2020

thehesiod commented Feb 5, 2020

anand870 commented Feb 16, 2020

marcioemiranda commented Mar 30, 2020

dimaqq commented Mar 31, 2020

timburke commented Jun 4, 2020 • edited

AyWa commented Jun 9, 2020

illia-v commented Jun 9, 2020

thehesiod commented Jun 9, 2020 • edited

pfreixes commented Jun 17, 2020

thehesiod commented Jun 17, 2020

pfreixes commented Jun 18, 2020

thehesiod commented Jun 21, 2020 • edited

bosr commented Jul 17, 2020 • edited

thehesiod commented Jul 17, 2020

thehesiod commented Aug 4, 2020 • edited

alvassin commented Feb 3, 2021

dimaqq commented Oct 25, 2019 •

edited

timburke commented Jun 4, 2020 •

edited

thehesiod commented Jun 9, 2020 •

edited

thehesiod commented Jun 21, 2020 •

edited

bosr commented Jul 17, 2020 •

edited

thehesiod commented Aug 4, 2020 •

edited