You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We started investigating a "could not connect to metadata endpoint url error"
File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 186, in _new_conn\
raise NewConnectionError(\
urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0xffffb1c41910>: Failed to establish a new connection: [Errno 99] Cannot assign requested address\
During handling of the above exception, another exception occurred:\
Traceback (most recent call last):\
File "/usr/local/lib/python3.9/dist-packages/botocore/utils.py", line 2917, in _get_response\
response = self._session.send(request.prepare())\
File "/usr/local/lib/python3.9/dist-packages/botocore/httpsession.py", line 484, in send\
raise EndpointConnectionError(endpoint_url=request.url, error=e)\
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "[http://169.254.170.2/v2/credentials/50fc788d-d1fc-4f36-a9e8-fb32e81a8455"\ ](http://169.254.170.2/v2/credentials/50fc788d-d1fc-4f36-a9e8-fb32e81a8455%22/)
With a lot of back and forth with AWS, this is what we found:
Confirmation that urllib3, which is what boto3 uses to establish connections, does not have idle timeouts [1]. The expectation is that, if a connection is not going to be used further it will be closed. Apparently, the client here [2] is not created as an object. Close calls [3] would be what is needed to close connections once they are opened. Without that the connection will not be closed until it is idle for 2 hours [4]. Client call at line 712 of the aws.py section of Cloud Custodian must be called multiple times in order for the error to keep appearing in the logs after good calls which followed bad calls as the credentials that are used are stored in the client for at least 10 minutes between refreshes [5].
It was found that many sockets were in the CLOSE-WAIT state.
As per boto3 service team - Clients are generally thread safe as noted here and in my tests with Batch running with the client created in the threads with 100 threads resulted in around 400 to 500 sockets being used while, when I created the client outside of the threads and used it inside the threading it resulted in around 5 to 10 additional sockets at the same 100 threads. Additionally, it resulted in only one or two connections to the ECS Agent address rather than hundreds of them. If the client were created somewhere else outside of the threading then there would be one client which would only need to make one call to the ECS Agent and would also be able to manage the client connections more efficiently rather than having new clients for every thread which means that connections are essentially being managed by the code generating a new client for every thread.
I was suspecting the issue to be in S3Transfer library, but they confirmed that the S3Transfer package in boto3 [6] and it has not been updated in 9 months
Additionally, I did replicate the error by reducing the net.ipv4.ip_local_port_range from the default “32768 61000” to "52768 61000", a reduction from around 29K to around 9K. The port range could be increased up to "1024 65535". In a run with 20 containers there were 1492 connections from process called “custodian” and of those 781 were connections to 169.254.170.2. This is expected behavior if there are hundreds of clients being created by hundreds of threads as each one will establish a connection to the ECS Agent in order to validate credentials. Moving the client creation out of the threads should reduce the number of sockets being used by close to half just from this if not more as it would allow the underlying http client to manage the connection pool and reuse connection rather than each client have to generate its own connection to the Service is it connecting to.
As part of investigating this we had to switch from AWS Batch managed ECS cluster to unmanaged AWS Batch ECS cluster. I suspect with multiple containers getting placed on the instance and sockets not getting garbage collected we run into a scenario where we are left with not enough ports on the instances. I will admit that we were running with the default port range configuration of “32768 61000”. We're planning on bumping up the port range to "1024 65535" as suggested in our lower environment this week to see if that helps reduce the errors
Extra information or context
2023-05-02 07:31:09,002: custodian.output.blob:DEBUG s3: uploading policy logs
2023-05-02 07:31:12,058: custodian.commands:ERROR Error while executing policy sqs-encrypt-at-rest, continuing
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/usr/local/lib/python3.9/dist-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/usr/local/lib/python3.9/dist-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
OSError: [Errno 99] Cannot assign requested address
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/botocore/httpsession.py", line 455, in send
urllib_response = conn.urlopen(
File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 787, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.9/dist-packages/urllib3/util/retry.py", line 525, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.9/dist-packages/urllib3/packages/six.py", line 770, in reraise
raise value
File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 244, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/usr/lib/python3.9/http/client.py", line 1257, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.9/dist-packages/botocore/awsrequest.py", line 94, in _send_request
rval = super()._send_request(
File "/usr/lib/python3.9/http/client.py", line 1303, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.9/http/client.py", line 1252, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.9/dist-packages/botocore/awsrequest.py", line 123, in _send_output
self.send(msg)
File "/usr/local/lib/python3.9/dist-packages/botocore/awsrequest.py", line 218, in send
return super().send(str)
File "/usr/lib/python3.9/http/client.py", line 952, in send
self.connect()
File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 205, in connect
conn = self._new_conn()
File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7f0028955640>: Failed to establish a new connection: [Errno 99] Cannot assign requested address
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/botocore/utils.py", line 2917, in _get_response
response = self._session.send(request.prepare())
File "/usr/local/lib/python3.9/dist-packages/botocore/httpsession.py", line 484, in send
raise EndpointConnectionError(endpoint_url=request.url, error=e)
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://169.254.170.2/v2/credentials/XXXXXXXX-X7f6-4215-b042-0a124d28e71b"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/botocore/credentials.py", line 1929, in fetch_creds
response = self._fetcher.retrieve_full_uri(
File "/usr/local/lib/python3.9/dist-packages/botocore/utils.py", line 2862, in retrieve_full_uri
return self._retrieve_credentials(full_url, headers)
File "/usr/local/lib/python3.9/dist-packages/botocore/utils.py", line 2898, in _retrieve_credentials
return self._get_response(
File "/usr/local/lib/python3.9/dist-packages/botocore/utils.py", line 2939, in _get_response
raise MetadataRetrievalError(error_msg=error_msg)
botocore.exceptions.MetadataRetrievalError: Error retrieving metadata: Received error when attempting to retrieve ECS metadata: Could not connect to the endpoint URL: "http://169.254.170.2/v2/credentials/XXXXXXXX-X7f6-4215-b042-0a124d28e71b"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/c7n/commands.py", line 305, in run
policy()
File "/usr/local/lib/python3.9/dist-packages/c7n/policy.py", line 1334, in __call__
resources = mode.run()
File "/usr/local/lib/python3.9/dist-packages/c7n/policy.py", line 355, in run
return []
File "/usr/local/lib/python3.9/dist-packages/c7n/ctx.py", line 103, in __exit__
self.output.__exit__(exc_type, exc_value, exc_traceback)
File "/usr/local/lib/python3.9/dist-packages/c7n/output.py", line 524, in __exit__
self.upload()
File "/usr/local/lib/python3.9/dist-packages/c7n/output.py", line 534, in upload
self.upload_file(os.path.join(root, f), key)
File "/usr/local/lib/python3.9/dist-packages/c7n/resources/aws.py", line 702, in upload_file
self.transfer.upload_file(
File "/usr/local/lib/python3.9/dist-packages/c7n/resources/aws.py", line 698, in transfer
self.ctx.session_factory(region=bucket_region, assume=False).client('s3'))
File "/usr/local/lib/python3.9/dist-packages/boto3/session.py", line 299, in client
return self._session.create_client(
File "/usr/local/lib/python3.9/dist-packages/botocore/session.py", line 951, in create_client
credentials = self.get_credentials()
File "/usr/local/lib/python3.9/dist-packages/botocore/session.py", line 507, in get_credentials
self._credentials = self._components.get_component(
File "/usr/local/lib/python3.9/dist-packages/botocore/credentials.py", line 2039, in load_credentials
creds = provider.load()
File "/usr/local/lib/python3.9/dist-packages/botocore/credentials.py", line 1902, in load
return self._retrieve_or_fail()
File "/usr/local/lib/python3.9/dist-packages/botocore/credentials.py", line 1911, in _retrieve_or_fail
creds = fetcher()
File "/usr/local/lib/python3.9/dist-packages/botocore/credentials.py", line 1936, in fetch_creds
raise CredentialRetrievalError(
botocore.exceptions.CredentialRetrievalError: Error when retrieving credentials from container-role: Error retrieving metadata: Received error when attempting to retrieve ECS metadata: Could not connect to the endpoint URL: "http://169.254.170.2/v2/credentials/XXXXXXXX-X7f6-4215-b042-0a124d28e71b"
2023-05-02 07:31:12,063: custodian.output:DEBUG Storing output with <LogFile file:///tmp/tmp2c5ptdi_/custodian-run.log>
The text was updated successfully, but these errors were encountered:
Describe the bug
We started investigating a "could not connect to metadata endpoint url error"
With a lot of back and forth with AWS, this is what we found:
Confirmation that urllib3, which is what boto3 uses to establish connections, does not have idle timeouts [1]. The expectation is that, if a connection is not going to be used further it will be closed. Apparently, the client here [2] is not created as an object. Close calls [3] would be what is needed to close connections once they are opened. Without that the connection will not be closed until it is idle for 2 hours [4]. Client call at line 712 of the aws.py section of Cloud Custodian must be called multiple times in order for the error to keep appearing in the logs after good calls which followed bad calls as the credentials that are used are stored in the client for at least 10 minutes between refreshes [5].
It was found that many sockets were in the CLOSE-WAIT state.
As per boto3 service team - Clients are generally thread safe as noted here and in my tests with Batch running with the client created in the threads with 100 threads resulted in around 400 to 500 sockets being used while, when I created the client outside of the threads and used it inside the threading it resulted in around 5 to 10 additional sockets at the same 100 threads. Additionally, it resulted in only one or two connections to the ECS Agent address rather than hundreds of them. If the client were created somewhere else outside of the threading then there would be one client which would only need to make one call to the ECS Agent and would also be able to manage the client connections more efficiently rather than having new clients for every thread which means that connections are essentially being managed by the code generating a new client for every thread.
I was suspecting the issue to be in S3Transfer library, but they confirmed that the S3Transfer package in boto3 [6] and it has not been updated in 9 months
[1] urllib3/urllib3#2498
[2] https://github.com/cloud-custodian/cloud-custodian/blob/main/c7n/resources/aws.py#L712
[3] https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/close.html
[4] https://man7.org/linux/man-pages/man7/tcp.7.html
[5] https://github.com/boto/botocore/blob/master/botocore/credentials.py#L61
[6] https://github.com/boto/boto3/blob/master/boto3/s3/transfer.py
As per AWS support team -
Additionally, I did replicate the error by reducing the net.ipv4.ip_local_port_range from the default “32768 61000” to "52768 61000", a reduction from around 29K to around 9K. The port range could be increased up to "1024 65535". In a run with 20 containers there were 1492 connections from process called “custodian” and of those 781 were connections to 169.254.170.2. This is expected behavior if there are hundreds of clients being created by hundreds of threads as each one will establish a connection to the ECS Agent in order to validate credentials. Moving the client creation out of the threads should reduce the number of sockets being used by close to half just from this if not more as it would allow the underlying http client to manage the connection pool and reuse connection rather than each client have to generate its own connection to the Service is it connecting to.
As part of investigating this we had to switch from AWS Batch managed ECS cluster to unmanaged AWS Batch ECS cluster. I suspect with multiple containers getting placed on the instance and sockets not getting garbage collected we run into a scenario where we are left with not enough ports on the instances. I will admit that we were running with the default port range configuration of “32768 61000”. We're planning on bumping up the port range to "1024 65535" as suggested in our lower environment this week to see if that helps reduce the errors
Extra information or context
The text was updated successfully, but these errors were encountered: