Custodian job errors/failures potentially because of exhaustion of ephemeral ports on the instance #9031

PratMis · 2023-10-04T14:08:14Z

Describe the bug

We started investigating a "could not connect to metadata endpoint url error"

File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 186, in _new_conn\
    raise NewConnectionError(\
urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0xffffb1c41910>: Failed to establish a new connection: [Errno 99] Cannot assign requested address\

During handling of the above exception, another exception occurred:\
Traceback (most recent call last):\
  File "/usr/local/lib/python3.9/dist-packages/botocore/utils.py", line 2917, in _get_response\
    response = self._session.send(request.prepare())\
  File "/usr/local/lib/python3.9/dist-packages/botocore/httpsession.py", line 484, in send\
    raise EndpointConnectionError(endpoint_url=request.url, error=e)\
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "[http://169.254.170.2/v2/credentials/50fc788d-d1fc-4f36-a9e8-fb32e81a8455"\ ](http://169.254.170.2/v2/credentials/50fc788d-d1fc-4f36-a9e8-fb32e81a8455%22/)

With a lot of back and forth with AWS, this is what we found:
Confirmation that urllib3, which is what boto3 uses to establish connections, does not have idle timeouts [1]. The expectation is that, if a connection is not going to be used further it will be closed. Apparently, the client here [2] is not created as an object. Close calls [3] would be what is needed to close connections once they are opened. Without that the connection will not be closed until it is idle for 2 hours [4]. Client call at line 712 of the aws.py section of Cloud Custodian must be called multiple times in order for the error to keep appearing in the logs after good calls which followed bad calls as the credentials that are used are stored in the client for at least 10 minutes between refreshes [5].

It was found that many sockets were in the CLOSE-WAIT state.

As per boto3 service team - Clients are generally thread safe as noted here and in my tests with Batch running with the client created in the threads with 100 threads resulted in around 400 to 500 sockets being used while, when I created the client outside of the threads and used it inside the threading it resulted in around 5 to 10 additional sockets at the same 100 threads. Additionally, it resulted in only one or two connections to the ECS Agent address rather than hundreds of them. If the client were created somewhere else outside of the threading then there would be one client which would only need to make one call to the ECS Agent and would also be able to manage the client connections more efficiently rather than having new clients for every thread which means that connections are essentially being managed by the code generating a new client for every thread.

I was suspecting the issue to be in S3Transfer library, but they confirmed that the S3Transfer package in boto3 [6] and it has not been updated in 9 months

[1] urllib3/urllib3#2498
[2] https://github.com/cloud-custodian/cloud-custodian/blob/main/c7n/resources/aws.py#L712
[3] https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/close.html
[4] https://man7.org/linux/man-pages/man7/tcp.7.html
[5] https://github.com/boto/botocore/blob/master/botocore/credentials.py#L61
[6] https://github.com/boto/boto3/blob/master/boto3/s3/transfer.py

As per AWS support team -

Additionally, I did replicate the error by reducing the net.ipv4.ip_local_port_range from the default “32768 61000” to "52768 61000", a reduction from around 29K to around 9K. The port range could be increased up to "1024 65535". In a run with 20 containers there were 1492 connections from process called “custodian” and of those 781 were connections to 169.254.170.2. This is expected behavior if there are hundreds of clients being created by hundreds of threads as each one will establish a connection to the ECS Agent in order to validate credentials. Moving the client creation out of the threads should reduce the number of sockets being used by close to half just from this if not more as it would allow the underlying http client to manage the connection pool and reuse connection rather than each client have to generate its own connection to the Service is it connecting to.

As part of investigating this we had to switch from AWS Batch managed ECS cluster to unmanaged AWS Batch ECS cluster. I suspect with multiple containers getting placed on the instance and sockets not getting garbage collected we run into a scenario where we are left with not enough ports on the instances. I will admit that we were running with the default port range configuration of “32768 61000”. We're planning on bumping up the port range to "1024 65535" as suggested in our lower environment this week to see if that helps reduce the errors

Extra information or context

2023-05-02 07:31:09,002: custodian.output.blob:DEBUG s3: uploading policy logs
2023-05-02 07:31:12,058: custodian.commands:ERROR Error while executing policy sqs-encrypt-at-rest, continuing
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.9/dist-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/usr/local/lib/python3.9/dist-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
OSError: [Errno 99] Cannot assign requested address

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/botocore/httpsession.py", line 455, in send
    urllib_response = conn.urlopen(
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.9/dist-packages/urllib3/util/retry.py", line 525, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.9/dist-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/lib/python3.9/http/client.py", line 1257, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.9/dist-packages/botocore/awsrequest.py", line 94, in _send_request
    rval = super()._send_request(
  File "/usr/lib/python3.9/http/client.py", line 1303, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1252, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/dist-packages/botocore/awsrequest.py", line 123, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.9/dist-packages/botocore/awsrequest.py", line 218, in send
    return super().send(str)
  File "/usr/lib/python3.9/http/client.py", line 952, in send
    self.connect()
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7f0028955640>: Failed to establish a new connection: [Errno 99] Cannot assign requested address

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/botocore/utils.py", line 2917, in _get_response
    response = self._session.send(request.prepare())
  File "/usr/local/lib/python3.9/dist-packages/botocore/httpsession.py", line 484, in send
    raise EndpointConnectionError(endpoint_url=request.url, error=e)
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://169.254.170.2/v2/credentials/XXXXXXXX-X7f6-4215-b042-0a124d28e71b"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/botocore/credentials.py", line 1929, in fetch_creds
    response = self._fetcher.retrieve_full_uri(
  File "/usr/local/lib/python3.9/dist-packages/botocore/utils.py", line 2862, in retrieve_full_uri
    return self._retrieve_credentials(full_url, headers)
  File "/usr/local/lib/python3.9/dist-packages/botocore/utils.py", line 2898, in _retrieve_credentials
    return self._get_response(
  File "/usr/local/lib/python3.9/dist-packages/botocore/utils.py", line 2939, in _get_response
    raise MetadataRetrievalError(error_msg=error_msg)
botocore.exceptions.MetadataRetrievalError: Error retrieving metadata: Received error when attempting to retrieve ECS metadata: Could not connect to the endpoint URL: "http://169.254.170.2/v2/credentials/XXXXXXXX-X7f6-4215-b042-0a124d28e71b"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/c7n/commands.py", line 305, in run
    policy()
  File "/usr/local/lib/python3.9/dist-packages/c7n/policy.py", line 1334, in __call__
    resources = mode.run()
  File "/usr/local/lib/python3.9/dist-packages/c7n/policy.py", line 355, in run
    return []
  File "/usr/local/lib/python3.9/dist-packages/c7n/ctx.py", line 103, in __exit__
    self.output.__exit__(exc_type, exc_value, exc_traceback)
  File "/usr/local/lib/python3.9/dist-packages/c7n/output.py", line 524, in __exit__
    self.upload()
  File "/usr/local/lib/python3.9/dist-packages/c7n/output.py", line 534, in upload
    self.upload_file(os.path.join(root, f), key)
  File "/usr/local/lib/python3.9/dist-packages/c7n/resources/aws.py", line 702, in upload_file
    self.transfer.upload_file(
  File "/usr/local/lib/python3.9/dist-packages/c7n/resources/aws.py", line 698, in transfer
    self.ctx.session_factory(region=bucket_region, assume=False).client('s3'))
  File "/usr/local/lib/python3.9/dist-packages/boto3/session.py", line 299, in client
    return self._session.create_client(
  File "/usr/local/lib/python3.9/dist-packages/botocore/session.py", line 951, in create_client
    credentials = self.get_credentials()
  File "/usr/local/lib/python3.9/dist-packages/botocore/session.py", line 507, in get_credentials
    self._credentials = self._components.get_component(
  File "/usr/local/lib/python3.9/dist-packages/botocore/credentials.py", line 2039, in load_credentials
    creds = provider.load()
  File "/usr/local/lib/python3.9/dist-packages/botocore/credentials.py", line 1902, in load
    return self._retrieve_or_fail()
  File "/usr/local/lib/python3.9/dist-packages/botocore/credentials.py", line 1911, in _retrieve_or_fail
    creds = fetcher()
  File "/usr/local/lib/python3.9/dist-packages/botocore/credentials.py", line 1936, in fetch_creds
    raise CredentialRetrievalError(
botocore.exceptions.CredentialRetrievalError: Error when retrieving credentials from container-role: Error retrieving metadata: Received error when attempting to retrieve ECS metadata: Could not connect to the endpoint URL: "http://169.254.170.2/v2/credentials/XXXXXXXX-X7f6-4215-b042-0a124d28e71b"
2023-05-02 07:31:12,063: custodian.output:DEBUG Storing output with <LogFile file:///tmp/tmp2c5ptdi_/custodian-run.log>

The text was updated successfully, but these errors were encountered:

PratMis added the kind/bug label Oct 4, 2023

kapilt added the status/0-triage label Oct 5, 2023

kapilt mentioned this issue Oct 26, 2023

aws - cache clients by region #9107

Merged

kapilt closed this as completed in #9107 May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custodian job errors/failures potentially because of exhaustion of ephemeral ports on the instance #9031

Custodian job errors/failures potentially because of exhaustion of ephemeral ports on the instance #9031

PratMis commented Oct 4, 2023 •

edited

Loading

Custodian job errors/failures potentially because of exhaustion of ephemeral ports on the instance #9031

Custodian job errors/failures potentially because of exhaustion of ephemeral ports on the instance #9031

Comments

PratMis commented Oct 4, 2023 • edited Loading

Describe the bug

Extra information or context

PratMis commented Oct 4, 2023 •

edited

Loading