-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the option to set socket.TCP_KEEPIDLE and socket.TCP_KEEPINTVL #2916
Comments
Thanks @mtoma for creating the feature request. Could you elaborate a bit more on why configuring Also linked in that post is another one on troubleshooting retry/timeout issues in Lambda that may also be relevant here: https://repost.aws/knowledge-center/lambda-function-retry-timeout-sdk. |
Sure, the current socket options only sets the keepalive socket flag. The TCP probes are still sent using the OS default parameter /proc/sys/net/ipv4/tcp_keepalive_time as explained here: Also I did all my tests with the tcp_keepalive=True parameter set and validated it is still loosing connection with the Lambda backend after exactly 350 seconds. It should be feasible to setup a test environment for this issue using an EC2 in a VPC and let it call a Lambda that sleeps for 360 seconds. |
Thanks for following up. Regarding this:
Since this is a limitation on Fargate's side, it sounds like something that should be addressed on that end. I found an issue for this here: aws/containers-roadmap#460. It is marked as |
The problem is this issue is not related to Fargate at all, it is only an example of infrastructure where it happens and is impossible to fix using a system wide parameter. Another maybe much more realistic use case is calling a Lambda from another Lambda when the caller Lambda is in a VPC. The real point here is that without this option the tcp_keepalive=True seems to me entirely useless. From the parameter setting tcp_keepalive=True I'm expecting "don't let my TCP connection time out by sending TCP probes to keep it alive". This doesn't happen. If this isn't the purpose of the tcp_keepalive=True parameter than I don't really understand what it was designed for as I don't see any use case where True/False would make any difference. Is there some use case I'm not aware of? Otherwise I don't understand why It should be required from the user to change a systemwide kernel parameter that could have massive impact on the the underlying system if is simply possible to set it per socket connection with |
Just to pile on to this but it seemsl like @mtoma has this covered, any infrastructure deployed in a private subnet that communicates with the internet via a NAT Gateway is impacted by this. The NAT Gateway will drop a KeepAlive unexpectedly after 350 seconds, and an attempt by your application to re-use a connection will fail unexpectedly |
Thanks for following up here and sharing more info. I brought this feature request up for discussion with the team, and the consensus was there are some valid points made here that are worth further investigation. One of my colleagues also found this blog post on implementing long-running TCP Connections within VPC networking which provides some more context around the issue. https://aws.amazon.com/blogs/networking-and-content-delivery/implementing-long-running-tcp-connections-within-vpc-networking/. For now we can use this issue to track +1 (👍)s, use cases, and possible approaches to implementation. |
Much needed feature. Tweaking the os level configuration for a lambda code can be impactful. |
FWIW @Samuelstephenr this is what I'm doing to avoid patching the entire Lambda OS-level configuration.
|
Add the option to set socket in the value binding behaviour of the file in the section policy and proper binding status docs for contact and issue bending behaviour of the file and more prompt issue factors in it. |
Describe the feature
With the botocore.config.Config option tcp_keepalive=True the TCP socket is opened with the keep alive socket option (socket.SO_KEEPALIVE) but the interval used for TCP keepalive probes is taken from the system default values in the proc filesystem /proc/sys/net/ipv4/tcp_keepalive_time on Linux.
Changing this value requires the root access which is often not available or is just read-only as in the AWS Fargate containers.
The linux default value for this parameter is 7200 seconds which exceeds by far the AWS VPC timeout of 350 seconds which makes boto3 invoke() call to start a Lambda function in synchronous mode (RequestResponse) loose contact with the Lambda backend ent times out without getting back any response if the Lambda execution time exceeds 350 seconds and is started from a VPC.
Use Case
It is currently impossible to get an answer from a Lambda function:
The invoke() call times out after read_timeout=XXX seconds because the TCP connection is lost.
I want this call to successfully finish maintaining the TCP connection alive with TCP probes sent with a user defined interval.
Proposed Solution
The only workaround today is to override the
_compute_socket_options
private method of thebotocore.args.ClientArgsCreator
class.The
botocore.config.Config
class should accept an additional parameter (for example tcp_keepalive_time) to allow the user to set the value for (on linux, but is different on OSX and possibly windows)(socket.SOL_TCP, socket.TCP_KEEPIDLE, 60), (socket.SOL_TCP, socket.TCP_KEEPINTVL, 60)
parameters.Other Information
The current workaround is quite ugly:
Acknowledgements
SDK version used
{'boto3_version': '1.26.7', 'botocore_version': '1.29.7'}
Environment details (OS name and version, etc.)
Amazon Linux
The text was updated successfully, but these errors were encountered: