Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aws-appmesh-envoy] Too many open files error in version v1.26.4.0+ in EKS Fargate #489

Closed
axot opened this issue Feb 22, 2024 · 11 comments
Closed
Labels
Bug Something isn't working

Comments

@axot
Copy link

axot commented Feb 22, 2024

Summary
Customer have reported that while aws-appmesh-envoy 1.25.4 functions properly,
but upgrading to v1.27.2.0-prod results in a "Too many open files" error.

The issue has been confirm in v1.27.2.0-prod and v1.27.3.0-prod.

[2024-02-20 05:50:14.275][70][critical][assert] [source/common/network/socket_interface_impl.cc:72] assert failure: SOCKET_VALID(result.return_value_). Details: socket(2) failed, got error: Too many open files

[2024-02-20 05:50:14.275][70][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers)

[2024-02-20 05:50:14.275][70][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: 79aa964fd5e123a215cb0a0d2d986db0953a62d0/1.27.2-appmesh.0/Modified/RELEASE/BoringSSL

Steps to Reproduce
Perform a load test on v1.27.2.0 or v1.27.3.0

Are you currently working around this issue?
N/A

Additional context
N/A

Attachments

@axot axot added the Bug Something isn't working label Feb 22, 2024
@BennettJames
Copy link

Hey Zheng,

Thanks for the report. Can you confirm the environment details? Is this EKS / ECS, fargate / ec2?

@axot
Copy link
Author

axot commented Feb 27, 2024

Hi @BennettJames, the customer has confirmed that the issue was reproduced in EKS 1.27/1.28 with a Fargate profile.

@axot axot changed the title aws-appmesh-envoy v1.27.2.0-prod Too many open files aws-appmesh-envoy Too many open files Feb 29, 2024
@karanvasnani
Copy link

Our local testing suggests that this is a regression introduced in our v1.26.4.0 image version and are currently narrowing down the root cause. Once we have identified the root cause, will provide an update on the next steps.

@karanvasnani
Copy link

Just to provide an update, we've found that this only impacts the EKS + Fargate use case and is not a problem on the EC2 platform.

Lately, there have been many reports like this from Envoy users on k8s. For example, there was a change made in the containerd runtime (containerd/containerd#8924) to set the no file soft limit as 1024 instead of previous infinity value and there is expected impact of this on Envoy as it doesn't set any explicit limit value. There is also an open thread on Envoy github to raise the soft limit: envoyproxy/envoy#31502.
EKS made a corresponding change in their AMI to change the soft limit from infinity to 1024 as well (awslabs/amazon-eks-ami#1535) but, this was later rolled back. However, I'm not yet sure why this would only impact certain later versions of Envoy and not the older ones.

Regarding the AppMesh image, it consists of an agent (see here) which forks the Envoy as a child process and monitors its health. While, we see that the agent process has the soft limit equal to the hard limit the child isn't inheriting those limits and has a lower soft limit of default 1024. In my investigation, I'm exploring whether this has to do we the linux capabilities not being preserved on the Envoy process when being forked.

@karanvasnani karanvasnani changed the title aws-appmesh-envoy Too many open files [aws-appmesh-envoy] Too many open files error in version v1.26.4.0+ in EKS Fargate Mar 8, 2024
@karanvasnani karanvasnani changed the title [aws-appmesh-envoy] Too many open files error in version v1.26.4.0+ in EKS Fargate [aws-appmesh-envoy] Too many open files error in version v1.26.4.0+ in EKS Fargate Mar 8, 2024
@axot
Copy link
Author

axot commented Mar 11, 2024

I've checked the syscall by agent as described below.

Before execve is executed, setrlimit is invoked with the value, in this case was 1024, stored in rlimit.init.
Therefore, if we apply setrlimit before invoking ForkExec, the Rlimit settings will be passed on to the forked process.

[pid   120] dup3(7, 0, 0)               = 0
[pid   120] dup3(9, 1, 0)               = 1
[pid   120] dup3(10, 2, 0)              = 2
[pid   120] setrlimit(RLIMIT_NOFILE, {rlim_cur=1024, rlim_max=65535}) = 0
[pid   120] execve("/usr/bin/envoy", ["/usr/bin/envoy", "--version"], 0xc00033b688 /* 69 vars */ <unfinished ...>

@axot
Copy link
Author

axot commented Mar 22, 2024

"restore original NOFILE rlimit in child process" commit was introduced since go 1.21
https://cs.opensource.google/go/go/+/f5eef58e4381259cbd84b3f2074c79607fb5c821

go versions
v1.25.4.0

$ /usr/local/go/bin/go version /usr/bin/agent 
/usr/bin/agent: go1.20.3

v1.27.3.0

$ /usr/local/go/bin/go version /usr/bin/agent
/usr/bin/agent: go1.22.0

@karanvasnani
Copy link

@axot thanks for following up on this, that change would explain the behavior we're observing. We were still on go1.20.6 in Envoy image v.1.26.4.0 so, I'll perhaps need to test that again to confirm if it's impacted. One thing that I don't understand yet is why this only impacts the EKS Fargate setups and not ECS Fargate given the same soft limits are imposed there as well.

@axot
Copy link
Author

axot commented Mar 22, 2024

Hi @karanvasnani, thanks for the confirm. I have checked the commit and found it has been backported to go1.20.4 https://go-review.googlesource.com/c/go/+/478659. so, this also affects the v1.26.4.0. I haven't had the opportunity to test ECS Fargate yet. However, I plan to check in the upcoming week.

@axot
Copy link
Author

axot commented Mar 22, 2024

Hi, I checked ECS Fargate, it seems the soft limit is 65535 not 1024.

$ ulimit -n
65535

$ cat /proc/275/limits | grep 'Max open files'
Max open files            65535                65535                files

$ uname -a
Linux ip-10-0-12-168.ap-northeast-1.compute.internal 5.10.210-201.852.amzn2.x86_64 #1 SMP Tue Feb 27 17:09:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

$ ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1337         1  0.0  0.2 1250240 23568 ?       Ssl  16:14   0:00 /usr/bin/agent
root        12  0.0  0.1 1846536 14876 ?       Ssl  16:14   0:00 /managed-agents/execute-command/amazon-ssm-agent
1337        34  0.2  0.6 2342264 55436 ?       Sl   16:14   0:01 /usr/bin/envoy -c /tmp/envoy-config-4251528163.yaml -l info --drain-time-s 20 --disable-hot-restart
root        49  0.0  0.3 1932888 26288 ?       Sl   16:14   0:00 /managed-agents/execute-command/ssm-agent-worker
root       267  0.1  0.2 1851388 20628 ?       Sl   16:17   0:00 /managed-agents/execute-command/ssm-session-worker i-063bb9c7447af9c9a-0b5e8a38c71f87cb1
root       275  0.0  0.0  13576  3224 pts/0    Ss   16:17   0:00 /bin/bash

@karanvasnani
Copy link

I have built a custom image with the change being introduced in this PR (aws/amazon-ecs-service-connect-agent#73) for testing: 354290006986.dkr.ecr.us-west-2.amazonaws.com/aws-appmesh-envoy:v1.27.3.0-raise-nofile-soft-limit, and will be validating if that worked this week.

@axot
Copy link
Author

axot commented Jun 26, 2024

Close the issue as v1.29.5.0 has been released.

@axot axot closed this as completed Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants