Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Virtual Gateway Memory Leak #399

Closed
mkielar opened this issue Mar 23, 2022 · 3 comments
Closed

Bug: Virtual Gateway Memory Leak #399

mkielar opened this issue Mar 23, 2022 · 3 comments
Labels
Bug Something isn't working

Comments

@mkielar
Copy link

mkielar commented Mar 23, 2022

Summary
We have an AppMesh Virtual Gateway deployed as ECS Fargate Service, attached to NLB Target Group. The NLB is then attached to two API Gateways (Public and Internal), exposing different APIs of the underlying services.

The services are also AppMesh Virtual Services, with Routers and Nodes, deployed as ECS Fargate Services.

This is how the memory utilization looks like over last few months:
image

The orange spikes denote increases in PendingCount metric of that service which are the marks of redeployments of the Virtual Gateway ECS Service. The blue graph is the Memory Utilization which is slowly, albeit constantly increasing between deployments.

The Envoy version in use is public.ecr.aws/appmesh/aws-appmesh-envoy:v1.20.0.1-prod, with following things configured:

  • ENABLE_ENVOY_XRAY_TRACING = 1
  • XRAY_SAMPLING_RULE_MANIFEST is set
  • ENABLE_ENVOY_STATS_TAGS = 1
  • ENABLE_ENVOY_DOG_STATSD = 1
  • ENVOY_LOG_LEVEL = warning
  • APPMESH_METRIC_EXTENSION_VERSION = 1

The envoy image in the Virtual Gateway ECS Task is accompanied by X-Ray and Cloudwatch sidecars to handle traces and statsd metrics.

The VG ECS Service runs on 256 CPU units and 512 MB or RAM.

Behind the Virtual Gateway we have around 15 Virtual Services that the VG routes to.

The number or request on the VirtualGateway does not exceed 1000req/minute, and usually stays at around 900 request/minute for entire day (falls down significantly during the night).

Steps to Reproduce
N/A.

Are you currently working around this issue?
We do redeploy / rotate the VG ECS Tasks occasionally, which causes the ECS Service memory to drop.

@mkielar mkielar added the Bug Something isn't working label Mar 23, 2022
@herrhound herrhound added this to Researching in aws-app-mesh-known-issues Mar 28, 2022
@mkielar
Copy link
Author

mkielar commented Apr 1, 2022

Follow up: We have just observed that the ECS Task with Virtual Gateway actually runs to the point when it reaches 100% Memory Utilization, and then fails. So it seems, there's no GC and the problem really exists and causes the tasks to fail after some time.

Graphs:
image
image

@shsahu
Copy link

shsahu commented Apr 1, 2022

Hi @mkielar, from the details shared here, it is not clear whether the issue is with Envoy or caused by some configuration supplied in App Mesh. To understand the root cause of this behavior, we will need more details like App Mesh configuration, envoy logs and heap dump. Please create a case with AWS App Mesh Customer Support, so that we can work closely to unblock you.

@mkielar
Copy link
Author

mkielar commented Apr 26, 2022

Hi, @shwetasahuit,
I've submitted the Support Ticket and uploaded some debug logs, hope they find their way to your team. It's been also suggested this may be the reason: envoyproxy/envoy#20800, I'll see if I have a chance to downgrade our envoys and see if that helps.

@shsahu shsahu closed this as completed Jun 15, 2022
@shsahu shsahu moved this from Researching to Implementing in aws-app-mesh-known-issues Jun 15, 2022
@shsahu shsahu moved this from Implementing to Planned in aws-app-mesh-known-issues Jun 15, 2022
@shsahu shsahu moved this from Planned to Recently closed in aws-app-mesh-known-issues Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
Development

No branches or pull requests

2 participants