Bug: Virtual Gateway Memory Leak #399

mkielar · 2022-03-23T10:25:29Z

Summary
We have an AppMesh Virtual Gateway deployed as ECS Fargate Service, attached to NLB Target Group. The NLB is then attached to two API Gateways (Public and Internal), exposing different APIs of the underlying services.

The services are also AppMesh Virtual Services, with Routers and Nodes, deployed as ECS Fargate Services.

This is how the memory utilization looks like over last few months:

The orange spikes denote increases in PendingCount metric of that service which are the marks of redeployments of the Virtual Gateway ECS Service. The blue graph is the Memory Utilization which is slowly, albeit constantly increasing between deployments.

The Envoy version in use is public.ecr.aws/appmesh/aws-appmesh-envoy:v1.20.0.1-prod, with following things configured:

ENABLE_ENVOY_XRAY_TRACING = 1
XRAY_SAMPLING_RULE_MANIFEST is set
ENABLE_ENVOY_STATS_TAGS = 1
ENABLE_ENVOY_DOG_STATSD = 1
ENVOY_LOG_LEVEL = warning
APPMESH_METRIC_EXTENSION_VERSION = 1

The envoy image in the Virtual Gateway ECS Task is accompanied by X-Ray and Cloudwatch sidecars to handle traces and statsd metrics.

The VG ECS Service runs on 256 CPU units and 512 MB or RAM.

Behind the Virtual Gateway we have around 15 Virtual Services that the VG routes to.

The number or request on the VirtualGateway does not exceed 1000req/minute, and usually stays at around 900 request/minute for entire day (falls down significantly during the night).

Steps to Reproduce
N/A.

Are you currently working around this issue?
We do redeploy / rotate the VG ECS Tasks occasionally, which causes the ECS Service memory to drop.

The text was updated successfully, but these errors were encountered:

mkielar · 2022-04-01T07:10:56Z

Follow up: We have just observed that the ECS Task with Virtual Gateway actually runs to the point when it reaches 100% Memory Utilization, and then fails. So it seems, there's no GC and the problem really exists and causes the tasks to fail after some time.

Graphs:

shsahu · 2022-04-01T21:50:57Z

Hi @mkielar, from the details shared here, it is not clear whether the issue is with Envoy or caused by some configuration supplied in App Mesh. To understand the root cause of this behavior, we will need more details like App Mesh configuration, envoy logs and heap dump. Please create a case with AWS App Mesh Customer Support, so that we can work closely to unblock you.

mkielar · 2022-04-26T06:56:25Z

Hi, @shwetasahuit,
I've submitted the Support Ticket and uploaded some debug logs, hope they find their way to your team. It's been also suggested this may be the reason: envoyproxy/envoy#20800, I'll see if I have a chance to downgrade our envoys and see if that helps.

mkielar added the Bug Something isn't working label Mar 23, 2022

herrhound added this to Researching in aws-app-mesh-known-issues Mar 28, 2022

shsahu closed this as completed Jun 15, 2022

shsahu moved this from Researching to Implementing in aws-app-mesh-known-issues Jun 15, 2022

shsahu moved this from Implementing to Planned in aws-app-mesh-known-issues Jun 15, 2022

shsahu moved this from Planned to Recently closed in aws-app-mesh-known-issues Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Virtual Gateway Memory Leak #399

Bug: Virtual Gateway Memory Leak #399

mkielar commented Mar 23, 2022

mkielar commented Apr 1, 2022 •

edited

shsahu commented Apr 1, 2022

mkielar commented Apr 26, 2022

Bug: Virtual Gateway Memory Leak #399

Bug: Virtual Gateway Memory Leak #399

Comments

mkielar commented Mar 23, 2022

mkielar commented Apr 1, 2022 • edited

shsahu commented Apr 1, 2022

mkielar commented Apr 26, 2022

mkielar commented Apr 1, 2022 •

edited