You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary
We have an AppMesh Virtual Gateway deployed as ECS Fargate Service, attached to NLB Target Group. The NLB is then attached to two API Gateways (Public and Internal), exposing different APIs of the underlying services.
The services are also AppMesh Virtual Services, with Routers and Nodes, deployed as ECS Fargate Services.
This is how the memory utilization looks like over last few months:
The orange spikes denote increases in PendingCount metric of that service which are the marks of redeployments of the Virtual Gateway ECS Service. The blue graph is the Memory Utilization which is slowly, albeit constantly increasing between deployments.
The Envoy version in use is public.ecr.aws/appmesh/aws-appmesh-envoy:v1.20.0.1-prod, with following things configured:
ENABLE_ENVOY_XRAY_TRACING = 1
XRAY_SAMPLING_RULE_MANIFEST is set
ENABLE_ENVOY_STATS_TAGS = 1
ENABLE_ENVOY_DOG_STATSD = 1
ENVOY_LOG_LEVEL = warning
APPMESH_METRIC_EXTENSION_VERSION = 1
The envoy image in the Virtual Gateway ECS Task is accompanied by X-Ray and Cloudwatch sidecars to handle traces and statsd metrics.
The VG ECS Service runs on 256 CPU units and 512 MB or RAM.
Behind the Virtual Gateway we have around 15 Virtual Services that the VG routes to.
The number or request on the VirtualGateway does not exceed 1000req/minute, and usually stays at around 900 request/minute for entire day (falls down significantly during the night).
Steps to Reproduce
N/A.
Are you currently working around this issue?
We do redeploy / rotate the VG ECS Tasks occasionally, which causes the ECS Service memory to drop.
The text was updated successfully, but these errors were encountered:
Follow up: We have just observed that the ECS Task with Virtual Gateway actually runs to the point when it reaches 100% Memory Utilization, and then fails. So it seems, there's no GC and the problem really exists and causes the tasks to fail after some time.
Hi @mkielar, from the details shared here, it is not clear whether the issue is with Envoy or caused by some configuration supplied in App Mesh. To understand the root cause of this behavior, we will need more details like App Mesh configuration, envoy logs and heap dump. Please create a case with AWS App Mesh Customer Support, so that we can work closely to unblock you.
Hi, @shwetasahuit,
I've submitted the Support Ticket and uploaded some debug logs, hope they find their way to your team. It's been also suggested this may be the reason: envoyproxy/envoy#20800, I'll see if I have a chance to downgrade our envoys and see if that helps.
Summary
We have an AppMesh Virtual Gateway deployed as ECS Fargate Service, attached to NLB Target Group. The NLB is then attached to two API Gateways (Public and Internal), exposing different APIs of the underlying services.
The services are also AppMesh Virtual Services, with Routers and Nodes, deployed as ECS Fargate Services.
This is how the memory utilization looks like over last few months:
The orange spikes denote increases in
PendingCount
metric of that service which are the marks of redeployments of the Virtual Gateway ECS Service. The blue graph is the Memory Utilization which is slowly, albeit constantly increasing between deployments.The Envoy version in use is
public.ecr.aws/appmesh/aws-appmesh-envoy:v1.20.0.1-prod
, with following things configured:ENABLE_ENVOY_XRAY_TRACING
=1
XRAY_SAMPLING_RULE_MANIFEST
is setENABLE_ENVOY_STATS_TAGS
=1
ENABLE_ENVOY_DOG_STATSD
=1
ENVOY_LOG_LEVEL
=warning
APPMESH_METRIC_EXTENSION_VERSION
=1
The
envoy
image in the Virtual Gateway ECS Task is accompanied by X-Ray and Cloudwatch sidecars to handle traces and statsd metrics.The VG ECS Service runs on
256 CPU
units and512 MB
or RAM.Behind the Virtual Gateway we have around 15 Virtual Services that the VG routes to.
The number or request on the VirtualGateway does not exceed 1000req/minute, and usually stays at around 900 request/minute for entire day (falls down significantly during the night).
Steps to Reproduce
N/A.
Are you currently working around this issue?
We do redeploy / rotate the VG ECS Tasks occasionally, which causes the ECS Service memory to drop.
The text was updated successfully, but these errors were encountered: