Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASP.NET Core hoarding memory #6803

Closed
vtortola opened this issue Jan 17, 2019 · 7 comments
Closed

ASP.NET Core hoarding memory #6803

vtortola opened this issue Jan 17, 2019 · 7 comments
Assignees
Labels

Comments

@vtortola
Copy link

This issue is following up on #1976

The problem

We have a websocket server that hoards memory during days, till the point that Kubernetes eventually kills it. We monitor it using prometheous-net.

I can see in the graphs, that GC is collecting regularly in all generations.

GC Server is disabled using:

<PropertyGroup>
  <ServerGarbageCollection>false</ServerGarbageCollection>
</PropertyGroup>

Before disabling GC Server, the service used to grow memory way faster. Now it takes two weeks to get into 512Mb.

Other services using ASP.NET Core on request/response fashion do not show this problem. This uses Websockets, where each connection last usually around 10 minutes... so I guess everything related with the connection survives till Gen 2 easily.

The application

The application is a very simple ASP.NET core application with two controllers, one simple one for readines/liveness probes from Kubernetes, and another controller for establishing Websocket connetions.

We did some preliminary and rough tests and checked we could handle 500 concurrent websockets per pod using 512Mb. We ran for hours with 2 pods and 1000 concurrent connections with memory being less than 150Mb . The deployed application, with 2 pods, has between 150 and 300 concurrent connections at any moment, and the memory varies from less than 100Mb on the first few days, till reaching the 512Mb in around 2 weeks. There no seems to be correlation between the number of connections and the memory used.

More than 70% of the connections last 10 minutes. Connections usually die abruptly due the load balancer cutting them after 600 seconds (10 min )

We have a limit of 512Mb per pod set using Kubernetes limits:

Limits:
  cpu:     1
  memory:  512Mi
Requests:
  cpu:     100m
  memory:  256Mi

Message rate is very low. We have a keep interval defined:

        app.UseWebSockets(new WebSocketOptions()
        {
           KeepAliveInterval = TimeSpan.FromSeconds(5)
        });

We use dotnet 2.1.6 on Linux using microsoft/dotnet:2.1-aspnetcore-runtime as base.

# dotnet --info

Host (useful for support):
  Version: 2.1.6
  Commit:  3f4f8eebd8

.NET Core SDKs installed:
  No SDKs were found.

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.6 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.6 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.1.6 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

This is a usual pattern in the application metrics:

image
image
image
image
image
image

The surprising thing

When we connect remotely and take a memory dump (using createdump), suddently the memory drops... without the service stopping, restarting or loosing any connected user. See the green line in the picture.

enter image description here
Note that there are two pods, showing the same behaviour, and then one (the green) drops suddenly in memory ussage due the taking of the memory dump.

enter image description here

The pods did not restart during the taking of the memory dump:
enter image description here

No connection was lost or restarted.

Memory dump data

I cannot share the dump for security reasons, but here is some data:

And the result of dumpheap -stat : https://pastebin.com/ERN7LZ0n

Heap:

(lldb) eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x00007F8481C8D0B0
generation 1 starts at 0x00007F8481C7E820
generation 2 starts at 0x00007F852A1D7000
ephemeral segment allocation context: none
         segment             begin         allocated              size
00007F852A1D6000  00007F852A1D7000  00007F853A1D5E90  0xfffee90(268430992)
00007F84807D0000  00007F84807D1000  00007F8482278000  0x1aa7000(27947008)
Large object heap starts at 0x00007F853A1D7000
         segment             begin         allocated              size
00007F853A1D6000  00007F853A1D7000  00007F853A7C60F8  0x5ef0f8(6222072)
Total Size:              Size: 0x12094f88 (302600072) bytes.
------------------------------
GC Heap Size:            Size: 0x12094f88 (302600072) bytes.
(lldb)

Free objects:

(lldb) dumpheap -type Free -stat
Statistics:
              MT    Count    TotalSize Class Name
00000000010c52b0   219774     10740482      Free
Total 219774 objects

Is there any explanation to this behaviour?

@sebastienros
Copy link
Member

Thanks for all these details.

I notice you are not using 2.2. Would it be possible for you to upgrade your base image, there was a bug that was fixed that was the cause of many performance issues in it. In the meantime I will setup a service similar to yours and let it run for a few days.

@vtortola
Copy link
Author

I will update it tomorrow, but we will have to wait at least for a week to see something relevant 👍

@vtortola
Copy link
Author

Unfortunatelly moving to 2.2 does not help. Memory keeps growing.

image

@ZOXEXIVO
Copy link

ZOXEXIVO commented Mar 4, 2019

I hope that this issue will be the quintessence of the whole struggle with the GC in ASP.NET Core.

@Bio2hazard
Copy link

Bio2hazard commented Mar 14, 2019

We are running an AspNet Core 2.1.4 microservice deployed on Alpine Linux, hosted on AWS ECS ( docker image microsoft/dotnet:2.1.4-aspnetcore-runtime-alpine ).

It's using server GC and has 1536MB available.

Our microservice with websockets is not experiencing any memory growth.

We do not explicitly set a KeepAliveInterval on the socket so it uses the default of 2 minutes. It's a shot in the dark, but have you tried to modify the KeepAliveInterval to see if it changes the rate at which the memory usage grows ?

Also, I assume the reason your memory usage dropped is due to createdump triggering a blocking collection prior to taking the dump.

You could verify that theory by wiring up some mechanism to run GC.Collect(2, GCCollectionMode.Forced, true); - if that causes a similar drop in memory usage as your createdump did, we can assume that the GC is deferring collecting them due to not wanting to interrupt the application.

Lastly, in our experience running dotnet core on linux, ( all with server GC mind you ), it has a tendency to keep and re-use memory instead of returning it to the OS for performance gains.

image

Note how even though the GC.GetTotalMemory ( with force: false ) reports the size of allocated objects between 9 MB - 100 MB, the working set ( via Process.GetCurrentProcess().WorkingSet64 ) remains at 243 MB.

There had been issues in the past ( on 2.1.3 and earlier ) with .net growing too aggressively and not properly respecting the limits set by the container and getting memkilled, but they seem to be fixed in 2.1.4. I hope this helps a little bit. Good luck!

@vtortola
Copy link
Author

Testing the GC.Collect(2, GCCollectionMode.Forced, true); thing now. I will share the results in a few days.

@vtortola
Copy link
Author

I had the chance to run the application in a Windows machine, reproduce the conditions that lead to the memory leak, take a snapshot and analyze it with dotMemory.

The problem was a feature of RabbitMQ.Client named "AutorecoveryConnection", which is keeping information about existing subscriptions in order to recover them in the event of a reconnection. Setting ConnectionFactory.AutomaticRecoveryEnabled to false on the RabbitMQ.Client configuration solved the problem (since we do not need this feature).

AutorecoveringConnection

Our application uses a channel per websocket connection, and it seems channels are not as lightweight as I thought. However, I find no reason to keep 11K+ EventingBasicConsumer objects in memory, when only 500 users are connected at most. After disabling the autor-recovery feature, this objects are not piling up anymore :(

How come that memory usage dropped when taking the memory snapshot? Well, probably RabbitMQ.Client disconnected by timeout (since the process freezes) and something happened internally in the component that released the held objects. I hope to have some time in the future to investigate this further, but for now it is clear is not an ASP.NET core thing.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants