Native memory leak with ETW EventSource #12374

pharring · 2019-03-29T05:45:42Z

Reported originally at microsoft/ApplicationInsights-dotnet#1102

Repros on .NET Core 2.2, x86 and x64 on Windows.

Build the following application:

using System.Diagnostics.Tracing;

namespace RPESLeak
{
    class MyEventSource : EventSource
    {
        private const EventKeywords Operations = (EventKeywords)0x1;

        private const string EventProviderName = "PSH-LeakTest";

        public MyEventSource() : base(
               EventProviderName,
               EventSourceSettings.EtwSelfDescribingEventFormat)
        {
        }

        public void ProcessOperationStart(long operation)
        {
            if (IsEnabled(EventLevel.Informational, Operations))
            {
                var payload = new { Leak = "leaky", Id = operation };
                Write(
                    "Request",
                    new EventSourceOptions { Keywords = Operations, Opcode = EventOpcode.Start, Level = EventLevel.Informational },
                    payload);
            }
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            var eventSource = new MyEventSource();
            for (long i = 0; ; i++)
            {
                eventSource.ProcessOperationStart(i);
            }
        }
    }
}

Run it on Windows and watch the memory usage of dotnet.exe in Task Manager. It settles down on my machine around 10 MB.

Now enable the ETW event listener for the event source. The easiest way to do that is to run PerfView. Press Alt+C to start a collection and make it look like this:

The important thing is to add this in Additional Providers:
PSH-LeakTest::Informational
You can turn off all the other options.
Now, when you start logging, watch the memory usage in dotnet.exe climb like crazy (over 200 MB in 5 seconds).
When you stop logging, the memory isn't freed. If you create a dump of the dotnet.exe process, the GC heap is fine. Most of the additional memory is in native heap and it seems to consist of the event manifests repeated over and over again.

This does not reproduce on .NET 4.7.2
I don't know about earlier .NET Core runtimes.

Please let me know if you need any more information, but I think this is a pretty minimal repro with clear instructions. One thing I could do is eliminate PerfView by writing a custom app based around TraceEvent or even a WPR profile.

To be clear how this affects Application Insights customers: The Application Insights Profiler runs in Azure App Services. It captures ETW sessions for 2 minutes at a time about once an hour. One of the providers it enables (listens to), is the Application Insights "Rich Payload Event Source" (see the original bug in the Application Insights repo). Customers are now encountering these memory leaks under load when profiling is running leading to application restarts.

The text was updated successfully, but these errors were encountered:

pharring · 2019-03-29T12:42:09Z

The thought occurred to me that PerfView is just the tool for tracking down the heap allocations. Indeed, here are the relevant allocating stacks:

Name	Inc %
\|\| + ntdll!RtlpAllocateHeapInternal	98.6
\|\| + coreclr!EEHeapAllocInProcessHeap	98.6
\|\| + coreclr!operator new[]	50.2
\|\| \|+ coreclr!EventPipeEvent::EventPipeEvent	50.2
\|\| \|\|+ coreclr!EventPipeProvider::AddEvent	50.2
\|\| \|\| + coreclr!EventPipeInternal::DefineEvent	50.2
\|\| \|\| + system.private.corelib!IEventProvider.DefineEventHandle	50.2
\|\| \|\| + system.private.corelib!NameInfo.GetOrCreateEventHandle	50.2
\|\| \|\| + system.private.corelib!EventSource.WriteImpl	50.2
\|\| \|\| + ?!?	50.2
\|\| \|+ coreclr!CStructArray::Grow	0.0
\|\| + coreclr!operator new	48.4
\|\| + coreclr!EventPipeProvider::AddEvent	48.4
\|\| + coreclr!EventPipeInternal::DefineEvent	33.8
\|\| \|+ system.private.corelib!IEventProvider.DefineEventHandle	33.8
\|\| \| + system.private.corelib!NameInfo.GetOrCreateEventHandle	33.8
\|\| \| + system.private.corelib!EventSource.WriteImpl	33.8
\|\| \| + ?!?	33.8
\|\| + coreclr!EventPipeProvider::AddEvent	14.6
\|\| + coreclr!EventPipeInternal::DefineEvent	14.6
\|\| + system.private.corelib!IEventProvider.DefineEventHandle	14.6
\|\| + system.private.corelib!NameInfo.GetOrCreateEventHandle	14.6
\|\| + system.private.corelib!EventSource.WriteImpl	14.6
\|\| + ?!?	14.6

vancem · 2019-03-29T15:23:46Z

@jorive who owns EventSource and EventPipe
@noahfalk owns the whole area
@cshung has been fixing bugs associated with EventPipe (which this seems to be)

@pharring We should probably move this bug to the coreclr repo since it is almost certainly going to be fixed by modifying the .NET Core runtime.

jorive · 2019-03-30T00:24:28Z

@pharring I'm taking a look.

pharring · 2019-04-01T16:43:04Z

FYI, I was able to eliminate PerfView from the repro by building my own TraceEvent-based listener. I also got a simplified repro in Azure App Service. They all have the same symptoms, pointing to this as the root cause.

jorive · 2019-04-01T16:54:43Z

Thanks @pharring, I am able to repro it locally with the code above.

vancem · 2019-04-01T17:57:26Z

Including @brianrob who wrote this.

I took a quick look and the key place is the stack

Name
--
\|\|  \|\|\|\|\|\|  \|+ system.private.corelib!System.Diagnostics.Tracing.EventSource.Write[System.__Canon](System.String, System.Diagnostics.Tracing.EventSourceOptions, System.__Canon)
\|\|  \|\|\|\|\|\|  \|\|+ system.private.corelib!EventSource.WriteImpl
\|\|  \|\|\|\|\|\|  \|\|\|+ system.private.corelib!NameInfo.GetOrCreateEventHandle
\|\|  \|\|\|\|\|\|  \|\|\|\|+ system.private.corelib!IEventProvider.DefineEventHandle
\|\|  \|\|\|\|\|\|  \|\|\|\|\|+ coreclr!EventPipeInternal::DefineEvent
\|\|  \|\|\|\|\|\|  \|\|\|\|\| + coreclr!EventPipeProvider::AddEvent

Which says that starting with an event write, we try to create a handle for it (for EventPipe), and this seems to allocate ALL THE TIME. However a handle to an event is about its Meta-Data, and thus there should be a lookup and MOST of the time we should NOT be creating new event meta-data.

Thus the basic issue is that we are missing our 'lookup' that would find that we already have meta-data for this event.

@brianrob - Can you quickly tell us where it SHOULD be happening?

There is a secondary issue that we should not need this meta-data at all unless EventPipe is turned on (which it is not), so it would be nice if we avoided even generating the handle if EventPipe is off.

@jorive - We should also check the scenario where we are collecting with EventPIpe instead of ETW before we call this bug fixed.

jorive · 2019-04-01T19:51:28Z

I'm unable to repro in netcoreapp3.0, and dotnet/coreclr#18355 seems to have fixed it.

vancem · 2019-04-01T20:18:18Z

@jorive - I assume you are going to try back-porting that fix (it is not that big).

@pharring - assuming that the fix above fixes things, the main thing we need to figure out is the impact of the bug. Can you detail the impact so that @jorive and make the case to the triage group.

In particular, in a 'normal' use of Application Insights, how bad is the leak? I am assuming that it is big enough that it forces customers to turn off App-Insights. It will also affect any other data collection (thus things leak if you also collect with PerfView).

We should err on the side of fixing this, Looking at https://github.com/dotnet/core/blob/master/release-notes/download-archive.md, we have already released 3 updates to 2.2. and presumably we will have another within a month. That will likely live at least 6 more months before V3.0 becomes available, so I think it is worth it.

brianrob · 2019-04-01T20:24:33Z

@vancem, you're right that this is about metadata, and prior to dotnet/coreclr#18355, the cache was essentially busted and so we always created a new metadata blob. With dotnet/coreclr#18355 this should no longer happen and only a single blob per event should be created.

It is likely possible to avoid creating the metadata if EventPipe isn't enabled, but if I recall correctly, I did the metadata generation regardless because it was hard to get to a place outside of event initialization that had enough of the data required to produce the metadata.

mattzink · 2019-04-01T20:30:45Z

FWIW, our team had to remove all usage of EventPipe used for diagnostics in our product, as the memory leak was extreme and would OOM kill our process every couple of hours. Backporting any fixes to 2.2 would be much appreciated.

yahorsi · 2019-04-01T20:35:04Z

as the memory leak was extreme and would OOM kill our process every couple of hours. Backporting any fixes to 2.2 would be much appreciated.

+1
Waiting for 3.0 to be fully shipped=>fully tested for our app=>being available in azure is a long path to go

I am assuming that it is big enough that it forces customers to turn off App-Insights.

It is actually depends on the load. The leak is about ~200 bytes per event as I understand

vancem · 2019-04-01T20:39:41Z

@jorive - it sounds like this bug is rather severe. I think we have a strong case to back-port it to V2.2. We should pursue that.

noahfalk · 2019-04-01T20:44:20Z

Agreed

jorive · 2019-04-01T22:37:34Z

I'm preparing port to 2.2 right now.

yahorsi · 2019-04-02T15:20:38Z

Any idea if this could be available on Azure at all? I mean deployed to Azure and so could be used in App Service? If yes then how long does it takes usually? :)

avandych · 2019-06-25T12:33:25Z

Seems like issue is still reproducible on 2.2.5. I'm using EventListener for monitoring events and when using it, memory usage in dotnet.exe climbs very fast

Host dotnet --info:

Host (useful for support):
Version: 2.2.5
Commit: 0a3c9209c0

.NET Core SDKs installed:
No SDKs were found.

.NET Core runtimes installed:
Microsoft.AspNetCore.All 2.2.5 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.2.5 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.2.5 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

jenshenneberg · 2019-06-25T12:45:25Z

We are also still seeing this issue on 2.2.5. Exact same dotnet --info output as above.

jorive · 2019-06-25T15:02:32Z

@SPANIOL Does it cap around ~1GB (This is the default internal circular buffer allocated/used)?

avandych · 2019-06-25T18:21:52Z

@jorive yeah, i believe it may be the case, left service for 3 days without any traffic and it used around 900 mb of memory, while usually it uses around 150 mb.

jorive · 2019-06-25T18:45:09Z

That memory is reused to buffer events before writing to file, and it should not grow beyond the specified value. It's not a memory leak.

mjsabby · 2019-06-25T18:46:36Z

Is this memory not used when you disable EventPipe?

yahorsi · 2019-06-25T18:48:22Z

That memory is reused to buffer events before writing to file, and it should not grow beyond the specified value. It's not a memory leak.

How big is the buffer in 2.2?

PS: Has this changed in 3.0, taking in to account changes related to the Docker ?

avandych · 2019-06-25T18:50:53Z

@mjsabby memory isn't used when pipe is disabled, when enabled memory usage climbs like crazy under traffic, while heap usage remains at same level
@jorive is buffer size configurable? It looks like it's little bit to much, use 1 gb of memory if you want to gather some basic metrics for dashboard and monitoring

jorive · 2019-06-25T19:28:05Z

@SPANIOL Yes, there are few ways to configure the size of the buffer, and depends on how are you enabling EventPipe tracing in your project.

Before starting the process, set environment variable: COMPlus_EventPipeCircularMB (default is COMPlus_EventPipeCircularMB=1024)
If you are enabling EventPipe tracing via App.eventpipeconfig file drop, then you can add an entry to the file CircularMB (if not provided, it defaults to CircularMB=1024)

jorive · 2019-06-25T19:34:09Z

@yahorsi On 3.0, the default value changed to 256MB

avandych · 2019-06-25T19:34:29Z

@jorive sounds promising! i'll try configuration via env variables tomorrow and post results here in case if anyone will have such troubles, i can see that ticket for documentation related to EventPipe's is still open, so it's difficult to figured it out from the start, but the case to use them for basic monitoring will be common

sywhang · 2019-06-25T19:37:35Z

@SPANIOL For 3.0 we added runtime event counters, which is far more low-overhead for purposes like simple APMs or dashboards for "basic monitoring" scenarios. Unfortunately backporting that to 2.2 is far more costly than we can afford, but just wanted to let you know that better alternatives do exist in 3.0 :)

jorive · 2019-06-25T19:39:26Z

Is this memory not used when you disable EventPipe?

This memory is deallocated when the tracing session is disabled.

noahfalk · 2019-06-26T02:20:25Z

I created dotnet/coreclr#25413 to track this. FWIW, I think 256MB is probably too big of a default and we should look into changing it.

jorive · 2019-06-26T03:20:57Z

I believe 256MB is the default used by PerfView, but I agree we should consider lowering it.

tylerohlsen · 2019-08-21T18:31:08Z

Our microservices were growing at a rate of 40MB per hour (each) because of this issue. We have turned off the Prometheus.Net collectors that hook up to the event sources as a temporary workaround (djluck/prometheus-net.DotNetRuntime#6).

This is definitely not using a circular buffer as I have an instance that's been running for 9 days and is now at an approximate 7GB commit size and still growing!

I have confirmed .Net Core 3 - Preview 8 does not experience the same issue. I have also confirmed that .Net Core 2.2.0 - 2.2.6 do experience the issue. I have the same stack trace for the allocations as mentioned in the first reply above.

Please back-port the fix to .Net Core 2.2.

sywhang · 2019-08-21T20:45:58Z

cc @josalem

This is definitely not using a circular buffer as I have an instance that's been running for 9 days and is now at an approximate 7GB commit size and still growing!

That definitely doesn't sound normal to me either.

I have confirmed .Net Core 3 - Preview 8 does not experience the same issue.

This makes me less nervous. That being said, AFAIK @jorive already ported his fix (dotnet/coreclr#23661) to 2.2 and if it's not addressing the issue, we might be looking at a different issue from one that was already ported.

Do you happen to have a memory dump and/or a small repro that I can use to repro the behavior?

tylerohlsen · 2019-08-22T03:26:29Z

Sorry, I do not think I can share the full memory dump, but here is some data from the memory dump and I can get you more data upon request.

0:000> !heap -stat -h 000002b837d30000 
 heap @ 000002b837d30000
group-by: TOTSIZE max-display: 20
    size     #blocks     total     ( %) (percent of total busy bytes)
    19000 cbf9 - 13eb51000  (96.11)
    32000 3ae - b7fc000  (3.47)
    258000 2 - 4b0000  (0.09)
    38 d482 - 2e7c70  (0.05)
    30 eb62 - 2c2260  (0.05)

0:000> !heap -p -a 000001d85f2a5f80    
    address 000001d85f2a5f80 found in
    _HEAP @ 1d3c2ff0000
              HEAP_ENTRY Size Prev Flags            UserPtr UserSize - state
        000001d85f2a5f50 1903 0000  [00]   000001d85f2a5f80    19000 - (busy)
        7ff9014cff83 ntdll!RtlpCallInterceptRoutine+0x000000000000003f
        7ff901494423 ntdll!RtlpAllocateHeapInternal+0x0000000000078ba3
        7ff88c8755b9 coreclr!EEHeapAllocInProcessHeap+0x0000000000000029
        7ff88c875571 coreclr!operator new[]+0x0000000000000025
        7ff88caebd12 coreclr!EventPipeBuffer::EventPipeBuffer+0x0000000000000016
        7ff88ca7da1a coreclr!EventPipeBufferManager::AllocateBufferForThread+0x00000000000001e2
        7ff88ca7e2be coreclr!EventPipeBufferManager::WriteEvent+0x00000000000000e6
        7ff88ca0dd5e coreclr!EventPipe::WriteEventInternal+0x00000000000000be
        7ff88c8e3e9c coreclr!EventPipe::WriteEvent+0x000000000000003c
        7ff88c9643c5 coreclr!FireEtwThreadCreated+0x0000000000120415
        7ff88c843fa3 coreclr!ETW::ThreadLog::FireThreadCreated+0x0000000000000043
        7ff88c843ef7 coreclr!Thread::HasStarted+0x00000000000000ef
        7ff88c83a42e coreclr!ThreadNative::KickOffThread+0x00000000000000be
        7ff88c83a34a coreclr!Thread::intermediateThreadProc+0x000000000000008a
        7ff8ffa37bd4 kernel32!BaseThreadInitThunk+0x0000000000000014
        7ff90144ce71 ntdll!RtlUserThreadStart+0x0000000000000021

As you can see, at the time of that memory dump, there was 52,217 occurrences of these 102,400 byte objects that take up a total of 5,347,020,800 (5.3 GB) of memory. I spot checked some of the same size objects in a debug instance and they all had the same allocation stack trace listed here (sorry for the mix-up, my stack trace is different than the OP's stack trace).

The repro was pretty simple. I made an Asp.Net Core 2.2.4 targeted application that was self hosted as a console app and self contained with win7-x64 as the runtime target. Then I added a nuget reference to the latest of prometheus-net.DotNetRuntime and called DotNetRuntimeStatsBuilder.Default().StartCollecting(); in Startup.Configure.

I can create a project like that and attach a zip if you'd like.

tylerohlsen · 2019-08-22T17:40:39Z

Attached is a simple app the reproduces my issue. EventPipeMemoryLeak.zip

I just let the app sit idle and the memory usage is continuously increasing. And the heap stats show a regular increase in quantity of these 0x19000 size objects.

@sywhang and @josalem ^^

sywhang · 2019-08-23T01:26:22Z

@tylerohlsen I'm running the repro app and trying to see how it ends up in that state. So far I'm not seeing anything that's not expected - I'm waiting to see if the memory usage grows beyond 1GB which is what EventListener sessions have as the max limit.

That being said the 0x19000 size events does correspond to the EventPipe buffer blocks that we allocate internally to store the events. The repro app's been running for about ~1.5 hours now and it's at 380MB - I will come back in a few hours and see how things have changed and take it from there.

sywhang · 2019-08-23T22:12:24Z

@tylerohlsen I moved the issue to dotnet/coreclr#26344 because this issue is different from the original issue. I'll continue posting things in that thread.

sywhang · 2019-08-27T18:03:34Z

Closing this issue since the original issue was fixed in both master/2.2.

danmoseley transferred this issue from dotnet/corefx Mar 29, 2019

sywhang closed this as completed Aug 27, 2019

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the 2.2.x milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 13, 2020

Native memory leak with ETW EventSource #12374

Native memory leak with ETW EventSource #12374

Comments

pharring commented Mar 29, 2019

pharring commented Mar 29, 2019

vancem commented Mar 29, 2019

jorive commented Mar 30, 2019

pharring commented Apr 1, 2019

jorive commented Apr 1, 2019

vancem commented Apr 1, 2019

jorive commented Apr 1, 2019

vancem commented Apr 1, 2019

brianrob commented Apr 1, 2019

mattzink commented Apr 1, 2019

yahorsi commented Apr 1, 2019

vancem commented Apr 1, 2019

noahfalk commented Apr 1, 2019

jorive commented Apr 1, 2019

yahorsi commented Apr 2, 2019 • edited Loading

avandych commented Jun 25, 2019 • edited Loading

jenshenneberg commented Jun 25, 2019

jorive commented Jun 25, 2019

avandych commented Jun 25, 2019 • edited Loading

jorive commented Jun 25, 2019

mjsabby commented Jun 25, 2019

yahorsi commented Jun 25, 2019

avandych commented Jun 25, 2019

jorive commented Jun 25, 2019

jorive commented Jun 25, 2019

avandych commented Jun 25, 2019

sywhang commented Jun 25, 2019

jorive commented Jun 25, 2019

noahfalk commented Jun 26, 2019

jorive commented Jun 26, 2019

tylerohlsen commented Aug 21, 2019 • edited Loading

sywhang commented Aug 21, 2019

tylerohlsen commented Aug 22, 2019 • edited Loading

tylerohlsen commented Aug 22, 2019 • edited Loading

sywhang commented Aug 23, 2019

sywhang commented Aug 23, 2019

sywhang commented Aug 27, 2019

yahorsi commented Apr 2, 2019 •

edited

Loading

avandych commented Jun 25, 2019 •

edited

Loading

avandych commented Jun 25, 2019 •

edited

Loading

tylerohlsen commented Aug 21, 2019 •

edited

Loading

tylerohlsen commented Aug 22, 2019 •

edited

Loading

tylerohlsen commented Aug 22, 2019 •

edited

Loading