Use AutoResetEvent as oppose to ManualResetEventSlim #8575

rokonec · 2023-03-17T16:27:54Z

Summary

Customer, mainly internal like XStore, with huge repos, using msbuild /graph /bl on powerful development and build computers, might experience 15x plus regression in evaluation time.

It has been identified as performance bug in our logging event pub/sub mechanism. When ingest queue reaches its bound, at .net472 ManualResetEventSlim causes way too many thread.Yields flooding the system with thread context switches.
This hypothesis has been verified by PerfMon perfcounter System.ContextSwitches.

Alhougt counterintuitive, AutoResetEvent , ManualResetEvent or even SpinLocking produced better behavior and with those the issue no longer reproduce.

Customer Impact

In case of XStore it was about 7 minutes in build time.

Regression?

Yes, introduced in VS 17.4.

Testing

Manual validation by @rokonec and automated tests. Using local repro to verify changes has fixed it.

Risk

Low

Note

It effect only VS MSBuild.exe. In dotnet build ManualResetEventSlim works better.

danmoseley · 2023-03-17T16:54:10Z

@stephentoub does this surprise you?

stephentoub · 2023-03-17T17:00:08Z

ManualResetEventSlim spins by default prior to falling back to a kernel-based blocking wait. AutoResetEvent is just a thin wrapper for a kernel sync primitive, so there's no spinning before it in managed world. If you want to disable that spinning with MRES, just pass 0 as the second argument to its ctor, e.g. make these https://github.com/dotnet/msbuild/pull/8575/files#diff-4e77c665c92becc2499e6b1ee5e7b673df485cba054b8c0ce31cd4974c9bbe70L1266-L1268 use new ManualResetEventSlim(false, 0) instead of new ManualResetEventSlim(false).

danmoseley · 2023-03-17T17:22:33Z

So i guess the spinning here implies that the lock is typically held for significant period of time, so spinning is just waste.

@rokonec Maybe try the other ctor on MRES?

I wonder if, separate to this change, there is scope to reduce the time spent holding the lock.

dfederm · 2023-03-17T17:50:56Z

@stephentoub @danmoseley For context this isn't a "lock" per se but rather a sort of throttle. This is an implementation of async logging where writers fill a buffer and 1 consumer drains the queue on a long-running background thread. Once the buffer fills, additional writers need to wait for there to be capacity.

My recommendation to @rokonec would be to use a bounded Channel since it seems to fit this scenario much better (previously this code used an ActionBlock instead of the current impl), and I would guess Channels have this scenario well-optimized and possibly continue to optimize in the future.

Something along the line of

// Initialization
            var channelOptions = new BoundedChannelOptions(maxMessagesToBuffer)
            {
                // If full, block
                FullMode = BoundedChannelFullMode.Wait,
 
                // Enable single-reader optimizations
                SingleReader = true,
            };
 
            _processLogMessageChannel = Channel.CreateBounded<LoggingEvent>(channelOptions);
 
            _processingTask = Task.Run(
                async () =>
                {
                    while (await _processLogMessageChannel.Reader.WaitToReadAsync().ConfigureAwait(false))
                    {
                        while (_processLogMessageChannel.Reader.TryRead(out LoggingEvent? loggingEvent))
                        {
                            // Do the thing
                        }
                    }
                });
 
// Enqueuing
 
        public async Task EnqueueAsync(LoggingEvent loggingEvent)
        {
            ValueTask writeTask = _processLogMessageChannel.Writer.WriteAsync(loggingEvent);
            if (!writeTask.IsCompleted)
            {
                // If full, block until it's written.
                await writeTask;
            }
        }
 
// Draining
 
        public async Task DrainAsync()
        {
            _processLogMessageChannel.Writer.Complete();
            await _processLogMessageChannel.Reader.Completion;
            await _processingTask;
        }

rokonec · 2023-03-17T19:13:53Z

@stephentoub with .net core MRES I can't repro this issue. Unfortunately Windows MRES implementation is quite different. See: https://referencesource.microsoft.com/#mscorlib/system/threading/ManualResetEventSlim.cs,542 weird combination logic spinning, yielding and sleeping.
When I feed ctor with 200+ it worked too. But it seemed to me too fragile. 0 in ctor would probably work as well, if I read sources right, but I actually like new autoresetevent better as it will unblock just one tread which is what is needed:

multiple threads are blocking on ingesting into bounded queue
one event from queue is consumed and processed
dequeu auto event is set
one thread is unblocked
As oppose to MRES where all threads would unblock and then all-1 went back to sleep bcz queue is full again.
I was originally thinking about BlockingCollection which is, IIRC, designed with this use case in mind, however, since our code has to run in .net 3.5 I had to implement it myself.

rokonec · 2023-03-17T19:20:49Z

@dfederm as for Channel it is interesting idea, I will play with it when I have some time, however, since msbuild is quite low and need to support source building we are trying to avoid external decencies if possible. Additionally any external dependencies can cause assembly hell in VS and often has to be handled by assembly redirects.

The purpose of this PR is to fix the bug, and the code provided do so. Since this will have to go into servicing VS 17.4 I am trying to make minimal most safe changes.

I am open for advices how to best implement that multiple blocking bounded publishers, one consumer thread as I like concurrency programming a lot, but I would rather address it in different issue and PR.

dfederm · 2023-03-17T19:28:27Z

For a servicing fix, seems reasonable then if this small change mitigates the issue.

rokonec · 2023-03-17T20:25:26Z

I believe it does fixed the issue:

new

old

ladipro

nit: It looks like you don't need to call _enqueueEvent.Reset() on line 1297 anymore.

Also, is it really expected to be logging at such a high rate that a 200k-item queue fills up?

rokonec · 2023-03-20T15:49:42Z

@ladipro asked:

Also, is it really expected to be logging at such a high rate that a 200k-item queue fills up?

Using /bl trigger "diag" logging verbosity which causes lot of messages flowing through system.
For example for orchardcore /graph /bl /m when I stop it right after all projects are evaluated it looks like this:

For orchard I started to see this only for MSBUILDLOGGINGQUEUECAPACITY=25000 and bellow. For bigger repositories builds running on more CPU cores I believe it can easily reach 200K queue size limit.

danmoseley · 2023-03-20T18:40:10Z

should diag verbosity cause a larger queue?

rainersigwald · 2023-03-20T18:58:14Z

Also, is it really expected to be logging at such a high rate that a 200k-item queue fills up?

This is a combination of a few things, which all make sense individually:

Machines are getting more parallel threads.
-graph evaluates projects for the whole build all at once up front, in order to discover graph structure.
It does so in parallel.
Evaluation produces log events.
At diag/binlog verbosity, evaluation can log quite a lot of events.

When combined, that can mean that on a big machine + a big repo + -graph + -bl, lots of events are logged very quickly.

But I that even with all of that considered together the overall thing still makes sense, with the backpressure that we can now provide from the logger back to the event production.

should diag verbosity cause a larger queue?

I would rather the change queue size for all configurations than tweak it by logger verbosity personally. The existing values aren't written in stone but we did see OOM failures caused by the prior (broken) bounding approach, and I'd like any future fixes to that to be easy to diagnose + tweak.

ladipro · 2023-03-21T07:05:08Z

From @rokonec's screenshot it looks like the average cost of one event is ~60 bytes. Even if it was 100 I'd say the queue should be much larger than 200k. In massive builds spending hundreds of megs is totally justifiable if the alternative is to block the execution.

Maybe with the exception of running in a 32-bit process where we want to stay frugal.

ghost · 2023-03-21T17:24:52Z

Hello! I noticed that you're targeting one of our servicing branches. Please consider updating the version.

ghost · 2023-03-21T17:24:55Z

Hello! I noticed that you're targeting one of our servicing branches. Please consider updating the version.

ghost · 2023-03-21T20:29:45Z

Hello! I noticed that you're targeting one of our servicing branches. Please consider updating the version.

@rokonec

* Concurrency bug fix - BuildManager instances acquire its own BuildTelemetry instance (#8561) * BuildManager instances acquire its own BuildTelemetry instance (#8444) Fixes https://devdiv.visualstudio.com/DevDiv/_workitems/edit/1708215 Context In VS there are multiple instances of BuildManager called asynchronously. For DTB and normal build and maybe other which I have not identified yet. Changes Made BuildManager instances acquire its own BuildTelemetry instance as oppose to sharing single BuildTelemetry instance in non thread safe manner. Testing Locally # Conflicts: # src/Build/BackEnd/Client/MSBuildClient.cs - resolved with minimal and safe approach * Bumping version * Turn off static graph restore. (#8498) Our CI builds fails because of bug NuGet/Home#12373. It is fixed in NuGet/NuGet.Client#5010. We are waiting for it to flow to CI machines. Meanwhile this PR applies a workaround. Note: This PR needs to be reverted once it happens. --------- Co-authored-by: AR-May <67507805+AR-May@users.noreply.github.com> * Use AutoResetEvent as oppose to ManualResetEventSlim (#8575) Summary Customer, mainly internal like XStore, with huge repos, using msbuild /graph /bl on powerful development and build computers, might experience 15x plus regression in evaluation time. It has been identified as performance bug in our logging event pub/sub mechanism. When ingest queue reaches its bound, at .net472 ManualResetEventSlim causes way too many thread.Yields flooding the system with thread context switches. This hypothesis has been verified by PerfMon perfcounter System.ContextSwitches. Alhougt counterintuitive, AutoResetEvent , ManualResetEvent or even SpinLocking produced better behavior and with those the issue no longer reproduce. Customer Impact In case of XStore it was about 7 minutes in build time. Regression? Yes, introduced in VS 17.4. Testing Manual validation by @rokonec and automated tests. Using local repro to verify changes has fixed it. Risk Low Note It effect only VS MSBuild.exe. In dotnet build ManualResetEventSlim works better. --------- Co-authored-by: Roman Konecny <rokonecn@microsoft.com> Co-authored-by: AR-May <67507805+AR-May@users.noreply.github.com>

rokonec self-assigned this Mar 17, 2023

dfederm approved these changes Mar 17, 2023

View reviewed changes

ladipro approved these changes Mar 20, 2023

View reviewed changes

rokonec changed the base branch from main to vs17.4 March 20, 2023 18:57

rokonec changed the base branch from vs17.4 to main March 20, 2023 18:58

rokonec force-pushed the rokonec/8574-lock-contention-logging-service branch from 6533aa5 to f0f039c Compare March 20, 2023 19:03

rokonec changed the base branch from main to vs17.4 March 20, 2023 19:08

rokonec closed this Mar 20, 2023

rokonec force-pushed the rokonec/8574-lock-contention-logging-service branch from f0f039c to 2eac915 Compare March 20, 2023 19:09

Changing from slim events to kernel events

d5b312c

rokonec reopened this Mar 20, 2023

Bumping version

4113d29

rokonec added Servicing-consider performance labels Mar 20, 2023

Merge branch 'vs17.4' into rokonec/8574-lock-contention-logging-service

b7dac53

rbhanda added Servicing-approved and removed Servicing-consider labels Mar 21, 2023

Forgind merged commit 7e646be into dotnet:vs17.4 Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use AutoResetEvent as oppose to ManualResetEventSlim #8575

Use AutoResetEvent as oppose to ManualResetEventSlim #8575

rokonec commented Mar 17, 2023 •

edited

Loading

danmoseley commented Mar 17, 2023

stephentoub commented Mar 17, 2023 •

edited

Loading

danmoseley commented Mar 17, 2023 •

edited

Loading

dfederm commented Mar 17, 2023 •

edited

Loading

rokonec commented Mar 17, 2023 •

edited

Loading

rokonec commented Mar 17, 2023

dfederm commented Mar 17, 2023

rokonec commented Mar 17, 2023

ladipro left a comment

rokonec commented Mar 20, 2023

danmoseley commented Mar 20, 2023

rainersigwald commented Mar 20, 2023

ladipro commented Mar 21, 2023

ghost commented Mar 21, 2023

ghost commented Mar 21, 2023

ghost commented Mar 21, 2023

Use AutoResetEvent as oppose to ManualResetEventSlim #8575

Use AutoResetEvent as oppose to ManualResetEventSlim #8575

Conversation

rokonec commented Mar 17, 2023 • edited Loading

Summary

Customer Impact

Regression?

Testing

Risk

Note

danmoseley commented Mar 17, 2023

stephentoub commented Mar 17, 2023 • edited Loading

danmoseley commented Mar 17, 2023 • edited Loading

dfederm commented Mar 17, 2023 • edited Loading

rokonec commented Mar 17, 2023 • edited Loading

rokonec commented Mar 17, 2023

dfederm commented Mar 17, 2023

rokonec commented Mar 17, 2023

ladipro left a comment

Choose a reason for hiding this comment

rokonec commented Mar 20, 2023

danmoseley commented Mar 20, 2023

rainersigwald commented Mar 20, 2023

ladipro commented Mar 21, 2023

ghost commented Mar 21, 2023

ghost commented Mar 21, 2023

ghost commented Mar 21, 2023

rokonec commented Mar 17, 2023 •

edited

Loading

stephentoub commented Mar 17, 2023 •

edited

Loading

danmoseley commented Mar 17, 2023 •

edited

Loading

dfederm commented Mar 17, 2023 •

edited

Loading

rokonec commented Mar 17, 2023 •

edited

Loading