Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use AutoResetEvent as oppose to ManualResetEventSlim #8575

Merged

Conversation

rokonec
Copy link
Contributor

@rokonec rokonec commented Mar 17, 2023

Summary

Customer, mainly internal like XStore, with huge repos, using msbuild /graph /bl on powerful development and build computers, might experience 15x plus regression in evaluation time.

It has been identified as performance bug in our logging event pub/sub mechanism. When ingest queue reaches its bound, at .net472 ManualResetEventSlim causes way too many thread.Yields flooding the system with thread context switches.
This hypothesis has been verified by PerfMon perfcounter System.ContextSwitches.

Alhougt counterintuitive, AutoResetEvent , ManualResetEvent or even SpinLocking produced better behavior and with those the issue no longer reproduce.

Customer Impact

In case of XStore it was about 7 minutes in build time.

Regression?

Yes, introduced in VS 17.4.

Testing

Manual validation by @rokonec and automated tests. Using local repro to verify changes has fixed it.

Risk

Low

Note

It effect only VS MSBuild.exe. In dotnet build ManualResetEventSlim works better.

@rokonec rokonec self-assigned this Mar 17, 2023
@danmoseley
Copy link
Member

@stephentoub does this surprise you?

@stephentoub
Copy link
Member

stephentoub commented Mar 17, 2023

ManualResetEventSlim spins by default prior to falling back to a kernel-based blocking wait. AutoResetEvent is just a thin wrapper for a kernel sync primitive, so there's no spinning before it in managed world. If you want to disable that spinning with MRES, just pass 0 as the second argument to its ctor, e.g. make these https://github.com/dotnet/msbuild/pull/8575/files#diff-4e77c665c92becc2499e6b1ee5e7b673df485cba054b8c0ce31cd4974c9bbe70L1266-L1268 use new ManualResetEventSlim(false, 0) instead of new ManualResetEventSlim(false).

@danmoseley
Copy link
Member

danmoseley commented Mar 17, 2023

So i guess the spinning here implies that the lock is typically held for significant period of time, so spinning is just waste.

@rokonec Maybe try the other ctor on MRES?

I wonder if, separate to this change, there is scope to reduce the time spent holding the lock.

@dfederm
Copy link
Contributor

dfederm commented Mar 17, 2023

@stephentoub @danmoseley For context this isn't a "lock" per se but rather a sort of throttle. This is an implementation of async logging where writers fill a buffer and 1 consumer drains the queue on a long-running background thread. Once the buffer fills, additional writers need to wait for there to be capacity.

My recommendation to @rokonec would be to use a bounded Channel since it seems to fit this scenario much better (previously this code used an ActionBlock instead of the current impl), and I would guess Channels have this scenario well-optimized and possibly continue to optimize in the future.

Something along the line of

// Initialization
            var channelOptions = new BoundedChannelOptions(maxMessagesToBuffer)
            {
                // If full, block
                FullMode = BoundedChannelFullMode.Wait,
 
                // Enable single-reader optimizations
                SingleReader = true,
            };
 
            _processLogMessageChannel = Channel.CreateBounded<LoggingEvent>(channelOptions);
 
            _processingTask = Task.Run(
                async () =>
                {
                    while (await _processLogMessageChannel.Reader.WaitToReadAsync().ConfigureAwait(false))
                    {
                        while (_processLogMessageChannel.Reader.TryRead(out LoggingEvent? loggingEvent))
                        {
                            // Do the thing
                        }
                    }
                });
 
// Enqueuing
 
        public async Task EnqueueAsync(LoggingEvent loggingEvent)
        {
            ValueTask writeTask = _processLogMessageChannel.Writer.WriteAsync(loggingEvent);
            if (!writeTask.IsCompleted)
            {
                // If full, block until it's written.
                await writeTask;
            }
        }
 
// Draining
 
        public async Task DrainAsync()
        {
            _processLogMessageChannel.Writer.Complete();
            await _processLogMessageChannel.Reader.Completion;
            await _processingTask;
        }

@rokonec
Copy link
Contributor Author

rokonec commented Mar 17, 2023

@stephentoub with .net core MRES I can't repro this issue. Unfortunately Windows MRES implementation is quite different. See: https://referencesource.microsoft.com/#mscorlib/system/threading/ManualResetEventSlim.cs,542 weird combination logic spinning, yielding and sleeping.
When I feed ctor with 200+ it worked too. But it seemed to me too fragile. 0 in ctor would probably work as well, if I read sources right, but I actually like new autoresetevent better as it will unblock just one tread which is what is needed:

  • multiple threads are blocking on ingesting into bounded queue
  • one event from queue is consumed and processed
  • dequeu auto event is set
  • one thread is unblocked
    As oppose to MRES where all threads would unblock and then all-1 went back to sleep bcz queue is full again.
    I was originally thinking about BlockingCollection which is, IIRC, designed with this use case in mind, however, since our code has to run in .net 3.5 I had to implement it myself.

@rokonec
Copy link
Contributor Author

rokonec commented Mar 17, 2023

@dfederm as for Channel it is interesting idea, I will play with it when I have some time, however, since msbuild is quite low and need to support source building we are trying to avoid external decencies if possible. Additionally any external dependencies can cause assembly hell in VS and often has to be handled by assembly redirects.

The purpose of this PR is to fix the bug, and the code provided do so. Since this will have to go into servicing VS 17.4 I am trying to make minimal most safe changes.

I am open for advices how to best implement that multiple blocking bounded publishers, one consumer thread as I like concurrency programming a lot, but I would rather address it in different issue and PR.

@dfederm
Copy link
Contributor

dfederm commented Mar 17, 2023

For a servicing fix, seems reasonable then if this small change mitigates the issue.

@rokonec
Copy link
Contributor Author

rokonec commented Mar 17, 2023

I believe it does fixed the issue:

new
image

old
image

Copy link
Member

@ladipro ladipro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It looks like you don't need to call _enqueueEvent.Reset() on line 1297 anymore.

Also, is it really expected to be logging at such a high rate that a 200k-item queue fills up?

@rokonec
Copy link
Contributor Author

rokonec commented Mar 20, 2023

@ladipro asked:

Also, is it really expected to be logging at such a high rate that a 200k-item queue fills up?

Using /bl trigger "diag" logging verbosity which causes lot of messages flowing through system.
For example for orchardcore /graph /bl /m when I stop it right after all projects are evaluated it looks like this:
image

For orchard I started to see this only for MSBUILDLOGGINGQUEUECAPACITY=25000 and bellow. For bigger repositories builds running on more CPU cores I believe it can easily reach 200K queue size limit.

@danmoseley
Copy link
Member

should diag verbosity cause a larger queue?

@rokonec rokonec changed the base branch from main to vs17.4 March 20, 2023 18:57
@rainersigwald
Copy link
Member

Also, is it really expected to be logging at such a high rate that a 200k-item queue fills up?

This is a combination of a few things, which all make sense individually:

  1. Machines are getting more parallel threads.
  2. -graph evaluates projects for the whole build all at once up front, in order to discover graph structure.
  3. It does so in parallel.
  4. Evaluation produces log events.
  5. At diag/binlog verbosity, evaluation can log quite a lot of events.

When combined, that can mean that on a big machine + a big repo + -graph + -bl, lots of events are logged very quickly.

But I that even with all of that considered together the overall thing still makes sense, with the backpressure that we can now provide from the logger back to the event production.

should diag verbosity cause a larger queue?

I would rather the change queue size for all configurations than tweak it by logger verbosity personally. The existing values aren't written in stone but we did see OOM failures caused by the prior (broken) bounding approach, and I'd like any future fixes to that to be easy to diagnose + tweak.

@rokonec rokonec changed the base branch from vs17.4 to main March 20, 2023 18:58
@rokonec rokonec force-pushed the rokonec/8574-lock-contention-logging-service branch from 6533aa5 to f0f039c Compare March 20, 2023 19:03
@rokonec rokonec changed the base branch from main to vs17.4 March 20, 2023 19:08
@rokonec rokonec closed this Mar 20, 2023
@rokonec rokonec force-pushed the rokonec/8574-lock-contention-logging-service branch from f0f039c to 2eac915 Compare March 20, 2023 19:09
@rokonec rokonec reopened this Mar 20, 2023
@ladipro
Copy link
Member

ladipro commented Mar 21, 2023

From @rokonec's screenshot it looks like the average cost of one event is ~60 bytes. Even if it was 100 I'd say the queue should be much larger than 200k. In massive builds spending hundreds of megs is totally justifiable if the alternative is to block the execution.

Maybe with the exception of running in a 32-bit process where we want to stay frugal.

@ghost
Copy link

ghost commented Mar 21, 2023

Hello! I noticed that you're targeting one of our servicing branches. Please consider updating the version.

@ghost
Copy link

ghost commented Mar 21, 2023

Hello! I noticed that you're targeting one of our servicing branches. Please consider updating the version.

@Forgind Forgind merged commit 7e646be into dotnet:vs17.4 Mar 21, 2023
@ghost
Copy link

ghost commented Mar 21, 2023

Hello! I noticed that you're targeting one of our servicing branches. Please consider updating the version.

rokonec added a commit that referenced this pull request Mar 27, 2023
* Concurrency bug fix - BuildManager instances acquire its own BuildTelemetry instance (#8561)

* BuildManager instances acquire its own BuildTelemetry instance (#8444)

Fixes https://devdiv.visualstudio.com/DevDiv/_workitems/edit/1708215

Context
In VS there are multiple instances of BuildManager called asynchronously. For DTB and normal build and maybe other which I have not identified yet.

Changes Made
BuildManager instances acquire its own BuildTelemetry instance as oppose to sharing single BuildTelemetry instance in non thread safe manner.

Testing
Locally
# Conflicts:
#	src/Build/BackEnd/Client/MSBuildClient.cs - resolved with minimal and safe approach

* Bumping version

* Turn off static graph restore. (#8498)

Our CI builds fails because of bug NuGet/Home#12373. 
It is fixed in NuGet/NuGet.Client#5010. 
We are waiting for it to flow to CI machines. Meanwhile this PR applies a workaround.

Note: This PR needs to be reverted once it happens.

---------

Co-authored-by: AR-May <67507805+AR-May@users.noreply.github.com>

* Use AutoResetEvent as oppose to ManualResetEventSlim (#8575)

Summary
Customer, mainly internal like XStore, with huge repos, using msbuild /graph /bl on powerful development and build computers, might experience 15x plus regression in evaluation time.

It has been identified as performance bug in our logging event pub/sub mechanism. When ingest queue reaches its bound, at .net472 ManualResetEventSlim causes way too many thread.Yields flooding the system with thread context switches.
This hypothesis has been verified by PerfMon perfcounter System.ContextSwitches.

Alhougt counterintuitive, AutoResetEvent , ManualResetEvent or even SpinLocking produced better behavior and with those the issue no longer reproduce.

Customer Impact
In case of XStore it was about 7 minutes in build time.

Regression?
Yes, introduced in VS 17.4.

Testing
Manual validation by @rokonec and automated tests. Using local repro to verify changes has fixed it.

Risk
Low

Note
It effect only VS MSBuild.exe. In dotnet build ManualResetEventSlim works better.

---------

Co-authored-by: Roman Konecny <rokonecn@microsoft.com>
Co-authored-by: AR-May <67507805+AR-May@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Performance]: Regression on large projects build with /graph /bl /m
8 participants