Add cross-process BuildCoordinator for node budget management#13384
Open
JakeRadMSFT wants to merge 10 commits intodotnet:mainfrom
Open
Add cross-process BuildCoordinator for node budget management#13384JakeRadMSFT wants to merge 10 commits intodotnet:mainfrom
JakeRadMSFT wants to merge 10 commits intodotnet:mainfrom
Conversation
Add BuildCoordinator -- a long-lived process that manages node budget across concurrent MSBuild builds. It limits total worker node count, queues excess builds, and dynamically rebalances via heartbeat protocol. Add BuildCoordinatorClient for BuildManager integration. Add --coordinator CLI mode to XMake.cs. Add ShutdownExcessNodes to INodeManager for dynamic budget reduction. Includes 29 unit tests. Peak nodes reduced from 110 to 10.
Key changes: - Remove onBudgetChanged callback — heartbeats are for liveness only, never shrink a running build's node pool mid-flight - Add --min-builds parameter to floor the budget divisor so even a lone build reserves headroom for future concurrent builds - Example: --budget 12 --min-builds 2 gives each build max 6 nodes, so a second build can start immediately without renegotiation Test results (10 concurrent Picasso builds, --budget 12 --min-builds 2): - Peak nodes: 10 (vs 110 on main) — 91% reduction - Wall-clock: ~3:52 (vs ~3:35 uncontrolled) — only 8% slower - All nodes cleaned up within 20s of last build finishing - Zero risk of killing nodes between project assignments
Removed unused ShutdownExcessNodes from INodeManager, NodeManager, NodeProviderOutOfProc, and TaskHostNodeManager. This method was only called from the heartbeat budget-change callback which was removed in the coordinator v2 redesign.
Confirm build-2 is promoted via heartbeat before registering build-3, preventing a race where build-3 registers before the promotion completes and gets immediately activated instead of queued.
- Constructor validation (ArgumentOutOfRangeException) - UnixOnlyFact for platform test instead of runtime check - CTS leak fix in ListenLoop (dispose previous) - CLI arg validation (clamp to >=1) - Console.Error replaced with Trace.WriteLine for library safety - Test pipe name isolation (internal ctor with pipe override) - _queuedBuilds.Count read under lock - Heartbeat-gated promotion docs clarified (ordering, not mid-build) - Injectable stale timer intervals for fast tests - Test name typo fixed - Start() checks for existing coordinator before pipe delete - HandleRegister race fixed (lock) - Heartbeat comment clarified in client - TestEnvironment added to test class - Partial class extraction (BuildManager.Coordinator.cs) - MSBUILDCOORDINATORENABLED env var gate
Builds are now activated immediately when capacity is available (active < max) instead of being queued and waiting for all active builds to acknowledge an epoch via heartbeat. This removes ~135 lines of complexity that served no purpose since BuildManager applies granted budget once at start and never changes MaxNodeCount mid-build. Removed: _rebalanceEpoch field, AcknowledgedEpoch property, RebalanceAll(), epoch gating in TryPromotePending, EpochGating test. Renamed TryPromotePending to TryPromoteOne with simple capacity check.
- Check budget headroom (sum of granted < total) before admitting new builds, preventing over-commitment when existing builds hold large grants. - Enforce FIFO queue ordering: if queue is non-empty, new arrivals always queue instead of cutting ahead of waiting builds. - When queue has items, CalculateBudget targets maxConcurrentBuilds for fair share so active builds shrink toward desired state and make room for promotions. - Update tests to account for budget-aware admission (builds requesting full budget consume all headroom, forcing subsequent builds to queue).
…allocation - Split monolithic BuildCoordinatorClient into pure domain (BuildCoordinator) and transport layer (NamedPipeCoordinatorHost + NamedPipeCoordinatorClient) - Fix FairShareBudgetPolicy overallocation: track allocated nodes so grants never exceed total budget (was 38 nodes on budget=24, now correctly capped) - Constructor-inject promotion callback (was mutable Action property) - Host owns coordinator lifecycle (creates internally) - Move GetWaitPipeName to NamedPipeCoordinatorHost (transport concern) - Rename BuildCoordinatorClient -> NamedPipeCoordinatorClient - Extract ICoordinatorClient interface for BuildManager abstraction - Expand PROTOCOL.md with architecture docs and design decisions - 25 unit/integration tests passing
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an opt-in, cross-process MSBuild BuildCoordinator that uses a named-pipe host/client to enforce a system-wide worker-node budget across concurrent builds, and integrates it into BuildManager and the msbuild entrypoint.
Changes:
- Introduces
BuildCoordinatordomain logic +FairShareBudgetPolicyand a named-pipe host/client transport. - Integrates coordinator registration/heartbeats into
BuildManager(NET-only) and disables node reuse when coordinated. - Adds
--coordinatormode inXMakeplus new unit/integration tests and protocol documentation.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| src/MSBuild/XMake.cs | Adds --coordinator mode and CLI parsing for budget/max-builds/startup-delay. |
| src/Build/Microsoft.Build.csproj | Includes new coordinator source files for non-net472 TFMs. |
| src/Build/BackEnd/BuildManager/BuildManager.cs | Makes BuildManager partial; hooks coordinator register/unregister; adjusts node reuse behavior. |
| src/Build/BackEnd/BuildManager/BuildManager.Coordinator.cs | NET-only coordinator integration (env-var gated), registration + heartbeat lifecycle. |
| src/Build/BackEnd/BuildCoordinator/PROTOCOL.md | Documents the named-pipe text protocol and design rationale. |
| src/Build/BackEnd/BuildCoordinator/NamedPipeCoordinatorHost.cs | Implements coordinator server transport over named pipes + promotion notifications. |
| src/Build/BackEnd/BuildCoordinator/NamedPipeCoordinatorClient.cs | Implements coordinator client transport, registration + queue wait-pipe + heartbeat timer. |
| src/Build/BackEnd/BuildCoordinator/INodeBudgetPolicy.cs | Defines the pluggable budget policy interface. |
| src/Build/BackEnd/BuildCoordinator/ICoordinatorClient.cs | Defines the BuildManager-facing coordinator client abstraction. |
| src/Build/BackEnd/BuildCoordinator/FairShareBudgetPolicy.cs | Implements fair-share budget allocation policy. |
| src/Build/BackEnd/BuildCoordinator/BuildCoordinator.cs | Implements queuing/promotion, staleness reaping, and budget granting logic. |
| src/Build.UnitTests/BackEnd/FairShareBudgetPolicy_Tests.cs | Adds unit tests for fair-share policy behavior. |
| src/Build.UnitTests/BackEnd/BuildCoordinator_Tests.cs | Adds domain tests and named-pipe integration tests for coordinator behavior. |
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
+1035
to
+1037
| #if NET | ||
| UnregisterFromCoordinator(); | ||
| #endif |
Comment on lines
+363
to
+364
| var process = Process.GetProcessById(pid); | ||
| return !process.HasExited; |
Comment on lines
+190
to
+223
| private string? SendCommand(string command) | ||
| { | ||
| const int maxAttempts = 5; | ||
| const int connectTimeoutMs = 5000; | ||
| const int retryDelayMs = 500; | ||
|
|
||
| for (int attempt = 0; attempt < maxAttempts; attempt++) | ||
| { | ||
| try | ||
| { | ||
| using var client = new NamedPipeClientStream(".", _pipeName, PipeDirection.InOut, System.IO.Pipes.PipeOptions.CurrentUserOnly); | ||
| client.Connect(connectTimeoutMs); | ||
|
|
||
| using var writer = new StreamWriter(client, leaveOpen: true) { AutoFlush = true }; | ||
| using var reader = new StreamReader(client, leaveOpen: true); | ||
|
|
||
| writer.WriteLine(command); | ||
| return reader.ReadLine(); | ||
| } | ||
| catch (TimeoutException) | ||
| { | ||
| if (attempt < maxAttempts - 1) | ||
| { | ||
| Thread.Sleep(retryDelayMs); | ||
| } | ||
| } | ||
| catch (IOException) | ||
| { | ||
| if (attempt < maxAttempts - 1) | ||
| { | ||
| Thread.Sleep(retryDelayMs); | ||
| } | ||
| } | ||
| } |
Comment on lines
+239
to
+248
| public class NamedPipeCoordinatorHost_Tests : IDisposable | ||
| { | ||
| private readonly NamedPipeCoordinatorHost _host; | ||
| private readonly string _pipeName; | ||
|
|
||
| public NamedPipeCoordinatorHost_Tests() | ||
| { | ||
| _pipeName = $"/tmp/MSBuild-Test-{Guid.NewGuid():N}"; | ||
| _host = new NamedPipeCoordinatorHost(new FairShareBudgetPolicy(12, 3), pipeName: _pipeName); | ||
| _host.Start(); |
Comment on lines
+15
to
+21
| public enum RegisterOutcome | ||
| { | ||
| /// <summary>Build was activated immediately.</summary> | ||
| Granted, | ||
| /// <summary>Build was queued — wait for promotion.</summary> | ||
| Queued, | ||
| } |
Comment on lines
+67
to
+69
| Console.WriteLine($"WARNING: Another coordinator may be running on {pipeName}. Taking over..."); | ||
| } | ||
|
|
Comment on lines
+149
to
+152
| public void StartHeartbeat() | ||
| { | ||
| _heartbeatTimer = new Timer(HeartbeatCallback, null, TimeSpan.FromSeconds(2), TimeSpan.FromSeconds(2)); | ||
| } |
|
|
||
| public NamedPipeCoordinatorHost_Tests() | ||
| { | ||
| _pipeName = $"/tmp/MSBuild-Test-{Guid.NewGuid():N}"; |
Comment on lines
+187
to
+196
| // Wait for startup delay to fire | ||
| Thread.Sleep(300); | ||
|
|
||
| // All 3 should be promoted (maxConcurrent=3) | ||
| promoted.Count.ShouldBe(3); | ||
|
|
||
| var status = coord.GetStatus(); | ||
| status.ActiveCount.ShouldBe(3); | ||
| status.QueuedCount.ShouldBe(0); | ||
| } |
Comment on lines
+362
to
+384
| for (int i = 0; i < args.Length; i++) | ||
| { | ||
| if (args[i].Equals("--coordinator", StringComparison.OrdinalIgnoreCase)) | ||
| { | ||
| found = true; | ||
| } | ||
| else if (args[i].Equals("--budget", StringComparison.OrdinalIgnoreCase) && i + 1 < args.Length) | ||
| { | ||
| int.TryParse(args[i + 1], out budget); | ||
| i++; | ||
| } | ||
| else if (args[i].Equals("--max-builds", StringComparison.OrdinalIgnoreCase) && i + 1 < args.Length) | ||
| { | ||
| int.TryParse(args[i + 1], out maxBuilds); | ||
| i++; | ||
| } | ||
| else if (args[i].Equals("--startup-delay", StringComparison.OrdinalIgnoreCase) && i + 1 < args.Length) | ||
| { | ||
| int.TryParse(args[i + 1], out startupDelay); | ||
| i++; | ||
| } | ||
| } | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces a cross-process BuildCoordinator that manages MSBuild worker node budgets across concurrent builds. When multiple builds run simultaneously on the same machine, each currently spawns up to
maxcpucountnodes independently — leading to node explosion, memory pressure, and thrashing. The coordinator caps total node count system-wide through a fair-share budget policy.Design
Layered Architecture
BuildCoordinator— Pure domain object. Tracks build registrations, grants node budgets via pluggableINodeBudgetPolicy, manages heartbeats/timeouts, handles queue promotion when builds finish. Zero I/O.NamedPipeCoordinatorHost— Server transport. Creates and owns theBuildCoordinator, exposes it over a named pipe using a line-based text protocol (REGISTER, HEARTBEAT, UNREGISTER, STATUS, SHUTDOWN).NamedPipeCoordinatorClient— Client transport. ImplementsICoordinatorClientinterface. Connects to the host, registers the build, maintains heartbeat, and reports granted node count back toBuildManager.ICoordinatorClient— Abstraction soBuildManagerdepends on the interface, not the concrete transport.FairShareBudgetPolicy— DefaultINodeBudgetPolicyimplementation. Divides remaining budget (TotalBudget - allocatedNodes) fairly across open slots, ensuring total grants never exceed the budget.Integration
BuildManager.Coordinator.cs— Opt-in glue that creates the client, registers before build, and unregisters after.XMake.cs—/coordinatorCLI switch launches the host with configurable budget and max-concurrent-builds.Tests
RapidArrival_NeverExceedsBudget— verifies 4 concurrent builds on budget=24 each get exactly 6 nodes, totaling 24Benchmark Results
10 concurrent large builds on macOS (12 cores / 36 GB).
Raw Results
Delta vs Node-Reuse Baseline (275s / 42,632 MB / 111 nodes)
Tighter budgets are actually faster than node-reuse while using 61–78% less memory. Less memory pressure means less swapping/GC contention — fewer nodes finishing faster beats many nodes thrashing.