Skip to content

Add cross-process BuildCoordinator for node budget management#13384

Open
JakeRadMSFT wants to merge 10 commits intodotnet:mainfrom
JakeRadMSFT:feature/build-coordinator-v3
Open

Add cross-process BuildCoordinator for node budget management#13384
JakeRadMSFT wants to merge 10 commits intodotnet:mainfrom
JakeRadMSFT:feature/build-coordinator-v3

Conversation

@JakeRadMSFT
Copy link
Member

@JakeRadMSFT JakeRadMSFT commented Mar 13, 2026

Summary

Introduces a cross-process BuildCoordinator that manages MSBuild worker node budgets across concurrent builds. When multiple builds run simultaneously on the same machine, each currently spawns up to maxcpucount nodes independently — leading to node explosion, memory pressure, and thrashing. The coordinator caps total node count system-wide through a fair-share budget policy.

Design

Layered Architecture

  • BuildCoordinator — Pure domain object. Tracks build registrations, grants node budgets via pluggable INodeBudgetPolicy, manages heartbeats/timeouts, handles queue promotion when builds finish. Zero I/O.
  • NamedPipeCoordinatorHost — Server transport. Creates and owns the BuildCoordinator, exposes it over a named pipe using a line-based text protocol (REGISTER, HEARTBEAT, UNREGISTER, STATUS, SHUTDOWN).
  • NamedPipeCoordinatorClient — Client transport. Implements ICoordinatorClient interface. Connects to the host, registers the build, maintains heartbeat, and reports granted node count back to BuildManager.
  • ICoordinatorClient — Abstraction so BuildManager depends on the interface, not the concrete transport.
  • FairShareBudgetPolicy — Default INodeBudgetPolicy implementation. Divides remaining budget (TotalBudget - allocatedNodes) fairly across open slots, ensuring total grants never exceed the budget.

Integration

  • BuildManager.Coordinator.cs — Opt-in glue that creates the client, registers before build, and unregisters after.
  • XMake.cs/coordinator CLI switch launches the host with configurable budget and max-concurrent-builds.

Tests

  • 25 unit/integration tests (13 domain + 12 integration)
  • RapidArrival_NeverExceedsBudget — verifies 4 concurrent builds on budget=24 each get exactly 6 nodes, totaling 24

Benchmark Results

10 concurrent large builds on macOS (12 cores / 36 GB).

Raw Results

Configuration Wall (s) Peak Nodes Peak RSS (MB)
No coordinator (no-reuse) 641 110 42,082
No coordinator (reuse) 275 111 42,632
Coordinator b24-m4 282 21 16,743
Coordinator b16-m3 248 13 11,581
Coordinator b12-m3 245 9 10,064
Coordinator b8-m3 229 5 9,422

Delta vs Node-Reuse Baseline (275s / 42,632 MB / 111 nodes)

Configuration Wall (s) Δ Wall Δ Wall % Peak RSS (MB) Δ RSS Δ RSS % Peak Nodes
coord-b24-m4 282 +7s +2.5% 16,743 -25,889 -60.7% 21
coord-b16-m3 248 -27s -9.8% 11,581 -31,051 -72.8% 13
coord-b12-m3 245 -30s -10.9% 10,064 -32,568 -76.4% 9
coord-b8-m3 229 -46s -16.7% 9,422 -33,210 -77.9% 5

Tighter budgets are actually faster than node-reuse while using 61–78% less memory. Less memory pressure means less swapping/GC contention — fewer nodes finishing faster beats many nodes thrashing.

Add BuildCoordinator -- a long-lived process that manages node budget
across concurrent MSBuild builds. It limits total worker node count,
queues excess builds, and dynamically rebalances via heartbeat protocol.

Add BuildCoordinatorClient for BuildManager integration.
Add --coordinator CLI mode to XMake.cs.
Add ShutdownExcessNodes to INodeManager for dynamic budget reduction.
Includes 29 unit tests. Peak nodes reduced from 110 to 10.
Key changes:
- Remove onBudgetChanged callback — heartbeats are for liveness only,
  never shrink a running build's node pool mid-flight
- Add --min-builds parameter to floor the budget divisor so even a
  lone build reserves headroom for future concurrent builds
- Example: --budget 12 --min-builds 2 gives each build max 6 nodes,
  so a second build can start immediately without renegotiation

Test results (10 concurrent Picasso builds, --budget 12 --min-builds 2):
- Peak nodes: 10 (vs 110 on main) — 91% reduction
- Wall-clock: ~3:52 (vs ~3:35 uncontrolled) — only 8% slower
- All nodes cleaned up within 20s of last build finishing
- Zero risk of killing nodes between project assignments
Removed unused ShutdownExcessNodes from INodeManager, NodeManager,
NodeProviderOutOfProc, and TaskHostNodeManager. This method was only
called from the heartbeat budget-change callback which was removed
in the coordinator v2 redesign.
Confirm build-2 is promoted via heartbeat before registering build-3,
preventing a race where build-3 registers before the promotion
completes and gets immediately activated instead of queued.
- Constructor validation (ArgumentOutOfRangeException)
- UnixOnlyFact for platform test instead of runtime check
- CTS leak fix in ListenLoop (dispose previous)
- CLI arg validation (clamp to >=1)
- Console.Error replaced with Trace.WriteLine for library safety
- Test pipe name isolation (internal ctor with pipe override)
- _queuedBuilds.Count read under lock
- Heartbeat-gated promotion docs clarified (ordering, not mid-build)
- Injectable stale timer intervals for fast tests
- Test name typo fixed
- Start() checks for existing coordinator before pipe delete
- HandleRegister race fixed (lock)
- Heartbeat comment clarified in client
- TestEnvironment added to test class
- Partial class extraction (BuildManager.Coordinator.cs)
- MSBUILDCOORDINATORENABLED env var gate
Builds are now activated immediately when capacity is available (active < max)
instead of being queued and waiting for all active builds to acknowledge an
epoch via heartbeat. This removes ~135 lines of complexity that served no
purpose since BuildManager applies granted budget once at start and never
changes MaxNodeCount mid-build.

Removed: _rebalanceEpoch field, AcknowledgedEpoch property, RebalanceAll(),
epoch gating in TryPromotePending, EpochGating test. Renamed TryPromotePending
to TryPromoteOne with simple capacity check.
- Check budget headroom (sum of granted < total) before admitting new builds,
  preventing over-commitment when existing builds hold large grants.
- Enforce FIFO queue ordering: if queue is non-empty, new arrivals always queue
  instead of cutting ahead of waiting builds.
- When queue has items, CalculateBudget targets maxConcurrentBuilds for fair share
  so active builds shrink toward desired state and make room for promotions.
- Update tests to account for budget-aware admission (builds requesting full budget
  consume all headroom, forcing subsequent builds to queue).
…allocation

- Split monolithic BuildCoordinatorClient into pure domain (BuildCoordinator)
  and transport layer (NamedPipeCoordinatorHost + NamedPipeCoordinatorClient)
- Fix FairShareBudgetPolicy overallocation: track allocated nodes so grants
  never exceed total budget (was 38 nodes on budget=24, now correctly capped)
- Constructor-inject promotion callback (was mutable Action property)
- Host owns coordinator lifecycle (creates internally)
- Move GetWaitPipeName to NamedPipeCoordinatorHost (transport concern)
- Rename BuildCoordinatorClient -> NamedPipeCoordinatorClient
- Extract ICoordinatorClient interface for BuildManager abstraction
- Expand PROTOCOL.md with architecture docs and design decisions
- 25 unit/integration tests passing
@JakeRadMSFT JakeRadMSFT changed the title BuildCoordinator: pure domain + transport layer, fix budget allocation Add cross-process BuildCoordinator for node budget management Mar 13, 2026
@JakeRadMSFT JakeRadMSFT marked this pull request as ready for review March 13, 2026 16:41
Copilot AI review requested due to automatic review settings March 13, 2026 16:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in, cross-process MSBuild BuildCoordinator that uses a named-pipe host/client to enforce a system-wide worker-node budget across concurrent builds, and integrates it into BuildManager and the msbuild entrypoint.

Changes:

  • Introduces BuildCoordinator domain logic + FairShareBudgetPolicy and a named-pipe host/client transport.
  • Integrates coordinator registration/heartbeats into BuildManager (NET-only) and disables node reuse when coordinated.
  • Adds --coordinator mode in XMake plus new unit/integration tests and protocol documentation.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/MSBuild/XMake.cs Adds --coordinator mode and CLI parsing for budget/max-builds/startup-delay.
src/Build/Microsoft.Build.csproj Includes new coordinator source files for non-net472 TFMs.
src/Build/BackEnd/BuildManager/BuildManager.cs Makes BuildManager partial; hooks coordinator register/unregister; adjusts node reuse behavior.
src/Build/BackEnd/BuildManager/BuildManager.Coordinator.cs NET-only coordinator integration (env-var gated), registration + heartbeat lifecycle.
src/Build/BackEnd/BuildCoordinator/PROTOCOL.md Documents the named-pipe text protocol and design rationale.
src/Build/BackEnd/BuildCoordinator/NamedPipeCoordinatorHost.cs Implements coordinator server transport over named pipes + promotion notifications.
src/Build/BackEnd/BuildCoordinator/NamedPipeCoordinatorClient.cs Implements coordinator client transport, registration + queue wait-pipe + heartbeat timer.
src/Build/BackEnd/BuildCoordinator/INodeBudgetPolicy.cs Defines the pluggable budget policy interface.
src/Build/BackEnd/BuildCoordinator/ICoordinatorClient.cs Defines the BuildManager-facing coordinator client abstraction.
src/Build/BackEnd/BuildCoordinator/FairShareBudgetPolicy.cs Implements fair-share budget allocation policy.
src/Build/BackEnd/BuildCoordinator/BuildCoordinator.cs Implements queuing/promotion, staleness reaping, and budget granting logic.
src/Build.UnitTests/BackEnd/FairShareBudgetPolicy_Tests.cs Adds unit tests for fair-share policy behavior.
src/Build.UnitTests/BackEnd/BuildCoordinator_Tests.cs Adds domain tests and named-pipe integration tests for coordinator behavior.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +1035 to +1037
#if NET
UnregisterFromCoordinator();
#endif
Comment on lines +363 to +364
var process = Process.GetProcessById(pid);
return !process.HasExited;
Comment on lines +190 to +223
private string? SendCommand(string command)
{
const int maxAttempts = 5;
const int connectTimeoutMs = 5000;
const int retryDelayMs = 500;

for (int attempt = 0; attempt < maxAttempts; attempt++)
{
try
{
using var client = new NamedPipeClientStream(".", _pipeName, PipeDirection.InOut, System.IO.Pipes.PipeOptions.CurrentUserOnly);
client.Connect(connectTimeoutMs);

using var writer = new StreamWriter(client, leaveOpen: true) { AutoFlush = true };
using var reader = new StreamReader(client, leaveOpen: true);

writer.WriteLine(command);
return reader.ReadLine();
}
catch (TimeoutException)
{
if (attempt < maxAttempts - 1)
{
Thread.Sleep(retryDelayMs);
}
}
catch (IOException)
{
if (attempt < maxAttempts - 1)
{
Thread.Sleep(retryDelayMs);
}
}
}
Comment on lines +239 to +248
public class NamedPipeCoordinatorHost_Tests : IDisposable
{
private readonly NamedPipeCoordinatorHost _host;
private readonly string _pipeName;

public NamedPipeCoordinatorHost_Tests()
{
_pipeName = $"/tmp/MSBuild-Test-{Guid.NewGuid():N}";
_host = new NamedPipeCoordinatorHost(new FairShareBudgetPolicy(12, 3), pipeName: _pipeName);
_host.Start();
Comment on lines +15 to +21
public enum RegisterOutcome
{
/// <summary>Build was activated immediately.</summary>
Granted,
/// <summary>Build was queued — wait for promotion.</summary>
Queued,
}
Comment on lines +67 to +69
Console.WriteLine($"WARNING: Another coordinator may be running on {pipeName}. Taking over...");
}

Comment on lines +149 to +152
public void StartHeartbeat()
{
_heartbeatTimer = new Timer(HeartbeatCallback, null, TimeSpan.FromSeconds(2), TimeSpan.FromSeconds(2));
}

public NamedPipeCoordinatorHost_Tests()
{
_pipeName = $"/tmp/MSBuild-Test-{Guid.NewGuid():N}";
Comment on lines +187 to +196
// Wait for startup delay to fire
Thread.Sleep(300);

// All 3 should be promoted (maxConcurrent=3)
promoted.Count.ShouldBe(3);

var status = coord.GetStatus();
status.ActiveCount.ShouldBe(3);
status.QueuedCount.ShouldBe(0);
}
Comment on lines +362 to +384
for (int i = 0; i < args.Length; i++)
{
if (args[i].Equals("--coordinator", StringComparison.OrdinalIgnoreCase))
{
found = true;
}
else if (args[i].Equals("--budget", StringComparison.OrdinalIgnoreCase) && i + 1 < args.Length)
{
int.TryParse(args[i + 1], out budget);
i++;
}
else if (args[i].Equals("--max-builds", StringComparison.OrdinalIgnoreCase) && i + 1 < args.Length)
{
int.TryParse(args[i + 1], out maxBuilds);
i++;
}
else if (args[i].Equals("--startup-delay", StringComparison.OrdinalIgnoreCase) && i + 1 < args.Length)
{
int.TryParse(args[i + 1], out startupDelay);
i++;
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants