Skip to content

Fix internal error 'Invalid node id specified' from out-of-proc node teardown race (#12438)#13966

Draft
JanProvaznik wants to merge 1 commit into
dotnet:mainfrom
JanProvaznik:dev/janprovaznik/fix-12438-node-teardown
Draft

Fix internal error 'Invalid node id specified' from out-of-proc node teardown race (#12438)#13966
JanProvaznik wants to merge 1 commit into
dotnet:mainfrom
JanProvaznik:dev/janprovaznik/fix-12438-node-teardown

Conversation

@JanProvaznik
Copy link
Copy Markdown
Member

Fixes #12438.

Symptom

Intermittent internal error during parallel builds (seen in CI and by external users):

MSBUILD : error MSB1025: An internal failure occurred while running MSBuild.
Microsoft.Build.Framework.InternalErrorException: MSB0001: Internal MSBuild Error: Invalid node id specified: 2.
   at Microsoft.Build.BackEnd.NodeProviderOutOfProc.SendData(Int32 nodeId, INodePacket packet)
   at Microsoft.Build.BackEnd.NodeManager.SendData(Int32 node, INodePacket packet)
   at Microsoft.Build.Execution.BuildManager.PerformSchedulingActions(IEnumerable`1 responses)
   at Microsoft.Build.Execution.BuildManager.ProcessPacket(Int32 node, INodePacket packet)

It sometimes manifests instead as Node N does not have a provider or the graceful MSB4166 Child node "N" exited prematurely.

Root cause — a node-teardown race

When an out-of-proc worker node dies/disconnects mid-build, its pipe read/IO thread tears the node down outside BuildManager._syncLock:

  • NodeManager.RoutePacket / DeserializeAndRoutePacketRemoveNodeFromMapping removes it from _nodeIdToProvider, and
  • NodeContext.CloseNodeContextTerminated removes it from NodeProviderOutOfProc._nodeContexts.

Meanwhile the scheduler, on the single work-queue thread under _syncLock, calls NodeManager.SendData(node)provider.SendData_nodeContexts.ContainsKey(...). The authoritative "node is gone" signal (NodeShutdownHandleNodeShutdown, which sets _shuttingDown and aborts) is only processed later on the serialized work queue. So there is a window in which the scheduler routes work to a node whose context/mapping the read thread just removed → the internal error. Whether you get the internal error vs. the graceful MSB4166 is pure timing.

This is a long-standing latent race (present since the multiproc engine's original design); recent IPC refactors and oversubscribed CI agents raised its probability enough to surface it. It reproduces only with out-of-proc nodes — which is why -m:1 (and, less reliably, -nr:false) work around it.

Fix — serialize node teardown through the work queue

Move all out-of-proc node-map removal onto the work-queue thread under _syncLock, driven by the NodeShutdown packet that HandleNodeShutdown already processes. The read/IO thread now only routes the NodeShutdown and disposes its own pipe — it no longer mutates the maps.

  • NodeManager.RoutePacket / DeserializeAndRoutePacket no longer call RemoveNodeFromMapping.
  • New NodeManager.RemoveNode(nodeId) (drops the provider context via INodeProvider.RemoveNodeContext, then the mapping) is called from BuildManager.HandleNodeShutdown under _syncLock.
  • NodeProviderOutOfProc.NodeContextTerminated is now a no-op; removal moves to RemoveNodeContext.
  • NodeManager._nodeIdToProvider becomes a ConcurrentDictionary — it is read off the sync lock by the SDK-resolver response path (MainNodeSdkResolverService.PacketReceived runs synchronously on a node's read/IO thread), so the non-concurrent Dictionary was a latent data race independent of this fix.

No new lock contention

The read/IO thread stays lock-free (it only Posts the NodeShutdown and disposes its pipe). The two O(1) map removals relocate into HandleNodeShutdown, which already holds _syncLock; node contexts are only ever removed at node death/shutdown (never on the build hot path), so nothing on a hot path is newly serialized.

New node lifecycle

Alive ──(HandleNodeShutdown → NodeManager.RemoveNode, under _syncLock)──► Dead

No "ghost window": the maps are only ever mutated while _syncLock is held.

Validation

  • Deterministic repro: a local build with the real race window widened, plus a harness that kills worker nodes mid-build, reproduced the exact issue stack 10/10 before the fix and 0/10 after.
  • Stock bits: on unmodified SDK bits the race reproduced ~4/30; 0 after the fix.
  • Unit tests: new NodeManager_Tests (3 tests) lock in the contract — RoutePacket(NodeShutdown) must not remove the mapping, RemoveNode removes both the mapping and the provider context, and RemoveNode is idempotent.

Notes / follow-ups (out of scope here)

  • MainNodeSdkResolverService sends responses on a node read/IO thread (off _syncLock). Because that send targets the requesting node on that node's own serial read loop, it cannot race its own teardown; the only real hazard there was the _nodeIdToProvider data race, now closed by the ConcurrentDictionary. Routing SDK responses through the work queue could be a future hardening.
  • Multi-threaded in-proc mode (BuildParameters.MultiThreaded) removes the in-proc node context in NodeProviderInProc.RoutePacket; an analogous serialization could be applied there as a follow-up (separate, experimental path).

)

When an out-of-proc worker node disconnects mid-build, its pipe read/IO
thread tore down node state outside BuildManager._syncLock: it removed the
node from NodeManager._nodeIdToProvider and from NodeProviderOutOfProc._nodeContexts.
The scheduler, running under _syncLock on the work-queue thread, could
concurrently call NodeManager.SendData for that node and observe a
half-torn-down node, throwing 'Invalid node id specified' (or the sibling
'Node N does not have a provider'). The race has existed since the engine's
multiproc design; recent IPC refactors and oversubscribed CI raised its
probability.

Serialize node-map teardown onto the work-queue thread under _syncLock,
driven by the NodeShutdown packet that BuildManager.HandleNodeShutdown
already processes:
- NodeManager.RoutePacket/DeserializeAndRoutePacket no longer remove the
  mapping; add NodeManager.RemoveNode, called from HandleNodeShutdown.
- NodeProviderOutOfProc.NodeContextTerminated is now a no-op; context removal
  moves to RemoveNodeContext on the serialized path. The read/IO thread only
  routes the NodeShutdown and disposes its pipe.
- Make NodeManager._nodeIdToProvider a ConcurrentDictionary; it is read off
  the sync lock by the SDK resolver response path.
- Add NodeManager regression tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential MSBuild regression "internal failure"

1 participant