Fix internal error 'Invalid node id specified' from out-of-proc node teardown race (#12438)#13966
Draft
JanProvaznik wants to merge 1 commit into
Draft
Conversation
) When an out-of-proc worker node disconnects mid-build, its pipe read/IO thread tore down node state outside BuildManager._syncLock: it removed the node from NodeManager._nodeIdToProvider and from NodeProviderOutOfProc._nodeContexts. The scheduler, running under _syncLock on the work-queue thread, could concurrently call NodeManager.SendData for that node and observe a half-torn-down node, throwing 'Invalid node id specified' (or the sibling 'Node N does not have a provider'). The race has existed since the engine's multiproc design; recent IPC refactors and oversubscribed CI raised its probability. Serialize node-map teardown onto the work-queue thread under _syncLock, driven by the NodeShutdown packet that BuildManager.HandleNodeShutdown already processes: - NodeManager.RoutePacket/DeserializeAndRoutePacket no longer remove the mapping; add NodeManager.RemoveNode, called from HandleNodeShutdown. - NodeProviderOutOfProc.NodeContextTerminated is now a no-op; context removal moves to RemoveNodeContext on the serialized path. The read/IO thread only routes the NodeShutdown and disposes its pipe. - Make NodeManager._nodeIdToProvider a ConcurrentDictionary; it is read off the sync lock by the SDK resolver response path. - Add NodeManager regression tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #12438.
Symptom
Intermittent internal error during parallel builds (seen in CI and by external users):
It sometimes manifests instead as
Node N does not have a provideror the gracefulMSB4166 Child node "N" exited prematurely.Root cause — a node-teardown race
When an out-of-proc worker node dies/disconnects mid-build, its pipe read/IO thread tears the node down outside
BuildManager._syncLock:NodeManager.RoutePacket/DeserializeAndRoutePacket→RemoveNodeFromMappingremoves it from_nodeIdToProvider, andNodeContext.Close→NodeContextTerminatedremoves it fromNodeProviderOutOfProc._nodeContexts.Meanwhile the scheduler, on the single work-queue thread under
_syncLock, callsNodeManager.SendData(node)→provider.SendData→_nodeContexts.ContainsKey(...). The authoritative "node is gone" signal (NodeShutdown→HandleNodeShutdown, which sets_shuttingDownand aborts) is only processed later on the serialized work queue. So there is a window in which the scheduler routes work to a node whose context/mapping the read thread just removed → the internal error. Whether you get the internal error vs. the gracefulMSB4166is pure timing.This is a long-standing latent race (present since the multiproc engine's original design); recent IPC refactors and oversubscribed CI agents raised its probability enough to surface it. It reproduces only with out-of-proc nodes — which is why
-m:1(and, less reliably,-nr:false) work around it.Fix — serialize node teardown through the work queue
Move all out-of-proc node-map removal onto the work-queue thread under
_syncLock, driven by theNodeShutdownpacket thatHandleNodeShutdownalready processes. The read/IO thread now only routes theNodeShutdownand disposes its own pipe — it no longer mutates the maps.NodeManager.RoutePacket/DeserializeAndRoutePacketno longer callRemoveNodeFromMapping.NodeManager.RemoveNode(nodeId)(drops the provider context viaINodeProvider.RemoveNodeContext, then the mapping) is called fromBuildManager.HandleNodeShutdownunder_syncLock.NodeProviderOutOfProc.NodeContextTerminatedis now a no-op; removal moves toRemoveNodeContext.NodeManager._nodeIdToProviderbecomes aConcurrentDictionary— it is read off the sync lock by the SDK-resolver response path (MainNodeSdkResolverService.PacketReceivedruns synchronously on a node's read/IO thread), so the non-concurrentDictionarywas a latent data race independent of this fix.No new lock contention
The read/IO thread stays lock-free (it only
Posts theNodeShutdownand disposes its pipe). The two O(1) map removals relocate intoHandleNodeShutdown, which already holds_syncLock; node contexts are only ever removed at node death/shutdown (never on the build hot path), so nothing on a hot path is newly serialized.New node lifecycle
No "ghost window": the maps are only ever mutated while
_syncLockis held.Validation
NodeManager_Tests(3 tests) lock in the contract —RoutePacket(NodeShutdown)must not remove the mapping,RemoveNoderemoves both the mapping and the provider context, andRemoveNodeis idempotent.Notes / follow-ups (out of scope here)
MainNodeSdkResolverServicesends responses on a node read/IO thread (off_syncLock). Because that send targets the requesting node on that node's own serial read loop, it cannot race its own teardown; the only real hazard there was the_nodeIdToProviderdata race, now closed by theConcurrentDictionary. Routing SDK responses through the work queue could be a future hardening.BuildParameters.MultiThreaded) removes the in-proc node context inNodeProviderInProc.RoutePacket; an analogous serialization could be applied there as a follow-up (separate, experimental path).