Skip to content

Enhance crash telemetry with richer diagnostics and EndBuild hang detection#13304

Merged
YuliiaKovalova merged 3 commits intomainfrom
dev/endbuild-hang-telemetry
Mar 2, 2026
Merged

Enhance crash telemetry with richer diagnostics and EndBuild hang detection#13304
YuliiaKovalova merged 3 commits intomainfrom
dev/endbuild-hang-telemetry

Conversation

@YuliiaKovalova
Copy link
Member

@YuliiaKovalova YuliiaKovalova commented Feb 27, 2026

Summary

When MSBuild crashes via throw-helpers like ErrorUtilities.ThrowInternalError, crash telemetry previously captured only the throw-helper frame in StackTop — making triage nearly impossible since all InternalErrorException crashes look identical. Additionally, when EndBuild() hangs waiting for submissions or nodes, no telemetry was emitted at all because the crash telemetry in the finally block is unreachable during a hang.

This PR addresses both problems.

Changes

Richer crash diagnostics (all crash types)

Property Purpose
StackCaller First meaningful caller frame, skipping known throw-helpers (ThrowInternalError, VerifyThrow, etc.) — the frame you actually need for triage
FullStackTrace Complete stack trace with file paths sanitized, capped at 4096 chars
ExceptionMessage Truncated exception message (256 chars) with file paths redacted to avoid PII
CrashThreadName Thread name at crash time (main, worker, node communication, etc.)

EndBuild hang detection

Replaces infinite WaitOne() calls in EndBuild() with timed 30-second loops that emit periodic diagnostic telemetry via CrashExitType.EndBuildHang:

Property Purpose
EndBuildWaitPhase Which wait point is stuck ("WaitingForSubmissions" / "WaitingForNodes")
EndBuildWaitDurationMs How long the hang has lasted
PendingSubmissionCount Submissions still in the pending dictionary
SubmissionsWithResultNoLogging Submissions that have a result but LoggingCompleted is false — the ones blocking EndBuild
ThreadExceptionRecorded Whether a thread exception exists on the BuildManager
UnmatchedProjectStartedCount Orphaned ProjectStarted events (no corresponding ProjectFinished)

Hang state is also persisted to %TEMP%\MSBuild_pid-{pid}.hang.txt via DumpHangDiagnosticsToFile for later retrieval from customer machines.

PII protection

All new telemetry properties are sanitized to prevent PII leaks:

  • Stack frames: File paths replaced with <redacted> (preserves line numbers)
  • Exception messages: Regex redaction of Windows (C:\...) and Unix (/...) paths → <path>
  • MSB0001 prefix stripping: Removes boilerplate MSB0001: Internal MSBuild Error: prefix

…ection

- Add StackCaller: skips throw-helper frames to find actual crash site
- Add FullStackTrace: multi-frame sanitized trace (4096 char cap)
- Add ExceptionMessage: truncated + path-redacted to avoid PII
- Add CrashThreadName: captures thread identity at crash time
- Add EndBuild hang detection: replace infinite WaitOne() with timed
  30s loops that emit periodic diagnostic telemetry
- Add CrashExitType.EndBuildHang with 6 diagnostic properties:
  EndBuildWaitPhase, EndBuildWaitDurationMs, PendingSubmissionCount,
  SubmissionsWithResultNoLogging, ThreadExceptionRecorded,
  UnmatchedProjectStartedCount
- Add DumpHangDiagnosticsToFile: persists hang state to disk
- PII protection: regex path redaction in exception messages,
  SanitizeFilePathsInText for stack traces, SanitizeStackFrame for
  individual frames
- 61 tests covering all new functionality
Copilot AI review requested due to automatic review settings February 27, 2026 14:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances MSBuild's crash telemetry infrastructure to improve diagnostics for two critical scenarios: crashes through throw-helpers (like ErrorUtilities.ThrowInternalError) and EndBuild hangs. Previously, all throw-helper crashes appeared identical in telemetry because only the throw-helper frame was captured, making triage nearly impossible. Additionally, when EndBuild hung indefinitely, no diagnostic data was emitted since the process never reached the finally block.

Changes:

  • Added richer crash diagnostics: StackCaller (skips throw-helpers to reveal the actual bug location), FullStackTrace (complete sanitized stack), ExceptionMessage (truncated with PII redaction), and CrashThreadName
  • Implemented EndBuild hang detection by replacing infinite WaitOne() calls with 30-second timed loops that periodically emit diagnostic telemetry via CrashExitType.EndBuildHang
  • Applied comprehensive PII protection through path redaction in stack frames, exception messages, and diagnostic outputs

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/Shared/ExceptionHandling.cs Added DumpHangDiagnosticsToFile to persist hang diagnostics to disk for later retrieval from customer machines
src/Framework/Telemetry/CrashTelemetryRecorder.cs Added CollectAndEmitEndBuildHangDiagnostics for immediate hang telemetry emission and defined 30-second diagnostic interval constant
src/Framework/Telemetry/CrashTelemetry.cs Added new telemetry properties (StackCaller, FullStackTrace, ExceptionMessage, CrashThreadName, EndBuild hang properties), PII sanitization methods (TruncateMessage, ExtractStackCaller, ExtractFullStackTrace, SanitizeFilePathsInText), and new EndBuildHang exit type
src/Framework.UnitTests/CrashTelemetry_Tests.cs Added 16 new tests covering StackCaller extraction for all throw-helpers, PII redaction, truncation, hang diagnostic serialization, and verification that dropped properties are absent
src/Build/BackEnd/BuildManager/BuildManager.cs Replaced infinite WaitOne() calls in EndBuild with timed 30-second loops that emit periodic diagnostics via EmitEndBuildHangDiagnostics

YuliiaKovalova and others added 2 commits February 27, 2026 15:31
Account for the suffix length when truncating, so total output
(content + '... [truncated]') never exceeds MaxStackTraceLength.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add null-forgiving operator after ShouldNotBeNull() assertions
for StackTop and FullStackTrace properties.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants