DO NOT MERGE: Add strace filesystem tracing workflow#13916
Conversation
Adds a Linux-only, workflow_dispatch GitHub Actions workflow that runs the MSBuild build under strace and uploads a filtered trace artifact so operators can answer "which process deleted/rewrote my file?". Design notes: - strace -P is exact-path, not recursive, and misses relative-path syscalls after chdir, so we capture the full trace and post-filter with grep -F against both absolute and repo-relative path forms. - -e trace=%file only covers path-arg syscalls; we additionally trace write/writev/pwrite64/close/ftruncate/copy_file_range and use -yy so fd-arg syscalls carry their backing path for grep to attribute. - Restore runs untraced (network/disk I/O outside the trace folder) to cut runtime roughly in half without losing signal. - Includes a strace smoke test that verifies the capture-then-grep approach records subtree events before the real build runs. - PID extraction uses sed -nE (mawk on Ubuntu does not support gawk's 3-arg match(..., arr) form). - Top-N reductions use awk 'NR<=N' rather than '| head -N' to avoid SIGPIPE on upstream sort under the default -e -o pipefail shell. - Build is scoped via --projects by default (Framework only) and --nodeReuse false for deterministic process boundaries; users can override both inputs at dispatch time. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a pull_request trigger so the workflow runs (and produces its artifact) on PRs without first merging to main. On pull_request events `inputs.*` is null, so env vars now use `|| 'default'` to fall back to the same defaults shown in the workflow_dispatch UI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a Linux GitHub Actions workflow that runs the repo build under strace, post-filters the trace to a target subtree, and uploads the filtered trace + a small summary as an artifact to help attribute filesystem mutations to specific PIDs.
Changes:
- Introduces a new
Strace Filesystem Tracingworkflow withworkflow_dispatchinputs for target path/configuration/project scoping. - Runs restore untraced, then runs the build under
straceand filters the log down to a specified repo-relative path. - Uploads a gzipped filtered trace plus a human-readable summary as a workflow artifact.
| # strace adds noticeable per-syscall overhead (~2-10x slowdown on | ||
| # syscall-heavy workloads), so this workflow is `workflow_dispatch` only - | ||
| # it is not part of regular PR/CI validation. |
| pull_request: | ||
| branches: | ||
| - main | ||
|
|
| - name: Restore (untraced) | ||
| # Restore is dominated by NuGet network/disk I/O that never touches | ||
| # our trace folder. Running it without strace cuts the overall job | ||
| # time in half without losing any signal in the filtered trace. | ||
| run: ./eng/common/build.sh --restore --configuration "$BUILD_CONFIG" |
Arcade's Build.proj resides in the NuGet cache; the ProjectToBuild items it builds from $(Projects) resolve relative paths against the cache directory, not the repo root. The first traced run failed with: error MSB3202: The project file "src/Framework/Microsoft.Build.Framework.csproj" was not found. Convert any relative BUILD_PROJECTS value to an absolute path under $GITHUB_WORKSPACE so caller-supplied repo-relative paths work the way users expect. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Review Summary
The workflow is well-documented and technically sound in its strace methodology (good smoke test, correct use of -yy for fd annotation, proper SIGPIPE avoidance, etc.). However, there is one blocking issue:
🚨 pull_request trigger runs this 120-minute workflow on every PR
The pull_request: branches: [main] trigger (line 51) will execute this heavyweight diagnostic workflow on every PR targeting main. This directly contradicts the header comment (lines 26-27) which states: "this workflow is workflow_dispatch only - it is not part of regular PR/CI validation."
Given this repo's CI patterns (no other workflow has a 120-minute timeout, and diagnostic workflows like this are typically workflow_dispatch-only), this trigger should be removed. If PR-based execution is needed, it should be gated by a label condition or path filter.
Other findings:
- Missing
set -ein shell scripts (lines 120, 157): Setup code runs without error checking - Direct
${{ }}interpolation in shell (line 267): Safe here but anti-pattern for security; use env vars instead - Comment inconsistency (lines 26-27 vs 51): Header contradicts actual behavior
What's done well:
- Excellent documentation explaining the strace methodology trade-offs
- Smart separation of restore (untraced) from build (traced) to reduce noise
- Good defensive shell coding (grep -F, grep -m, awk NR<=N instead of head to avoid SIGPIPE)
- Proper
|| truefor expected grep failures - Reasonable artifact management (gzip filtered, delete full trace)
Note
🔒 Integrity filter blocked 1 item
The following item were blocked because they don't meet the GitHub integrity level.
- #13916
pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
To allow these resources, lower min-integrity in your GitHub frontmatter:
tools:
github:
min-integrity: approved # merged | approved | unapproved | noneGenerated by Expert Code Review (on open) for issue #13916 · ● 1.9M
|
Workflow run #26758406589 succeeded (11m32s). Excerpt from the uploaded For any future "who deleted/rewrote my file?" investigation, that PID column is enough to grep the unfiltered (per-PID) trace lines in the artifact and walk back the responsible MSBuild node / dotnet child process. This PR is intended as a proof of concept — please DO NOT MERGE. |
|
@OvesN that pathway works for now, but as soon as we turn on multi-threaded mode by default, PID becomes useless to us. Is thread ID something that is surfaced by strace? If we can pin CLR logical threads to specific OS threads that could be a replacement. Otherwise I think we'd need CLR level tracing |
DO NOT MERGE - opened only to trigger the
Strace Filesystem Tracingworkflow on the head commit so the artifact can be downloaded and verified.Adds a workflow_dispatch + pull_request GitHub Actions workflow that runs the MSBuild build under
straceon Linux and uploads a filtered filesystem trace as an artifact (msbuild-strace-trace: gzipped log + summary). Used to attribute file operations (delete/rewrite/truncate) to specific PIDs.Defaults: trace
artifacts/bin/Microsoft.Build.Framework/Debug/net10.0,Debugconfig, scoped tosrc/Framework/Microsoft.Build.Framework.csprojto keep runtime down. All overridable viaworkflow_dispatchinputs.