Skip to content

DO NOT MERGE: Add strace filesystem tracing workflow#13916

Open
OvesN wants to merge 3 commits into
dotnet:mainfrom
OvesN:feat/strace-filesystem-tracing
Open

DO NOT MERGE: Add strace filesystem tracing workflow#13916
OvesN wants to merge 3 commits into
dotnet:mainfrom
OvesN:feat/strace-filesystem-tracing

Conversation

@OvesN
Copy link
Copy Markdown
Contributor

@OvesN OvesN commented Jun 1, 2026

DO NOT MERGE - opened only to trigger the Strace Filesystem Tracing workflow on the head commit so the artifact can be downloaded and verified.

Adds a workflow_dispatch + pull_request GitHub Actions workflow that runs the MSBuild build under strace on Linux and uploads a filtered filesystem trace as an artifact (msbuild-strace-trace: gzipped log + summary). Used to attribute file operations (delete/rewrite/truncate) to specific PIDs.

Defaults: trace artifacts/bin/Microsoft.Build.Framework/Debug/net10.0, Debug config, scoped to src/Framework/Microsoft.Build.Framework.csproj to keep runtime down. All overridable via workflow_dispatch inputs.

OvesN and others added 2 commits June 1, 2026 15:13
Adds a Linux-only, workflow_dispatch GitHub Actions workflow that runs
the MSBuild build under strace and uploads a filtered trace artifact so
operators can answer "which process deleted/rewrote my file?".

Design notes:
- strace -P is exact-path, not recursive, and misses relative-path
  syscalls after chdir, so we capture the full trace and post-filter
  with grep -F against both absolute and repo-relative path forms.
- -e trace=%file only covers path-arg syscalls; we additionally trace
  write/writev/pwrite64/close/ftruncate/copy_file_range and use -yy so
  fd-arg syscalls carry their backing path for grep to attribute.
- Restore runs untraced (network/disk I/O outside the trace folder) to
  cut runtime roughly in half without losing signal.
- Includes a strace smoke test that verifies the capture-then-grep
  approach records subtree events before the real build runs.
- PID extraction uses sed -nE (mawk on Ubuntu does not support gawk's
  3-arg match(..., arr) form).
- Top-N reductions use awk 'NR<=N' rather than '| head -N' to avoid
  SIGPIPE on upstream sort under the default -e -o pipefail shell.
- Build is scoped via --projects by default (Framework only) and
  --nodeReuse false for deterministic process boundaries; users can
  override both inputs at dispatch time.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a pull_request trigger so the workflow runs (and produces its
artifact) on PRs without first merging to main. On pull_request events
`inputs.*` is null, so env vars now use `|| 'default'` to fall back to
the same defaults shown in the workflow_dispatch UI.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 1, 2026 13:33
@OvesN OvesN requested a review from a team as a code owner June 1, 2026 13:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Linux GitHub Actions workflow that runs the repo build under strace, post-filters the trace to a target subtree, and uploads the filtered trace + a small summary as an artifact to help attribute filesystem mutations to specific PIDs.

Changes:

  • Introduces a new Strace Filesystem Tracing workflow with workflow_dispatch inputs for target path/configuration/project scoping.
  • Runs restore untraced, then runs the build under strace and filters the log down to a specified repo-relative path.
  • Uploads a gzipped filtered trace plus a human-readable summary as a workflow artifact.

Comment on lines +25 to +27
# strace adds noticeable per-syscall overhead (~2-10x slowdown on
# syscall-heavy workloads), so this workflow is `workflow_dispatch` only -
# it is not part of regular PR/CI validation.
Comment on lines +51 to +54
pull_request:
branches:
- main

Comment on lines +111 to +115
- name: Restore (untraced)
# Restore is dominated by NuGet network/disk I/O that never touches
# our trace folder. Running it without strace cuts the overall job
# time in half without losing any signal in the filtered trace.
run: ./eng/common/build.sh --restore --configuration "$BUILD_CONFIG"
Arcade's Build.proj resides in the NuGet cache; the ProjectToBuild items
it builds from $(Projects) resolve relative paths against the cache
directory, not the repo root. The first traced run failed with:

  error MSB3202: The project file
  "src/Framework/Microsoft.Build.Framework.csproj" was not found.

Convert any relative BUILD_PROJECTS value to an absolute path under
$GITHUB_WORKSPACE so caller-supplied repo-relative paths work the way
users expect.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The workflow is well-documented and technically sound in its strace methodology (good smoke test, correct use of -yy for fd annotation, proper SIGPIPE avoidance, etc.). However, there is one blocking issue:

🚨 pull_request trigger runs this 120-minute workflow on every PR

The pull_request: branches: [main] trigger (line 51) will execute this heavyweight diagnostic workflow on every PR targeting main. This directly contradicts the header comment (lines 26-27) which states: "this workflow is workflow_dispatch only - it is not part of regular PR/CI validation."

Given this repo's CI patterns (no other workflow has a 120-minute timeout, and diagnostic workflows like this are typically workflow_dispatch-only), this trigger should be removed. If PR-based execution is needed, it should be gated by a label condition or path filter.

Other findings:

  • Missing set -e in shell scripts (lines 120, 157): Setup code runs without error checking
  • Direct ${{ }} interpolation in shell (line 267): Safe here but anti-pattern for security; use env vars instead
  • Comment inconsistency (lines 26-27 vs 51): Header contradicts actual behavior

What's done well:

  • Excellent documentation explaining the strace methodology trade-offs
  • Smart separation of restore (untraced) from build (traced) to reduce noise
  • Good defensive shell coding (grep -F, grep -m, awk NR<=N instead of head to avoid SIGPIPE)
  • Proper || true for expected grep failures
  • Reasonable artifact management (gzip filtered, delete full trace)

Note

🔒 Integrity filter blocked 1 item

The following item were blocked because they don't meet the GitHub integrity level.

  • #13916 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Expert Code Review (on open) for issue #13916 · ● 1.9M

@OvesN
Copy link
Copy Markdown
Contributor Author

OvesN commented Jun 1, 2026

Workflow run #26758406589 succeeded (11m32s).

Excerpt from the uploaded msbuild-strace-trace artifact summary, for the default trace path artifacts/bin/Microsoft.Build.Framework/Debug/net10.0 while building src/Framework/Microsoft.Build.Framework.csproj:

Filtered lines:        223

Distinct PIDs that touched the path:
3255, 3257, 3311, 3317, 3365, 3366, 3367, 3368, 3520

Top 30 syscalls by frequency:
     75 lstat
     24 openat
     24 close
     20 stat
     18 ftruncate
     17 utimensat
     17 copy_file_range
     14 mkdir
      1 pwrite64

For any future "who deleted/rewrote my file?" investigation, that PID column is enough to grep the unfiltered (per-PID) trace lines in the artifact and walk back the responsible MSBuild node / dotnet child process.

This PR is intended as a proof of concept — please DO NOT MERGE.

@baronfel
Copy link
Copy Markdown
Member

baronfel commented Jun 1, 2026

@OvesN that pathway works for now, but as soon as we turn on multi-threaded mode by default, PID becomes useless to us. Is thread ID something that is surfaced by strace? If we can pin CLR logical threads to specific OS threads that could be a replacement. Otherwise I think we'd need CLR level tracing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants