test(ci): probe — disable Defender + capture Event Log + fix tasklist sidecar#7855
test(ci): probe — disable Defender + capture Event Log + fix tasklist sidecar#7855JohnMcLear wants to merge 2 commits into
Conversation
…tasklist sidecar Three orthogonal probes against the Windows silent-ELIFECYCLE flake, landed in one PR because they're all workflow-only and complementary. PROBE A — Defender real-time monitoring OFF for the test phase. The kill fingerprint (silent external termination, no JS-handler trace, no native abort report, sub-1s death window) matches Microsoft Defender's behavioural-monitoring TerminateProcess signature. GHA Windows runners have Defender RT enabled by default, and rapid loopback TCP fanout is on Defender's "suspect process behaviour" list. If kills disappear with RT off → causal, this PR is the fix-as-mitigation; if not → Defender ruled out. PROBE H — pre-test wevtutil clear + post-test event log dump. We've never looked at the Windows event log around the kill. `Application`, `System`, `Microsoft-Windows-Windows Defender/ Operational`, and the `Application Error`/`Application Hang`/ `Windows Error Reporting` providers between them will surface who killed the process: Defender, Service Control Manager, Werfault, kernel guard, etc. Clear the logs pre-test so signal-to-noise is high; dump post-test regardless of pass/fail. PROBE I — tasklist sidecar fix (latent bug from PR #7846). The bash `tasklist /v /fi "imagename eq node.exe" /fo csv` produced empty output on the runner — git-bash mangles tasklist's UTF-16-LE-with-BOM output. Switch to PowerShell's Get-CimInstance Win32_Process with explicit columns. This gives us the OS-side equivalent of the libuv handle table (HandleCount, ThreadCount, WorkingSetSize, PageFileUsage, KernelModeTime, UserModeTime) sampled every 500 ms. When Node's `_getActiveHandles` goes silent during the V8 starvation window, the OS still sees the process; this captures that view. All three additions land in node-report/ which the existing artifact upload picks up on failure. No test-code changes. No new dependencies. Expected outcomes: - Defender root cause: Win-with-plugins flake rate drops materially over 5+ runs. event-defender.txt shows pre-kill threat-detection entries on the kills that DO still happen. - Defender not the root cause: event-application.txt / event-system.txt names the actual terminator (Service Control Manager, kernel, Werfault). Probe G (procdump) is the next step. - Neither: kernel-level kill bypassing all event logging — escalates to ETW tracing or a procdump on kill-detect trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Review Summary by QodoProbe Windows silent-ELIFECYCLE flake with Defender, Event Log, tasklist fixes
WalkthroughsDescription• Disable Windows Defender real-time monitoring during test execution • Clear and capture Windows Event Log (Application, System, Defender) for diagnostics • Fix tasklist sidecar to use PowerShell instead of bash for proper process metrics • Add pre/post Defender state verification and process termination event tracking Diagramflowchart LR
A["Test Execution"] --> B["PROBE A: Disable Defender RT"]
A --> C["PROBE H: Clear Event Logs"]
A --> D["PROBE I: Fix tasklist Sidecar"]
B --> E["Capture Defender State Before/After"]
C --> F["Dump Application/System/Defender Events"]
D --> G["Sample Process Metrics via PowerShell"]
E --> H["Artifact: defender-state-*.txt"]
F --> H
G --> H["Artifact: event-*.txt, tasklist.log"]
File Changes1. .github/workflows/backend-tests.yml
|
Code Review by Qodo
1. Defender not restored
|
| powershell -Command "Set-MpPreference -DisableRealtimeMonitoring \$true -ErrorAction SilentlyContinue; Get-MpPreference | Select-Object -Property DisableRealtimeMonitoring,DisableBehaviorMonitoring,DisableIOAVProtection,IsTamperProtected | Format-List" \ | ||
| > "$OUT/defender-state-before.txt" 2>&1 || true |
There was a problem hiding this comment.
1. Defender not restored 🐞 Bug ⛨ Security
The Windows backend test step disables Microsoft Defender real-time monitoring but never re-enables it, so subsequent steps in the same job run with AV protection reduced. This is a security posture regression introduced by this PR in both Windows jobs.
Agent Prompt
### Issue description
The workflow disables Microsoft Defender real-time monitoring via `Set-MpPreference -DisableRealtimeMonitoring $true` but never restores it to `$false` before exiting the step, leaving later job steps running with Defender RT disabled.
### Issue Context
This occurs in both Windows backend test jobs (with and without plugins). The step already has a post-test section; restoration should happen there and ideally be guarded with a bash `trap` so it runs even if the test command fails.
### Fix Focus Areas
- .github/workflows/backend-tests.yml[240-305]
- .github/workflows/backend-tests.yml[413-478]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
The first artifact upload step has `if: failure()` so we only see node-report data on failure. For the Defender hypothesis (PR #7855) we need to compare event-defender.txt between a passing run (baseline) and a future failing run (kill signature) — otherwise N=1 captures can't be evaluated. Add a second upload step gated on `always()` that uploads only the small text files (event-*.txt, defender-*.txt) on every run regardless of outcome. The unique `-${{ github.run_attempt }}` suffix lets reruns accumulate separate artifacts for comparison. Each artifact is ~few KB so this doesn't materially impact storage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Workflow-only probe targeting the Windows silent-ELIFECYCLE flake. Three orthogonal additions in one PR:
Get-CimInstance Win32_Processfor clean ASCII with HandleCount, ThreadCount, WorkingSetSize, PageFileUsage, KernelModeTime, UserModeTime sampled every 500 ms)Why
Every hypothesis tested so far is RULED OUT:
The kill fingerprint (silent external termination, no JS-handler trace, no native abort report, sub-1s death window) matches Microsoft Defender's behavioural-monitoring
TerminateProcesssignature more closely than any other plausible cause. Defender is enabled by default on GHA Windows runners, and rapid loopback TCP fanout is on its suspect-process-behaviour list. We've simply never tested it.If kills disappear with RT off → causal. If kills persist but
event-defender.txtshows pre-kill detection entries → Defender is involved with a more nuanced trigger. If kills persist with no Defender entries → the OS event log will name the actual terminator (Service Control Manager, kernel guard, Werfault, etc.).What's captured in the artifact on failure
What this doesn't change
pnpm test -- --exitinvocation unchanged on this branch (the--exitprobe isprobe-flake-no-exit-flagon a separate branch, run in parallel)Test plan
🤖 Generated with Claude Code