Skip to content

test(ci): probe — disable Defender + capture Event Log + fix tasklist sidecar#7855

Open
JohnMcLear wants to merge 2 commits into
developfrom
probe-flake-defender-eventlog-sidecar
Open

test(ci): probe — disable Defender + capture Event Log + fix tasklist sidecar#7855
JohnMcLear wants to merge 2 commits into
developfrom
probe-flake-defender-eventlog-sidecar

Conversation

@JohnMcLear
Copy link
Copy Markdown
Member

Summary

Workflow-only probe targeting the Windows silent-ELIFECYCLE flake. Three orthogonal additions in one PR:

  1. A — disable Microsoft Defender real-time monitoring for the duration of the test step
  2. H — clear Application + System event logs pre-test, dump them post-test (pass or fail) to the artifact directory
  3. I — fix the tasklist sidecar from test(ci): OS-level sidecar watcher for the Windows silent ELIFECYCLE #7846 (was producing empty output due to git-bash + UTF-16 BOM; switched to Get-CimInstance Win32_Process for clean ASCII with HandleCount, ThreadCount, WorkingSetSize, PageFileUsage, KernelModeTime, UserModeTime sampled every 500 ms)

Why

Every hypothesis tested so far is RULED OUT:

Hypothesis Ruled out by How
Memory/handle leak every pre-kill node-report nominal state
TIME_WAIT accumulation #7852 keepAlive collapsed it, kill survived
Rapid-sequential cadence #7854 setImmediate yield in root hook, kill survived
File-specific pathology death corpus 7 files, same fingerprint

The kill fingerprint (silent external termination, no JS-handler trace, no native abort report, sub-1s death window) matches Microsoft Defender's behavioural-monitoring TerminateProcess signature more closely than any other plausible cause. Defender is enabled by default on GHA Windows runners, and rapid loopback TCP fanout is on its suspect-process-behaviour list. We've simply never tested it.

If kills disappear with RT off → causal. If kills persist but event-defender.txt shows pre-kill detection entries → Defender is involved with a more nuanced trigger. If kills persist with no Defender entries → the OS event log will name the actual terminator (Service Control Manager, kernel guard, Werfault, etc.).

What's captured in the artifact on failure

node-report/
├── defender-state-before.txt  # pre-test Defender config
├── defender-state-after.txt   # post-test Defender config (sanity)
├── event-clear.txt            # confirmation logs were cleared
├── event-application.txt      # last 500 Application events with timestamps
├── event-system.txt           # last 500 System events
├── event-defender.txt         # last 200 Defender Operational events
├── event-app-errors.txt       # specifically Application Error / Hang / WER
├── netstat.log                # (existing) localhost TCP every 500ms
├── tasklist.log               # (NOW WORKING) node.exe handle/CPU/RSS every 500ms
└── be-NNNN-*.json / hb-* / mt-*  # (existing) Node diagnostic reports

What this doesn't change

  • No code changes (workflow only)
  • No test changes
  • pnpm test -- --exit invocation unchanged on this branch (the --exit probe is probe-flake-no-exit-flag on a separate branch, run in parallel)
  • Linux jobs untouched
  • The Defender disable only applies to the test step on Windows; runner is reset between jobs anyway

Test plan

  • Linux ± plugins must pass (probe touches Windows only)
  • Windows ± plugins backend test reruns 5+ times to compare flake rate vs the ~22% baseline
  • On any failure: pull artifact, inspect event-*.txt for who terminated Node
  • tasklist.log shows real columns this time (not just headers)

🤖 Generated with Claude Code

…tasklist sidecar

Three orthogonal probes against the Windows silent-ELIFECYCLE flake,
landed in one PR because they're all workflow-only and complementary.

PROBE A — Defender real-time monitoring OFF for the test phase.
The kill fingerprint (silent external termination, no JS-handler
trace, no native abort report, sub-1s death window) matches
Microsoft Defender's behavioural-monitoring TerminateProcess
signature. GHA Windows runners have Defender RT enabled by default,
and rapid loopback TCP fanout is on Defender's "suspect process
behaviour" list. If kills disappear with RT off → causal, this PR
is the fix-as-mitigation; if not → Defender ruled out.

PROBE H — pre-test wevtutil clear + post-test event log dump.
We've never looked at the Windows event log around the kill.
`Application`, `System`, `Microsoft-Windows-Windows Defender/
Operational`, and the `Application Error`/`Application Hang`/
`Windows Error Reporting` providers between them will surface
who killed the process: Defender, Service Control Manager,
Werfault, kernel guard, etc. Clear the logs pre-test so
signal-to-noise is high; dump post-test regardless of pass/fail.

PROBE I — tasklist sidecar fix (latent bug from PR #7846).
The bash `tasklist /v /fi "imagename eq node.exe" /fo csv`
produced empty output on the runner — git-bash mangles tasklist's
UTF-16-LE-with-BOM output. Switch to PowerShell's
Get-CimInstance Win32_Process with explicit columns. This gives
us the OS-side equivalent of the libuv handle table (HandleCount,
ThreadCount, WorkingSetSize, PageFileUsage, KernelModeTime,
UserModeTime) sampled every 500 ms. When Node's `_getActiveHandles`
goes silent during the V8 starvation window, the OS still
sees the process; this captures that view.

All three additions land in node-report/ which the existing
artifact upload picks up on failure. No test-code changes.
No new dependencies.

Expected outcomes:
  - Defender root cause: Win-with-plugins flake rate drops materially
    over 5+ runs. event-defender.txt shows pre-kill threat-detection
    entries on the kills that DO still happen.
  - Defender not the root cause: event-application.txt /
    event-system.txt names the actual terminator (Service Control
    Manager, kernel, Werfault). Probe G (procdump) is the next step.
  - Neither: kernel-level kill bypassing all event logging — escalates
    to ETW tracing or a procdump on kill-detect trigger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

Review Summary by Qodo

Probe Windows silent-ELIFECYCLE flake with Defender, Event Log, tasklist fixes

🧪 Tests

Grey Divider

Walkthroughs

Description
• Disable Windows Defender real-time monitoring during test execution
• Clear and capture Windows Event Log (Application, System, Defender) for diagnostics
• Fix tasklist sidecar to use PowerShell instead of bash for proper process metrics
• Add pre/post Defender state verification and process termination event tracking
Diagram
flowchart LR
  A["Test Execution"] --> B["PROBE A: Disable Defender RT"]
  A --> C["PROBE H: Clear Event Logs"]
  A --> D["PROBE I: Fix tasklist Sidecar"]
  B --> E["Capture Defender State Before/After"]
  C --> F["Dump Application/System/Defender Events"]
  D --> G["Sample Process Metrics via PowerShell"]
  E --> H["Artifact: defender-state-*.txt"]
  F --> H
  G --> H["Artifact: event-*.txt, tasklist.log"]

Loading

Grey Divider

File Changes

1. .github/workflows/backend-tests.yml 🧪 Tests +110/-32

Windows CI workflow probes for silent process termination flake

• Added PROBE A: disable Windows Defender real-time monitoring via Set-MpPreference before test
 execution
• Added PROBE H (pre): clear Application and System event logs with wevtutil cl to reduce noise
• Added PROBE H (post): dump Windows Event Log entries (Application, System, Defender Operational,
 error/hang events) to artifact directory post-test
• Fixed PROBE I: replaced bash tasklist command with PowerShell Get-CimInstance Win32_Process to
 capture process metrics (HandleCount, ThreadCount, WorkingSetSize, PageFileUsage, KernelModeTime,
 UserModeTime) in clean ASCII format
• Added Defender state verification before and after test to confirm RT monitoring remained disabled
• Applied identical changes to both Windows test job sections (backend-tests and
 backend-tests-plugins)

.github/workflows/backend-tests.yml


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented May 26, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0)

Grey Divider


Action required

1. Defender not restored 🐞 Bug ⛨ Security
Description
The Windows backend test step disables Microsoft Defender real-time monitoring but never re-enables
it, so subsequent steps in the same job run with AV protection reduced. This is a security posture
regression introduced by this PR in both Windows jobs.
Code

.github/workflows/backend-tests.yml[R240-241]

Evidence
The workflow sets DisableRealtimeMonitoring to $true and later only records the Defender state
(Get-MpPreference) without ever setting it back to $false, then proceeds to subsequent steps in
the job (e.g., vitest). The same pattern is duplicated in the plugins Windows job.

.github/workflows/backend-tests.yml[218-317]
.github/workflows/backend-tests.yml[391-489]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The workflow disables Microsoft Defender real-time monitoring via `Set-MpPreference -DisableRealtimeMonitoring $true` but never restores it to `$false` before exiting the step, leaving later job steps running with Defender RT disabled.

### Issue Context
This occurs in both Windows backend test jobs (with and without plugins). The step already has a post-test section; restoration should happen there and ideally be guarded with a bash `trap` so it runs even if the test command fails.

### Fix Focus Areas
- .github/workflows/backend-tests.yml[240-305]
- .github/workflows/backend-tests.yml[413-478]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Sidecar errors discarded 🐞 Bug ◔ Observability
Description
The tasklist sidecar redirects PowerShell stderr to /dev/null, so CIM/WMI query failures will not
be recorded anywhere, making probe output misleading when data is missing. This affects both Windows
jobs because the same watcher loop is duplicated.
Code

.github/workflows/backend-tests.yml[R268-269]

Evidence
In both watcher loops, the PowerShell command that generates tasklist samples explicitly redirects
stderr to /dev/null, so any failures/warnings from Get-CimInstance are not captured in the
artifact directory.

.github/workflows/backend-tests.yml[259-271]
.github/workflows/backend-tests.yml[432-444]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The watcher loop discards stderr from the PowerShell `Get-CimInstance` call (`2>/dev/null`) and also suppresses failures (`|| true`). If CIM/WMI intermittently fails under load, `tasklist.log` may lack data with no recorded error context.

### Issue Context
Because the output is appended to an artifact for debugging flakes, losing the error stream reduces the probe's diagnostic value.

### Fix Focus Areas
- .github/workflows/backend-tests.yml[259-271]
- .github/workflows/backend-tests.yml[432-444]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment on lines +240 to +241
powershell -Command "Set-MpPreference -DisableRealtimeMonitoring \$true -ErrorAction SilentlyContinue; Get-MpPreference | Select-Object -Property DisableRealtimeMonitoring,DisableBehaviorMonitoring,DisableIOAVProtection,IsTamperProtected | Format-List" \
> "$OUT/defender-state-before.txt" 2>&1 || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Defender not restored 🐞 Bug ⛨ Security

The Windows backend test step disables Microsoft Defender real-time monitoring but never re-enables
it, so subsequent steps in the same job run with AV protection reduced. This is a security posture
regression introduced by this PR in both Windows jobs.
Agent Prompt
### Issue description
The workflow disables Microsoft Defender real-time monitoring via `Set-MpPreference -DisableRealtimeMonitoring $true` but never restores it to `$false` before exiting the step, leaving later job steps running with Defender RT disabled.

### Issue Context
This occurs in both Windows backend test jobs (with and without plugins). The step already has a post-test section; restoration should happen there and ideally be guarded with a bash `trap` so it runs even if the test command fails.

### Fix Focus Areas
- .github/workflows/backend-tests.yml[240-305]
- .github/workflows/backend-tests.yml[413-478]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

The first artifact upload step has `if: failure()` so we only see
node-report data on failure. For the Defender hypothesis (PR #7855)
we need to compare event-defender.txt between a passing run (baseline)
and a future failing run (kill signature) — otherwise N=1 captures
can't be evaluated. Add a second upload step gated on `always()`
that uploads only the small text files (event-*.txt, defender-*.txt)
on every run regardless of outcome. The unique `-${{ github.run_attempt }}`
suffix lets reruns accumulate separate artifacts for comparison.

Each artifact is ~few KB so this doesn't materially impact storage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant