test(ci): probe — disable Defender + capture Event Log + fix tasklist sidecar by JohnMcLear · Pull Request #7855 · ether/etherpad

JohnMcLear · 2026-05-26T18:59:49Z

Summary

Workflow-only probe targeting the Windows silent-ELIFECYCLE flake. Three orthogonal additions in one PR:

A — disable Microsoft Defender real-time monitoring for the duration of the test step
H — clear Application + System event logs pre-test, dump them post-test (pass or fail) to the artifact directory
I — fix the tasklist sidecar from test(ci): OS-level sidecar watcher for the Windows silent ELIFECYCLE #7846 (was producing empty output due to git-bash + UTF-16 BOM; switched to Get-CimInstance Win32_Process for clean ASCII with HandleCount, ThreadCount, WorkingSetSize, PageFileUsage, KernelModeTime, UserModeTime sampled every 500 ms)

Why

Every hypothesis tested so far is RULED OUT:

Hypothesis	Ruled out by	How
Memory/handle leak	every pre-kill node-report	nominal state
TIME_WAIT accumulation	#7852	keepAlive collapsed it, kill survived
Rapid-sequential cadence	#7854	setImmediate yield in root hook, kill survived
File-specific pathology	death corpus	7 files, same fingerprint

The kill fingerprint (silent external termination, no JS-handler trace, no native abort report, sub-1s death window) matches Microsoft Defender's behavioural-monitoring TerminateProcess signature more closely than any other plausible cause. Defender is enabled by default on GHA Windows runners, and rapid loopback TCP fanout is on its suspect-process-behaviour list. We've simply never tested it.

If kills disappear with RT off → causal. If kills persist but event-defender.txt shows pre-kill detection entries → Defender is involved with a more nuanced trigger. If kills persist with no Defender entries → the OS event log will name the actual terminator (Service Control Manager, kernel guard, Werfault, etc.).

What's captured in the artifact on failure

node-report/
├── defender-state-before.txt  # pre-test Defender config
├── defender-state-after.txt   # post-test Defender config (sanity)
├── event-clear.txt            # confirmation logs were cleared
├── event-application.txt      # last 500 Application events with timestamps
├── event-system.txt           # last 500 System events
├── event-defender.txt         # last 200 Defender Operational events
├── event-app-errors.txt       # specifically Application Error / Hang / WER
├── netstat.log                # (existing) localhost TCP every 500ms
├── tasklist.log               # (NOW WORKING) node.exe handle/CPU/RSS every 500ms
└── be-NNNN-*.json / hb-* / mt-*  # (existing) Node diagnostic reports

What this doesn't change

No code changes (workflow only)
No test changes
pnpm test -- --exit invocation unchanged on this branch (the --exit probe is probe-flake-no-exit-flag on a separate branch, run in parallel)
Linux jobs untouched
The Defender disable only applies to the test step on Windows; runner is reset between jobs anyway

Test plan

Linux ± plugins must pass (probe touches Windows only)
Windows ± plugins backend test reruns 5+ times to compare flake rate vs the ~22% baseline
On any failure: pull artifact, inspect event-*.txt for who terminated Node
tasklist.log shows real columns this time (not just headers)

🤖 Generated with Claude Code

…tasklist sidecar Three orthogonal probes against the Windows silent-ELIFECYCLE flake, landed in one PR because they're all workflow-only and complementary. PROBE A — Defender real-time monitoring OFF for the test phase. The kill fingerprint (silent external termination, no JS-handler trace, no native abort report, sub-1s death window) matches Microsoft Defender's behavioural-monitoring TerminateProcess signature. GHA Windows runners have Defender RT enabled by default, and rapid loopback TCP fanout is on Defender's "suspect process behaviour" list. If kills disappear with RT off → causal, this PR is the fix-as-mitigation; if not → Defender ruled out. PROBE H — pre-test wevtutil clear + post-test event log dump. We've never looked at the Windows event log around the kill. `Application`, `System`, `Microsoft-Windows-Windows Defender/ Operational`, and the `Application Error`/`Application Hang`/ `Windows Error Reporting` providers between them will surface who killed the process: Defender, Service Control Manager, Werfault, kernel guard, etc. Clear the logs pre-test so signal-to-noise is high; dump post-test regardless of pass/fail. PROBE I — tasklist sidecar fix (latent bug from PR #7846). The bash `tasklist /v /fi "imagename eq node.exe" /fo csv` produced empty output on the runner — git-bash mangles tasklist's UTF-16-LE-with-BOM output. Switch to PowerShell's Get-CimInstance Win32_Process with explicit columns. This gives us the OS-side equivalent of the libuv handle table (HandleCount, ThreadCount, WorkingSetSize, PageFileUsage, KernelModeTime, UserModeTime) sampled every 500 ms. When Node's `_getActiveHandles` goes silent during the V8 starvation window, the OS still sees the process; this captures that view. All three additions land in node-report/ which the existing artifact upload picks up on failure. No test-code changes. No new dependencies. Expected outcomes: - Defender root cause: Win-with-plugins flake rate drops materially over 5+ runs. event-defender.txt shows pre-kill threat-detection entries on the kills that DO still happen. - Defender not the root cause: event-application.txt / event-system.txt names the actual terminator (Service Control Manager, kernel, Werfault). Probe G (procdump) is the next step. - Neither: kernel-level kill bypassing all event logging — escalates to ETW tracing or a procdump on kill-detect trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qodo-code-review · 2026-05-26T18:59:53Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

qodo-free-for-open-source-projects · 2026-05-26T19:00:41Z

Review Summary by Qodo

Probe Windows silent-ELIFECYCLE flake with Defender, Event Log, tasklist fixes

🧪 Tests

Walkthroughs

Description

• Disable Windows Defender real-time monitoring during test execution
• Clear and capture Windows Event Log (Application, System, Defender) for diagnostics
• Fix tasklist sidecar to use PowerShell instead of bash for proper process metrics
• Add pre/post Defender state verification and process termination event tracking

Diagram

flowchart LR
  A["Test Execution"] --> B["PROBE A: Disable Defender RT"]
  A --> C["PROBE H: Clear Event Logs"]
  A --> D["PROBE I: Fix tasklist Sidecar"]
  B --> E["Capture Defender State Before/After"]
  C --> F["Dump Application/System/Defender Events"]
  D --> G["Sample Process Metrics via PowerShell"]
  E --> H["Artifact: defender-state-*.txt"]
  F --> H
  G --> H["Artifact: event-*.txt, tasklist.log"]

File Changes

1. .github/workflows/backend-tests.yml 🧪 Tests +110/-32

Windows CI workflow probes for silent process termination flake

• Added PROBE A: disable Windows Defender real-time monitoring via Set-MpPreference before test
 execution
• Added PROBE H (pre): clear Application and System event logs with wevtutil cl to reduce noise
• Added PROBE H (post): dump Windows Event Log entries (Application, System, Defender Operational,
 error/hang events) to artifact directory post-test
• Fixed PROBE I: replaced bash tasklist command with PowerShell Get-CimInstance Win32_Process to
 capture process metrics (HandleCount, ThreadCount, WorkingSetSize, PageFileUsage, KernelModeTime,
 UserModeTime) in clean ASCII format
• Added Defender state verification before and after test to confirm RT monitoring remained disabled
• Applied identical changes to both Windows test job sections (backend-tests and
 backend-tests-plugins)

.github/workflows/backend-tests.yml

qodo-free-for-open-source-projects · 2026-05-26T19:00:43Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0)

1. Defender not restored 🐞 Bug ⛨ Security

Description

The Windows backend test step disables Microsoft Defender real-time monitoring but never re-enables
it, so subsequent steps in the same job run with AV protection reduced. This is a security posture
regression introduced by this PR in both Windows jobs.

Code

.github/workflows/backend-tests.yml[R240-241]

Evidence

The workflow sets DisableRealtimeMonitoring to $true and later only records the Defender state
(Get-MpPreference) without ever setting it back to $false, then proceeds to subsequent steps in
the job (e.g., vitest). The same pattern is duplicated in the plugins Windows job.

.github/workflows/backend-tests.yml[218-317]
.github/workflows/backend-tests.yml[391-489]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The workflow disables Microsoft Defender real-time monitoring via `Set-MpPreference -DisableRealtimeMonitoring $true` but never restores it to `$false` before exiting the step, leaving later job steps running with Defender RT disabled.

### Issue Context
This occurs in both Windows backend test jobs (with and without plugins). The step already has a post-test section; restoration should happen there and ideally be guarded with a bash `trap` so it runs even if the test command fails.

### Fix Focus Areas
- .github/workflows/backend-tests.yml[240-305]
- .github/workflows/backend-tests.yml[413-478]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Sidecar errors discarded 🐞 Bug ◔ Observability

Description

The tasklist sidecar redirects PowerShell stderr to /dev/null, so CIM/WMI query failures will not
be recorded anywhere, making probe output misleading when data is missing. This affects both Windows
jobs because the same watcher loop is duplicated.

Code

.github/workflows/backend-tests.yml[R268-269]

Evidence
In both watcher loops, the PowerShell command that generates tasklist samples explicitly redirects
stderr to /dev/null, so any failures/warnings from Get-CimInstance are not captured in the
artifact directory.
.github/workflows/backend-tests.yml[259-271]
.github/workflows/backend-tests.yml[432-444]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The watcher loop discards stderr from the PowerShell `Get-CimInstance` call (`2>/dev/null`) and also suppresses failures (`|| true`). If CIM/WMI intermittently fails under load, `tasklist.log` may lack data with no recorded error context.

### Issue Context
Because the output is appended to an artifact for debugging flakes, losing the error stream reduces the probe's diagnostic value.

### Fix Focus Areas
- .github/workflows/backend-tests.yml[259-271]
- .github/workflows/backend-tests.yml[432-444]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-free-for-open-source-projects · 2026-05-26T19:03:54Z

+          powershell -Command "Set-MpPreference -DisableRealtimeMonitoring \$true -ErrorAction SilentlyContinue; Get-MpPreference | Select-Object -Property DisableRealtimeMonitoring,DisableBehaviorMonitoring,DisableIOAVProtection,IsTamperProtected | Format-List" \
+            > "$OUT/defender-state-before.txt" 2>&1 || true


1. Defender not restored 🐞 Bug ⛨ Security

The Windows backend test step disables Microsoft Defender real-time monitoring but never re-enables it, so subsequent steps in the same job run with AV protection reduced. This is a security posture regression introduced by this PR in both Windows jobs.

Agent Prompt

### Issue description The workflow disables Microsoft Defender real-time monitoring via `Set-MpPreference -DisableRealtimeMonitoring $true` but never restores it to `$false` before exiting the step, leaving later job steps running with Defender RT disabled. ### Issue Context This occurs in both Windows backend test jobs (with and without plugins). The step already has a post-test section; restoration should happen there and ideally be guarded with a bash `trap` so it runs even if the test command fails. ### Fix Focus Areas - .github/workflows/backend-tests.yml[240-305] - .github/workflows/backend-tests.yml[413-478]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

The first artifact upload step has `if: failure()` so we only see node-report data on failure. For the Defender hypothesis (PR #7855) we need to compare event-defender.txt between a passing run (baseline) and a future failing run (kill signature) — otherwise N=1 captures can't be evaluated. Add a second upload step gated on `always()` that uploads only the small text files (event-*.txt, defender-*.txt) on every run regardless of outcome. The unique `-${{ github.run_attempt }}` suffix lets reruns accumulate separate artifacts for comparison. Each artifact is ~few KB so this doesn't materially impact storage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JohnMcLear mentioned this pull request May 26, 2026

test(ci): probe — drop mocha --exit on Windows backend tests #7856

Open

3 tasks

qodo-free-for-open-source-projects Bot reviewed May 26, 2026

View reviewed changes

qodo-code-review Bot deleted a comment from qodo-free-for-open-source-projects Bot May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(ci): probe — disable Defender + capture Event Log + fix tasklist sidecar#7855

test(ci): probe — disable Defender + capture Event Log + fix tasklist sidecar#7855
JohnMcLear wants to merge 2 commits into
developfrom
probe-flake-defender-eventlog-sidecar

JohnMcLear commented May 26, 2026

Uh oh!

qodo-code-review Bot commented May 26, 2026

Uh oh!

qodo-free-for-open-source-projects Bot commented May 26, 2026

Uh oh!

qodo-free-for-open-source-projects Bot commented May 26, 2026 •

edited

Loading

Uh oh!

qodo-free-for-open-source-projects Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		powershell -Command "Set-MpPreference -DisableRealtimeMonitoring \$true -ErrorAction SilentlyContinue; Get-MpPreference \| Select-Object -Property DisableRealtimeMonitoring,DisableBehaviorMonitoring,DisableIOAVProtection,IsTamperProtected \| Format-List" \
		> "$OUT/defender-state-before.txt" 2>&1 \|\| true

Uh oh!

Conversation

JohnMcLear commented May 26, 2026

Summary

Why

What's captured in the artifact on failure

What this doesn't change

Test plan

Uh oh!

qodo-code-review Bot commented May 26, 2026

Qodo reviews are paused for this user.

Uh oh!

qodo-free-for-open-source-projects Bot commented May 26, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-free-for-open-source-projects Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

qodo-free-for-open-source-projects Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qodo-free-for-open-source-projects Bot commented May 26, 2026 •

edited

Loading