Skip to content

Prevent deadlock on DiagnosticsCollector#3348

Merged
cotti merged 5 commits into
mainfrom
fix/diagnostics_channel_stale
May 20, 2026
Merged

Prevent deadlock on DiagnosticsCollector#3348
cotti merged 5 commits into
mainfrom
fix/diagnostics_channel_stale

Conversation

@cotti
Copy link
Copy Markdown
Contributor

@cotti cotti commented May 19, 2026

This pull request improves the reliability and safety of the DiagnosticsCollector by ensuring diagnostics are properly drained even if StartAsync was never called, preventing deadlocks during disposal or stopping. It also introduces new tests to verify this behavior and adds a guard to WaitForDrain to avoid misuse. Additionally, it makes minor code quality improvements and adds a new property for tracking start state.

Diagnostics channel and collector robustness:

  • Modified DiagnosticsCollector.StopAsync to synchronously drain diagnostics if StartAsync was never called, ensuring diagnostics are not lost and preventing deadlocks on disposal or stop.
  • Added the IsStarted property to both DiagnosticsCollector and IDiagnosticsCollector to track whether a background reader is active. [1] [2]
  • Updated WaitForDrain to throw an exception if called before the collector is started, preventing silent failures or deadlocks.

Testing and regression coverage:

  • Added DiagnosticsCollectorDisposeTests to verify that disposing or stopping a collector without starting it does not hang and that diagnostics are still processed. Also tests that WaitForDrain throws if misused.

Code quality and documentation:

  • Improved comments and documentation in DiagnosticsChannel and DiagnosticsCollector to clarify channel behavior and method responsibilities.
  • Minor code cleanup, such as removing unused using System.Diagnostics; and correcting method visibility. [1] [2] [3]

@cotti cotti self-assigned this May 19, 2026
@cotti cotti requested a review from a team as a code owner May 19, 2026 05:31
@cotti cotti added the fix label May 19, 2026
@cotti cotti requested a review from Mpdreamz May 19, 2026 05:31
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Review Change Stack

Warning

Rate limit exceeded

@cotti has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 20 minutes and 25 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6934657c-eaaf-4eba-ad0b-6e4851377732

📥 Commits

Reviewing files that changed from the base of the PR and between 25979cd and 22f8976.

📒 Files selected for processing (4)
  • src/Elastic.Documentation/Diagnostics/DiagnosticsChannel.cs
  • src/Elastic.Documentation/Diagnostics/DiagnosticsCollector.cs
  • src/Elastic.Documentation/Diagnostics/IDiagnosticsCollector.cs
  • tests/Elastic.Changelog.Tests/Changelogs/DiagnosticsCollectorDisposeTests.cs
📝 Walkthrough

Walkthrough

This PR fixes potential deadlocks in DiagnosticsCollector when StopAsync or DisposeAsync are called without StartAsync having been invoked. It introduces an IsStarted property to track initialization and modifies StopAsync to synchronously drain queued diagnostics before awaiting the background task when the reader was never started. The interface adds a guard in WaitForDrain to prevent misuse by throwing InvalidOperationException if called before initialization. Channel semantics are documented, and regression tests verify the fix prevents hangs in edge cases.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant DiagnosticsCollector
  participant BackgroundReader
  participant Channel
  participant Outputs

  Caller->>DiagnosticsCollector: StopAsync()
  DiagnosticsCollector->>DiagnosticsCollector: check _readerStarted
  alt _readerStarted == false
    DiagnosticsCollector->>DiagnosticsCollector: Drain() synchronously
    DiagnosticsCollector-->>Caller: return
  else _readerStarted == true
    DiagnosticsCollector->>BackgroundReader: await _started
    BackgroundReader->>Channel: complete reader / finish loop
    BackgroundReader->>DiagnosticsCollector: drain queued items
    DiagnosticsCollector->>Outputs: write drained diagnostics
    DiagnosticsCollector-->>Caller: return
  end
Loading
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Prevent deadlock on DiagnosticsCollector' directly and clearly summarizes the main objective of the changeset—addressing deadlock issues in the DiagnosticsCollector.
Description check ✅ Passed The description comprehensively explains the changes: preventing deadlocks by draining diagnostics even when StartAsync is not called, adding IsStarted tracking, updating WaitForDrain, and adding regression tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch fix/diagnostics_channel_stale

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/Elastic.Documentation/Diagnostics/DiagnosticsCollector.cs`:
- Line 33: IsStarted currently returns true just because _started is non-null,
which becomes true when Task.Run returns a canceled Task (StartAsync with a
canceled token) even though the reader never executed; change IsStarted to check
that the background task is actually running by returning _started is not null
&& !_started.IsCompleted (i.e. use _started?.IsCompleted to guard), and ensure
StopAsync/Drain() logic already uses IsStarted so it will call Drain() only when
the background task truly started; update the IsStarted property (and any
dependent callers if necessary) to use this completed-state check rather than
null-only.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b039c96b-e356-4b4d-94c5-e354310c80df

📥 Commits

Reviewing files that changed from the base of the PR and between d2ee28b and 026c15d.

📒 Files selected for processing (4)
  • src/Elastic.Documentation/Diagnostics/DiagnosticsChannel.cs
  • src/Elastic.Documentation/Diagnostics/DiagnosticsCollector.cs
  • src/Elastic.Documentation/Diagnostics/IDiagnosticsCollector.cs
  • tests/Elastic.Changelog.Tests/Changelogs/DiagnosticsCollectorDisposeTests.cs

Comment thread src/Elastic.Documentation/Diagnostics/DiagnosticsCollector.cs Outdated
ConcurrentDictionary<string, bool> InUseSubstitutionKeys { get; }

/// True once StartAsync has been called and a background reader is draining the channel.
bool IsStarted => true;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why IsStarted is initialized with true?

I would assume a boolean like this would be initialized with false and only becomes true when something starts and changes it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ its fine for the interface to not have a default implementation.

Copy link
Copy Markdown
Member

@Mpdreamz Mpdreamz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to gate the write?

public virtual void Write(Diagnostic diagnostic)
{
	IncrementSeverityCount(diagnostic);
    if (_started)
		Channel.Write(diagnostic);
}

Copy link
Copy Markdown
Member

@Mpdreamz Mpdreamz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some raw thoughts, wdyt @cotti

ultimately, not calling .StartAsync() is not necessary an exception, a collector could be instantiated and disposed and that should be fine too. Can we extend the tests to assert that as well?

// instead of deadlocking on Channel.Reader.Completion.
if (_started is null)
{
Drain();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this potentially problematic in a different way, since nothing is calling waittoread.

Maybe we need a new flag after:

_ = await Channel.WaitToWrite(cancellationToken);

or after while (Channel.Reader.TryRead(out var item)) ?

and base the decision to await Channel.Reader.Completion? on that?

We also need to wait for .NET 11.0 see e.g: dotnet/runtime#123555 to be completely safe.

- Gate Channel.Write on a _readerStarted flag set inside the reader delegate,
  so writes are dropped when no reader will drain them (per Mpdreamz).
- _readerStarted, not _started != null, drives IsStarted; Task.Run with an
  already-canceled token leaves _started non-null without ever running the
  delegate (per CodeRabbit).
- StopAsync no longer needs a sync-drain branch — gating keeps the channel
  empty when no reader started. Keep a defensive Drain after awaiting the
  reader Task to mop up if it exited early via cancellation.
- Drop the default IsStarted => true on IDiagnosticsCollector and require
  implementers to declare it (per reakaleek + Mpdreamz).
- Add a regression test for instantiate-and-dispose with no emissions
  (per Mpdreamz).

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/Elastic.Documentation/Diagnostics/DiagnosticsCollector.cs (1)

101-104: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

_readerStarted startup race can drop diagnostics and skip shutdown waiting.

Between StartAsync returning and the delegate setting _readerStarted on Line 53, StopAsync can early-return and Write can drop items even though startup is already in-flight.

Suggested minimal fix
 public virtual async Task StopAsync(Cancel cancellationToken)
 {
 	Channel.TryComplete();
-	// No reader was ever scheduled, so Write() gated everything out and the
-	// channel is empty — nothing to await, nothing to drain.
-	if (!_readerStarted)
+	if (_started is null)
 		return;
 
 	try
 	{
 		await _started!;
 	}
 	catch (OperationCanceledException)
 	{
 		// Reader was canceled before its final Drain(); mop up below.
 	}
 	// Defensive: if the reader exited early via cancellation, items may
 	// still be queued. Drain them synchronously so they're not lost.
 	Drain();
 }

 public virtual void Write(Diagnostic diagnostic)
 {
 	IncrementSeverityCount(diagnostic);
-	if (_readerStarted)
+	if (_readerStarted || _started is { IsCompleted: false })
 		Channel.Write(diagnostic);
 }

Also applies to: 127-128

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Elastic.Documentation/Diagnostics/DiagnosticsCollector.cs` around lines
101 - 104, Startup race: _readerStarted may be observed false by StopAsync and
Write while StartAsync is still in-flight, causing dropped diagnostics and
skipped shutdown waits. Fix by ensuring _readerStarted is set before StartAsync
returns and before scheduling the reader delegate (or use an
atomic/interlocked/volatile write or a TaskCompletionSource that StartAsync
awaits and StopAsync/Write await/check) — update the startup path in StartAsync
(the code that schedules the reader delegate and currently sets _readerStarted)
to set the flag (or complete the TCS) deterministically before returning, and
change the StopAsync and Write checks (the same checks referenced at 127-128) to
observe that same atomic/awaitable signal so they never early-return while
startup is in-flight.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/Elastic.Documentation/Diagnostics/DiagnosticsCollector.cs`:
- Around line 101-104: Startup race: _readerStarted may be observed false by
StopAsync and Write while StartAsync is still in-flight, causing dropped
diagnostics and skipped shutdown waits. Fix by ensuring _readerStarted is set
before StartAsync returns and before scheduling the reader delegate (or use an
atomic/interlocked/volatile write or a TaskCompletionSource that StartAsync
awaits and StopAsync/Write await/check) — update the startup path in StartAsync
(the code that schedules the reader delegate and currently sets _readerStarted)
to set the flag (or complete the TCS) deterministically before returning, and
change the StopAsync and Write checks (the same checks referenced at 127-128) to
observe that same atomic/awaitable signal so they never early-return while
startup is in-flight.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5a79be47-e79d-4ae2-a326-2807def13cf7

📥 Commits

Reviewing files that changed from the base of the PR and between 026c15d and 25979cd.

📒 Files selected for processing (3)
  • src/Elastic.Documentation/Diagnostics/DiagnosticsCollector.cs
  • src/Elastic.Documentation/Diagnostics/IDiagnosticsCollector.cs
  • tests/Elastic.Changelog.Tests/Changelogs/DiagnosticsCollectorDisposeTests.cs

cotti and others added 2 commits May 20, 2026 10:44
Per review: only flip the flag once the reader has cleared its initial
setup, so it strictly means "the delegate executed past the first await".
A canceled token or a Task.Run short-circuit leaves the flag false.

Co-authored-by: Cursor <cursoragent@cursor.com>
…el_stale

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	src/Elastic.Documentation/Diagnostics/IDiagnosticsCollector.cs
@cotti cotti temporarily deployed to integration-tests May 20, 2026 13:49 — with GitHub Actions Inactive
@cotti cotti requested a review from Mpdreamz May 20, 2026 13:51
The Write gate (`if (_readerStarted) Channel.Write(...)`) had a
scheduling race: between StartAsync returning and the Task.Run
delegate hitting `await Channel.WaitToWrite`, any emitted diagnostic
was silently dropped. This broke the F# authoring suite, which
synchronously schedules the reader and emits diagnostics before the
reader had a chance to flip the flag.

- Write unconditionally — UnboundedChannel never blocks, and items in
  an undrained channel are GC'd with the collector.
- StopAsync now gates on `_started is null` (was StartAsync ever
  called?) rather than `_readerStarted` (is the reader executing?).
  Once StartAsync was called with a non-canceled token, awaiting
  `_started` is always meaningful: it completes when the reader exits
  after draining the channel.
- IsStarted still backed by `_readerStarted` for accurate
  introspection (execution proof, per earlier review feedback).
- Drop the leftover //TODO in DiagnosticsChannel.Write.

Co-authored-by: Cursor <cursoragent@cursor.com>
@cotti cotti temporarily deployed to integration-tests May 20, 2026 14:08 — with GitHub Actions Inactive
@cotti cotti enabled auto-merge (squash) May 20, 2026 14:10
@cotti cotti merged commit 0464e10 into main May 20, 2026
24 checks passed
@cotti cotti deleted the fix/diagnostics_channel_stale branch May 20, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants