Skip to content

Fix flaky NRE in ConcurrentLfuSoakTests scheduler teardown#758

Closed
Copilot wants to merge 3 commits intomainfrom
copilot/fix-flaky-ci-failure
Closed

Fix flaky NRE in ConcurrentLfuSoakTests scheduler teardown#758
Copilot wants to merge 3 commits intomainfrom
copilot/fix-flaky-ci-failure

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 13, 2026

RunIntegrityCheckAsync used as BackgroundThreadScheduler then immediately dereferenced the result, causing a spurious NullReferenceException if the cast returned null — masking the real failure and making the soak test flaky.

Changes

  • ConcurrentLfuSoakTests.cs — both RunIntegrityCheckAsync overloads (ConcurrentLfu<int,string> and ConcurrentLfu<string,string>): replace the silent as-cast-and-dereference with a FluentAssertions type assertion followed by a direct cast.
// Before — null-deref if scheduler isn't BackgroundThreadScheduler
var scheduler = lfu.Scheduler as BackgroundThreadScheduler;
scheduler.Dispose();
await scheduler.Completion;

// After — fails with a diagnostic message instead of an opaque NRE
lfu.Scheduler.Should().BeOfType<BackgroundThreadScheduler>(
    "soak tests require a BackgroundThreadScheduler to drain deterministically");
var scheduler = (BackgroundThreadScheduler)lfu.Scheduler;
scheduler.Dispose();
await scheduler.Completion;

If the scheduler type ever changes unexpectedly the test now produces a clear assertion failure rather than an unattributed NullReferenceException.

Original prompt

Create a fix for a flaky CI failure in BitFaster.Caching.

Context:

  • GitHub Actions job failed with: System.NullReferenceException in BitFaster.Caching.UnitTests.Lfu.ConcurrentLfuSoakTests.WhenConcurrentGetCacheEndsInConsistentState(iteration: 5).
  • Stack trace points to: BitFaster.Caching.UnitTests/Lfu/ConcurrentLfuSoakTests.cs at the call site await RunIntegrityCheckAsync(lfu, iteration); (around line 164).
  • The failure is caused by RunIntegrityCheckAsync doing var scheduler = lfu.Scheduler as BackgroundThreadScheduler; scheduler.Dispose(); await scheduler.Completion; which can dereference null if the scheduler is not a BackgroundThreadScheduler at runtime (or if it is unexpectedly null), leading to flaky NRE.

Goal:

  • Update the test code so it never throws NullReferenceException during scheduler teardown.
  • Prefer a deterministic and diagnostic failure when the scheduler isn’t the expected type.

Implementation guidance:

  1. In BitFaster.Caching.UnitTests/Lfu/ConcurrentLfuSoakTests.cs, update BOTH overloads of RunIntegrityCheckAsync (the ConcurrentLfu<int,string> and ConcurrentLfu<string,string> versions).
  2. Replace the as BackgroundThreadScheduler + dereference with either:
    • an assertion that lfu.Scheduler is of type BackgroundThreadScheduler, then cast and dispose/await completion; OR
    • a null-safe branch that disposes/awaits only when scheduler is background scheduler, and otherwise logs the actual scheduler type.
      Prefer the assertion approach to catch unexpected scheduler changes.
  3. Ensure the test still runs RunIntegrityCheck(lfu, output) after stopping the scheduler.
  4. Keep changes minimal and limited to test code unless you discover a deeper issue.

Reference:

Please open a PR with the fix and include a concise description in the PR body.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.

Assistant: The job is failing due to a flaky NullReferenceException in the soak test ConcurrentLfuSoakTests.WhenConcurrentGetCacheEndsInConsistentState(iteration: 5) on macOS (dotnet test -f net6.0). The stack points at the integrity-check call site in ConcurrentLfuSoakTests.cs:164 (ref 49e105fd56fd3b880ba7024c3c2f8ce4948591bd), and the most likely null is the background scheduler cast in RunIntegrityCheckAsync:

var scheduler = lfu.Scheduler as BackgroundThreadScheduler;
scheduler.Dispose();          // <- scheduler can be null => NRE
await scheduler.Completion;

Even though the test constructs the cache with a BackgroundThreadScheduler, under certain race/teardown conditions (or if the scheduler gets swapped/disposed unexpectedly), that cast can return null, turning the test failure into a spurious NullReferenceException instead of a meaningful assertion failure.

Fix: make the test robust and fail with a clear message (or handle non-background scheduler)

Update both overloads of RunIntegrityCheckAsync to (1) avoid as+dereference, and (2) produce a clear failure if the scheduler isn’t the expected type.

Suggested patch (same change for both int and string overloads):

private async Task RunIntegrityCheckAsync(ConcurrentLfu<int, string> lfu, int iteration)
{
    this.output.WriteLine($"iteration {iteration} keys={string.Join(" ", lfu.Keys)}");

    lfu.Scheduler.Should().BeOfType<BackgroundThreadScheduler>(
        "these soak tests require a BackgroundThreadScheduler so we can await drain/stop deterministically");

    var scheduler = (BackgroundThreadScheduler)lfu.Scheduler;

    scheduler.Dispose();
    await scheduler.Completion;

    RunIntegrityCheck(lfu, this.output);
}

If you prefer not to assert the type, then at least guard and still run the integrity check:

var scheduler = lfu.Scheduler as BackgroundThreadScheduler;
if (scheduler != null)
{
    scheduler.Dispose();
    await scheduler.Completion;
}
else
{
    this.output.WriteLine($"Warning: scheduler was {lfu.Scheduler?.GetType().FullName ?? "null"}");
}
RunIntegrityCheck(lfu, this.output);

Why this solves the CI failure

  • It removes the NullReferenceException caused by dereferencing a potentially-null scheduler variable.
  • If the scheduler is unexpectedly not a BackgroundThreadScheduler, the test will now fail with a diagnostic assertion explaining what went wrong, instead of an opaque NRE.
  • This keeps the test focused on cache integrity, not on incidental teardown races.

Optional CI hardening (if you still see rare integrity flakiness)

These are “soak” tests ([Collection("Soak")], r...

This pull request was created from Copilot chat.

…eduler type before cast

Agent-Logs-Url: https://github.com/bitfaster/BitFaster.Caching/sessions/d56560c8-1ab4-47a6-8acc-3d850c94848b

Co-authored-by: bitfaster <12851828+bitfaster@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix flaky CI failure in BitFaster.Caching Fix flaky NRE in ConcurrentLfuSoakTests scheduler teardown Apr 13, 2026
Copilot AI requested a review from bitfaster April 13, 2026 06:49
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 13, 2026

Coverage Status

coverage: 98.931% (-0.1%) from 99.037% — copilot/fix-flaky-ci-failure into main

@bitfaster bitfaster closed this Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants