Skip to content

Refactor health check service to be more responsive#7261

Merged
JamesNK merged 6 commits intomainfrom
jamesnk/healthcheckservice-refactor
Jan 27, 2025
Merged

Refactor health check service to be more responsive#7261
JamesNK merged 6 commits intomainfrom
jamesnk/healthcheckservice-refactor

Conversation

@JamesNK
Copy link
Copy Markdown
Member

@JamesNK JamesNK commented Jan 27, 2025

Description

Fixes #6112

The health check service immediately checks health when a resource moves to a running state. However, if the intial report is unhealthy then it waits 5 seconds before checking again. This PR refactors the health check service to be more responsive.

  • If a health report isn't healthy then the delay interval starts at 1 second and increases by 1 second per failure up to the existing 5 sec. Healthy delay is still 30 seconds.
  • Refactored the health check loop to use an instant interrupt instead of polling the state once per second. The DelayAsync method is complicated, but the actual health monitoring loop is much simpler.
  • Allow setting the health delays from tests so time based tests run a lot faster. No more 30 second tests.
  • No more blocking in tests: AutoResetEvent 👎

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
    • No. Follow-up changes expected.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
    • No
  • Did you add public API?
    • Yes
      • If yes, did you have an API Review for it?
        • Yes
        • No
      • Did you add <remarks /> and <code /> elements on your triple slash comments?
        • Yes
        • No
    • No
  • Does the change make any security assumptions or guarantees?
    • Yes
      • If yes, have you done a threat model and had a security review?
        • Yes
        • No
    • No
  • Does the change require an update in our Aspire docs?

@JamesNK JamesNK added the area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication label Jan 27, 2025
@JamesNK JamesNK requested review from davidfowl and eerhardt January 27, 2025 07:21
@JamesNK JamesNK requested a review from mitchdenny as a code owner January 27, 2025 07:22
Comment thread src/Aspire.Hosting/Health/ResourceHealthCheckService.cs Outdated
Comment thread src/Aspire.Hosting/Health/ResourceHealthCheckService.cs Outdated
Comment thread src/Aspire.Hosting/Health/ResourceHealthCheckService.cs Outdated
Comment thread src/Aspire.Hosting/Health/ResourceHealthCheckService.cs
Copy link
Copy Markdown
Contributor

@davidfowl davidfowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit about adding comments but LGTM

@JamesNK JamesNK enabled auto-merge (squash) January 27, 2025 11:15
@JamesNK JamesNK merged commit d18ce7a into main Jan 27, 2025
@JamesNK JamesNK deleted the jamesnk/healthcheckservice-refactor branch January 27, 2025 12:05
ResourceMonitorState? state;

_latestEvents[resourceName] = resourceEvent;
lock (_resourceMonitoringStates)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code isn't multi-threaded, right? There is just 1 thread/loop processing these events. So is there a need to lock here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there is an internal GetResourceMonitorState that is used for tests that could be on a different thread.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I didn't want tests to fail because the dictionary was in a bad state during a get.

The lock will never be contended in the real world, and this isn't performance critical code, so it seemed better to add some safety in the cause of test stability.

@github-actions github-actions Bot locked and limited conversation to collaborators Feb 27, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow Health Check intervals to be configurable

3 participants