Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exempting specific checks #169

Open
7 tasks
brendanheywood opened this issue Nov 22, 2023 · 3 comments
Open
7 tasks

Exempting specific checks #169

brendanheywood opened this issue Nov 22, 2023 · 3 comments
Assignees

Comments

@brendanheywood
Copy link
Contributor

We want to create the ability to add extra config in heartbeat which applies on top of all the status checks and if an override is in place then it:

  • downgrades the check from critical to a warning
  • there must be a note attached with a url
  • there must be a time expiry
  • We want a nice simple gui showing any checks which are currently not green and have button to take you off the exempt page
  • an exempt can be extended if it has expired, or extended before it expires
  • show a report similar to the existing status page but also showing the overrides
  • have bespoke handling of the checks which are managed inside heartbeat to enable exemption of specific adhoc tasks or scheduled tasks
@brendanheywood
Copy link
Contributor Author

had a chat with @matthewhilton and I think the cleanest design is that the overrides only ever happen at the check level which will make the logic of handling the overrides much easier. But this kinda moves the problem in that we still have some checks like 'are there any slow tasks' which are really checking a whole bunch of things and when they fail it could fail on any type of task.

So the solution I have in mind for this is that various checks (only the ones in heartbeat) actually conditionally declare multiple checks for each class of issue. So lets say that a site is green and there is 100 tasks and they are all good, then there is 1 check and it is green.

Now lets say that 3 types of task start to fail, then we will see the one original check which says 97 tasks are good, and then 3 new extra checks which say 'task foo is broken', 'task bar is broken', 'task blah is broken' and now we can address each of them in turn individually. In other words the main check will never actually fail it will only spawn failing tasks. It also means the logic of looking for failing tasks needs be move back into lib.php (or called from there) rather than inside the result object. A little weird but I think its ok for this situation as its a fast query and it is only moving the perf hit to a bit earlier.

@brendanheywood
Copy link
Contributor Author

One more thing, if we mark a failing check as muted for a month, and then after one month that check is actually resolved, either the check is no longer declared or the check is declared and is passing, then I think we should explicitly mark the override as having been resolved. If the check is still failing then it is shown as overdue and it keeps alerting and someone will probably extend it again and / or resolve it properly. We want a full audit trail of who added overrides if the same check fails intermittently over time.

@owenherbert
Copy link
Contributor

I've created a consolidated README.md file on what these changes would look like, let's discuss it on Monday.

bwalkerl added a commit that referenced this issue Mar 8, 2024
Co-authored-by: Owen Herbert <owenherbert@catalyst-au.net>
brendanheywood pushed a commit that referenced this issue Mar 12, 2024
Co-authored-by: Owen Herbert <owenherbert@catalyst-au.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants