Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A problem state may be opened which will never be closed by detection #53

Closed
hmpf opened this issue Jun 17, 2020 · 6 comments
Closed

A problem state may be opened which will never be closed by detection #53

hmpf opened this issue Jun 17, 2020 · 6 comments

Comments

@hmpf
Copy link
Contributor

hmpf commented Jun 17, 2020

There is another issue which may need to be discussed, related to my experience with NAV.

A problem state may be opened which will never be closed by detection. E.g.:

  • If an admin physically removes a piece of hardware from a router, NAV will detect this and flag the problem.
  • However, the hardware was removed on purpose and will never be re-inserted. NAV will only consider the problem resolved once it detects the hardware as having been re-inserted, so the problem will stay open indefinitely.
  • The NAV admin solves this by manually resolving the problem from NAV's status UI.
  • However, this closes the problem state in NAV, but never sends an event/alert through the system.
  • This would cause the problem state to stay open in AAS, while it's closed in NAV, and the two systems will be out of sync.

So, what would be the best way to ensure the updated problem state in NAV is propagated to AAS? That's a discussion we need to have...

Originally posted by @lunkwill42 in #45 (comment)

@hmpf hmpf added the discussion Requires developer feedback/discussion before implementation label Jun 17, 2020
@lunkwill42
Copy link
Member

While I'm thinking about possible suggestions, I might add that the same issue would arise if the resolving alert never reaches AAS for some reason (server errors, network errors, whatever issues). We could always make NAV generate the corresponding alert to trigger dispatch to AAS - but the generic issue remains - a resolving alert may never reach AAS and it will become out of sync

@hmpf
Copy link
Contributor Author

hmpf commented Jun 19, 2020

Looks like we need both an ack-function and an stay-dead-function I'd say. UI-wise... the turn-off button needs be less comfortable to use so it isn't triggered by accident.

@lunkwill42
Copy link
Member

Yes, we need a function for manually closing an alert by a human using the frontend. This would be logged to the Event table, associated with that user, as the event is closed, so you can see the difference between an alert that was resolved by message from the alert source, and one that was closed manually. NAV has this same function.

@hmpf
Copy link
Contributor Author

hmpf commented Oct 14, 2020

We now have a manual way for a human to close an incident.

I wonder: could we also have an agent that can close "old" incidents? Maybe if we decommission a source, we could ask that agent to close everything from that source. Which reminds me, some logic only in the api needs to move to the models/queryset.

@lunkwill42
Copy link
Member

Adding a note from an old discussion concering the glue service for NAV: We expect to make the glue service itself able to sync pre-existing NAV problems to Argus - making it able to do it the other way around might be just as interesting - i.e. make it able to see which of its active Argus incidents have already been closed in NAV, and update the Argus Incidents.

@katsel katsel removed the discussion Requires developer feedback/discussion before implementation label Jan 19, 2021
@katsel
Copy link
Contributor

katsel commented Jan 19, 2021

@katsel katsel closed this as completed Jan 19, 2021
@katsel katsel removed this from the Blue sky milestone Jan 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants