Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate Grafana dashboard better aiding incident resolution #2301

Closed
sagikazarmark opened this issue May 8, 2023 · 0 comments · Fixed by #2360
Closed

Alternate Grafana dashboard better aiding incident resolution #2301

sagikazarmark opened this issue May 8, 2023 · 0 comments · Fixed by #2360

Comments

@sagikazarmark
Copy link
Contributor

Describe the solution you'd like
The current dashboard is packed with information that provides a good overview of what's going on, but IMO less useful during an incident. I'm working on an alternate dashboard for that purpose.

What is the added value?
When trying to figure out the root cause of the incident, engineers need the most relevant information at hand. Sure, higher latency is a problem, but it's probably not the root cause of an incident. Failing provider requests are probably a better culprit.

By prioritizing the most important information on the dashboard, the engineer working on an incident will likely determine the root cause quicker.

Give us examples of the outcome

Some information I'd like to see on such a dashboard:

  • Number of secret stores (w/ failing validation)
  • List of failing secret stores? (cardinality?)
  • Provider API call error rates
  • Number of failed secrets
  • Any other that helps determining whether ESO is the culprit in an incident and if so, how
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant