Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offering Observability (and other) Templated Links/Queries by Use Case #1706

Open
vlerenc opened this issue Feb 6, 2024 · 3 comments
Open
Labels
component/dashboard Gardener Dashboard kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age)

Comments

@vlerenc
Copy link
Member

vlerenc commented Feb 6, 2024

What would you like to be added:
@ashwani2k proposed (and showed a text-based interactive prototype) to make it simpler to collect templated links/queries and share them with operators and end users alike. This is especially handy for Prometheus/Vali(=Loki). An MCM colleague contributed his personal link/query collection (by use case; task-oriented) and Ashwani put that into a machine-readable format (see https://github.tools.sap/kubernetes/ops-guide/pull/742, not accessible to everybody, see small excerpt below), e.g.:

categories:
- title: Observability
  description: "Useful collection of links and queries (with placeholders) for shoot clusters, grouped by categories/use cases, to support Gardener operators in their tasks."
  categories:
  - title: Machines
    description: "Everything around machines, i.e. backing VMs as well as Kubernetes nodes."
    categories:
    - title: Scale Up
      description: "Identify whether scale up was triggered by CA or not."
      queries:
      - title: Check the number of nodes which were scaled up.
        query:
          type: prom
          expression: '"shoot:kube_node_info:count"'
      - title: Check if CA has triggered the scale up.
        query:
          type: vali
          expression: '{container_name="cluster-autoscaler"} |~ "Final scale-up" |~ "shoot--$projectName--$shootName-$worker-pool"'
    - title: Scale Down
      description: "Identify whether scale down was triggered by CA or not."
      queries:
        ...
    - title: Upgrade
      description: Check whether the upgrade is stuck due to any error in MCM or due to PDB violation."
      queries:
      - title: Check for errors for any machine in a worker-pool for a given provider.
        query:
          type: vali
          expression: '{container_name="machine-controller-manager-provider-$provider"} |~ "shoot--$projectName--$shootName-$worker-pool" |~ "machine codes error"'
      - title: Check if drain is stuck due to a PDB violation.
        query:
          type: vali
          expression: '{container_name="machine-controller-manager-provider-$provider"} |~ "could not be evicted from node" |~ "occur due to PDB violation"'

It would be great to make those links/queries available in the Gardener Dashboard (maybe also/even https://github.com/gardener/gardenctl-v2) for the benefit of everybody and because the Dashboard is a GUI and Plutono(=Grafana) is also one.

Why is this needed:
We do not share domain specific knowledge good enough (within a team, with adopters/our community, with end users) and even if some individuals have personal notes, most often they have them only on one specific subject matter. Newcomers start off with nothing. End users also have nothing and are even further away from our observability stack. All of them would benefit from a curated list of templated links/queries to analyse their issues/understand what puzzles them.

@vlerenc vlerenc added kind/enhancement Enhancement, improvement, extension component/dashboard Gardener Dashboard labels Feb 6, 2024
@gardener gardener deleted a comment from gardener-robot Feb 6, 2024
@gardener gardener deleted a comment from gardener-robot Feb 6, 2024
@gardener gardener deleted a comment from gardener-robot Feb 6, 2024
@vlerenc
Copy link
Member Author

vlerenc commented Feb 6, 2024

The comment was made out-of-band by @petersutter to have the configuration "in some cluster / in some configmap". It could be maintained in GitHub, deployed automatically, without fear of breaking anything. Tools like the Dashboard (or gardenctl) could fetch it on-the-fly and show its content as needed. When used in the context of tickets, modern LLMs can help selecting the most appropriate links and filling the placeholders. That is actually also possible for the Dashboard (or gardenctl) if "there is space" to ask a question, but that's a next step (if at all). For now, it would be great to:

  • Lower the entry barrier (as compared to Plutono(=Grafana) dashboards in Gardener itself) and facilitate a low-risk way to collect/capture expert knowledge
  • ...and make this information easily accessible in the clients Gardener offers, predominantly the Dashboard (but eventually also gardenctl with a focus on text-based access, e.g. log queries to be further processed with grep, sed, awk, etc.)

@gardener gardener deleted a comment from gardener-robot Feb 6, 2024
@vlerenc
Copy link
Member Author

vlerenc commented Feb 6, 2024

The comment was made out-of-band by @ScheererJ that this idea can be expanded to more than links/queries, e.g. to also run pre-defined scripts (something like stored procedures for ops) in the browser or using similar technology like we use it already for web terminals.

@gardener-robot
Copy link

@vlerenc You have mentioned internal references in the public. Please check.

@gardener gardener deleted a comment from gardener-robot Feb 17, 2024
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/dashboard Gardener Dashboard kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age)
Projects
None yet
Development

No branches or pull requests

2 participants