Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status / health of workload clusters #13

Open
othylmann opened this issue Jun 26, 2018 · 49 comments
Open

Status / health of workload clusters #13

othylmann opened this issue Jun 26, 2018 · 49 comments
Labels
area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service epic/revisit-after-capi topic/observability ui/backstage The next generation web UI for Giant Swarm

Comments

@othylmann
Copy link
Member

othylmann commented Jun 26, 2018

Epic Story

As a customer, I want to be able to easily and efficiently check the health of my workload clusters so I know whether they require attention or not.

(This Epic is in ideation phase and the following stories below have been created to collect the various sources of information we require in order to build the MVP of this Epic)

Linked UserStories

  • Competitor Analysis giantswarm/giantswarm#6136
  • Customer Focus Sessions/ Feedback giantswarm/giantswarm#6137
  • Available Data giantswarm/giantswarm#6138
  • Mock ups giantswarm/giantswarm#6139
  • Architectural implications giantswarm/giantswarm#6312
  • In scope/ out of scope

User Personas

Linked Stories

@othylmann

This comment has been minimized.

@othylmann

This comment has been minimized.

@teemow

This comment has been minimized.

@othylmann

This comment has been minimized.

@teemow

This comment has been minimized.

@teemow teemow changed the title Product Spec for Status / Health Status / Health of tenant clusters Sep 19, 2018
@othylmann

This comment has been minimized.

@marians

This comment has been minimized.

@marians

This comment has been minimized.

@J-K-C

This comment has been minimized.

@J-K-C

This comment has been minimized.

@marians

This comment has been minimized.

@marians
Copy link
Member

marians commented May 23, 2019

I started breaking down what this story may be about, as a basis for upcoming brainstorming sessions. Credits to Marcel for helping me.

On the one hand I would like to keep us open minded about the vision. I know our customers mean a lot of different things at once. We will have to boil it down and draw a meaningful path. So to start somewhere, I started mapping out what could flow into the thing we are talking about here.

https://docs.google.com/presentation/d/1I2_hz-bkOOK2--AqP63cUKREtLNnx6sfZJfJebFeG-Y/edit#slide=id.p

Typical disclaimer: WIP, early stage, nothing set in stone etc. More iterations needed even for the simple stuff in there.

@puja108

This comment has been minimized.

@marians

This comment has been minimized.

@marians
Copy link
Member

marians commented Jul 23, 2019

For future reference, here is a very high level plan on how to tackle the topic:

https://docs.google.com/drawings/d/1CNzDAk6HPqeE8qqVtBu29iAZaauvPdP5Z9_XB7Bt4BM/edit

image

I collect some more input in this slides document: https://docs.google.com/presentation/d/1I2_hz-bkOOK2--AqP63cUKREtLNnx6sfZJfJebFeG-Y/edit#slide=id.p

@marians

This comment has been minimized.

@marians
Copy link
Member

marians commented Aug 6, 2019

From giantswarm/giantswarm#6511:

As volumes being filled is a constant source of trouble in day to day life, it would be quite valuable to have volume usage date available in the MVP. I suggest that we discuss alternatives to acquire that data (currently provided by the tenant cluster node exporter, fetched by Prometheus) without relying on Prometheus.

@puja108
Copy link
Member

puja108 commented Aug 12, 2019

What are the reasons behind not using Prom and what would be the alternative?

@JosephSalisbury
Copy link
Contributor

JosephSalisbury commented Aug 12, 2019

Yeah, not using Prometheus for this doesn't feel like the right direction

@J-K-C
Copy link

J-K-C commented Aug 13, 2019

Does not scale and is not reliable enough according to @teemow. However, @cornelius-keller will be deep diving into the technical feasibility/ architecture of this feature at some point and then we can get some more data for a more data driven discussion. Please watch this space :)

@cornelius-keller
Copy link
Contributor

As for now Prometheus is to my knowledge the only Source for metrics data I think we can use it. Also even if we are looking for a replacement it seems to me that this would take much longer then we want to wait for this feature. It seems a bad idea to couple Prometheus replacement with this story.

So far far as I understand the architecture by now we anyways need an api endpoint to expose the metrics data to the front end. This can act as a facede to Prometheus for now. Ideally we can replace Prometheus later without changing this api endpoint.

@marians
Copy link
Member

marians commented Aug 14, 2019

Let me reconstruct what I remember from @teemow's statements on Prometheus as we run it currently, with respect to this story, at the risk of getting it wrong:

  • Prometheus is currently in a state where it doesn't seem suited for higher usage. "The current setup doesn't scale".
  • We should focus first (=for the MVP) on data we can get without relying on Prometheus.

I think it's up to us to look at this in more detail. Like so:

  • If we decide, for example, to query tenant cluster volume data from Prometheus, how will this affect the resource usage and stability of Prometheus?
  • How will the query pattern look like? (Likely: All tenant clusters, all volumes, once a minute)
  • How can we deal with temporary unavailability of Prometheus data in the health UI? (In short: design for an "unknown" state in all metrics)

@puja108
Copy link
Member

puja108 commented Aug 14, 2019

If you come up with such assertions then please state what you are comparing it against. "without relying on Prometheus" implies you rely on something else (at least for the metric that you are using as an example here). What system will you ask? Do you write your own? Will you keep it in a time series DB? Will you write your own DB or use Influx?
No matter how you answer these questions, I do not see a solution without a tool "like" Prometheus, so like Cornelius said, why not just rely on what we have instead of building something completely redundant next to it? Also, if you want to put in the work, why not work on making Prometheus better or build workarounds (e.g. a cache) for the cases you mention?
Also, keep in mind that currently the context of this is Happa (AFAIK) and it is not a heavily used tool, so any query load you generate is short-lived and usually pertains to single users. Yes, it will need to scale at some point, but so does our Prom setup need to scale, we already know that as we rely heavily on it for our SLA.
That all said, if there's data you can get without Prom, please do so, but include the effort of building your tooling in your thoughts on this, otherwise this story might be blown out of proportions.

@marians
Copy link
Member

marians commented Aug 14, 2019

The info and details (only some of them deserve the term "metrics") we want to focus on in the MVP are the ones we can get from

  • Kubernetes API: e. g. node details such as capacity, requests, limits
  • Provider resource details coming via our CRs, etc. AutoScalingGroup details coming via the cluster resource or whatever the node pool equivalent will be.

There might be more. These are two examples.

See https://github.com/giantswarm/giantswarm/issues/6139#issuecomment-516847947 for a visual representation of this sort of details on the node level. EDIT: The volume data is currently only available in Prometheus.

@puja108
Copy link
Member

puja108 commented Aug 14, 2019

Those look fine to me, and most really do not need Prometheus, which is cool. I am not 100% sure on the volumes stuff, but if that is the only thing in there that is not available through K8s itself, then I would skip it in this phase.

Getting things K8s APIs is definitely ok. The backend for the metrics you will get from the API currently come from metrics-server component, but in the future we might even have local Prom serving that, so as long as you ask K8s API for metrics we do not rely on a single backend.

BUT on K8s API I would also be careful about hammering the APIs and have some caching and not too much "live" data involved.

@J-K-C
Copy link

J-K-C commented Aug 14, 2019

Hi all, suggest we park this discussion for a moment as the Architecture deep dive has been assigned to Cornelius of which it is only his 2nd week plus, Timo is AFK. Tomorrow we have an introductory session to this epic for Cornelius and the Ludacris team where we will also be reviewing the MVP and from there, I suggest we book a couple sessions where we can have some data driven discussions around this.. maybe a session in Rome if the timings right.

@puja108
Copy link
Member

puja108 commented Aug 14, 2019

I don't think there's more need for discussion here. All good, move forward.

@teemow
Copy link
Member

teemow commented Aug 27, 2019

My considerations were that in this story we are talking about the current state only. This is and should be in the status section of our CRDs. This is a very reliable source.

This story isn't about metrics and timeseries.

Prometheus isn't reliable. We had many flapping prometheus in control planes already. It can be easily wiped. Let's say the data structure is less defined and versioned. It is at its limit. So presenting metrics to the customer needs to wait until we have worked on the prometheus topology and maybe long term storage.

@puja108
Copy link
Member

puja108 commented Aug 27, 2019

My considerations were that in this story we are talking about the current state only. This is and should be in the status section of our CRDs. This is a very reliable source.

This story isn't about metrics and timeseries.

Fully agree!

Prometheus isn't reliable. We had many flapping prometheus in control planes already. It can be easily wiped. Let's say the data structure is less defined and versioned. It is at its limit. So presenting metrics to the customer needs to wait until we have worked on the prometheus topology and maybe long term storage.

Ok, this is maybe where the confusion came from. What I read from this is not "we should build sth else for metrics and replace prom" but "we need to improve our prom/metrics setup before we can rely on it for metrics", right?

@cornelius-keller
Copy link
Contributor

I have looked again through all the history of this story and the related work especially from @marians . Initially it seems I have underestimated the technical complexity and the different requirements to this story from customer / ui perspective vs the internal technical challenges like moving towards cluster api, and having an operator readable cluster status where other operators can react on.

After all I would like to suggest a new MVP.
I suggest to have a very simple traffic light status per cluster, based in the beginning only on the number of desired nodes and the number of ready nodes.

If there are all desired nodes ready -> cluster is green.
If there are between 1 and 20 % of the desired nodes not ready -> cluster is yellow.
If there are more then 20 % of the nodes not ready -> cluster is read.

A node that is not there at all because for example it is not created yet by the infrastructure or because the infrastructure failed will be considered as not ready.

In the first iteration I think we can ignore the intermediate state. As @teemow pointed out during upgrades the new nodes are created before old ones are deleted. So the cluster should stay green.

In other cases I think it is consequent and most easily to explain what happens if we keep the status evaluation simple in the beginning. If I look at for example elastic search a cluster becomes yellow if you add a node and it starts to re balance shards. Yellow would just mean "Desired state is not current state, but it is not bad yet". Whereas red means "Desired state is not current state and it is probably bad".

I would like to have this for all clusters, regardless if they are using node pools or not.
We could have the same thing per node pool if the cluster uses them.

I think this is the minimal thing that provides user value and does not cause to many technical uncertainties as the information is probably already existing or could be easily added to the current CRs.

Based on customer feedback we can then decide to add more information or states to the traffic light or for example work on having single sign on for grafana so that we can reuse the dashboards that we have there.

WDYT?

@teemow
Copy link
Member

teemow commented Dec 9, 2019

Sounds good to me. Small steps will help us to align with cluster-api upstream and the different implementation levels we have in the operators.

Will this distinguish between "node not ready" and "api unavailable"?

Btw on Azure we don't create new instances first and then tear down the old one afaik. Not sure about KVM. On AWS the ASG definitely creates a new instance before the old one is teared down.

@cornelius-keller
Copy link
Contributor

cornelius-keller commented Dec 9, 2019

I think extending this in a way that master down would mean also red should be easy. With multi master the semantics would then probably to all masters down.

Regarding the states during creation and upgrading: I still think that this is easy to explain to the users, and I would like to add more states based on user feedback.

For a three node cluster this even means that it turns red during updates if we don't create the new node before adding a new one. But if you run a three node cluster in production this is actually a bad thing as it means you lost 1/3 of your capacity and this will probably affect your workload. So showing it red during the update still seems appropriate to me.
On the other hand we could think on having a threshold of 5% before switching from green to yellow, so big clusters don't turn yellow during upgrades. But also this I would like to tweak after customer feedback.

@puja108
Copy link
Member

puja108 commented Dec 10, 2019

Going in the right direction. I suggest you evaluate two things in more detail. Both will need a bit of thoughts and maybe even some tryouts and tests either manually or automated:

  1. What is the exact percentages/limits you want status to be changing at?
  2. Should status be separated for masters vs nodes and between node pools?

Point 2 might be something you will just leave out of MVP, but need to be careful that if you decide to not separate anything you will need a bit more detailed communication to not confuse people about their status. In my experience people worry in different ways about API down vs nodes not ready.

@cornelius-keller cornelius-keller transferred this issue from another repository Dec 12, 2019
@cornelius-keller cornelius-keller added topic/observability area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service mission/🌹/observability labels Dec 12, 2019
@J-K-C J-K-C added this to the 2020 Q1 milestone Dec 13, 2019
@snizhana-dynnyk snizhana-dynnyk removed this from the 2020 Q1 milestone Dec 16, 2019
@cornelius-keller cornelius-keller moved this from In design to Ready to develop in Giant Swarm Roadmap (Deprecated) Jan 16, 2020
@cornelius-keller cornelius-keller moved this from In Design to Planned in Giant Swarm Roadmap (Deprecated) Mar 20, 2020
@marians marians changed the title Status / Health of tenant clusters Status / health of workload clusters Jan 14, 2021
@snizhana-dynnyk
Copy link
Contributor

A customer mentioned this in a feedback call about our monitoring - that they would like to have some sort of a traffic light system for workload clusters. This would help them to develop trust in our monitoring and also give an overview of workload clusters' health.

I will add this issue to a Product board so we can discuss it on Monday. We might consider implementing this traffic light system in Management API.

Additionally, defining the concepts of 'red' / 'yellow' clusters might be useful for our internal operations, e.g. prioritizing postmortems or even using the min number of 'red' clusters as an outcome of the reliability goal.

@puja108
Copy link
Member

puja108 commented Jun 14, 2021

This still makes sense in Ludacris, but we'd currently not prioritize it very high, as the source of truth for such health being in CRDs will change with CAPI (and upstream CAPI has similar health stories they are thinking about). Thus, I would revisit this once we're further on CAPI unless there's increased priority from the customer side on this.

For now I would also say that giving access to our grafana dashboards should at least start giving a first picture, not in our own interface but at least in a first interface. Ludacris will definitely also look at what dashboards they own and how those will be experienced by customers.

@JosephSalisbury
Copy link
Contributor

@puja108 can we find a home for this? it looks lost somehow

@puja108
Copy link
Member

puja108 commented Oct 20, 2022

I'd move this over to one of the KaaS teams, so it can be checked against and maybe merged with the general cluster health within CAPI story. The idea back in Ludacris was mainly, let's for now rely on the health status we get through CAPI. We'd then need to see where we expose it aka dashboard vs/and happa/kgs. I know for example that the azure CLI and in some way also clusterctl show an aggregate cluster health in the command line.

cc @alex-dabija @gawertm @cornelius-keller which team would be closest to this right now?

@alex-dabija
Copy link

@puja108 can we find a home for this? it looks lost somehow

The issue is not lost. We agreed in KaaS (some time ago), that Ludacris' backlog we'll stay on the KaaS Sync's board until either Rocket or Hydra has a need to implement the feature.

We (@cornelius-keller, @gawertm and me) discuss it quickly today in the KaaS Product Sync and agreed that it's still best to pull in the story in one of the teams when it's needed.

cc @alex-dabija @gawertm @cornelius-keller which team would be closest to this right now?

Unfortunately, it's difficult to say which team is closest because we are mostly focused on having stable clusters.

@teemow
Copy link
Member

teemow commented Jun 4, 2024

@puja108 @marians this is still interesting in terms of fleet management. Especially for an interface like backstage in which you can drill down (health) information about an installation or cluster. Eg seeing the state of applications on a cluster or the current alerts of the cluster itself.

@marians marians added the ui/backstage The next generation web UI for Giant Swarm label Jun 6, 2024
@marians
Copy link
Member

marians commented Jun 6, 2024

I'm putting these three related issues on Honeybadger's board for Backstage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service epic/revisit-after-capi topic/observability ui/backstage The next generation web UI for Giant Swarm
Projects
Status: Backlog 📦
Development

No branches or pull requests

9 participants