Status / health of workload clusters #13

othylmann · 2018-06-26T12:30:20Z

Epic Story

As a customer, I want to be able to easily and efficiently check the health of my workload clusters so I know whether they require attention or not.

(This Epic is in ideation phase and the following stories below have been created to collect the various sources of information we require in order to build the MVP of this Epic)

Linked UserStories

Competitor Analysis giantswarm/giantswarm#6136
Customer Focus Sessions/ Feedback giantswarm/giantswarm#6137
Available Data giantswarm/giantswarm#6138
Mock ups giantswarm/giantswarm#6139
Architectural implications giantswarm/giantswarm#6312
In scope/ out of scope

User Personas

Linked Stories

Statuspage.io giantswarm/giantswarm#2966
How to Define Cluster Health giantswarm/giantswarm#3221
Overview of Cluster State and Status giantswarm/giantswarm#3223
Transparency regarding cluster state: giantswarm/giantswarm#2209
Real-time event stream: giantswarm/giantswarm#3579
Expose workload cluster metrics: giantswarm/giantswarm#1959
History of a cluster: giantswarm/giantswarm#2197
Other related tickets https://github.com/orgs/giantswarm/projects/41?fullscreen=true&card_filter_query=label%3Aarea%2Fobservability

marians · 2019-05-23T17:31:53Z

I started breaking down what this story may be about, as a basis for upcoming brainstorming sessions. Credits to Marcel for helping me.

On the one hand I would like to keep us open minded about the vision. I know our customers mean a lot of different things at once. We will have to boil it down and draw a meaningful path. So to start somewhere, I started mapping out what could flow into the thing we are talking about here.

https://docs.google.com/presentation/d/1I2_hz-bkOOK2--AqP63cUKREtLNnx6sfZJfJebFeG-Y/edit#slide=id.p

Typical disclaimer: WIP, early stage, nothing set in stone etc. More iterations needed even for the simple stuff in there.

marians · 2019-07-23T11:32:08Z

For future reference, here is a very high level plan on how to tackle the topic:

https://docs.google.com/drawings/d/1CNzDAk6HPqeE8qqVtBu29iAZaauvPdP5Z9_XB7Bt4BM/edit

I collect some more input in this slides document: https://docs.google.com/presentation/d/1I2_hz-bkOOK2--AqP63cUKREtLNnx6sfZJfJebFeG-Y/edit#slide=id.p

marians · 2019-08-06T08:53:01Z

From giantswarm/giantswarm#6511:

As volumes being filled is a constant source of trouble in day to day life, it would be quite valuable to have volume usage date available in the MVP. I suggest that we discuss alternatives to acquire that data (currently provided by the tenant cluster node exporter, fetched by Prometheus) without relying on Prometheus.

puja108 · 2019-08-12T13:47:45Z

What are the reasons behind not using Prom and what would be the alternative?

JosephSalisbury · 2019-08-12T16:41:39Z

Yeah, not using Prometheus for this doesn't feel like the right direction

J-K-C · 2019-08-13T13:43:32Z

Does not scale and is not reliable enough according to @teemow. However, @cornelius-keller will be deep diving into the technical feasibility/ architecture of this feature at some point and then we can get some more data for a more data driven discussion. Please watch this space :)

cornelius-keller · 2019-08-13T18:43:35Z

As for now Prometheus is to my knowledge the only Source for metrics data I think we can use it. Also even if we are looking for a replacement it seems to me that this would take much longer then we want to wait for this feature. It seems a bad idea to couple Prometheus replacement with this story.

So far far as I understand the architecture by now we anyways need an api endpoint to expose the metrics data to the front end. This can act as a facede to Prometheus for now. Ideally we can replace Prometheus later without changing this api endpoint.

marians · 2019-08-14T06:47:39Z

Let me reconstruct what I remember from @teemow's statements on Prometheus as we run it currently, with respect to this story, at the risk of getting it wrong:

Prometheus is currently in a state where it doesn't seem suited for higher usage. "The current setup doesn't scale".
We should focus first (=for the MVP) on data we can get without relying on Prometheus.

I think it's up to us to look at this in more detail. Like so:

If we decide, for example, to query tenant cluster volume data from Prometheus, how will this affect the resource usage and stability of Prometheus?
How will the query pattern look like? (Likely: All tenant clusters, all volumes, once a minute)
How can we deal with temporary unavailability of Prometheus data in the health UI? (In short: design for an "unknown" state in all metrics)

puja108 · 2019-08-14T08:13:43Z

If you come up with such assertions then please state what you are comparing it against. "without relying on Prometheus" implies you rely on something else (at least for the metric that you are using as an example here). What system will you ask? Do you write your own? Will you keep it in a time series DB? Will you write your own DB or use Influx?
No matter how you answer these questions, I do not see a solution without a tool "like" Prometheus, so like Cornelius said, why not just rely on what we have instead of building something completely redundant next to it? Also, if you want to put in the work, why not work on making Prometheus better or build workarounds (e.g. a cache) for the cases you mention?
Also, keep in mind that currently the context of this is Happa (AFAIK) and it is not a heavily used tool, so any query load you generate is short-lived and usually pertains to single users. Yes, it will need to scale at some point, but so does our Prom setup need to scale, we already know that as we rely heavily on it for our SLA.
That all said, if there's data you can get without Prom, please do so, but include the effort of building your tooling in your thoughts on this, otherwise this story might be blown out of proportions.

marians · 2019-08-14T08:27:53Z

The info and details (only some of them deserve the term "metrics") we want to focus on in the MVP are the ones we can get from

Kubernetes API: e. g. node details such as capacity, requests, limits
Provider resource details coming via our CRs, etc. AutoScalingGroup details coming via the cluster resource or whatever the node pool equivalent will be.

There might be more. These are two examples.

See https://github.com/giantswarm/giantswarm/issues/6139#issuecomment-516847947 for a visual representation of this sort of details on the node level. EDIT: The volume data is currently only available in Prometheus.

puja108 · 2019-08-14T08:52:39Z

Those look fine to me, and most really do not need Prometheus, which is cool. I am not 100% sure on the volumes stuff, but if that is the only thing in there that is not available through K8s itself, then I would skip it in this phase.

Getting things K8s APIs is definitely ok. The backend for the metrics you will get from the API currently come from metrics-server component, but in the future we might even have local Prom serving that, so as long as you ask K8s API for metrics we do not rely on a single backend.

BUT on K8s API I would also be careful about hammering the APIs and have some caching and not too much "live" data involved.

J-K-C · 2019-08-14T12:54:56Z

Hi all, suggest we park this discussion for a moment as the Architecture deep dive has been assigned to Cornelius of which it is only his 2nd week plus, Timo is AFK. Tomorrow we have an introductory session to this epic for Cornelius and the Ludacris team where we will also be reviewing the MVP and from there, I suggest we book a couple sessions where we can have some data driven discussions around this.. maybe a session in Rome if the timings right.

puja108 · 2019-08-14T13:47:06Z

I don't think there's more need for discussion here. All good, move forward.

teemow · 2019-08-27T10:00:16Z

My considerations were that in this story we are talking about the current state only. This is and should be in the status section of our CRDs. This is a very reliable source.

This story isn't about metrics and timeseries.

Prometheus isn't reliable. We had many flapping prometheus in control planes already. It can be easily wiped. Let's say the data structure is less defined and versioned. It is at its limit. So presenting metrics to the customer needs to wait until we have worked on the prometheus topology and maybe long term storage.

puja108 · 2019-08-27T10:12:12Z

My considerations were that in this story we are talking about the current state only. This is and should be in the status section of our CRDs. This is a very reliable source.

This story isn't about metrics and timeseries.

Fully agree!

Prometheus isn't reliable. We had many flapping prometheus in control planes already. It can be easily wiped. Let's say the data structure is less defined and versioned. It is at its limit. So presenting metrics to the customer needs to wait until we have worked on the prometheus topology and maybe long term storage.

Ok, this is maybe where the confusion came from. What I read from this is not "we should build sth else for metrics and replace prom" but "we need to improve our prom/metrics setup before we can rely on it for metrics", right?

cornelius-keller · 2019-12-09T10:04:48Z

I have looked again through all the history of this story and the related work especially from @marians . Initially it seems I have underestimated the technical complexity and the different requirements to this story from customer / ui perspective vs the internal technical challenges like moving towards cluster api, and having an operator readable cluster status where other operators can react on.

After all I would like to suggest a new MVP.
I suggest to have a very simple traffic light status per cluster, based in the beginning only on the number of desired nodes and the number of ready nodes.

If there are all desired nodes ready -> cluster is green.
If there are between 1 and 20 % of the desired nodes not ready -> cluster is yellow.
If there are more then 20 % of the nodes not ready -> cluster is read.

A node that is not there at all because for example it is not created yet by the infrastructure or because the infrastructure failed will be considered as not ready.

In the first iteration I think we can ignore the intermediate state. As @teemow pointed out during upgrades the new nodes are created before old ones are deleted. So the cluster should stay green.

In other cases I think it is consequent and most easily to explain what happens if we keep the status evaluation simple in the beginning. If I look at for example elastic search a cluster becomes yellow if you add a node and it starts to re balance shards. Yellow would just mean "Desired state is not current state, but it is not bad yet". Whereas red means "Desired state is not current state and it is probably bad".

I would like to have this for all clusters, regardless if they are using node pools or not.
We could have the same thing per node pool if the cluster uses them.

I think this is the minimal thing that provides user value and does not cause to many technical uncertainties as the information is probably already existing or could be easily added to the current CRs.

Based on customer feedback we can then decide to add more information or states to the traffic light or for example work on having single sign on for grafana so that we can reuse the dashboards that we have there.

WDYT?

teemow · 2019-12-09T13:05:21Z

Sounds good to me. Small steps will help us to align with cluster-api upstream and the different implementation levels we have in the operators.

Will this distinguish between "node not ready" and "api unavailable"?

Btw on Azure we don't create new instances first and then tear down the old one afaik. Not sure about KVM. On AWS the ASG definitely creates a new instance before the old one is teared down.

cornelius-keller · 2019-12-09T13:53:16Z

I think extending this in a way that master down would mean also red should be easy. With multi master the semantics would then probably to all masters down.

Regarding the states during creation and upgrading: I still think that this is easy to explain to the users, and I would like to add more states based on user feedback.

For a three node cluster this even means that it turns red during updates if we don't create the new node before adding a new one. But if you run a three node cluster in production this is actually a bad thing as it means you lost 1/3 of your capacity and this will probably affect your workload. So showing it red during the update still seems appropriate to me.
On the other hand we could think on having a threshold of 5% before switching from green to yellow, so big clusters don't turn yellow during upgrades. But also this I would like to tweak after customer feedback.

puja108 · 2019-12-10T14:48:18Z

Going in the right direction. I suggest you evaluate two things in more detail. Both will need a bit of thoughts and maybe even some tryouts and tests either manually or automated:

What is the exact percentages/limits you want status to be changing at?
Should status be separated for masters vs nodes and between node pools?

Point 2 might be something you will just leave out of MVP, but need to be careful that if you decide to not separate anything you will need a bit more detailed communication to not confuse people about their status. In my experience people worry in different ways about API down vs nodes not ready.

snizhana-dynnyk · 2021-06-09T13:41:21Z

A customer mentioned this in a feedback call about our monitoring - that they would like to have some sort of a traffic light system for workload clusters. This would help them to develop trust in our monitoring and also give an overview of workload clusters' health.

I will add this issue to a Product board so we can discuss it on Monday. We might consider implementing this traffic light system in Management API.

Additionally, defining the concepts of 'red' / 'yellow' clusters might be useful for our internal operations, e.g. prioritizing postmortems or even using the min number of 'red' clusters as an outcome of the reliability goal.

puja108 · 2021-06-14T10:31:29Z

This still makes sense in Ludacris, but we'd currently not prioritize it very high, as the source of truth for such health being in CRDs will change with CAPI (and upstream CAPI has similar health stories they are thinking about). Thus, I would revisit this once we're further on CAPI unless there's increased priority from the customer side on this.

For now I would also say that giving access to our grafana dashboards should at least start giving a first picture, not in our own interface but at least in a first interface. Ludacris will definitely also look at what dashboards they own and how those will be experienced by customers.

JosephSalisbury · 2022-10-20T10:57:32Z

@puja108 can we find a home for this? it looks lost somehow

puja108 · 2022-10-20T12:25:41Z

I'd move this over to one of the KaaS teams, so it can be checked against and maybe merged with the general cluster health within CAPI story. The idea back in Ludacris was mainly, let's for now rely on the health status we get through CAPI. We'd then need to see where we expose it aka dashboard vs/and happa/kgs. I know for example that the azure CLI and in some way also clusterctl show an aggregate cluster health in the command line.

cc @alex-dabija @gawertm @cornelius-keller which team would be closest to this right now?

alex-dabija · 2022-10-24T09:16:41Z

@puja108 can we find a home for this? it looks lost somehow

The issue is not lost. We agreed in KaaS (some time ago), that Ludacris' backlog we'll stay on the KaaS Sync's board until either Rocket or Hydra has a need to implement the feature.

We (@cornelius-keller, @gawertm and me) discuss it quickly today in the KaaS Product Sync and agreed that it's still best to pull in the story in one of the teams when it's needed.

cc @alex-dabija @gawertm @cornelius-keller which team would be closest to this right now?

Unfortunately, it's difficult to say which team is closest because we are mostly focused on having stable clusters.

teemow · 2024-06-04T14:05:25Z

@puja108 @marians this is still interesting in terms of fleet management. Especially for an interface like backstage in which you can drill down (health) information about an installation or cluster. Eg seeing the state of applications on a cluster or the current alerts of the cluster itself.

marians · 2024-06-06T11:51:31Z

I'm putting these three related issues on Honeybadger's board for Backstage.

This comment has been minimized.

Sign in to view

teemow changed the title ~~Product Spec for Status / Health~~ Status / Health of tenant clusters Sep 19, 2018

This comment has been minimized.

Sign in to view

cornelius-keller transferred this issue from another repository Dec 12, 2019

cornelius-keller added this to In design in Giant Swarm Roadmap (Deprecated) Dec 12, 2019

cornelius-keller added topic/observability area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service mission/🌹/observability labels Dec 12, 2019

J-K-C added this to the 2020 Q1 milestone Dec 13, 2019

puja108 mentioned this issue Dec 13, 2019

Spec Representation of Catalog Entries (Apps) within the Control Plane #26

Closed

snizhana-dynnyk removed this from the 2020 Q1 milestone Dec 16, 2019

cornelius-keller moved this from In design to Ready to develop in Giant Swarm Roadmap (Deprecated) Jan 16, 2020

cornelius-keller moved this from In Design to Planned in Giant Swarm Roadmap (Deprecated) Mar 20, 2020

snizhana-dynnyk removed the mission/🌹/observability label Mar 23, 2020

marians mentioned this issue May 26, 2020

Add health reference page giantswarm/docs#344

Closed

1 task

marians changed the title ~~Status / Health of tenant clusters~~ Status / health of workload clusters Jan 14, 2021

puja108 added the epic/revisit-after-capi label Jun 14, 2021

puja108 removed the team/ludacris label Oct 5, 2021

This was referenced Jun 4, 2024

History of a cluster #57

Open

Transparency Regarding Cluster State #75

Open

marians added the ui/backstage The next generation web UI for Giant Swarm label Jun 6, 2024

Status / health of workload clusters #13

Status / health of workload clusters #13

Comments

othylmann commented Jun 26, 2018 • edited by marians

Epic Story

Linked UserStories

User Personas

Linked Stories

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

marians commented May 23, 2019 • edited

This comment has been minimized.

This comment has been minimized.

marians commented Jul 23, 2019

This comment has been minimized.

marians commented Aug 6, 2019 • edited

puja108 commented Aug 12, 2019

JosephSalisbury commented Aug 12, 2019 • edited

J-K-C commented Aug 13, 2019

cornelius-keller commented Aug 13, 2019

marians commented Aug 14, 2019 • edited

puja108 commented Aug 14, 2019

marians commented Aug 14, 2019 • edited

puja108 commented Aug 14, 2019

J-K-C commented Aug 14, 2019

puja108 commented Aug 14, 2019

teemow commented Aug 27, 2019

puja108 commented Aug 27, 2019

cornelius-keller commented Dec 9, 2019

teemow commented Dec 9, 2019

cornelius-keller commented Dec 9, 2019 • edited

puja108 commented Dec 10, 2019

snizhana-dynnyk commented Jun 9, 2021

puja108 commented Jun 14, 2021

JosephSalisbury commented Oct 20, 2022

puja108 commented Oct 20, 2022

alex-dabija commented Oct 24, 2022

teemow commented Jun 4, 2024

marians commented Jun 6, 2024 • edited

othylmann commented Jun 26, 2018 •

edited by marians

marians commented May 23, 2019 •

edited

marians commented Aug 6, 2019 •

edited

JosephSalisbury commented Aug 12, 2019 •

edited

marians commented Aug 14, 2019 •

edited

marians commented Aug 14, 2019 •

edited

cornelius-keller commented Dec 9, 2019 •

edited

marians commented Jun 6, 2024 •

edited