-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Status / health of workload clusters #13
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I started breaking down what this story may be about, as a basis for upcoming brainstorming sessions. Credits to Marcel for helping me. On the one hand I would like to keep us open minded about the vision. I know our customers mean a lot of different things at once. We will have to boil it down and draw a meaningful path. So to start somewhere, I started mapping out what could flow into the thing we are talking about here. https://docs.google.com/presentation/d/1I2_hz-bkOOK2--AqP63cUKREtLNnx6sfZJfJebFeG-Y/edit#slide=id.p Typical disclaimer: WIP, early stage, nothing set in stone etc. More iterations needed even for the simple stuff in there. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
For future reference, here is a very high level plan on how to tackle the topic: https://docs.google.com/drawings/d/1CNzDAk6HPqeE8qqVtBu29iAZaauvPdP5Z9_XB7Bt4BM/edit I collect some more input in this slides document: https://docs.google.com/presentation/d/1I2_hz-bkOOK2--AqP63cUKREtLNnx6sfZJfJebFeG-Y/edit#slide=id.p |
This comment has been minimized.
This comment has been minimized.
From giantswarm/giantswarm#6511: As volumes being filled is a constant source of trouble in day to day life, it would be quite valuable to have volume usage date available in the MVP. I suggest that we discuss alternatives to acquire that data (currently provided by the tenant cluster node exporter, fetched by Prometheus) without relying on Prometheus. |
What are the reasons behind not using Prom and what would be the alternative? |
Yeah, not using Prometheus for this doesn't feel like the right direction |
Does not scale and is not reliable enough according to @teemow. However, @cornelius-keller will be deep diving into the technical feasibility/ architecture of this feature at some point and then we can get some more data for a more data driven discussion. Please watch this space :) |
As for now Prometheus is to my knowledge the only Source for metrics data I think we can use it. Also even if we are looking for a replacement it seems to me that this would take much longer then we want to wait for this feature. It seems a bad idea to couple Prometheus replacement with this story. So far far as I understand the architecture by now we anyways need an api endpoint to expose the metrics data to the front end. This can act as a facede to Prometheus for now. Ideally we can replace Prometheus later without changing this api endpoint. |
Let me reconstruct what I remember from @teemow's statements on Prometheus as we run it currently, with respect to this story, at the risk of getting it wrong:
I think it's up to us to look at this in more detail. Like so:
|
If you come up with such assertions then please state what you are comparing it against. "without relying on Prometheus" implies you rely on something else (at least for the metric that you are using as an example here). What system will you ask? Do you write your own? Will you keep it in a time series DB? Will you write your own DB or use Influx? |
The info and details (only some of them deserve the term "metrics") we want to focus on in the MVP are the ones we can get from
There might be more. These are two examples. See https://github.com/giantswarm/giantswarm/issues/6139#issuecomment-516847947 for a visual representation of this sort of details on the node level. EDIT: The volume data is currently only available in Prometheus. |
Those look fine to me, and most really do not need Prometheus, which is cool. I am not 100% sure on the volumes stuff, but if that is the only thing in there that is not available through K8s itself, then I would skip it in this phase. Getting things K8s APIs is definitely ok. The backend for the metrics you will get from the API currently come from metrics-server component, but in the future we might even have local Prom serving that, so as long as you ask K8s API for metrics we do not rely on a single backend. BUT on K8s API I would also be careful about hammering the APIs and have some caching and not too much "live" data involved. |
Hi all, suggest we park this discussion for a moment as the Architecture deep dive has been assigned to Cornelius of which it is only his 2nd week plus, Timo is AFK. Tomorrow we have an introductory session to this epic for Cornelius and the Ludacris team where we will also be reviewing the MVP and from there, I suggest we book a couple sessions where we can have some data driven discussions around this.. maybe a session in Rome if the timings right. |
I don't think there's more need for discussion here. All good, move forward. |
My considerations were that in this story we are talking about the current state only. This is and should be in the This story isn't about metrics and timeseries. Prometheus isn't reliable. We had many flapping prometheus in control planes already. It can be easily wiped. Let's say the data structure is less defined and versioned. It is at its limit. So presenting metrics to the customer needs to wait until we have worked on the prometheus topology and maybe long term storage. |
Fully agree!
Ok, this is maybe where the confusion came from. What I read from this is not "we should build sth else for metrics and replace prom" but "we need to improve our prom/metrics setup before we can rely on it for metrics", right? |
I have looked again through all the history of this story and the related work especially from @marians . Initially it seems I have underestimated the technical complexity and the different requirements to this story from customer / ui perspective vs the internal technical challenges like moving towards cluster api, and having an operator readable cluster status where other operators can react on. After all I would like to suggest a new MVP. If there are all desired nodes ready -> cluster is green. A node that is not there at all because for example it is not created yet by the infrastructure or because the infrastructure failed will be considered as not ready. In the first iteration I think we can ignore the intermediate state. As @teemow pointed out during upgrades the new nodes are created before old ones are deleted. So the cluster should stay green. In other cases I think it is consequent and most easily to explain what happens if we keep the status evaluation simple in the beginning. If I look at for example elastic search a cluster becomes yellow if you add a node and it starts to re balance shards. Yellow would just mean "Desired state is not current state, but it is not bad yet". Whereas red means "Desired state is not current state and it is probably bad". I would like to have this for all clusters, regardless if they are using node pools or not. I think this is the minimal thing that provides user value and does not cause to many technical uncertainties as the information is probably already existing or could be easily added to the current CRs. Based on customer feedback we can then decide to add more information or states to the traffic light or for example work on having single sign on for grafana so that we can reuse the dashboards that we have there. WDYT? |
Sounds good to me. Small steps will help us to align with cluster-api upstream and the different implementation levels we have in the operators. Will this distinguish between "node not ready" and "api unavailable"? Btw on Azure we don't create new instances first and then tear down the old one afaik. Not sure about KVM. On AWS the ASG definitely creates a new instance before the old one is teared down. |
I think extending this in a way that master down would mean also red should be easy. With multi master the semantics would then probably to all masters down. Regarding the states during creation and upgrading: I still think that this is easy to explain to the users, and I would like to add more states based on user feedback. For a three node cluster this even means that it turns red during updates if we don't create the new node before adding a new one. But if you run a three node cluster in production this is actually a bad thing as it means you lost 1/3 of your capacity and this will probably affect your workload. So showing it red during the update still seems appropriate to me. |
Going in the right direction. I suggest you evaluate two things in more detail. Both will need a bit of thoughts and maybe even some tryouts and tests either manually or automated:
Point 2 might be something you will just leave out of MVP, but need to be careful that if you decide to not separate anything you will need a bit more detailed communication to not confuse people about their status. In my experience people worry in different ways about API down vs nodes not ready. |
A customer mentioned this in a feedback call about our monitoring - that they would like to have some sort of a traffic light system for workload clusters. This would help them to develop trust in our monitoring and also give an overview of workload clusters' health. I will add this issue to a Product board so we can discuss it on Monday. We might consider implementing this traffic light system in Management API. Additionally, defining the concepts of 'red' / 'yellow' clusters might be useful for our internal operations, e.g. prioritizing postmortems or even using the min number of 'red' clusters as an outcome of the reliability goal. |
This still makes sense in Ludacris, but we'd currently not prioritize it very high, as the source of truth for such health being in CRDs will change with CAPI (and upstream CAPI has similar health stories they are thinking about). Thus, I would revisit this once we're further on CAPI unless there's increased priority from the customer side on this. For now I would also say that giving access to our grafana dashboards should at least start giving a first picture, not in our own interface but at least in a first interface. Ludacris will definitely also look at what dashboards they own and how those will be experienced by customers. |
@puja108 can we find a home for this? it looks lost somehow |
I'd move this over to one of the KaaS teams, so it can be checked against and maybe merged with the general cluster health within CAPI story. The idea back in Ludacris was mainly, let's for now rely on the health status we get through CAPI. We'd then need to see where we expose it aka dashboard vs/and happa/kgs. I know for example that the azure CLI and in some way also cc @alex-dabija @gawertm @cornelius-keller which team would be closest to this right now? |
The issue is not lost. We agreed in KaaS (some time ago), that Ludacris' backlog we'll stay on the KaaS Sync's board until either Rocket or Hydra has a need to implement the feature. We (@cornelius-keller, @gawertm and me) discuss it quickly today in the KaaS Product Sync and agreed that it's still best to pull in the story in one of the teams when it's needed.
Unfortunately, it's difficult to say which team is closest because we are mostly focused on having stable clusters. |
I'm putting these three related issues on Honeybadger's board for Backstage. |
Epic Story
As a customer, I want to be able to easily and efficiently check the health of my workload clusters so I know whether they require attention or not.
(This Epic is in ideation phase and the following stories below have been created to collect the various sources of information we require in order to build the MVP of this Epic)
Linked UserStories
User Personas
Linked Stories
The text was updated successfully, but these errors were encountered: