-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starting Kubernetes Observability through basic SLIs/SLOs #4997
Comments
One basic SLI/SLO should be on Kubernetes native components including cluster-autoscaler, kube-controller-manager, and kube-scheduler are using leader-with-lease in client-go. What we want to monitor here is:
Being with no leader for a period of time is quite critical for a production Kubernetes cluster. Hence we need to define proper SLIs/SLOs based on these observations. This information can be retrieved by the
SLO_a: At the moment we don't have a specific metricset/data_stream that specifically collects this information from |
Leader SLIs is something needed, I agree. We have seen it in sdh recenlty. But before discussing for specific slis, I would suggest to try to cover the big categories of SLIs as described from google here https://sre.google/workbook/implementing-slos/ Now to your question:
Does this mean that we will force users to install Prometheus Integration until we implement something if they want to have alerts right? If yes dont like it so much |
As a first step I will focus on identifying the high level categories/groups of SLIs/SLOs. One good approach is to follow the Four Golden Signals which are used by Google SRE. They are Latency, Traffic, Errors and Saturation. A more specific approach is the The RED method developed by Tom Wilkie, who was an SRE at Google and used the Four Golden Signals. The RED method drops the saturation because it is used for more advanced cases and people remember better things that come in threes :-). However I would personally be in favor of adding the saturation/utilization aspect into the game. In summary I would propose the following groups of SLIs/SLOs: If we agree on this first step as a base, I will then go ahead with defining specific SLIs/SLOs per group based on the metrics/data that we currently collect for the k8s cluster. @gizas what do you think about this plan? |
Yes of course Chris. I think we are aligned, that is why I had added the 3rd link above in the description , to guide the assignee of the story to group the slos and justify the choices.
You suggest this and you propose 3 Groups :) I got confused I would suggest ( See examples from here https://sysdig.com/blog/golden-signals-kubernetes/) Just reminder that configuration of those through API is part of story. So lets think how we should save this information. |
You see 3 groups because I grouped latency and traffic into one group. In any case, this is just a conceptual grouping and 3 or 4 doesn't really matter in the end.
Yeap, that should happen at some point, it's also included in the TODOs of the issue. |
Heads up on this. After struggling a bit with ES queries and aggregations I managed to create one first alert using the k8s API request duration. One major issue here is that this metric is coming as a histogram from See how we store these metrics at https://github.com/elastic/beats/blob/b0bbd16ed0064a4ef480ed15c96d0b954f0748a7/metricbeat/module/kubernetes/apiserver/_meta/testdata/docs.plain-expected.json#L12. @gizas we had discussed about this in the past with @MichaelKatsoulis and we concluded that maybe we need to add I took the idea from That's the very first alert which I put here as an example. I will continue with adding more similar ones with the currently available metrics. I will be updating this comment with the upcoming queries/alerts. Group_1: latency (of control plane, like apiserver, controllermanager etc)k8s API request duration alert
k8s Controller manager request duration alert
Group_2: resource saturation (disk pressure etc)k8s Node CPU utilization alert
k8s Node Memory Usage alert
Node unschedulable alert
Node not ready alert
Node memory_pressure
Node disk_pressure
Node out_of_disk
Node pid_pressure
Group_3: errors (container OOM etc)Pod RX Error Rate
Pod TX Error Rate
Pod Terminated with Error
Pod Terminated OOMKilled
Group_4: Cluster ConnectivityKubernetes Shipped Docs During last minute
|
@gizas I think we already have a good set of Alerts listed in the above comment. All these are fully functional and ready to be used. What would the ideal place to "store" those? Maybe somewhere inside the Kubernetes package? Note that due to the long json bodies it would be better to store those in standalone files. I tried to put them under the kubernetes package and it is possible to store them under the I also tried to put them under For now it would be ok to store them under this subdir and reference those files from our docs maybe? However this would be problematic when it comes to version syncronization etc. @jsoriano any thoughts here on how we could approach this better? |
Really great progress here Chis!!!! I reviewed them and dont think we miss something important for first iteration.
Now as far as the folder is concerned, I guess Reminder that we would like a script to automate the installation of all of this with one run. Ideally the user to choose which to install or not |
Are these alerts intended for documentation purposes only, or do we want them installed when installing the packages? If they are only used for documentation, they could be included in the If we want them to be installed when installing the package, then we should add them to their own place in the package-spec, and we should have a follow-up task in Kibana to install them when available. This process would start with a new issue in the package-spec repository. |
The end goal would be to be installed when installing the package! I would suggest for now to include them in doc in order to proceed with our implementation and follow up with the stories as you suggest. |
Perfect, please create an issue in the package-spec repository to discuss about this. |
Thanks for your feedback here folks! I will file an issue to the package-spec. As far as the placement into the docs of the package for now, this seems to be problematic because of the elastic-package build
Build the package
Error: updating files failed: updating readme file apiserver-latency-alert.md failed: rendering Readme failed: parsing README template failed (path: /home/chrismark/go/src/github.com/elastic/integrations/packages/kubernetes/_dev/build/docs/apiserver-latency-alert.md): template: apiserver-latency-alert.md:91: function "ctx" not defined I wonder if we can escape this somehow 🤔 @jsoriano was this met before in any of the packages? |
Files in |
Thanks @jsoriano, having them under @gizas I have filed a PR to add them inside the package: #5364 Issue in package-spec: elastic/package-spec#484 |
One last from me here @ChrsMark . I see we now have a full first version of alerts. Can we distinquish those from SLOs? I mean we can keep current |
Good point @gizas! One basic limitation that I found in all APIs/methods other than the Watchers one is that those do not provide flexibility to define advanced ES queries with aggregations. For example if you check the examples at https://github.com/kdelemme/kibana/blob/main/x-pack/plugins/observability/dev_docs/slo.md the SLOs can only be defined on specific fields like So in order to implement what we needed to conceptually cover the monitoring/alerting of a k8s cluster we need to go with advanced queries that Watchers allow. One thing here is that it won't be that great for now to have 12 Watchers/Alerts and then 2 SLOs, to my mind it is better to be consistent. |
Hey @gizas I tried to come up with a basic SLO and the only thing that could create looks like the following: curl -i -k --request POST \
--url https://elastic:changeme@0.0.0.0:5601/api/observability/slos \
--header 'Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==' \
--header 'Content-Type: application/json' \
--header 'kbn-xsrf: oui' \
--data '{
"name": "Kubernetes Data",
"description": "My SLO Description",
"indicator": {
"type": "sli.kql.custom",
"params": {
"index": "metrics*",
"good": "data_stream.dataset: kubernetes.node",
"total": "",
"filter": ""
}
},
"timeWindow": {
"duration": "1d",
"isRolling": true
},
"budgetingMethod": "occurrences",
"objective": {
"target": 0.001
}
}' Again, the API looks quite limiting at the moment and for example a numerator query is required while I only wanted to perform a doc count query. Having said this I feel that we would need to wait for https://github.com/elastic/actionable-observability/issues/26 before investing much time in creating SLOs that are not actually useful for the k8s use-case. See what are the plans for the "Custom Metric SLI" which is what we would need most probably -> https://github.com/elastic/actionable-observability/issues/26. Update: curl -i -k --request POST \
--url https://elastic:changeme@0.0.0.0:5601/api/observability/slos \
--header 'Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==' \
--header 'Content-Type: application/json' \
--header 'kbn-xsrf: oui' \
--data '{
"name": "Kubernetes Node Network Errors",
"description": "Kubernetes Node Network Errors",
"indicator": {
"type": "sli.kql.custom",
"params": {
"index": "metrics-kubernetes*",
"good": "",
"total": "kubernetes.node.network.rx.errors <= 5 and kubernetes.node.network.tx.errors <= 5",
"filter": ""
}
},
"timeWindow": {
"duration": "1d",
"isRolling": true
},
"budgetingMethod": "occurrences",
"objective": {
"target": 0.995
}
}' But again since we cannot group by Node's name this indicator is not so much useful again for a k8s cluster monitoring/alerting. |
Blogpost was released: https://www.elastic.co/blog/enable-kubernetes-alerting-elastic-observability. Closing this one for now. |
Context
Observability definition includes much more than collection of metrics and logs. We should be able to provide to our users a holistic approach where functionalities like alerting and SLO definition should be part of the user experience for Kubernetes observability. In more details, we should help our users, when they onboard a new Kubernetes Cluster, to configure specific Alerts and SLOs and link those with specific instances of the data collected.
Action
As part of this story we should :
Useful Links
SLO API: https://github.com/kdelemme/kibana/blob/main/x-pack/plugins/observability/dev_docs/slo.md#L0-L1
Alert API: https://www.elastic.co/guide/en/kibana/current/alerting-apis.html
Deliverables
Relevant useful links:
TODOs
a. collect which metrics need to be collected/improved ✅
b. Create snippets to install these alerts through the API ✅
a. PR to add them in k8s package docs: Add documentation for k8s alerts installation #5364
b. PR to add them in automation repo: Add k8s monitoring watchers k8s-integration-infra#16
Extras
5. Investigate how alerts can be part of the package: ✅ Issue created -> elastic/package-spec#484
6. Demo/blogpost: https://github.com/elastic/blog/issues/1869
7. Any follow up documentation.
The text was updated successfully, but these errors were encountered: