-
Notifications
You must be signed in to change notification settings - Fork 14
Fix prometheus rules and improve/fix the SOP #151
Fix prometheus rules and improve/fix the SOP #151
Conversation
05497c0
to
f8c2613
Compare
201cdd0
to
9109769
Compare
9109769
to
dd40c3b
Compare
827908a
to
4a57e61
Compare
@@ -15,7 +15,7 @@ spec: | |||
- name: general.rules | |||
rules: | |||
- alert: MobileSecurityServiceDown | |||
expr: absent(up{job="mobile-security-service-application"} == 1) | |||
expr: absent(kube_pod_container_status_running{namespace="mobile-security-service",container="application"}==1) | |||
for: 5m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not working here because of this was changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be >= 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@camilamacedo86 what about this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true. Really tks to get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done it + add further information in the SOPs as was requested/suggested by OPS.
4a57e61
to
aaa4ac3
Compare
summary: Pod count for namespace mobile-security-service is {{ printf "%.0f" $value }}. Expected 3 pods. For more information see on the MSS operator https://github.com/aerogear/mobile-security-service-operator" | ||
sop_url: "https://github.com/aerogear/mobile-security-service-operator/blob/0.2.0/SOP/SOP-operator.adoc" | ||
expr: | | ||
(1-absent(kube_pod_status_ready{condition="true", namespace="mobile-security-service"})) or sum(kube_pod_status_ready{condition="true", namespace="mobile-security-service"}) != 3 | ||
(1-absent(kube_pod_status_ready{condition="true", namespace="mobile-security-service"})) or sum(kube_pod_status_ready{condition="true", namespace="mobile-security-service"}) < 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than hard coding the namespace value here, is it possible to get the namespace name dynamically? The problem is that the namespace name could be different on different clusters.
cc @grdryn @laurafitzgerald are we using hardcoded namespace in UPS too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it is done in the integr8ly ansible scripts as it was done here, for example : https://github.com/integr8ly/installation/pull/687/files#diff-eaa212d002e12d63341d8140d3803235R18. Here it works for the impl of this repo then in the RHMI, it is implemented by the ansible scripts.
Please, let me know if I did not get as you did it, folks. @laurafitzgerald @austincunningham
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally for the monitoring resources for the service (rather than the operator itself (although maybe it could provision for itself also, like it provisions its own Service for metrics exposure), the operator would be provisioning those (after checking that the relevant CRDs exist on the cluster already). I think that's the best eventual solution (maybe not necessary to do for this release, and if you're correct about the integreatly installer, then that's probably enough for now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keylock is not managed and doing the things in the way that you are proposing here as well. However, in the future maybe it could be improved horizontally in all projects as you suggested. For now, IMHO shows fine move within a way that will work for the current repo and will allow us to use it to create easily the templates to manage it by ansible scripts as it shows been doing so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that we agreed to move forward with this approach so far internally. Please, let me know if has anything else that we should change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's create some issues to capture future improvements.
deploy/monitor/prometheus_rule.yaml
Outdated
annotations: | ||
description: The Pod count for the mobile-security-service has changed in the last 5 minutes. | ||
description: The Pod count for the mobile-security-service has changed in the last 5 minutes and has not the minimal quantity required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Pod count for the mobile-security-service has changed in the last 5 minutes and is less than the minimal required value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍 . However, after any change in the PR files, it is losing the approval. Please, could you do it again for we are able to merge it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a small suggestion to update the description. otherwise LGTM.
Hi @maleck13 @Nanyte25 please feel free to make suggestions for any part of the SOP's files. |
aaa4ac3
to
1266e09
Compare
1266e09
to
2c87770
Compare
…stions/requirements
551ddf6
to
3729f93
Compare
Hi @wei-lee I think we are able to move forward here now. |
Motivation
What
Verification Steps
Check the Sops MMS and Oper
Check here the Prometheus Rule changed: https://prometheus-route-middleware-monitoring.apps.london-4bc4.openshiftworkshop.com/alerts
Checklist:
Progress
Additional Notes