Skip to content
This repository has been archived by the owner on Dec 18, 2019. It is now read-only.

Fix prometheus rules and improve/fix the SOP #151

Merged
merged 6 commits into from
Jul 16, 2019

Conversation

camilamacedo86
Copy link
Contributor

@camilamacedo86 camilamacedo86 commented Jul 12, 2019

Motivation

What

  • Remove the CLI steps from the SOP's
  • Small fixes in the steps
  • Fix the MobileSecurityServicePodCount ( it should alert when the pods are < 3 and NOT != of 3 )
  • Fix the MobileSecurityServiceDown ( it was not working at all and shown the alert when the MSS pod was up and running )
  • Add steps to scale the MSS pods in the case of some issues regards performance be faced.
  • Add further information in order to attend the OPS suggestions/requirements

Verification Steps

Checklist:

  • Code has been tested locally by PR requester
  • Changes have been successfully verified by another team member

Progress

  • Finished task
  • TODO

Additional Notes

Screenshot 2019-07-12 at 08 12 13

@camilamacedo86 camilamacedo86 force-pushed the fix-prometheus-sop branch 2 times, most recently from 201cdd0 to 9109769 Compare July 12, 2019 11:11
@camilamacedo86 camilamacedo86 changed the title fix(SOP and Prometheus rules): explain how to increase the podsin the… Fix prometheus rules and improve/fix the SOP Jul 12, 2019
@camilamacedo86 camilamacedo86 force-pushed the fix-prometheus-sop branch 3 times, most recently from 827908a to 4a57e61 Compare July 12, 2019 12:28
@@ -15,7 +15,7 @@ spec:
- name: general.rules
rules:
- alert: MobileSecurityServiceDown
expr: absent(up{job="mobile-security-service-application"} == 1)
expr: absent(kube_pod_container_status_running{namespace="mobile-security-service",container="application"}==1)
for: 5m
Copy link
Contributor Author

@camilamacedo86 camilamacedo86 Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not working here because of this was changed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be >= 1 ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camilamacedo86 what about this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true. Really tks to get it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done it + add further information in the SOPs as was requested/suggested by OPS.

summary: Pod count for namespace mobile-security-service is {{ printf "%.0f" $value }}. Expected 3 pods. For more information see on the MSS operator https://github.com/aerogear/mobile-security-service-operator"
sop_url: "https://github.com/aerogear/mobile-security-service-operator/blob/0.2.0/SOP/SOP-operator.adoc"
expr: |
(1-absent(kube_pod_status_ready{condition="true", namespace="mobile-security-service"})) or sum(kube_pod_status_ready{condition="true", namespace="mobile-security-service"}) != 3
(1-absent(kube_pod_status_ready{condition="true", namespace="mobile-security-service"})) or sum(kube_pod_status_ready{condition="true", namespace="mobile-security-service"}) < 3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than hard coding the namespace value here, is it possible to get the namespace name dynamically? The problem is that the namespace name could be different on different clusters.

cc @grdryn @laurafitzgerald are we using hardcoded namespace in UPS too?

Copy link
Contributor Author

@camilamacedo86 camilamacedo86 Jul 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that it is done in the integr8ly ansible scripts as it was done here, for example : https://github.com/integr8ly/installation/pull/687/files#diff-eaa212d002e12d63341d8140d3803235R18. Here it works for the impl of this repo then in the RHMI, it is implemented by the ansible scripts.

Please, let me know if I did not get as you did it, folks. @laurafitzgerald @austincunningham

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally for the monitoring resources for the service (rather than the operator itself (although maybe it could provision for itself also, like it provisions its own Service for metrics exposure), the operator would be provisioning those (after checking that the relevant CRDs exist on the cluster already). I think that's the best eventual solution (maybe not necessary to do for this release, and if you're correct about the integreatly installer, then that's probably enough for now).

Copy link
Contributor Author

@camilamacedo86 camilamacedo86 Jul 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keylock is not managed and doing the things in the way that you are proposing here as well. However, in the future maybe it could be improved horizontally in all projects as you suggested. For now, IMHO shows fine move within a way that will work for the current repo and will allow us to use it to create easily the templates to manage it by ansible scripts as it shows been doing so far.

Copy link
Contributor Author

@camilamacedo86 camilamacedo86 Jul 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that we agreed to move forward with this approach so far internally. Please, let me know if has anything else that we should change here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's create some issues to capture future improvements.

annotations:
description: The Pod count for the mobile-security-service has changed in the last 5 minutes.
description: The Pod count for the mobile-security-service has changed in the last 5 minutes and has not the minimal quantity required.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Pod count for the mobile-security-service has changed in the last 5 minutes and is less than the minimal required value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍 . However, after any change in the PR files, it is losing the approval. Please, could you do it again for we are able to merge it?

wei-lee
wei-lee previously approved these changes Jul 15, 2019
Copy link

@wei-lee wei-lee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small suggestion to update the description. otherwise LGTM.

@camilamacedo86
Copy link
Contributor Author

camilamacedo86 commented Jul 15, 2019

Hi @maleck13 @Nanyte25 please feel free to make suggestions for any part of the SOP's files.
See here: https://github.com/aerogear/mobile-security-service-operator/tree/master/SOP.

@camilamacedo86
Copy link
Contributor Author

Hi @wei-lee I think we are able to move forward here now.

@camilamacedo86 camilamacedo86 merged commit 82717bd into aerogear:master Jul 16, 2019
@camilamacedo86 camilamacedo86 deleted the fix-prometheus-sop branch July 16, 2019 08:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants