New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add missing hosts info to the prometheus exporter output. #8328
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #8328 +/- ##
============================================
- Coverage 29.19% 23.30% -5.90%
+ Complexity 31390 23701 -7689
============================================
Files 5260 5128 -132
Lines 369709 347679 -22030
Branches 53890 49901 -3989
============================================
- Hits 107950 81038 -26912
- Misses 246976 254990 +8014
+ Partials 14783 11651 -3132
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@blueorangutan package |
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7969 |
@blueorangutan test |
@shwstppr a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
[SF] Trillian test result (tid-8506)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - didn't test it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on a high level, good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
not tested it
Sometimes the hostStats object of the agents becomes null in the management server. It is a rare situation, and we haven't found the root cause yet, but it occurs occasionally in our CloudStack deployments with many hosts. The hostStat is null, even though the agent is UP and hosting multiple VMs. It is possible to access the VM consoles and execute tasks on them. This pull request doesn't address the issue directly; rather it displays those hosts in Prometheus so we can restart the agent and get the necessary information.
Description
Sometimes the hostStats object of the agents becomes null in the management server. It is a rare situation, and we haven't found the root cause yet, but it occurs occasionally in our CloudStack deployments with many hosts.
The hostStat is null, even though the agent is UP and hosting multiple VMs. It is possible to access the VM consoles and execute tasks on them.
This pull request doesn't address the issue directly; rather it displays those hosts in Prometheus so we can restart the agent and get the necessary information.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
How Has This Been Tested?
prometheus.exporter.enable
totrue
.curl localhost:9595/metrics
on management server to make sure that prometheus is working.curl localhost:9595/metrics | grep cloudstack_host_missing_info
you get nothing in output cause the host state is still there. (If you wait for couple of minutes management server may remove it)curl localhost:9595/metrics | grep cloudstack_host_missing_info
again to get the following output:How did you try to break this feature and the system with this change?
The change wouldn't affect other area of code as the prometheus module is somehow an independent part of the CloudStack.