Monitoring
Greg DeKoenigsberg edited this page Nov 19, 2012
·
22 revisions
This page contains a list of resources that a monitoring agent could check for on a production Eucalyptus IaaS cloud.
Standard Operating System Checks
- CPU Load - Done
- System Load - Done
- Memory usage - Done
- Swap usage - Done
- Disk space - Done
- Network Traffic
- Total Processes - Done
- I/O (metric tbd)
Cloud
- Running images
- Running instance type
- Available resources
- Available cores - Done
Cloud Controller
- PostgreSQL connections / CLC logs indicate proxool errors / CLC/postgres (can connect)
- Number of Public IP Addresses allocated Done
- Number of available instances per type (warn when availability runs low)
- CLC logs indicate errors in VmInstance$RestoreAllocation
- CLC,Walrus,SC,VB/jvm (heap usage, full gc stop time threshold)
- Check TCP port 8773 is listening Done
Walrus
- Cache size
- Bukkits disk usage
- DRBD sync & role (status)
- Walrus logs indicate "Peer is primary and I am supposed to be master! Unable to proceed!"
- Check TCP port 8773 is listening Done
Cluster Controller
- Do the NODES in the a HA CC setup match for paired CCs
- CC/NC rampart package versions do not match (more Eucalyptus Bug specific, not daily monitoring?)
- CC logs indicate errors communicating with an NC
- Network tomography (partition detection; routing/switching check)
- Check TCP port 8774 is listening Done
- CC image cache size vs size of current cache (/var/lib/eucalyptus/dynserv/)
Storage Controller
- Local cache
- Loopback devices in use - Done
- Specifically disk space in /var/lib/eucalyptus/volumes
- Failed snapshots? (check for vol and snap ID's by filename length?)
- tgtd status
- NetApp/Equallogic logs have faulures because an "iSCSI session from another initiator already exists"
- I/O Wait
- Check TCP port 8773 is listening Done
Node Controller
- NC/kernel (load average threshold, dmesg call stacks with io_sched in them)
- NC/libvirtd (is it alive)
- Loopback devices in use - Done
- iscsid - To rethink, as if iscsid is stops, euca will restart it auto on mounting EBS vol.
- iscsiadm (can establish sessions)
- libvirtd - Done
- Specifically disk space in /var/lib/eucalyptus/instances/[cache,work]
- NC has not received a describeResources request in the last WARN_TIMEOUT minutes: Warning
- NC has not received a describeResources request in the last ERROR_TIMEOUT minutes: Error
- iscsiadm on NC is not responding to control operations or has errors about loading kernel module
- Check TCP port 80 is listening Done
code on Github | bugs on Jira | questions on StackOverflow | chat on IRC
Made available under the CC-BY-3.0-US license.
© 2015 Hewlett-Packard Development Company, L.P.
Index of Categories
- category.HOWTO
- category.Training
- category.Troubleshooting
- category.UNCATEGORIZED
- category.aws-compatibility
- category.blueprints
- category.bugs
- category.ceph
- category.community
- category.components
- category.confluence
- category.contributing
- category.contributions
- category.cors
- category.debugging
- category.design
- category.developer
- category.docs
- category.docs-team
- category.ebs
- category.eustore
- category.events
- category.examples
- category.faqs
- category.fedora
- category.gsoc
- category.ha
- category.hackfests
- category.images
- category.infrastructure
- category.install
- category.legacy
- category.meetings
- category.monitoring
- category.networking
- category.objectstorage
- category.participation
- category.people
- category.releng
- category.reporting
- category.riakcs
- category.stats
- category.storage
- category.tools
- category.training
- category.troubleshooting
- category.ui