Compare old/new plugin output for missing details #32

atc0005 · 2021-01-05T14:12:43Z

From the top of my mind I'm thinking of the CRITICAL, WARNING threshold details shown in the one-line summary output for the older plugins. That is useful to see why at a glance a Service Check state has been determined to be in a non-OK state.

The text was updated successfully, but these errors were encountered:

atc0005 · 2021-01-07T10:04:32Z

As an example, here is what the PowerCLI plugin uses with its one-line summary (not the whole output string):

$usageLevelSummary = "vCPU allocation for powered VMs is $($vCPUsPercentageUsedOfAllowed)% of $($MaxVCPUsAllowed) ($($vCPUsRemaining) remaining) [WARNING: $($WarningUse)% , CRITICAL: $($CriticalUse)%]"

and here is real production output for the VMs in a specific resource pool:

OK: vCPU allocation for powered VMs is 70% of 20 (6 remaining) [WARNING: 95% , CRITICAL: 97%]

The new plugin should provide this info in 1:1 form, or in a modified form that better communicates the details.

atc0005 · 2021-01-11T11:34:42Z

Here is some live output from the v0.1.0 (and likely v0.1.1) release of the vCPUs allocation plugin:

OK: 14 vCPUs allocated (70.0%): 6 more remaining from 20 allowed (evaluated 5 VMs, 1 Resource Pools)

OK: 115 vCPUs allocated (71.9%): 45 more remaining from 160 allowed (evaluated 61 VMs, 4 Resource Pools)

The question is whether that format is any better. We don't explicitly note the WARNING and CRITICAL thresholds in the one-line summary, but we do list how many VMs have been evaluated, how many Resource Pools. The hope is that having that right there in the summary will help to pinpoint configuration issues more quickly vs a reminder what the thresholds are.

The threshold values are shown in the Long Service Output in the web UI like so:

Service State Information
Current Status:	  OK   (for 0d 15h 5m 4s)
Status Information:	OK: 115 vCPUs allocated (71.9%): 45 more remaining from 160 allowed (evaluated 61 VMs, 4 Resource Pools)

**ERRORS**

* None

**THRESHOLDS**

* CRITICAL: 100% of 160 vCPUs allocated
* WARNING: 97% of 160 vCPUs allocated

**DETAILED INFO**

* vCPUs
** Allocated: 115 (71.9%)
** Max Allowed: 160

This seems like a fair compromise?

atc0005 · 2021-01-11T11:36:31Z

One thing not clearly noted is whether the evaluated VMs are powered on or not. That's not noted in the one-line summary or the Long Service Output listing.

This should probably be noted for all plugins which allow filtering on power status. The same goes for any other explicit evaluation criteria toggled by the sysadmin configuring the service check command definition. Choices there should be explicitly noted in the Long Service Output, if not in the one-line summary.

As with several other plugins in this project, this one borrows heavily from existing projects. In particular, this plugin was initially based on a PowerShell / PowerCLI plugin I wrote in 2019. Doc updates have been applied, example usage has been added, including a command definition "contrib" file illustrating how the plugin would be referenced within a production Nagios configuration. Note: Some minor scratch notes from my attempt at crafting a combined age/size plugin are also included. Those notes mostly focus on my attempts to understand the process of determining the size of a snapshot using govmomi and the vSphere Web Services API. Partial work towards implementing snapshot size monitoring has also been included, though it is non-functional at this time. I hope to return to this once I understand how the vSphere API (through govmomi) can be used to reliably determine snapshot size information. Other small (unrelated) fixes have also been included, including some bad copy/paste/modify attempts in the README, doc comments, etc. - refs GH-4 - refs GH-32

atc0005 · 2021-02-01T14:13:16Z

Working on deploying updated plugins based on a build of current master branch. The work from #107 in particular is up for extended testing.

#69 provided a new plugin which omits the [WARNING: 97% , CRITICAL: 99%] values displayed by the PowerCLI plugin. Not sure yet if that is a "problem".

From a different service check: 0.96% of total capacity.

That's not included in the new plugin's output.

EDIT:

Probably easier to just include an entire line for completeness:

OK: Memory usage is at 93.89% of 40 GB allowed (2.45 GB remaining), 0.96% of total capacity. [WARNING: 101% , CRITICAL: 110%]

atc0005 · 2021-02-01T23:50:17Z

OK: Memory usage is at 93.89% of 40 GB allowed (2.45 GB remaining), 0.96% of total capacity. [WARNING: 101% , CRITICAL: 110%]

The 0.96% of total capacity remark seems to be computed using these bits of PowerCLI logic:

$poolDetails = @{
    "name" = $_.Name;
    "cpuActive" = ($_.Runtime.Cpu.OverallUsage / 1000);
    "memoryConsumed" = ($_.Runtime.Memory.OverallUsage / 1GB)
    "memoryTotal" = ($_.Runtime.Memory.MaxUsage / 1GB)
}

and

# This property is attached to each entry in the pool; fetch value from first
# array entry.
if ($detailedPools.Count -gt 0) {
    $totalMemoryAvailable = $detailedPools[0].memoryTotal
}

$memoryPercentageAllowed = [math]::Round(($totalMemoryUsed / $MaxMemoryAllowed) * 100, 2)
$memoryPercentageTotalCapacity = [math]::Round(($totalMemoryUsed / $totalMemoryAvailable) * 100, 2)
$memoryRemaining = [math]::Round(($MaxMemoryAllowed - $totalMemoryUsed), 2)

Per the Data Object - ResourcePoolResourceUsage(vim.ResourcePool.ResourceUsage) doc, this is what the maxUsage field is about:

NAME	TYPE	DESCRIPTION
maxUsage	xsd:long	Current upper-bound on usage. The upper-bound is based on the limit configured on this resource pool, as well as limits configured on any parent resource pool.

It may be that I was able to compute the total memory available in the cluster due to the memory limit on the pool being unlimited? This doesn't seem like a reliable way to list the overall percentage of memory consumed from the cluster. Instead you'd have to get the list of hosts, tally the total memory, then calculate per pool and in aggregate.

If there are pool caps, that would need to factor in somehow?

atc0005 · 2021-02-01T23:56:13Z

The question is whether that format is any better.

I'm biased, but I like the new format better. Overall I think I've met the original intent for this issue, so I'm consider it resolved. #110 was spun off to handle testing the addition of reporting the percentage of memory used from total cluster capacity, and I can spin off new issues for anything not already covered.

Considering this resolved.

atc0005 added this to the v0.1.0 milestone Jan 5, 2021

atc0005 self-assigned this Jan 5, 2021

atc0005 modified the milestones: v0.1.0, Future, v0.6.0 Jan 6, 2021

atc0005 pinned this issue Jan 6, 2021

This was referenced Jan 11, 2021

Choice of including/excluding VMs from evaluation based on power status not exposed #44

Closed

Initial draft of datastore monitoring plugin #58

Merged

atc0005 mentioned this issue Jan 19, 2021

Initial draft of snapshots age monitoring plugin #69

Merged

atc0005 unpinned this issue Jan 26, 2021

atc0005 mentioned this issue Feb 1, 2021

Add support for listing Resource Pool memory usage as percentage of total cluster capacity #110

Closed

atc0005 closed this as completed Feb 1, 2021

atc0005 added output/extended Long Service Output (aka, "extended" or "detailed") output/summary Service Output (aka, "one-line-summary") labels Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare old/new plugin output for missing details #32

Compare old/new plugin output for missing details #32

atc0005 commented Jan 5, 2021

atc0005 commented Jan 7, 2021 •

edited

atc0005 commented Jan 11, 2021 •

edited

atc0005 commented Jan 11, 2021

atc0005 commented Feb 1, 2021 •

edited

atc0005 commented Feb 1, 2021 •

edited

atc0005 commented Feb 1, 2021

Compare old/new plugin output for missing details #32

Compare old/new plugin output for missing details #32

Comments

atc0005 commented Jan 5, 2021

atc0005 commented Jan 7, 2021 • edited

atc0005 commented Jan 11, 2021 • edited

atc0005 commented Jan 11, 2021

atc0005 commented Feb 1, 2021 • edited

atc0005 commented Feb 1, 2021 • edited

atc0005 commented Feb 1, 2021

atc0005 commented Jan 7, 2021 •

edited

atc0005 commented Jan 11, 2021 •

edited

atc0005 commented Feb 1, 2021 •

edited

atc0005 commented Feb 1, 2021 •

edited