ZenPack to model & monitor storage devices' S.M.A.R.T. status
- smartmontools
- Should work with versions 6.x and 7.x
- Potentially limited functionality with 5.x
- An account on the host, which can
- Log in via SSH with a key
- Use a bash-compatible shell
- Run the
smartctl
command with certain parameters via privilege escalation without password- This may not be required on some hosts, depending on configuration
- Currently tries to detect
dzdo
,doas
,pfexec
, andsudo
- ZenPackLib
Example entries in /etc/sudoers
Cmnd_Alias SMARTCTL = /usr/sbin/smartctl --info *
zenoss ALL=(ALL) NOPASSWD: SMARTCTL
zSmartDiskMapMatch
- Regex of device names for the modeler to match. If unset, there is no filtering, and all discovered devices (see below) are modeled.
zSmartIgnoreModels
- Regex of model names for the modeler to ignore. For SCSI devices with separate Vendor and Product fields, this compared against Vendor and Product joined with a space.
zSmartIgnoreUnsupported
- Skips modeling of devices reported to not support SMART. Defaults to True.
On systems other than macOS, SMART-supporting devices are discovered with smartctl --scan
.
This pack will not attempt to enable SMART on any device using smartctl --smart=on
or set any other parameter. Configuration of smartmon is outside the scope of this pack and document.
Due to the device name format that smartctl --scan
returns on macOS, devices are discovered using diskutil list
instead and results found not to support SMART are ignored.
NVMe devices may not be reported by smartctl --scan
, so discovery by ls /dev/nvme*
is also attempted.
Devices on HP Smart Arrays using the CCISS driver may not be reported by smartctl --scan
. Those and other non-reporting devices may be added manually to zenoss_smart.txt
in the home directory of the account used by Zenoss to access the target system. Entries should be one device per line, in the format of smartctl --scan
output.
Example:
/dev/sda -d cciss,1
/dev/sdb -d cciss,2
/dev/sdc -d cciss,3
Percentages come from the normalized "Value" columns as reported by smartctl --attributes
. Values in excess of 100 are scaled to 0-100.
If a normalized value is at or below a non-zero threshold as reported by smartctl --attributes
, an event is generated. This is not based on thresholds in the performance template.
Some normalized health values may not be available on SCSI/SAS or NVMe devices, based on limitations of smartctl
output.
The normalized value of whichever of the following attributes is found first:
- 9 - Power On Hours
- 193 - Load Cycle Count
- 225 - Load/Unload Cycle Count
- 12 - Power Cycle Count
Failing that, percentage of accumulated load-unload cycles vs lifetime specified, or accumulated start-stop cycles vs lifetime specified.
The normalized value for Raw_Read_Error_Rate
(1).
The normalized value for Reallocated_Sector_Ct
(5).
The normalized value of whichever of the following attributes is found first:
- 169 - Remaining Life Percentage
- 202 - Percent Lifetime Remain / Data Address Mark Errors
- 173 - Ave Block-Erase Count / Wear Leveling Count / Media Wearout Indicator
- 177 - Wear Leveling Count / Wear Range Delta
- 231 - SSD Life Left
Failing that, the Percentage Used Endurance Indicator
value from Device Stats or Percentage Used
from NVMe information is subtracted from 100.
Lowest normalized value reported that is NOT:
- Temperature
- 0 value with 0 threshold
- Used in the Rated Lifetime datapoint
- Used in the SSD Life datapoint
Current raw value counts for Reallocated_Sector_Ct
(5) and Offline_Uncorrectable
(198). Number of Reallocated Logical Sectors
is taken from Device Stats if Reallocated_Sector_Ct
is not present, or Elements in grown defect list
if Device Stats is not available.
Kingston SSDs may reset the attribute 198 counter at power-on, and consider the attribute to be "Uncorrectable Sector Count".
An event will be generated if either of these values is > 0. Increasing reallocation counts are treated as unique events that will not be deduplicated.
Raw values for above online and offline reallocation as rates, with current raw value of Current_Pending_Sector
(197), or Number of Realloc. Candidate Logical Sectors
from Device Stats.
Total of Non-medium error count
for SCSI devices, Media and Data Integrity Errors
for NVMe devices, and Number of Reported Uncorrectable Errors
& Number of Interface CRC Errors
from Device Stats, as a rate.
Blocks sent to initiato
& Blocks received from initiator
for SCSI devices, Data Units Read
& Data Units Written
, or Logical Sectors Read
& Logical Sectors Written
from Device Stats, as rates.
Activity Graph values multiplied by device's logical block size.
Sum of Number of read and write commands whose size <= segment size
& Number of read and write commands whose size > segment size
for SCSI devices, sum of Host Read Commands
& Host Write Commands
for NVMe devices, or sum of Number of Read Commands
& Number of Write Commands
from Device Stats, as a rate.
Current raw value for Temperature_Celsius
(194) or Airflow_Temperature_Cel
(190). Current Temperature
is taken from SCT Temperature or Device Stats if neither attribute is present. Attribute 231 may be used if device is a hard disk, or 189 if it is idenfitied as Temperature.
Total PHY events from SCSI or SATA PHY event logs as a rate.
I'm not going to make any assumptions about your device class organization, so it's up to you to configure the daviswr.cmd.SMART
modeler on the appropriate class or device.
While this ZenPack tries to be as generic as possible, please keep in mind your storage device manufacturer may have chosen to use attributes in a proprietary way.