detect ECC memory reported errors #1548

Merged
merged 4 commits into from Jan 11, 2017

Projects

None yet

1 participant

@ktsaou
Member
ktsaou commented Jan 11, 2017 edited

This PR adds 2 more charts:

  1. correctable ECC errors
  2. uncorrectable ECC errors

Both charts show the number of errors/s for each memory controller.

and 3 new alarms:

  1. 1hour ecc memory correctable
  2. 1hour ecc memory uncorrectable
  3. 1hour memory hw corrupted (this uses /proc/meminfo)

The charts and the alarms have been configured to not be added if there are errors at all. They will be added when the first errors occurs (in which case, alarms will be dispatched too).

To check if your system has this capability, fetch for http://your.netdata.ip:19999/netdata.conf and search for this section:

[plugin:proc:/sys/devices/system/edac/mc]
        # directory to monitor = /sys/devices/system/edac/mc
        # enable ECC memory correctable errors = auto
        # enable ECC memory uncorrectable errors = auto

If the later 2 lines exist, your system has this capability. You can also set them to yes (instead of auto), to view the (empty) charts at the memory section of the dashboard and verify there are alarms for them. (if they are not empty, they should show up by themselves).

The charts:

image

The alarms:

image

image

fixes #1508

ktsaou added some commits Jan 11, 2017
@ktsaou ktsaou detect ECC memory correctable and uncorrectable errors; fixes #1508 2ecf423
@ktsaou ktsaou added health.d/memory.conf to make files
8e31df7
@ktsaou ktsaou fix ondemand switch for hwcorrupted memory
ca3edf9
@ktsaou ktsaou move hwcorrupted chart to ecc family
77cf788
@ktsaou ktsaou merged commit 074dbd1 into firehol:master Jan 11, 2017

2 checks passed

codeclimate no new or fixed issues
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment