Additional monitoring - ECC errors? #1508

Closed
skloeckner opened this Issue Jan 4, 2017 · 14 comments

Projects

None yet

3 participants

@skloeckner
skloeckner commented Jan 4, 2017 edited

I did some quick googling to see if it was possible to monitor ECC errors as this seems like a no brainer benefit to netdata. I haven't found any documentation/pull request adding this and I believe it would be very valuable to sysadmins who monitor bare metal.

The only results I found were for the actual kernel module, EDAC:

http://bluesmoke.sourceforge.net/

It seems this was put in upstream back in kernel 2.6. Is this still a thing? If so, how can netdata properly monitor this while maintaining it's low-memory and resource footprint?

Here are entries in the syslog that show EDAC finding and correcting errors(This actually crashes the system for some reason, however, it is detectable in some way):

Jan  3 22:28:25 ceph-osd1 kernel: [  590.529178] mce: [Hardware Error]: Machine check events logged
Jan  3 22:28:26 ceph-osd1 kernel: [  591.115453] EDAC MC0: 2 CE read ECC error on CPU#0Channel#1_DIMM#1 (channel:1 slot:1 page:0x1d1b99 offset:0x500 grain:8 syndrome:0x391a5d80 - read error)
Jan  3 22:28:27 ceph-osd1 kernel: [  591.923321] mce: [Hardware Error]: Machine check events logged
Jan  3 22:28:27 ceph-osd1 kernel: [  592.115662] EDAC MC0: 1 CE read ECC error on CPU#0Channel#1_DIMM#1 (channel:1 slot:1 page:0x1d1b91 offset:0x80 grain:8 syndrome:0x5537ac01 - read error)
Jan  3 22:28:27 ceph-osd1 kernel: [  592.115673] EDAC MC0: 2 CE read ECC error on CPU#0Channel#1_DIMM#1 (channel:1 slot:1 page:0x1d1b91 offset:0x400 grain:8 syndrome:0xa4110a04 - read error)
Jan  3 22:28:49 ceph-osd1 kernel: [  614.702120] perf interrupt took too long (2543 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Jan  3 22:31:54 ceph-osd1 kernel: [  799.803043] mce_notify_irq: 1 callbacks suppressed
Jan  3 22:31:54 ceph-osd1 kernel: [  799.803054] mce: [Hardware Error]: Machine check events logged
Jan  3 22:31:55 ceph-osd1 kernel: [  800.140717] EDAC MC0: 1 CE read ECC error on CPU#0Channel#1_DIMM#1 (channel:1 slot:1 page:0x1d1b91 offset:0x80 grain:8 syndrome:0x5537ac01 - read error)
Jan  3 22:34:09 ceph-osd1 kernel: [  934.783137] mce: [Hardware Error]: Machine check events logged
Jan  3 22:34:10 ceph-osd1 kernel: [  935.154440] EDAC MC0: 1 CE read ECC error on CPU#0Channel#1_DIMM#1 (channel:1 slot:1 page:0x1d1b91 offset:0x80 grain:8 syndrome:0xa4110a04 - read error)
Jan  3 22:34:26 ceph-osd1 kernel: [  951.159105] EDAC MC0: 1 CE read ECC error on CPU#0Channel#1_DIMM#1 (channel:1 slot:1 page:0x1d1b99 offset:0x80 grain:8 syndrome:0x920da040 - read error)

More examples of errors on a system of mine:

[  205.065427] EDAC i7core: New Corrected error(s): dimm0: +0, dimm1: +1, dimm2 +0  
[95329.936255] EDAC i7core: New Corrected error(s): dimm0: +0, dimm1: +1, dimm2 +0     
[98258.202811] EDAC i7core: New Corrected error(s): dimm0: +0, dimm1: +3, dimm2 +0     
[144730.442605] EDAC i7core: New Corrected error(s): dimm0: +0, dimm1: +2, dimm2 +0  
[146364.870888] EDAC i7core: New Corrected error(s): dimm0: +0, dimm1: +1, dimm2 +0 

I'm sure there are a lot of sysadmins out there that have to look after old systems. This would greatly benefit us as well on any future system if this kernel module is still supported on newer hardware containing ECC.

I wouldn't mind doing more research on this to see if there is anything in memory I could find that netdata can quickly query. I am at work right now and about to head home right now.

Quick thought: Maybe a plugin would suffice for now to parse the output of dmesg and just report on failures?

I am starting to seriously learn python and would not mind writing the plugin but it may take me a lot of trial and error.

Will report back if I find anything.

@ktsaou
Member
ktsaou commented Jan 4, 2017

There should be chart under "Memory" with the hw errors. I couldn't test it, so there is no alarm yet: https://github.com/firehol/netdata/blob/0e6bc626cb01f374f7e6b89cecb471bd2da003ad/src/proc_meminfo.c#L265-L279

If it works, I could add an alarm to it.

@skloeckner
skloeckner commented Jan 4, 2017 edited

Hmm, I did not see any alarms for the server for anything related to this. Part of the problem is that netdata does not appear to persist data through a crash but I do see alarms just up until the server has crashed and rebooted.

I'm not so sure what I'm looking at with this code, to be honest. What metrics is this code looking at and how do I enable the chart so I can verify what it is doing?

@ktsaou
Member
ktsaou commented Jan 4, 2017

ok, could you please post your /proc/meminfo?

@skloeckner
skloeckner commented Jan 5, 2017 edited

This is with no errors and with the bad RAM taken out. Here's my problem is that I can only cause ECC errors with the bad stick in, however, the machine reboots to avoid any corruption so I cannot get live output from /prov/meminfo with the ECC errors.

Regardless, here it is currently:

MemTotal:        8166200 kB
MemFree:          149356 kB
MemAvailable:    5612040 kB
Buffers:              92 kB
Cached:          5621712 kB
SwapCached:           96 kB
Active:          4212604 kB
Inactive:        3548160 kB
Active(anon):    1406916 kB
Inactive(anon):   732512 kB
Active(file):    2805688 kB
Inactive(file):  2815648 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       3906556 kB
SwapFree:        3886324 kB
Dirty:            149012 kB
Writeback:             0 kB
AnonPages:       2138800 kB
Mapped:            26032 kB
Shmem:               548 kB
Slab:             163280 kB
SReclaimable:     125704 kB
SUnreclaim:        37576 kB
KernelStack:       39552 kB
PageTables:        12912 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     7989656 kB
Committed_AS:    8041924 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:   1091584 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       58944 kB
DirectMap2M:     8321024 kB
@skloeckner
skloeckner commented Jan 5, 2017 edited

At /sys/devices/system/edac/mc/mc0/ce_count

I believe this is what maintains the count for ECC errors, at least on my supermicro.

This does not exist on my laptop, which obviously I am not using ECC on, but I can confirm all of the rest of my supermicro servers have this location present.

ansible-all "ls /sys/devices/system/edac/mc/mc0/" osds
ceph-osd1 | SUCCESS | rc=0 >>
all_channel_counts  inject_addrmatch  max_location	   size_mb
ce_count	    inject_eccmask    mc_name		   subsystem
ce_noinfo_count     inject_enable     power		   ue_count
dimm0		    inject_section    reset_counters	   ue_noinfo_count
dimm1		    inject_type       seconds_since_reset  uevent
Shared connection to ceph-osd1 closed.


ceph-osd0 | SUCCESS | rc=0 >>
all_channel_counts  dimm4	      max_location	   subsystem
ce_count	    inject_addrmatch  mc_name		   ue_count
ce_noinfo_count     inject_eccmask    power		   ue_noinfo_count
dimm0		    inject_enable     reset_counters	   uevent
dimm1		    inject_section    seconds_since_reset
dimm3		    inject_type       size_mb
Shared connection to ceph-osd0 closed.


ceph-osd4 | SUCCESS | rc=0 >>
all_channel_counts  dimm4	      max_location	   subsystem
ce_count	    inject_addrmatch  mc_name		   ue_count
ce_noinfo_count     inject_eccmask    power		   ue_noinfo_count
dimm0		    inject_enable     reset_counters	   uevent
dimm1		    inject_section    seconds_since_reset
dimm3		    inject_type       size_mb
Shared connection to ceph-osd4 closed.


ceph-osd2 | SUCCESS | rc=0 >>
all_channel_counts  dimm4	      max_location	   subsystem
ce_count	    inject_addrmatch  mc_name		   ue_count
ce_noinfo_count     inject_eccmask    power		   ue_noinfo_count
dimm0		    inject_enable     reset_counters	   uevent
dimm1		    inject_section    seconds_since_reset
dimm3		    inject_type       size_mb
Shared connection to ceph-osd2 closed.


ceph-osd6 | SUCCESS | rc=0 >>
ce_count	 dimm4	       power		    subsystem
ce_noinfo_count  dimm8	       reset_counters	    ue_count
dimm0		 max_location  seconds_since_reset  ue_noinfo_count
dimm12		 mc_name       size_mb		    uevent
Shared connection to ceph-osd6 closed.

I am not entirely sure if that translates to the "HardwareCorrupted" section of /proc/meminfo

Read more here:

http://www.admin-magazine.com/HPC/Articles/Memory-Errors

I'll see if I can re-install that bad DIMM and catch the errors as they happen and hopefully catch some output of both /proc/meminfo and the above.

Will report back when I get a chance to test.

@ktsaou
Member
ktsaou commented Jan 5, 2017 edited

Well, there is already a counter in the output you posted

HardwareCorrupted:     0 kB

This one is parsed by netdata and if it is non-zero a new chart will appear on the netdata dashboard under the memory section, although there is no currently an alarm for it. I could add one.

I see the directory /sys/devices/system/edac/mc provides:

  • ce_count - correctable errors
  • ue_count - uncorrectable errors

I think these could be very nice statistics to add to netdata, so I marked this thread as an enhancement.

@skloeckner
skloeckner commented Jan 8, 2017 edited

Testing now...

Already see the ce_count going up:

root@ceph-osd1:~# cat /sys/devices/system/edac/mc/mc0/ce_count 
112
root@ceph-osd1:~# cat /sys/devices/system/edac/mc/mc0/ue_count 
0

No HardwareCorrupted results yet.

root@ceph-osd1:~# cat /proc/meminfo 
MemTotal:       16423704 kB
MemFree:          177964 kB
MemAvailable:   13884204 kB
Buffers:            3512 kB
Cached:         13830524 kB
SwapCached:            0 kB
Active:          5797804 kB
Inactive:       10082268 kB
Active(anon):    1444516 kB
Inactive(anon):   603036 kB
Active(file):    4353288 kB
Inactive(file):  9479232 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       3906556 kB
SwapFree:        3906556 kB
Dirty:             80704 kB
Writeback:             0 kB
AnonPages:       2046792 kB
Mapped:            46076 kB
Shmem:               760 kB
Slab:             256820 kB
SReclaimable:     212044 kB
SUnreclaim:        44776 kB
KernelStack:       39440 kB
PageTables:        13328 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    12118408 kB
Committed_AS:    7770528 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:   1028096 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       50752 kB
DirectMap2M:    16717824 kB
@ktsaou
Member
ktsaou commented Jan 8, 2017

ok. I think HardwareCorrupted is related to ue_count.

Could you please run the following and post the output:

find /sys/devices/system/edac/mc/ -type f -a -name "ce_count" -o -name "ue_count" | while read f; do echo "${f}"; cat "${f}"; done

mine on server A gives:

/sys/devices/system/edac/mc/mc0/csrow0/ue_count
0
/sys/devices/system/edac/mc/mc0/csrow0/ce_count
0
/sys/devices/system/edac/mc/mc0/ue_count
0
/sys/devices/system/edac/mc/mc0/ce_count
0

and on server B gives:

/sys/devices/system/edac/mc/mc0/ue_count
0
/sys/devices/system/edac/mc/mc0/ce_count
0
/sys/devices/system/edac/mc/mc0/csrow0/ue_count
0
/sys/devices/system/edac/mc/mc0/csrow0/ce_count
0
/sys/devices/system/edac/mc/mc0/csrow1/ue_count
0
/sys/devices/system/edac/mc/mc0/csrow1/ce_count
0
/sys/devices/system/edac/mc/mc0/csrow2/ue_count
0
/sys/devices/system/edac/mc/mc0/csrow2/ce_count
0
/sys/devices/system/edac/mc/mc0/csrow3/ue_count
0
/sys/devices/system/edac/mc/mc0/csrow3/ce_count
0

I think that we only need ce_count and ue_count found at /sys/devices/system/edac/mc/mcX/ (where X is a number) - not the .../mcX/csrowY/... ones. These are used to find the DIMM, which I think is not so important for netdata if the parent has the right counters.

@ktsaou
Member
ktsaou commented Jan 8, 2017

@candiao we are trying to find out what netdata should check to detect hardware memory errors. I recall you have a few nice and unusual servers with 1TB RAM each. Could you please run the command given above and post here the output? It would be useful for you too if this works properly.

@candiao
candiao commented Jan 9, 2017 edited

Hello,

uname -a
3.10.0-514.2.2.el7.x86_64 #1 SMP Wed Nov 16 13:15:13 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

cat /sys/devices/system/edac/mc/mc0/ce_count
0
cat /sys/devices/system/edac/mc/mc0/ue_count
0

find /sys/devices/system/edac/mc/ -type f -a -name "ce_count" -o -name "ue_count" | while read f; do echo "${f}"; cat "${f}"; done

/sys/devices/system/edac/mc/mc0/csrow0/ue_count
0
/sys/devices/system/edac/mc/mc0/csrow0/ce_count
0
/sys/devices/system/edac/mc/mc0/ue_count
0
/sys/devices/system/edac/mc/mc0/ce_count
0
/sys/devices/system/edac/mc/mc1/csrow0/ue_count
0
/sys/devices/system/edac/mc/mc1/csrow0/ce_count
0
/sys/devices/system/edac/mc/mc1/ue_count
0
/sys/devices/system/edac/mc/mc1/ce_count
0
/sys/devices/system/edac/mc/mc2/csrow0/ue_count
0
/sys/devices/system/edac/mc/mc2/csrow0/ce_count
0
/sys/devices/system/edac/mc/mc2/ue_count
0
/sys/devices/system/edac/mc/mc2/ce_count
0
/sys/devices/system/edac/mc/mc3/csrow0/ue_count
0
/sys/devices/system/edac/mc/mc3/csrow0/ce_count
0
/sys/devices/system/edac/mc/mc3/ue_count
0
/sys/devices/system/edac/mc/mc3/ce_count
0
/sys/devices/system/edac/mc/mc4/csrow0/ue_count
0
/sys/devices/system/edac/mc/mc4/csrow0/ce_count
0
/sys/devices/system/edac/mc/mc4/ue_count
0
/sys/devices/system/edac/mc/mc4/ce_count
0
/sys/devices/system/edac/mc/mc5/csrow0/ue_count
0
/sys/devices/system/edac/mc/mc5/csrow0/ce_count
0
/sys/devices/system/edac/mc/mc5/ue_count
0
/sys/devices/system/edac/mc/mc5/ce_count
0
/sys/devices/system/edac/mc/mc6/csrow0/ue_count
0
/sys/devices/system/edac/mc/mc6/csrow0/ce_count
0
/sys/devices/system/edac/mc/mc6/ue_count
0
/sys/devices/system/edac/mc/mc6/ce_count
0
/sys/devices/system/edac/mc/mc7/csrow0/ue_count
0
/sys/devices/system/edac/mc/mc7/csrow0/ce_count
0
/sys/devices/system/edac/mc/mc7/ue_count
0
/sys/devices/system/edac/mc/mc7/ce_count
0

@skloeckner
skloeckner commented Jan 11, 2017 edited

I would say alarms for ce_count would be good. It would indicate one of your RAM sticks is failing and might need to be tested. Then again, I suppose it would be a fine line to walk if ce_count is at an acceptable level. In my case, it seems to crash the server.

I was not able to replicate the crash of the server to trigger any other results from ue_count or /proc/meminfo. I will test once again to see what I end up with so it gives us better insight on what to alarm on. I plan on removing other good memory sticks so that hopefully the bad stick will fill up faster and crash the server.

Here's output across 6 servers(From the command above):

ceph-osd0 | SUCCESS | rc=0 >>
/sys/devices/system/edac/mc/mc0/ue_count
0
/sys/devices/system/edac/mc/mc0/ce_count
0
Shared connection to ceph-osd0 closed.


ceph-osd1 | SUCCESS | rc=0 >>
/sys/devices/system/edac/mc/mc0/ue_count
0
/sys/devices/system/edac/mc/mc0/ce_count
0
Shared connection to ceph-osd1 closed.


ceph-osd2 | SUCCESS | rc=0 >>
/sys/devices/system/edac/mc/mc0/ue_count
0
/sys/devices/system/edac/mc/mc0/ce_count
0
Shared connection to ceph-osd2 closed.


ceph-osd5 | SUCCESS | rc=0 >>
/sys/devices/system/edac/mc/mc0/ue_count
0
/sys/devices/system/edac/mc/mc0/ce_count
0
Shared connection to ceph-osd5 closed.


ceph-osd4 | SUCCESS | rc=0 >>
/sys/devices/system/edac/mc/mc0/ue_count
0
/sys/devices/system/edac/mc/mc0/ce_count
0
Shared connection to ceph-osd4 closed.


ceph-osd6 | SUCCESS | rc=0 >>
/sys/devices/system/edac/mc/mc0/ue_count
0
/sys/devices/system/edac/mc/mc0/ce_count
0
Shared connection to ceph-osd6 closed.
@ktsaou ktsaou added a commit to ktsaou/netdata that referenced this issue Jan 11, 2017
@ktsaou ktsaou detect ECC memory correctable and uncorrectable errors; fixes #1508 2ecf423
@ktsaou
Member
ktsaou commented Jan 11, 2017

I have implemented this in #1548

@ktsaou ktsaou added the fixed label Jan 11, 2017
@ktsaou
Member
ktsaou commented Jan 11, 2017

@skloeckner I would appreciate a test on your faulty memory modules.

@ktsaou ktsaou closed this in #1548 Jan 11, 2017
@ktsaou
Member
ktsaou commented Jan 11, 2017

merged it too!
if you find an issue, please post here and I'll re-open it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment