Linux kernel machine check handling middleware
C Groff Shell Makefile Other
Latest commit e6386a0 Sep 6, 2016 Andi Kleen Fix memory leak in sysfs reader for bad fields handling
Saves about 1.8k memory on my system.

Found by gcc -fsanitize=leak

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Permalink
Failed to load latest commit information.
input Fix GENMEM script to set EN and S bits May 10, 2016
tests Automatically try to load mce-inject driver in the testsuite Mar 12, 2015
triggers Set x bit on unknown error trigger Mar 22, 2015
.gitignore mcelog: Add version.c and version.tmp to gitignore May 16, 2016
CHANGES Mark CHANGES file obsolete Mar 17, 2015
Makefile Remove empty xeon-75xx files Aug 24, 2016
README.md format README -> README.md Sep 2, 2016
README.releases document new release scheme Oct 19, 2013
TODO Initial import of mcelog-0.8pre + some old patches Sep 4, 2008
bitfield.c Use %ll instead of %L in *printf to support musl libc. Feb 4, 2016
bitfield.h More fixes for %ll instead of %L in *printf to support musl libc. Feb 9, 2016
broadwell_de.c Add support to decode MSCOD values for Broadwell-{de,ep,ex} Jan 4, 2016
broadwell_de.h Add support to decode MSCOD values for Broadwell-{de,ep,ex} Jan 4, 2016
broadwell_epex.c Fix spelling errors. Mar 27, 2016
broadwell_epex.h Add support to decode MSCOD values for Broadwell-{de,ep,ex} Jan 4, 2016
bus.c Fix memory leaks in triggers Sep 28, 2015
bus.h Add a trigger for IO-MCA errors Sep 17, 2014
cache.c Fix parsing of sysfs CPU cache description Sep 28, 2015
cache.h Initial yellow bit support Nov 26, 2009
client.c mcelog, remove socket file on SIGINT Aug 15, 2016
client.h mcelog, remove socket file on SIGINT Aug 15, 2016
config-intro.man Add mcelog.conf.5 man page Mar 12, 2015
config.c Merge branch 'master' of git://git.kernel.org/pub/scm/utils/cpu/mce/m… Mar 1, 2010
config.h Move credentials setup into config.c Nov 30, 2009
core2.c Fix spelling errors. Mar 27, 2016
core2.h Update core2/P6old support Sep 10, 2008
dimm.c Use %ll instead of %L in *printf to support musl libc. Feb 4, 2016
dimm.h Use 64bit types for memory addresses Aug 25, 2015
dmi.c Read the DMI entries we need from /sys/firmware/dmi/entries if availa… Mar 21, 2016
dmi.h Mark dmi_entry as packed Sep 6, 2016
dunnington.c Fix incorrect dunnington specific decoding Sep 7, 2009
dunnington.h Add Dunnington support Sep 16, 2008
eventloop.c Handle old glibc without ppoll Dec 27, 2009
eventloop.h Wait for children in non daemon mode Nov 26, 2009
genconfig.py Add a missing escape in the mcelog.conf.5 generator Aug 25, 2015
haswell.c Fix spelling errors. Mar 27, 2016
haswell.h Add better decoding support for Haswell server processors Sep 8, 2014
intel.c Add Kabylake client support Sep 2, 2016
intel.h Add Kabylake client support Sep 2, 2016
ivy-bridge.c More compact data structures for reporting SNB/IVB memory controller … Sep 8, 2014
ivy-bridge.h Add Ivy Bridge support to mcelog Jan 18, 2013
k8.c Fix spelling errors. Mar 27, 2016
k8.h Enable -Wextra and some more warnings and clean them in code May 23, 2009
leaky-bucket.c Fix potential division by zero for unknown time unit Aug 24, 2016
leaky-bucket.h Fix error count during the last agetime Mar 31, 2015
list.h Initial memdb support Nov 26, 2009
lk10-mcelog.pdf Add LK10 mcelog paper Oct 17, 2010
mce.pdf Initial import of mcelog-0.8pre + some old patches Sep 4, 2008
mcelog.8 Add --is-cpu-supported command line option Sep 14, 2015
mcelog.c Add Kabylake client support Sep 2, 2016
mcelog.conf Fix spelling errors. Mar 27, 2016
mcelog.conf.5 Fix spelling errors. Mar 27, 2016
mcelog.cron Initial import of mcelog-0.8pre + some old patches Sep 4, 2008
mcelog.h Add Kabylake client support Sep 2, 2016
mcelog.init Write pidfile by default in daemon mode Feb 26, 2010
mcelog.logrotate Reopen log files on SIGUSR1 in daemon mode Feb 26, 2010
mcelog.service mcelog, fix systemd service stop Aug 9, 2016
mcelog.triggers.5 Add man page for mcelog trigger input arguments Mar 22, 2015
memdb.c Add support for DMI strings in AMI BIOS Jul 1, 2015
memdb.h Avoid filling prefilling memory data base when DMI is disabled Jun 9, 2015
memutil.c Check for out of memory in asprintf Feb 27, 2010
memutil.h Add xalloc_nonzero() to memutil May 24, 2009
msg.c Reopen log files on SIGUSR1 in daemon mode Feb 26, 2010
msg.h Reopen log files on SIGUSR1 in daemon mode Feb 26, 2010
msr.c mcelog: Fix file descriptor leak in domsr() May 16, 2016
nehalem.c Remove empty xeon-75xx files Aug 24, 2016
nehalem.h Fix channel enumeration on Knights Landing Sep 9, 2015
p4.c Add support to decode MSCOD values for Skylake server Apr 15, 2016
p4.h Pass socketid to cache error trigger Nov 26, 2009
page.c Add method to lookup whether a DIMM exists in memdb Sep 19, 2012
page.h Fix CMCI overflow count handling Jan 21, 2010
paths.h Write pidfile by default in daemon mode Feb 26, 2010
rbtree.c Initial page predictive failure analysis support Nov 26, 2009
rbtree.h Initial page predictive failure analysis support Nov 26, 2009
sandy-bridge.c More compact data structures for reporting SNB/IVB memory controller … Sep 8, 2014
sandy-bridge.h Add support for Sandy Bridge extended error logging Sep 19, 2012
server.c fix: server does not start because it assumed it is already running Aug 22, 2015
server.h Initial memdb support Nov 26, 2009
skylake_xeon.c Add support to decode MSCOD values for Skylake server Apr 15, 2016
skylake_xeon.h Add support to decode MSCOD values for Skylake server Apr 15, 2016
sysfs.c Fix memory leak in sysfs reader for bad fields handling Sep 6, 2016
sysfs.h Move sysfs write functions from page.c to sysfs.c Nov 28, 2009
trigger.c trigger: Avoid warning from earlier merge Mar 21, 2016
trigger.h mcelog: Abstract forking triggers for reuse Sep 9, 2010
tsc.c Use %ll instead of %L in *printf to support musl libc. Feb 4, 2016
tsc.h Enable -Wextra and some more warnings and clean them in code May 23, 2009
tulsa.c Add Tulsa support for Cache Bus controller and Bus and Interconnect E… Sep 7, 2009
tulsa.h Add Intel Xeon 71xx (Tulsa) MCA decoding support May 5, 2009
unknown.c Fix memory leaks in triggers Sep 28, 2015
unknown.h Add a trigger for unknown machine check errors Sep 20, 2014
version.h Output git tag for mcelog --version Sep 28, 2015
yellow.c Fix incorrect asprintf in yellow.c Sep 17, 2014
yellow.h Pass socketid to cache error trigger Nov 26, 2009

README.md

mcelog

mcelog is the user space backend for logging machine check errors reported by the hardware to the kernel. The kernel does the immediate actions (like killing processes etc.) and mcelog decodes the errors and manages various other advanced error responses like offlining memory, CPUs or triggering events. In addition mcelog also handles corrected errors, by logging and accounting them. It primarily handles machine checks and thermal events, which are reported for errors detected by the CPU.

For more details on what mcelog can do and the underlying theory see mcelog.org.

It is recommended that mcelog runs on all x86 machines, both 64bit (since early 2.6) and 32bit (since 2.6.32).

mcelog can run in several modes:

  • cronjob
  • trigger
  • daemon

cronjob is the old method. mcelog runs every 5 minutes from cron and checks for errors. Disadvantage of this is that it can delay error reporting significantly (upto 10 minutes) and does not allow mcelog to keep extended state.

trigger is a newer method where the kernel runs mcelog on a error.

This is configured with:

echo /usr/sbin/mcelog > /sys/devices/system/machinecheck/machinecheck0/trigger

This is faster, but still doesn't allow mcelog to keep state, and has relatively high overhead for each error because a program has to be initialized from scratch.

In daemon mode mcelog runs continuously as a daemon in the background and wait for errors. It is enabled by running mcelog --daemon & from a init script. This is the fastest and most feature-ful.

The recommended mode is daemon, because several new functions (like page error predictive failure analysis) require a continuously running daemon.

Documentation

  • The primary reference documentation are the man pages.
  • lk10-mcelog.pdf has a overview over the errors mcelog handles (originally from Linux Kongress 2010).
  • mce.pdf is a very old paper describing the first releases of mcelog (some parts are obsolete).

For distributors

You can run mcelog from systemd or similar daemons. An example systemd unit file is in mcelog.service.

For older distributions using init scripts

Please install an init script by default that runs mcelog in daemon mode. The mcelog.init script is a good starting point. Also install a logrotated file (mcelog.logrotate) or equivalent when mcelog is running in daemon mode. These two are not in make install.

The installation also requires a config file /etc/mcelog.conf and the default triggers. These are all installed by make install

/dev/mcelog is needed for mcelog operation. If it's not there it can be created with:

mknod /dev/mcelog c 10 227

Normally it should be created automatically in udev.

Security

mcelog needs to run as root because it might trigger actions like page-offlining, which require CAP_SYS_ADMIN. Also it opens /dev/mcelog and an UNIX socket for client support.

It also opens /dev/mem to parse the BIOS DMI tables. It is careful to close the file descriptor and unmap any mappings after using them.

There is support for changing the user in daemon mode after opening the device and the sockets, but that would stop triggers from doing corrective action that require root.

In principle it would be possible to only keep CAP_SYS_ADMIN for page-offling, but that would prevent triggers from doing root-only actions not covered by it (and CAP_SYS_ADMIN is not that different from full root)

In daemon mode mcelog listens to a UNIX socket and processes requests from sh mcelog --client. This can be disabled in the configuration file. The uid/gid of the requestor is checked on access and is configurable (default 0/0 only). The command parsing code is very straight forward (server.c). The client parsing/reply is currently done with full privileges of the daemon.

Testing

There is a simple test suite in sh tests/. The test suite requires root to run and access to mce-inject and a kernel with MCE injection support CONFIG_X86_MCE_INJECT. It will kill any running mcelog daemon.

Run it with sh make test.

The test suite requires the mce-inject tool. The mce-inject executable must be either in $PATH or in the ../mce-inject directory.

You can also test under valgrind with sh make valgrind-test. For this valgrind needs to be installed of course. Advanced valgrind options can be specified with:

make VALGRIND="valgrind --option" valgrind-test

Other checks

make iccverify and make clangverify run the static verifiers in clang and icc respectively.

License

This program is licensed under the subject of the GNU Public General License, v.2