Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Linux kernel machine check handling middleware
C Shell Groff Makefile

Makefile: Allow defining MANDIR

Signed-off-by: Andi Kleen <ak@linux.intel.com>
latest commit 065adc4a63
@rworkman rworkman authored Andi Kleen committed
Failed to load latest commit information.
input Add a test case for iomca and unknown triggers
tests Automatically try to load mce-inject driver in the testsuite
triggers Set x bit on unknown error trigger
.gitignore Add more files to .gitignore
CHANGES Mark CHANGES file obsolete
Makefile Makefile: Allow defining MANDIR
README Add example systemd unit file
README.releases document new release scheme
TODO Initial import of mcelog-0.8pre + some old patches
TODO-diskdb Hook up diskdb memory error for intel
bitfield.c Don't drop MSB in number fields
bitfield.h Move test_prefix() definition from p4.c to bitfield.h
bus.c Add a trigger for IO-MCA errors
bus.h Add a trigger for IO-MCA errors
cache.c Fix cache map parsing
cache.h Initial yellow bit support
client.c mcelog: fix 'mcelog --client' blocked problem
client.h Initial client support
config-intro.man Add mcelog.conf.5 man page
config.c Merge branch 'master' of git://git.kernel.org/pub/scm/utils/cpu/mce/m…
config.h Move credentials setup into config.c
core2.c Update core2/P6old support
core2.h Update core2/P6old support
db.c Add notes that old diskdb files are obsolete
db.h Initial import of mcelog-0.8pre + some old patches
dbquery.c Use memutil functions everywhere
dimm.c Add notes that old diskdb files are obsolete
dimm.h Initial import of mcelog-0.8pre + some old patches
diskdb.c Add notes that old diskdb files are obsolete
diskdb.h Disable on disk DIMM database by default
dmi.c Avoid segfault when system has no SMBIOS entry point.
dmi.h Close /dev/mem file descriptor when not needed anymore
dunnington.c Fix incorrect dunnington specific decoding
dunnington.h Add Dunnington support
eventloop.c Handle old glibc without ppoll
eventloop.h Wait for children in non daemon mode
genconfig.py Add man page for mcelog trigger input arguments
haswell.c Fix a comment
haswell.h Add better decoding support for Haswell server processors
intel.c Add model number for Broadwell-DE
intel.h Support Broadwell-U
ivy-bridge.c More compact data structures for reporting SNB/IVB memory controller …
ivy-bridge.h Add Ivy Bridge support to mcelog
k8.c Add support for calling icc static verifier
k8.h Enable -Wextra and some more warnings and clean them in code
leaky-bucket.c Fix error count during the last agetime
leaky-bucket.h Fix error count during the last agetime
list.h Initial memdb support
lk10-mcelog.pdf Add LK10 mcelog paper
mce.pdf Initial import of mcelog-0.8pre + some old patches
mcelog.8 Add man page for mcelog trigger input arguments
mcelog.c Add all current Atom cpuids
mcelog.conf Some editorial editing of the comments in mcelog.conf
mcelog.conf.5 Add mcelog.conf.5 man page
mcelog.cron Initial import of mcelog-0.8pre + some old patches
mcelog.h Add all current Atom cpuids
mcelog.init Write pidfile by default in daemon mode
mcelog.logrotate Reopen log files on SIGUSR1 in daemon mode
mcelog.service Add example systemd unit file
mcelog.triggers.5 Add man page for mcelog trigger input arguments
memdb.c Fix error count during the last agetime
memdb.h Add method to lookup whether a DIMM exists in memdb
memutil.c Check for out of memory in asprintf
memutil.h Add xalloc_nonzero() to memutil
msg.c Reopen log files on SIGUSR1 in daemon mode
msg.h Reopen log files on SIGUSR1 in daemon mode
msr.c Return by default if no CPU matches in set_imc_log
nehalem.c Add Xeon75xx support
nehalem.h Add Xeon75xx support
p4.c Add a trigger for unknown machine check errors
p4.h Pass socketid to cache error trigger
page.c Add method to lookup whether a DIMM exists in memdb
page.h Fix CMCI overflow count handling
paths.h Write pidfile by default in daemon mode
rbtree.c Initial page predictive failure analysis support
rbtree.h Initial page predictive failure analysis support
sandy-bridge.c More compact data structures for reporting SNB/IVB memory controller …
sandy-bridge.h Add support for Sandy Bridge extended error logging
server.c Lower size of the ctl buffer to avoid potential DOS
server.h Initial memdb support
sysfs.c Fix fstat warning
sysfs.h Move sysfs write functions from page.c to sysfs.c
trigger.c mcelog: Abstract forking triggers for reuse
trigger.h mcelog: Abstract forking triggers for reuse
tsc.c Enable -Wextra and some more warnings and clean them in code
tsc.h Enable -Wextra and some more warnings and clean them in code
tulsa.c Add Tulsa support for Cache Bus controller and Bus and Interconnect E…
tulsa.h Add Intel Xeon 71xx (Tulsa) MCA decoding support
unknown.c Add a trigger for unknown machine check errors
unknown.h Add a trigger for unknown machine check errors
version.h Upgrade version number
xeon75xx.c Remove old xeon75xx aux format support
xeon75xx.h Add Xeon75xx support
yellow.c Fix incorrect asprintf in yellow.c
yellow.h Pass socketid to cache error trigger

README

mcelog is the user space backend for logging machine check errors
reported by the hardware to the kernel. The kernel does the immediate
actions (like killing processes etc.) and mcelog decodes the errors
and manages various other advanced error responses like
offlining memory, CPUs or triggering events. In addition
mcelog also handles corrected errors, by logging and accounting them.

It primarily handles machine checks and thermal events, which
are reported for errors detected by the CPU.

For more details on what mcelog can do and the underlying theory
see http://www.mcelog.org

It is recommended that mcelog runs on all x86 machines, both
64bit (since early 2.6) and 32bit (since 2.6.32)

mcelog can run in several modi: cronjob, trigger, daemon

cronjob is the old method. mcelog runs every 5 minutes from cron and checks
for errors. Disadvantage of this is that it can delay error reporting 
significantly (upto 10 minutes) and does not allow mcelog to keep extended state.

trigger is a newer method where the kernel runs mcelog on a error.
This is configured with 
echo /usr/sbin/mcelog > /sys/devices/system/machinecheck/machinecheck0/trigger
This is faster, but still doesn't allow mcelog to keep state,
and has relatively high overhead for each error because a program has
to be initialized from scratch.

In daemon mode mcelog runs continuously as a daemon in the background
and wait for errors. It is enabled by running mcelog --daemon & 
from a init script. This is the fastest and most feature-ful.

The recommended mode is daemon, because several new functions (like page error
predictive failure analysis) require a continuously running daemon.

Documentation:

The primary reference documentation are the man pages.
lk10-mcelog.pdf has a overview over the errors mcelog handles
(originally from Linux Kongress 2010)
mce.pdf is a very old paper describing the first releases of mcelog
(some parts are obsolete)

For distributors:

You can run mcelog from systemd or similar daemons. An example
systemd unit file is in mcelog.service.

For older distributions using init scripts:

Please install a init script by default that runs mcelog in daemon mode.
The mcelog.init script is a good starting point.

Also install a logrotated file (mcelog.logrotate) or equivalent 
when mcelog is running in daemon mode. 

These two are not in make install.

The installation also requires a config file (/etc/mcelog.conf) and
the default triggers. These are all installed by "make install"

/dev/mcelog is needed for mcelog operation
If it's not there it can be created with mknod /dev/mcelog c 10 227
Normally it should be created automatically in udev.

Security:

mcelog needs to run as root because it might trigger actions like
page-offlining, which require CAP_SYS_ADMIN. Also it opens /dev/mcelog
and a unix socket for client support.

It also opens /dev/mem to parse the BIOS DMI tables. It is careful
to close the file descriptor and unmap any mappings after using them.

There is support for changing the user in daemon mode after opening
the device and the sockets, but that would stop triggers from
doing corrective action that require root.

In principle it would be possible to only keep CAP_SYS_ADMIN
for page-offling, but that would prevent triggers from doing root
only actions not covered by it (and CAP_SYS_ADMIN is not that different 
from full root)

In daemon mode mcelog listens to a unix socket and processes
requests from mcelog --client. This can be disabled in the configuration file.
The uid/gid of the requestor is checked on access and is configurable
(default 0/0 only). The command parsing code is very straight forward
(server.c) The client parsing/reply is currently done with full privileges
of the daemon.

Testing:

There is a simple test suite in tests/. The test suite requires root to 
run and access to mce-inject and a kernel with MCE injection support 
(CONFIG_X86_MCE_INJECT).  It will kill any running mcelog daemon.

Run it with "make test"

The test suite requires the mce-inject tool, available from
git://git.kernel.org/pub/utils/cpu/mce/mce-inject.git
The mce-inject executable must be either in $PATH or in the
../mce-inject directory.

You can also test under valgrind with "make valgrind-test". For 
this valgrind needs to be installed of course.  Advanced
valgrind options can be specified with 
make VALGRIND="valgrind --option" valgrind-test

Other checks:

make iccverify and make clangverify run the static verifiers
in clang and icc respectively.

License:

This program is licensed under the subject of the GNU Public General
License, v.2

Something went wrong with that request. Please try again.