GitHub - aizvorski/mubench: low-level x86 instruction benchmark

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
modules/IPC		modules/IPC
.gitignore		.gitignore
COPYING		COPYING
MANIFEST		MANIFEST
Makefile		Makefile
PLEASE_CONTRIBUTE_RESULTS		PLEASE_CONTRIBUTE_RESULTS
README		README
TODO		TODO
collate.pl		collate.pl
instructions.pl		instructions.pl
mubench-web.pl		mubench-web.pl
mubench.pl		mubench.pl

Repository files navigation

mubench - low-level x86 instruction benchmark
Copyright (C) 2005-2006 Alex Izvorski <aizvorski@gmail.com>


== About ==

mubench is an in-depth, low-level benchmark for x86 processors.  Its primary goal is to provide useful information for people who optimize assembly code and for people who write compilers.  It measures latency and throughput for each individual instruction (sometimes several forms of the same instruction), as well as the throughput of arbitrary instruction mixes.  The results produced by mubench are typically an order of magnitude more detailed than those found in AMD or Intel manuals.

mubench results for a variety of processors are available at the mubench website, http://mubench.sourceforge.org/ .  If you find this information useful, please run mubench on your processor and upload the results to the website.  Instructions on how to contribute results are found below.

mubench fully supports all SIMD instruction sets for the x86, including SSSE3, SSE3, SSE2, SSE, MMX, MMX Ext, 3DNow! and 3DNow! Ext.  Support for non-SIMD instructions is partial: most data move, binary arithmetic, logical, shift/rotate and bit/byte instructions are supported, but other instructions, particularly branch and function call instructions or instructions manipulating the stack, are not supported.  Floating-point instructions for the x87 are not supported.  mubench only uses register-to-register (or immediate) forms of the instructions; memory operands are not supported.  These limitations will be gradually removed in later releases.


== Running ==

perl mubench.pl [options]

Options:
 --(no-)accurate           runs tests several times (default on)
 --mhz=2500                processor speed in MHz (default detected from /proc/cpuinfo, set here if that is wrong, for example if you have SpeedStep or Cool'n'Quiet enabled)
 --(no-)64bit              benchmark 64-bit (amd64, emt64, x86-64) instructions (default detected from uname)
 --(no-)32bit              benchmark 32-bit instructions
 --(no-)pairs              benchmark instruction mixes (default off, extremely slow but provides info on cross-latencies)
 --include=<pattern>,<pattern2>  benchmark only instructions matching the given list of patterns (regular expressions ok)
 --output=xml|csv|text     select output format
 --outfile=file.xml        output file to save results to (default mubench-results-<date>.xml if xml, standard output otherwise)
 --gcc=<path to gcc>       which compiler to use (must be gcc or derivative, e.g. kgcc, pgcc) (default gcc in path)
 --cflags=<compiler flags> which compiler flags to use (default enable all simd instruction sets)

Run this benchmark on an otherwise idle system (or as close as possible to idle: the benchmark will try to compensate for occasional cpu usage).

The quick benchmark runs in 5-15 minutes.  The full benchmark (with the --pairs option) takes 6-8 hours to comlpete on a x86-64 system, or 2-3 hours on a x86 system (since there are fewer instructions to try).


== Contributing results ==

Simply run perl mubench.pl with no options.  It will create a file mubench-results-<date>.xml.bz2.  Upload this file using the upload form at http://mubench.sourceforge.org/contribute.html .

If you have a lot of time available, you can run with the --pairs option, which produces a very complete set of timings but takes 6-9 hours to run.  We would love to get a contribution of the results for your processor with the --pairs option.  Thanks!


== Output ==

instruction     latency throughput
----------------------------------
paddb xmm, xmm  1.0038  0.50349

All numbers are measured in clock cycles.

Latency=2 means it takes two clock cycles for the result to be available.  Throughput=2 means a new instruction of the same kind can only be started once every two clock cycles.  Note that smaller latency *and smaller throughput* are faster.  Many instructions on recent processors have throughput < 1, meaning more than one of the same instruction can run in the same clock cycle.  It is normal to have some non-integer values, although a lot of instructions will typically have throughput=1.  The same instruction with different operands may have different performance.


== Requires ==

Perl modules: IPC::Run.
Recent versions of gcc and binutils (gcc >= 3.3, binutils >= 2.16.92 for SSSE3/MNI support) which must be in the path.


== Files ==

Creates test.c and test in the current working directory.

Tries to read /proc/cpuinfo on startup.


== Bugs ==

* mubench <= 0.2.1 did not work with gcc 4.x.x