As for any software, the topic of performance is often sensitive and plagued with heated discussions. It is objectively difficult to come up with scientifically accurate figures as they depend on many factors, including at least hardware, operating system, common lisp implementation, optimization flags and usage pattern.
What follows is a list of micro-benchmarks, suitable to have an initial idea about STMX performance for short, non-conflicting transactions.
Setup and optimization flags:
Before starting the REPL, it is recommended to remove any cached FASL file by deleting the folder ~/.cache/common-lisp/
Start the REPL and execute what follows:
(declaim (optimize (compilation-speed 0) (space 0) (debug 0) (safety 0) (speed 3))) (ql:quickload "stmx") (in-package :stmx.util) (defvar *v* (tvar 0)) (defvar *m* (new 'rbmap :pred 'fixnum<)) (defvar *tm* (new 'tmap :pred 'fixnum<)) (defvar *h* (new 'ghash-table :test 'fixnum= :hash 'identity)) (defvar *th* (new 'thash-table :test 'fixnum= :hash 'identity)) ;; some initial values (setf (get-gmap *m* 1) 0) (setf (get-gmap *tm* 1) 0) (setf (get-ghash *h* 1) 0) (setf (get-ghash *th* 1) 0) (defmacro x3 (&rest body) `(let ((v *v*) (m *m*) (tm *tm*) (h *h*) (th *th*)) (declare (ignorable v m tm h th)) (dotimes (,(gensym) 3) ,@body))) (defmacro 1m (&rest body) `(time (dotimes (i 1000000) ,@body)))
to warm-up STMX and the common-lisp process before starting the benchmarks, it is also recommended to run first the test suite with:
(ql:quickload "stmx.test") (fiveam:run! 'stmx.test:suite)
Run each benchmark inside an
(atomic ...)block one million times (see
1mmacro above) in a single thread. Repeat each run three times (see
3xmacro above) and take the lowest of the three reported elapsed times. Divide by one million to get the average elapsed real time per iteration.
This means for example that to run the benchmark
($ v)one has to type
(x3 (1m (atomic ($ v))))
All timings reported in the next section are the output on the author's system of the procedure just described, and thus for each benchmark they contain the average elapsed real time per iteration, i.e. the total elapsed time divided by the number of iterations (one million).
What follows are some timings obtained on the authors's system, and by no means they claim to be exact, absolute or reproducible: your mileage may vary.
Date: 12 April 2015
Hardware: Intel Core-i7 4770 @3.4 GHz (quad-core w/ hyper-threading), 16GB RAM
Software: Debian GNU/Linux 7.0 (x86_64), SBCL 1.2.10 (x86_64), STMX 2.0.4
Single-thread benchmarks, executed one million times
|name||executed code||STMX sw-only transactions||STMX hybrid hw+sw (requires Intel TSX and 64-bit SBCL)||HAND-OPTIMIZED hw transactions - see doc/benchmark.lisp|
|average time in microseconds|
|read-write-N||best fit of the 3 runs above||(0.142+N*0.0098)||(0.0226+N*0.0036)||(0.0260+N*0.0016)|
|orelse retry-N||best fit of the 3 runs above||(0.248+N*0.178)||(0.308+N*0.182)|
|grow tmap from N to N+1 entries (up to 10)||
|grow tmap from N to N+1 entries (up to 100)||
|grow tmap from N to N+1 entries (up to 1000)||
|grow thash from N to N+1 entries (up to 10)||
|grow thash from N to N+1 entries (up to 100)||
|grow thash from N to N+1 entries (up to 1000)||
Concurrent benchmarks on a 4-core CPU. They already iterate
ten million times, do not wrap them in |
Dining philosophers, load with|
|number of threads||executed code||STMX sw-only transactions||STMX hybrid hw+sw||STMX hybrid hw+sw, HAND OPTIMIZED||hw-only, HAND-OPTIMIZED||LOCK (atomic compare-and-swap)||LOCK (bordeaux-threads mutex)|
|millions transactions per second|