Permalink
Fetching contributors…
Cannot retrieve contributors at this time
313 lines (239 sloc) 11.3 KB

STMX Performance

As for any software, the topic of performance is often sensitive and plagued with heated discussions. It is objectively difficult to come up with scientifically accurate figures as they depend on many factors, including at least hardware, operating system, common lisp implementation, optimization flags and usage pattern.

What follows is a list of micro-benchmarks, suitable to have an initial idea about STMX performance for short, non-conflicting transactions.

Benchmark procedure

Setup and optimization flags:

  1. Before starting the REPL, it is recommended to remove any cached FASL file by deleting the folder ~/.cache/common-lisp/

  2. Start the REPL and execute what follows:

     (declaim (optimize (compilation-speed 0) (space 0) (debug 0) (safety 0) (speed 3)))
     (ql:quickload "stmx")
     (in-package :stmx.util)
     (defvar *v* (tvar 0))
     (defvar *m*  (new 'rbmap :pred 'fixnum<)) 
     (defvar *tm* (new 'tmap  :pred 'fixnum<)) 
     (defvar *h*  (new 'ghash-table :test 'fixnum= :hash 'identity)) 
     (defvar *th* (new 'thash-table :test 'fixnum= :hash 'identity)) 
     ;; some initial values
     (setf (get-gmap *m* 1) 0)
     (setf (get-gmap *tm* 1) 0)
     (setf (get-ghash *h* 1) 0)
     (setf (get-ghash *th* 1) 0)
     (defmacro x3 (&rest body)
       `(let ((v *v*)
              (m *m*)
              (tm *tm*)
              (h *h*)
              (th *th*))
          (declare (ignorable v m tm h th))
          (dotimes (,(gensym) 3)
            ,@body)))
     (defmacro 1m (&rest body)
       `(time (dotimes (i 1000000)
          ,@body)))
    
  3. to warm-up STMX and the common-lisp process before starting the benchmarks, it is also recommended to run first the test suite with:

     (ql:quickload "stmx.test")
     (fiveam:run! 'stmx.test:suite)
    
  4. Run each benchmark inside an (atomic ...) block one million times (see 1m macro above) in a single thread. Repeat each run three times (see 3x macro above) and take the lowest of the three reported elapsed times. Divide by one million to get the average elapsed real time per iteration.

    This means for example that to run the benchmark ($ v) one has to type (x3 (1m (atomic ($ v))))

All timings reported in the next section are the output on the author's system of the procedure just described, and thus for each benchmark they contain the average elapsed real time per iteration, i.e. the total elapsed time divided by the number of iterations (one million).

Benchmark results

What follows are some timings obtained on the authors's system, and by no means they claim to be exact, absolute or reproducible: your mileage may vary.

Date: 12 April 2015

Hardware: Intel Core-i7 4770 @3.4 GHz (quad-core w/ hyper-threading), 16GB RAM

Software: Debian GNU/Linux 7.0 (x86_64), SBCL 1.2.10 (x86_64), STMX 2.0.4

Single-thread benchmarks, executed one million times with (x3 (1m (atomic ...)))
name executed code STMX sw-only transactions STMX hybrid hw+sw (requires Intel TSX and 64-bit SBCL) HAND-OPTIMIZED hw transactions - see doc/benchmark.lisp
average time in microseconds
nil nil 0.0710.0230.012
read-1 ($ v) 0.0890.0230.014
write-1 (setf ($ v) 1) 0.1130.0270.026
read-write-1(incf (the fixnum ($ v))) 0.1380.0270.024
read-write-10 (dotimes (j 10) (incf (the fixnum ($ v)))) 0.2260.0360.033
read-write-100 (dotimes (j 100) (incf (the fixnum ($ v)))) 1.1140.1970.196
read-write-1000 (dotimes (j 1000) (incf (the fixnum ($ v)))) 9.9091.8961.749
read-write-Nbest fit of the 3 runs above (0.142+N*0.0098)(0.0226+N*0.0036)(0.0260+N*0.0016)
orelse empty (orelse) 0.0420.0270.022
orelse unary (orelse ($ v)) 0.2190.263N/A
orelse retry-1 (orelse (retry) ($ v)) 0.3540.438N/A
orelse retry-2 (orelse (retry) (retry) ($ v)) 0.5010.586N/A
orelse retry-4 (orelse (retry) (retry)
(retry) (retry) ($ v))
0.7810.872N/A
orelse retry-N best fit of the 3 runs above (0.248+N*0.178)(0.308+N*0.182)
tmap read-1 (get-gmap tm 1) 0.2610.1840.178
tmap read-write-1 (incf (get-gmap tm 1)) 0.5580.4190.409
grow tmap from N to N+1 entries (up to 10) (when (zerop (mod i 10)) (clear-gmap tm))
(set-gmap tm i t)
3.2323.706
grow tmap from N to N+1 entries (up to 100) (when (zerop (mod i 100)) (clear-gmap tm))
(set-gmap tm i t)
4.6695.249
grow tmap from N to N+1 entries (up to 1000) (when (zerop (mod i 1000)) (clear-gmap tm))
(set-gmap tm i t)
5.8426.464
thash read-write-1 (incf (get-ghash th 1)) 0.9560.566
grow thash from N to N+1 entries (up to 10) (when (zerop (mod i 10)) (clear-ghash th))
(set-ghash th i t)
1.6691.738
grow thash from N to N+1 entries (up to 100) (when (zerop (mod i 100)) (clear-ghash th))
(set-ghash th i t)
1.2241.400
grow thash from N to N+1 entries (up to 1000) (when (zerop (mod i 1000)) (clear-ghash th))
(set-ghash th i t)
1.1741.325
Concurrent benchmarks on a 4-core CPU. They already iterate ten million times, do not wrap them in (1m ...).
Dining philosophers, load with
(load "stmx/example/dining-philosophers.stmx.lisp")
(load "stmx/example/dining-philosophers.stmx-hw.lisp")
(load "stmx/example/dining-philosophers.hw-only.lisp")
(load "stmx/example/dining-philosophers.lock.lisp")
(in-package :stmx.example.dining-philosophers.[...])
number of threads executed code STMX sw-only transactions STMX hybrid hw+sw STMX hybrid hw+sw, HAND OPTIMIZED hw-only, HAND-OPTIMIZED LOCK (atomic compare-and-swap) LOCK (bordeaux-threads mutex)
millions transactions per second
1 thread (dining-philosophers 1) 4.51124.4534.97 50.0068.9714.64
2 threads (dining-philosophers 2) 8.1509.4611.48 26.4456.9211.43
3 threads (dining-philosophers 3) 11.869.6710.40 30.3351.629.48
4 threads (dining-philosophers 4) 15.4811.8313.84 32.0544.9814.24
5 threads (dining-philosophers 5) 13.7213.0515.16 32.3863.1318.59
6 threads (dining-philosophers 6) 14.7913.9414.86 37.4872.4619.14
7 threads (dining-philosophers 7) 15.0014.3913.25 43.4886.6320.92
8 threads (dining-philosophers 8) 15.7913.5914.15 47.90102.1123.55
10 threads (dining-philosophers 10) 15.2413.9616.59 56.18117.1030.24
15 threads (dining-philosophers 15) 15.4316.2821.54 88.94165.2049.68
20 threads (dining-philosophers 20) 15.5518.5921.12 142.20203.7753.89
25 threads (dining-philosophers 25) 188.54
30 threads (dining-philosophers 30) 15.5115.8416.01 211.86235.9457.64
35 threads (dining-philosophers 35) 260.61
40 threads (dining-philosophers 40) 15.5020.1615.20 278.75254.6258.34
50 threads (dining-philosophers 50) 15.4216.3419.27 272.33262.6758.98
100 threads (dining-philosophers 100) 15.51 275.22274.80
200 threads (dining-philosophers 200) 15.53 284.21277.47