-
Notifications
You must be signed in to change notification settings - Fork 8
Compute Performance Benchmarks
Test performed Feb 2019 on SCExAO instrument mixed GPU/CPU system.
Extreme-AO test: 3kHz, 14,400 sensors, 2000 actuators, 1089 control modes, modal control with pseudo-open loop reconstruction.
ExAO loop can run at 4.3 kHz (2 GPUs total), or 5.2 kHz (3 GPUs total).
At 2kHz loop speed, a 16-step/400 modes predictive control can be updated every 100 sec with a 26 sec filter compute time.
Test system:
Dual-socket Intel scalable 6140 running at 3GHz (turbo), RAM=315GB
GPUs: Nvidia RTX2080Ti (x2), GTX1080Ti (x2), GTX980Ti (x4)
kernel: 4.18.0-15-lowlatency #16-Ubuntu SMP PREEMPT
cacao v0.1.02-dev
nvidia driver 410.93, CUDA 10.0, MAGMA 2.5
In this test, a single WFS drives a single DM at high speed (3kHz). Compute-intensive operations (MVMs) are deployed to GPUs (one GPU for each of the two MVMs).
Test case:
- 14,400 WFS sensors, 2000 DM actuators, 3 kHz loop.
- sensor mask reduces number of sensors to 6827 active pixels
- DM mask reduces number of actuators to 1833 active actuators
- Number of modes = 1089
- GPU performs MVM#1 from sensors to modes: [6827 x 1089] x [6827] -> [1089]
- Modal control done by CPU, includes real-time pseudo-open loop reconstruction
- GPU performs MVM#2 from modes to actuators: [1089 x 1833] x [1089] -> [1833]
MVM#1 GPU utilization
RTX2080Ti 43%
GTX1080Ti 65%
GTX980Ti 72%
MVM#2 GPU utilization
RTX2080Ti 9%
GTX1080Ti 12%
GTX980Ti 15%
CPU utilization (intel Xeon scalable 6140)
aolrun process (main loop) 17.5% (per core, deployed on 6 cores)
dmcomb process 11.4% (single core)
The highest load is on the GPU performing MVM#1 (43% utilization).
Detailed timing info (MVM#1 on RTX2080Ti, MVM#2 on GTX980Ti)
STATUS 0 0.69 % [ 690 / 100000 ] [ 2.305 us] LOAD IMAGE
STATUS 1 6.35 % [ 6354 / 100000 ] [ 21.229 us] DARK SUBTRACT
STATUS 2 0.56 % [ 555 / 100000 ] [ 1.854 us] COMPUTE WFS IMAGE TOTAL
STATUS 3 1.44 % [ 1437 / 100000 ] [ 4.801 us] NORMALIZE WFS IMAGE
STATUS 4 1.59 % [ 1592 / 100000 ] [ 5.319 us] SUBTRACT REFERENCE
STATUS 5 0.04 % [ 41 / 100000 ] [ 0.137 us] MULTIPLYING BY CONTROL MATRIX -> MODE VALUES : SETUP
STATUS 6 0.03 % [ 26 / 100000 ] [ 0.087 us] START CONTROL MATRIX MULTIPLICATION: CHECK IF NEW CM EXISTS
STATUS 7 0.02 % [ 17 / 100000 ] [ 0.057 us] CONTROL MATRIX MULT: CREATE COMPUTING THREADS
STATUS 8 52.86 % [ 52859 / 100000 ] [ 176.601 us] CONTROL MATRIX MULT: WAIT FOR THREADS TO COMPLETE
STATUS 9 0.40 % [ 403 / 100000 ] [ 1.346 us] CONTROL MATRIX MULT: COMBINE TRHEADS RESULTS
STATUS 10 1.39 % [ 1385 / 100000 ] [ 4.627 us] CONTROL MATRIX MULT: INCREMENT COUNTER AND EXIT FUNCTION
STATUS 11 6.18 % [ 6181 / 100000 ] [ 20.651 us] MULTIPLYING BY GAINS
STATUS 12 0.02 % [ 20 / 100000 ] [ 0.067 us] ENTER SET DM MODES
STATUS 13 0.00 % [ 0 / 100000 ] [ 0.000 us] START DM MODES MATRIX MULTIPLICATION
STATUS 14 0.00 % [ 0 / 100000 ] [ 0.000 us] MATRIX MULT: CREATE COMPUTING THREADS
STATUS 15 0.00 % [ 0 / 100000 ] [ 0.000 us] MATRIX MULT: WAIT FOR THREADS TO COMPLETE
STATUS 16 0.00 % [ 0 / 100000 ] [ 0.000 us] MATRIX MULT: COMBINE TRHEADS RESULTS
STATUS 17 0.00 % [ 0 / 100000 ] [ 0.000 us] MATRIX MULT: INCREMENT COUNTER AND EXIT FUNCTION
STATUS 18 0.04 % [ 38 / 100000 ] [ 0.127 us] LOG DATA
STATUS 19 0.04 % [ 43 / 100000 ] [ 0.144 us] READING IMAGE
STATUS 20 28.36 % [ 28359 / 100000 ] [ 94.747 us] ... WAITING FOR IMAGE
----1--------2--------3--------4--------5--------6----
wait im | ->GPU | COMPUTE | ->CPU
------------------------------------------------------
GPU 0 : 0.00 % 46.90 % 4.09 % 2.08 % 45.05 % 1.88 %
GPU 0 : 0.00 us 156.70 us 13.66 us 6.95 us 150.53 us 6.26 us
----1--------2--------3--------4--------5--------6----
GPU 0 : 0.00 % 0.00 % 0.00 % 0.00 % 0.00 % 0.00 %
--------------- MODAL STRING -------------------------------------------------------------
STATUSM 0 7.00 % [ 6998 / 100000 ] [ 23.380 us] DARK SUBTRACT
STATUSM 1 1.90 % [ 1900 / 100000 ] [ 6.348 us] NORMALIZE
STATUSM 2 56.90 % [ 56904 / 100000 ] [ 190.115 us] EXTRACT WFS MODES
STATUSM 3 0.92 % [ 919 / 100000 ] [ 3.070 us] UPDATE CURRENT DM STATE
STATUSM 4 0.05 % [ 54 / 100000 ] [ 0.180 us] MIX PREDICTION WITH CURRENT DM STATE
STATUSM 5 1.71 % [ 1713 / 100000 ] [ 5.723 us] MODAL FILTERING / CLIPPING
STATUSM 6 0.59 % [ 595 / 100000 ] [ 1.988 us] INTER-PROCESS LATENCY
STATUSM 10 23.81 % [ 23809 / 100000 ] [ 79.545 us] MODES TO DM ACTUATORS (GPU)
STATUSM 20 7.11 % [ 7108 / 100000 ] [ 23.748 us] ... WAITING FOR IMAGE imWFS0
There is 97us extra time in the loop, for a 333us loop period. The max compute loop speed for this configuration would be approximately 4.2 kHz, at which point total frame wait time goes to 0.
Same test setup as above, with MVM#1 split over two RTX2080Ti GPUs.
MVM#1 GPUs utilization
RTX2080Ti 27%
RTX2080Ti 27%
Detailed timing:
STATUS 0 0.63 % [ 626 / 100000 ] [ 2.087 us] LOAD IMAGE
STATUS 1 6.61 % [ 6613 / 100000 ] [ 22.050 us] DARK SUBTRACT
STATUS 2 0.53 % [ 526 / 100000 ] [ 1.754 us] COMPUTE WFS IMAGE TOTAL
STATUS 3 1.42 % [ 1418 / 100000 ] [ 4.728 us] NORMALIZE WFS IMAGE
STATUS 4 1.51 % [ 1514 / 100000 ] [ 5.048 us] SUBTRACT REFERENCE
STATUS 5 0.04 % [ 43 / 100000 ] [ 0.143 us] MULTIPLYING BY CONTROL MATRIX -> MODE VALUES : SETUP
STATUS 6 0.04 % [ 38 / 100000 ] [ 0.127 us] START CONTROL MATRIX MULTIPLICATION: CHECK IF NEW CM EXISTS
STATUS 7 0.03 % [ 27 / 100000 ] [ 0.090 us] CONTROL MATRIX MULT: CREATE COMPUTING THREADS
STATUS 8 37.24 % [ 37242 / 100000 ] [ 124.177 us] CONTROL MATRIX MULT: WAIT FOR THREADS TO COMPLETE
STATUS 9 0.47 % [ 474 / 100000 ] [ 1.580 us] CONTROL MATRIX MULT: COMBINE TRHEADS RESULTS
STATUS 10 1.39 % [ 1392 / 100000 ] [ 4.641 us] CONTROL MATRIX MULT: INCREMENT COUNTER AND EXIT FUNCTION
STATUS 11 6.24 % [ 6240 / 100000 ] [ 20.806 us] MULTIPLYING BY GAINS
STATUS 12 0.03 % [ 27 / 100000 ] [ 0.090 us] ENTER SET DM MODES
STATUS 13 0.00 % [ 0 / 100000 ] [ 0.000 us] START DM MODES MATRIX MULTIPLICATION
STATUS 14 0.00 % [ 0 / 100000 ] [ 0.000 us] MATRIX MULT: CREATE COMPUTING THREADS
STATUS 15 0.00 % [ 0 / 100000 ] [ 0.000 us] MATRIX MULT: WAIT FOR THREADS TO COMPLETE
STATUS 16 0.00 % [ 0 / 100000 ] [ 0.000 us] MATRIX MULT: COMBINE TRHEADS RESULTS
STATUS 17 0.00 % [ 0 / 100000 ] [ 0.000 us] MATRIX MULT: INCREMENT COUNTER AND EXIT FUNCTION
STATUS 18 0.08 % [ 76 / 100000 ] [ 0.253 us] LOG DATA
STATUS 19 0.04 % [ 40 / 100000 ] [ 0.133 us] READING IMAGE
STATUS 20 43.70 % [ 43704 / 100000 ] [ 145.723 us] ... WAITING FOR IMAGE
----1--------2--------3--------4--------5--------6----
wait im | ->GPU | COMPUTE | ->CPU
------------------------------------------------------
GPU 0 : 0.00 % 64.19 % 2.68 % 1.98 % 29.29 % 1.86 %
GPU 1 : 0.00 % 0.00 % 0.00 % 0.00 % 0.00 % 0.00 %
GPU 0 : 0.00 us 214.03 us 8.93 us 6.61 us 97.65 us 6.21 us
GPU 1 : 0.00 us 0.00 us 0.00 us 0.00 us 0.00 us 0.00 us
----1--------2--------3--------4--------5--------6----
GPU 0 : 0.00 % 0.00 % 0.00 % 0.00 % 0.00 % 0.00 %
GPU 1 : 0.00 % 0.00 % 0.00 % 0.00 % 0.00 % 0.00 %
--------------- MODAL STRING -------------------------------------------------------------
STATUSM 0 7.24 % [ 7238 / 100000 ] [ 24.134 us] DARK SUBTRACT
STATUSM 1 1.86 % [ 1857 / 100000 ] [ 6.192 us] NORMALIZE
STATUSM 2 42.31 % [ 42310 / 100000 ] [ 141.075 us] EXTRACT WFS MODES
STATUSM 3 0.83 % [ 826 / 100000 ] [ 2.754 us] UPDATE CURRENT DM STATE
STATUSM 4 0.04 % [ 39 / 100000 ] [ 0.130 us] MIX PREDICTION WITH CURRENT DM STATE
STATUSM 5 1.56 % [ 1563 / 100000 ] [ 5.212 us] MODAL FILTERING / CLIPPING
STATUSM 6 0.59 % [ 589 / 100000 ] [ 1.964 us] INTER-PROCESS LATENCY
STATUSM 10 24.27 % [ 24271 / 100000 ] [ 80.927 us] MODES TO DM ACTUATORS (GPU)
STATUSM 20 21.31 % [ 21307 / 100000 ] [ 71.044 us] ... WAITING FOR IMAGE imWFS0
--------------- AUX MODAL STRING ---------------------------------------------------------
STATUSM1 0 0.22 % [ 224 / 100000 ] [ 0.747 us] WRITING MODAL CORRECTION IN CIRCULAR BUFFER
STATUSM1 1 0.28 % [ 283 / 100000 ] [ 0.944 us] COMPUTING TIME-DELAYED MODAL CORRECTION
STATUSM1 2 0.02 % [ 24 / 100000 ] [ 0.080 us] COMPUTING TIME-DELAYED PREDICTED CORRECTION
STATUSM1 3 0.56 % [ 556 / 100000 ] [ 1.854 us] COMPUTING OPEN LOOP WF
STATUSM1 4 1.75 % [ 1750 / 100000 ] [ 5.835 us] COMPUTING TELEMETRY
STATUSM1 5 97.16 % [ 97163 / 100000 ] [ 323.972 us] ... WAITING FOR INPUT
There is 145us extra time in the loop, for a 333us loop period. The max compute loop speed for this configuration would be approximately 5.3 kHz, at which point total frame wait time goes to 0.
In this test, the loop is running at 2kHz. A new predictive filter is computed every 200,000 measurements (100 sec). The data matrix size is 5GB (400x16x200000 elements, single precision). A single GPU performs the required data matrix pseudo-inverse computation. The problem size is just below the memory limit for a RTX2080Ti GPU.
Time to accumulate new data matrix : 100 sec
Total predictive filter compute + update time : 26 sec
Timing info:
[setup] 0.054 ms
[copy input to GPU] 1742.331 ms
[compute trans(A) x A] 1498.422 ms
[setup] 0.024 ms
[Compute eigenvalues] 1896.963 ms
[Select eigenvalues] 702.780 ms
[Compute M2] 32.967 ms
[Compute Ainv] 1172.749 ms
[Get Ainv from GPU] 2189.683 ms
[output setup] 0.005 ms
[Write output array] 2119.278 ms
[Test output] 0.001 ms
[Optional gemm] 701.399 ms
compute and control for adaptive optics (cacao) - https://github.com/cacao-org/cacao
- Real-Time OS install
- OS Performance Tuning
- Real-time OS benchmarks:
- GPU drivers and tools
- cacao Performance