Skip to content

Compute Performance Benchmarks

Olivier Guyon edited this page Jul 5, 2019 · 7 revisions

Summary

Test performed Feb 2019 on SCExAO instrument mixed GPU/CPU system.

Extreme-AO test: 3kHz, 14,400 sensors, 2000 actuators, 1089 control modes, modal control with pseudo-open loop reconstruction.

ExAO loop can run at 4.3 kHz (2 GPUs total), or 5.2 kHz (3 GPUs total).

At 2kHz loop speed, a 16-step/400 modes predictive control can be updated every 100 sec with a 26 sec filter compute time.


Test system:

Dual-socket Intel scalable 6140 running at 3GHz (turbo), RAM=315GB
GPUs: Nvidia RTX2080Ti (x2), GTX1080Ti (x2), GTX980Ti (x4)
kernel: 4.18.0-15-lowlatency #16-Ubuntu SMP PREEMPT
cacao v0.1.02-dev
nvidia driver 410.93, CUDA 10.0, MAGMA 2.5

Running AO control loop, mixed CPU/GPU, one GPU per MVM

In this test, a single WFS drives a single DM at high speed (3kHz). Compute-intensive operations (MVMs) are deployed to GPUs (one GPU for each of the two MVMs).

Test case:

  • 14,400 WFS sensors, 2000 DM actuators, 3 kHz loop.
  • sensor mask reduces number of sensors to 6827 active pixels
  • DM mask reduces number of actuators to 1833 active actuators
  • Number of modes = 1089
  • GPU performs MVM#1 from sensors to modes: [6827 x 1089] x [6827] -> [1089]
  • Modal control done by CPU, includes real-time pseudo-open loop reconstruction
  • GPU performs MVM#2 from modes to actuators: [1089 x 1833] x [1089] -> [1833]

MVM#1 GPU utilization

RTX2080Ti    43%
GTX1080Ti    65%
GTX980Ti     72%

MVM#2 GPU utilization

RTX2080Ti     9%
GTX1080Ti    12%
GTX980Ti     15%

CPU utilization (intel Xeon scalable 6140)

aolrun process (main loop)   17.5%  (per core, deployed on 6 cores)
dmcomb process               11.4%  (single core)

The highest load is on the GPU performing MVM#1 (43% utilization).

Detailed timing info (MVM#1 on RTX2080Ti, MVM#2 on GTX980Ti)

STATUS  0      0.69 %    [      690  /  100000  ]   [     2.305 us] LOAD IMAGE
STATUS  1      6.35 %    [     6354  /  100000  ]   [    21.229 us] DARK SUBTRACT
STATUS  2      0.56 %    [      555  /  100000  ]   [     1.854 us] COMPUTE WFS IMAGE TOTAL
STATUS  3      1.44 %    [     1437  /  100000  ]   [     4.801 us] NORMALIZE WFS IMAGE
STATUS  4      1.59 %    [     1592  /  100000  ]   [     5.319 us] SUBTRACT REFERENCE
STATUS  5      0.04 %    [       41  /  100000  ]   [     0.137 us] MULTIPLYING BY CONTROL MATRIX -> MODE VALUES : SETUP
STATUS  6      0.03 %    [       26  /  100000  ]   [     0.087 us] START CONTROL MATRIX MULTIPLICATION: CHECK IF NEW CM EXISTS
STATUS  7      0.02 %    [       17  /  100000  ]   [     0.057 us] CONTROL MATRIX MULT: CREATE COMPUTING THREADS
STATUS  8     52.86 %    [    52859  /  100000  ]   [   176.601 us] CONTROL MATRIX MULT: WAIT FOR THREADS TO COMPLETE
STATUS  9      0.40 %    [      403  /  100000  ]   [     1.346 us] CONTROL MATRIX MULT: COMBINE TRHEADS RESULTS
STATUS 10      1.39 %    [     1385  /  100000  ]   [     4.627 us] CONTROL MATRIX MULT: INCREMENT COUNTER AND EXIT FUNCTION
STATUS 11      6.18 %    [     6181  /  100000  ]   [    20.651 us] MULTIPLYING BY GAINS
STATUS 12      0.02 %    [       20  /  100000  ]   [     0.067 us] ENTER SET DM MODES
STATUS 13      0.00 %    [        0  /  100000  ]   [     0.000 us] START DM MODES MATRIX MULTIPLICATION
STATUS 14      0.00 %    [        0  /  100000  ]   [     0.000 us] MATRIX MULT: CREATE COMPUTING THREADS
STATUS 15      0.00 %    [        0  /  100000  ]   [     0.000 us] MATRIX MULT: WAIT FOR THREADS TO COMPLETE
STATUS 16      0.00 %    [        0  /  100000  ]   [     0.000 us] MATRIX MULT: COMBINE TRHEADS RESULTS
STATUS 17      0.00 %    [        0  /  100000  ]   [     0.000 us] MATRIX MULT: INCREMENT COUNTER AND EXIT FUNCTION
STATUS 18      0.04 %    [       38  /  100000  ]   [     0.127 us] LOG DATA
STATUS 19      0.04 %    [       43  /  100000  ]   [     0.144 us] READING IMAGE
STATUS 20     28.36 %    [    28359  /  100000  ]   [    94.747 us] ... WAITING FOR IMAGE

          ----1--------2--------3--------4--------5--------6----
                   wait im | ->GPU |     COMPUTE     |   ->CPU  
          ------------------------------------------------------
GPU  0  :    0.00 %  46.90 %   4.09 %   2.08 %  45.05 %   1.88 %
GPU  0  :   0.00 us 156.70 us 13.66 us  6.95 us 150.53 us  6.26 us

          ----1--------2--------3--------4--------5--------6----
GPU  0  :    0.00 %   0.00 %   0.00 %   0.00 %   0.00 %   0.00 %

--------------- MODAL STRING -------------------------------------------------------------
STATUSM   0      7.00 %    [     6998  /  100000  ]   [    23.380 us] DARK SUBTRACT
STATUSM   1      1.90 %    [     1900  /  100000  ]   [     6.348 us] NORMALIZE
STATUSM   2     56.90 %    [    56904  /  100000  ]   [   190.115 us] EXTRACT WFS MODES
STATUSM   3      0.92 %    [      919  /  100000  ]   [     3.070 us] UPDATE CURRENT DM STATE
STATUSM   4      0.05 %    [       54  /  100000  ]   [     0.180 us] MIX PREDICTION WITH CURRENT DM STATE
STATUSM   5      1.71 %    [     1713  /  100000  ]   [     5.723 us] MODAL FILTERING / CLIPPING
STATUSM   6      0.59 %    [      595  /  100000  ]   [     1.988 us] INTER-PROCESS LATENCY
STATUSM  10     23.81 %    [    23809  /  100000  ]   [    79.545 us] MODES TO DM ACTUATORS (GPU)
STATUSM  20      7.11 %    [     7108  /  100000  ]   [    23.748 us] ... WAITING FOR IMAGE imWFS0

There is 97us extra time in the loop, for a 333us loop period. The max compute loop speed for this configuration would be approximately 4.2 kHz, at which point total frame wait time goes to 0.

Splitting MVM#1 to multiple GPUs

Same test setup as above, with MVM#1 split over two RTX2080Ti GPUs.

MVM#1 GPUs utilization

RTX2080Ti    27%
RTX2080Ti    27%

Detailed timing:

STATUS  0      0.63 %    [      626  /  100000  ]   [     2.087 us] LOAD IMAGE
STATUS  1      6.61 %    [     6613  /  100000  ]   [    22.050 us] DARK SUBTRACT
STATUS  2      0.53 %    [      526  /  100000  ]   [     1.754 us] COMPUTE WFS IMAGE TOTAL
STATUS  3      1.42 %    [     1418  /  100000  ]   [     4.728 us] NORMALIZE WFS IMAGE
STATUS  4      1.51 %    [     1514  /  100000  ]   [     5.048 us] SUBTRACT REFERENCE
STATUS  5      0.04 %    [       43  /  100000  ]   [     0.143 us] MULTIPLYING BY CONTROL MATRIX -> MODE VALUES : SETUP
STATUS  6      0.04 %    [       38  /  100000  ]   [     0.127 us] START CONTROL MATRIX MULTIPLICATION: CHECK IF NEW CM EXISTS
STATUS  7      0.03 %    [       27  /  100000  ]   [     0.090 us] CONTROL MATRIX MULT: CREATE COMPUTING THREADS
STATUS  8     37.24 %    [    37242  /  100000  ]   [   124.177 us] CONTROL MATRIX MULT: WAIT FOR THREADS TO COMPLETE
STATUS  9      0.47 %    [      474  /  100000  ]   [     1.580 us] CONTROL MATRIX MULT: COMBINE TRHEADS RESULTS
STATUS 10      1.39 %    [     1392  /  100000  ]   [     4.641 us] CONTROL MATRIX MULT: INCREMENT COUNTER AND EXIT FUNCTION
STATUS 11      6.24 %    [     6240  /  100000  ]   [    20.806 us] MULTIPLYING BY GAINS
STATUS 12      0.03 %    [       27  /  100000  ]   [     0.090 us] ENTER SET DM MODES
STATUS 13      0.00 %    [        0  /  100000  ]   [     0.000 us] START DM MODES MATRIX MULTIPLICATION
STATUS 14      0.00 %    [        0  /  100000  ]   [     0.000 us] MATRIX MULT: CREATE COMPUTING THREADS
STATUS 15      0.00 %    [        0  /  100000  ]   [     0.000 us] MATRIX MULT: WAIT FOR THREADS TO COMPLETE
STATUS 16      0.00 %    [        0  /  100000  ]   [     0.000 us] MATRIX MULT: COMBINE TRHEADS RESULTS
STATUS 17      0.00 %    [        0  /  100000  ]   [     0.000 us] MATRIX MULT: INCREMENT COUNTER AND EXIT FUNCTION
STATUS 18      0.08 %    [       76  /  100000  ]   [     0.253 us] LOG DATA
STATUS 19      0.04 %    [       40  /  100000  ]   [     0.133 us] READING IMAGE
STATUS 20     43.70 %    [    43704  /  100000  ]   [   145.723 us] ... WAITING FOR IMAGE

          ----1--------2--------3--------4--------5--------6----
                   wait im | ->GPU |     COMPUTE     |   ->CPU  
          ------------------------------------------------------
GPU  0  :    0.00 %  64.19 %   2.68 %   1.98 %  29.29 %   1.86 %
GPU  1  :    0.00 %   0.00 %   0.00 %   0.00 %   0.00 %   0.00 %
GPU  0  :   0.00 us 214.03 us  8.93 us  6.61 us 97.65 us  6.21 us
GPU  1  :   0.00 us  0.00 us  0.00 us  0.00 us  0.00 us  0.00 us

          ----1--------2--------3--------4--------5--------6----
GPU  0  :    0.00 %   0.00 %   0.00 %   0.00 %   0.00 %   0.00 %
GPU  1  :    0.00 %   0.00 %   0.00 %   0.00 %   0.00 %   0.00 %

--------------- MODAL STRING -------------------------------------------------------------
STATUSM   0      7.24 %    [     7238  /  100000  ]   [    24.134 us] DARK SUBTRACT
STATUSM   1      1.86 %    [     1857  /  100000  ]   [     6.192 us] NORMALIZE
STATUSM   2     42.31 %    [    42310  /  100000  ]   [   141.075 us] EXTRACT WFS MODES
STATUSM   3      0.83 %    [      826  /  100000  ]   [     2.754 us] UPDATE CURRENT DM STATE
STATUSM   4      0.04 %    [       39  /  100000  ]   [     0.130 us] MIX PREDICTION WITH CURRENT DM STATE
STATUSM   5      1.56 %    [     1563  /  100000  ]   [     5.212 us] MODAL FILTERING / CLIPPING
STATUSM   6      0.59 %    [      589  /  100000  ]   [     1.964 us] INTER-PROCESS LATENCY
STATUSM  10     24.27 %    [    24271  /  100000  ]   [    80.927 us] MODES TO DM ACTUATORS (GPU)
STATUSM  20     21.31 %    [    21307  /  100000  ]   [    71.044 us] ... WAITING FOR IMAGE imWFS0

--------------- AUX MODAL STRING ---------------------------------------------------------
STATUSM1  0      0.22 %    [      224  /  100000  ]   [     0.747 us] WRITING MODAL CORRECTION IN CIRCULAR BUFFER
STATUSM1  1      0.28 %    [      283  /  100000  ]   [     0.944 us] COMPUTING TIME-DELAYED MODAL CORRECTION
STATUSM1  2      0.02 %    [       24  /  100000  ]   [     0.080 us] COMPUTING TIME-DELAYED PREDICTED CORRECTION
STATUSM1  3      0.56 %    [      556  /  100000  ]   [     1.854 us] COMPUTING OPEN LOOP WF
STATUSM1  4      1.75 %    [     1750  /  100000  ]   [     5.835 us] COMPUTING TELEMETRY
STATUSM1  5     97.16 %    [    97163  /  100000  ]   [   323.972 us] ... WAITING FOR INPUT

There is 145us extra time in the loop, for a 333us loop period. The max compute loop speed for this configuration would be approximately 5.3 kHz, at which point total frame wait time goes to 0.

Predictive Control: 500-modes, 16th order prediction

In this test, the loop is running at 2kHz. A new predictive filter is computed every 200,000 measurements (100 sec). The data matrix size is 5GB (400x16x200000 elements, single precision). A single GPU performs the required data matrix pseudo-inverse computation. The problem size is just below the memory limit for a RTX2080Ti GPU.

Time to accumulate new data matrix               :  100 sec
Total predictive filter compute + update time    :   26 sec
Timing info: 
[setup]                                  0.054 ms
[copy input to GPU]                   1742.331 ms
[compute trans(A) x A]                1498.422 ms
[setup]                                  0.024 ms
[Compute eigenvalues]                 1896.963 ms
[Select eigenvalues]                   702.780 ms
[Compute M2]                            32.967 ms
[Compute Ainv]                        1172.749 ms
[Get Ainv from GPU]                   2189.683 ms
[output setup]                           0.005 ms
[Write output array]                  2119.278 ms
[Test output]                            0.001 ms
[Optional gemm]                        701.399 ms