Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Large" and "sleep" versions of "CL N-pipe" #17

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

void234
Copy link

@void234 void234 commented Feb 23, 2021

It is inefficient to poll GPU for results wasting CPU time and
(in case of dGPUs) PCIe bandwidth, especially if CPU is powerful
while (i)GPU is not. Original "CL N-pipe" cores are not touched,
OpenCL kernels are not touched, but scheduling code is modified
to permit 100 times larger work units ("CL 1-pipe large" etc) and
also to flush assignment to GPU and put CPU to sleep
("CL 1-pipe sleep" etc).

"Large" cores are marginally faster than original ones.
"Sleep" cores are slightly slower than "large" ones because GPU may
sometimes finish processing work unit while CPU still sleeps.
These cores, however, consume zero CPU (all other cores consume 1
logical CPU unless sleep is transparently performed by GPU driver -
Intel does this for gen8 but not for newer GPUs, this helps but only
if work unit is large enough for CPU to sleep for several milliseconds).
This results in higher power efficiency and, if we are not limited
by TDP, significant performance improvement. Effect is more pronounced
when CPU does not support MT.

Note that with "sleep" cores there is no need to manually limit
number of threads for CPU cruncher.

Performance/efficiency can be further improved by growing work unit
size faster. Wider testing and benchmarking (especially on high-end
GPUs) are welcome.

Benchmarks below are performed with CPU being loaded with
2.9116.525-amd64 core #4 (YK AVX2).
CUDA client is 2.9110.519b, core #10 (CUDA 1-pipe 64-thd sleep 100us).
"521" refers to 2.9112.521 dnetc-win32-x86-opencl.zip/
dnetc-linux-amd64-opencl.tar.gz
Power consumption is "measured" with "Core Temp" / "s-tui".

Core i5-8265U (15W, 4C8T, 14 nm, 1.6-3.9 GHz,
Intel UHD Graphics 620 [gen9] 1100 MHz), Ubuntu 20.04
CL 2-pipe/large/sleep

Mode               CPU     iGPU   Summary Power Efficiency
521, 7 threads     124      150       274    15   18.27
521, 8 threads     127      150       277    15   18.47
521, iGPU only       0      184       184    15   12.27
CPU only           181        0       181    15   12.07
Sleep, 8 threads   135      148       283    15   18.87
iGPU only, sleep     0      186       186    15   12.40
iGPU only, large     0      186       187    15   12.47

[1.022 efficiency improvement, "sleep" is optimal]

Core i7-9700K (95W, 8C8T, 14 nm, 3.6-4.9 GHz,
Intel UHD Graphics 630 [gen9] 1200 MHz), Windows 10 20H2
CL 2-pipe/large/sleep

Mode               CPU     iGPU   Summary Power Efficiency
521, 8 threads     480       92       572    95    6.02
521, 7 threads     406      187       593    95    6.24
Sleep, 8 threads   457      178       635    95    6.68
Sleep, 7 threads   403      188       591    95    6.22
CPU only           473        0       473    95    4.98
iGPU only, sleep     0      188       188    22    8.55
iGPU only, large     0      190       190    44    4.32

<Note terrible power efficiency of polling - "large" vs "sleep">
[1.071 efficiency improvement, "sleep" is optimal]

Core i5-5200U (15W, 2C4T, 14 nm, 2.2-2.7 GHz,
Intel HD Graphics 5500 [gen8] 900 MHz)
NVidia GeForce 820M 2048 MB, ForceWare 382.05
Windows 10 20H2
CL 4-pipe/large/sleep

Mode                CPU     iGPU   dGPU Summary Power* Efficiency*
521, 4 threads       66       59      0     125    15    8.33
521, 3 threads       63      167      0     230  21.4   10.75
521, 3 threads, CUDA 29      161     89     279   15*       *
CPU only             67        0      0      67  10.2    6.57
Sleep, 4 threads     71      168      0     239  21.4   11.17
Large, 4 threads     68      172      0     240  21.4   11.21
iGPU only, sleep      0      173      0     173  13.5   12.81
iGPU only, large      0      175      0     175  13.5   12.96
dGPU only, sleep      0        0    123     123  1.3*       *
dGPU only, large      0        0    134     134  8.3*       *
Sleep, 4 threads, dG 42      153    119     314   15*       *
Custom**, 4 threads  41      155    120     316   15*       *

*dGPU is not included in power measurements
**Custom - "large" for iGPU (gen8 driver idles CPU himself),
"sleep" for dGPU
[1.043 efficiency improvement, "large" is optimal for iGPU]
[CPU+iGPU+dGPU: 1.133 performance improvement, "sleep" is optimal for dGPU]

"-bench" Intel UHD Graphics 620 [gen9] 1100 MHz (Core i5-8265U)

RC5-72: using core #0 (CL ANSI 1-pipe).
RC5-72: Benchmark for core #0 (CL ANSI 1-pipe)
0.00:00:16.14 [113,990,283 keys/sec]
RC5-72: using core #1 (CL 1-pipe).
RC5-72: Benchmark for core #1 (CL 1-pipe)
0.00:00:16.32 [187,106,455 keys/sec]
RC5-72: using core #2 (CL 2-pipe).
RC5-72: Benchmark for core #2 (CL 2-pipe)
0.00:00:16.92 [184,015,486 keys/sec]
RC5-72: using core #3 (CL 4-pipe).
RC5-72: Benchmark for core #3 (CL 4-pipe)
0.00:00:16.80 [166,416,580 keys/sec]
RC5-72: using core #4 (CL 1-pipe large).
RC5-72: Benchmark for core #4 (CL 1-pipe large)
0.00:00:16.80 [184,818,394 keys/sec]
RC5-72: using core #5 (CL 2-pipe large).
RC5-72: Benchmark for core #5 (CL 2-pipe large)
0.00:00:16.81 [188,636,921 keys/sec]
RC5-72: using core #6 (CL 4-pipe large).
RC5-72: Benchmark for core #6 (CL 4-pipe large)
0.00:00:16.61 [170,029,327 keys/sec]
RC5-72: using core #7 (CL 1-pipe sleep).
RC5-72: Benchmark for core #7 (CL 1-pipe sleep)
0.00:00:16.05 [189,540,521 keys/sec]
RC5-72: using core #8 (CL 2-pipe sleep).
RC5-72: Benchmark for core #8 (CL 2-pipe sleep)
0.00:00:17.02 [192,711,899 keys/sec]
RC5-72: using core #9 (CL 4-pipe sleep).
RC5-72: Benchmark for core #9 (CL 4-pipe sleep)
0.00:00:16.93 [174,570,008 keys/sec]
RC5-72 benchmark summary :
Default core : #-1 (undefined) 0 keys/sec
Fastest core : #8 (CL 2-pipe sleep) 192,711,899 keys/sec

"-bench" Intel UHD Graphics 630 [gen9] 1200 MHz (Core i7-9700K)

RC5-72: using core #0 (CL ANSI 1-pipe).
RC5-72: Benchmark for core #0 (CL ANSI 1-pipe)
0.00:00:16.96 [124,370,534 keys/sec]
RC5-72: using core #1 (CL 1-pipe).
RC5-72: Benchmark for core #1 (CL 1-pipe)
0.00:00:16.84 [186,580,220 keys/sec]
RC5-72: using core #2 (CL 2-pipe).
RC5-72: Benchmark for core #2 (CL 2-pipe)
0.00:00:16.76 [189,445,953 keys/sec]
RC5-72: using core #3 (CL 4-pipe).
RC5-72: Benchmark for core #3 (CL 4-pipe)
0.00:00:16.53 [172,042,275 keys/sec]
RC5-72: using core #4 (CL 1-pipe large).
RC5-72: Benchmark for core #4 (CL 1-pipe large)
0.00:00:16.10 [191,761,686 keys/sec]
RC5-72: using core #5 (CL 2-pipe large).
RC5-72: Benchmark for core #5 (CL 2-pipe large)
0.00:00:16.84 [192,842,719 keys/sec]
RC5-72: using core #6 (CL 4-pipe large).
RC5-72: Benchmark for core #6 (CL 4-pipe large)
0.00:00:16.59 [176,169,744 keys/sec]
RC5-72: using core #7 (CL 1-pipe sleep).
RC5-72: Benchmark for core #7 (CL 1-pipe sleep)
0.00:00:16.59 [183,669,420 keys/sec]
RC5-72: using core #8 (CL 2-pipe sleep).
RC5-72: Benchmark for core #8 (CL 2-pipe sleep)
0.00:00:16.57 [186,548,997 keys/sec]
RC5-72: using core #9 (CL 4-pipe sleep).
RC5-72: Benchmark for core #9 (CL 4-pipe sleep)
0.00:00:16.35 [169,087,725 keys/sec]
RC5-72 benchmark summary :
Default core : #-1 (undefined) 0 keys/sec
Fastest core : #5 (CL 2-pipe large) 192,842,719 keys/sec

"-bench" Intel HD Graphics 5500 [gen8] 900 MHz (Core i5-5200U)

RC5-72: using core #0 (CL ANSI 1-pipe).
RC5-72: Benchmark for core #0 (CL ANSI 1-pipe)
0.00:00:16.15 [9,209,485 keys/sec]
RC5-72: using core #1 (CL 1-pipe).
RC5-72: Benchmark for core #1 (CL 1-pipe)
0.00:00:16.06 [168,667,029 keys/sec]
RC5-72: using core #2 (CL 2-pipe).
RC5-72: Benchmark for core #2 (CL 2-pipe)
0.00:00:16.81 [168,043,318 keys/sec]
RC5-72: using core #3 (CL 4-pipe).
RC5-72: Benchmark for core #3 (CL 4-pipe)
0.00:00:17.03 [171,313,110 keys/sec]
RC5-72: using core #4 (CL 1-pipe large).
RC5-72: Benchmark for core #4 (CL 1-pipe large)
0.00:00:16.86 [173,663,198 keys/sec]
RC5-72: using core #5 (CL 2-pipe large).
RC5-72: Benchmark for core #5 (CL 2-pipe large)
0.00:00:17.06 [177,573,667 keys/sec]
RC5-72: using core #6 (CL 4-pipe large).
RC5-72: Benchmark for core #6 (CL 4-pipe large)
0.00:00:16.70 [176,852,285 keys/sec]
RC5-72: using core #7 (CL 1-pipe sleep).
RC5-72: Benchmark for core #7 (CL 1-pipe sleep)
0.00:00:16.51 [166,997,768 keys/sec]
RC5-72: using core #8 (CL 2-pipe sleep).
RC5-72: Benchmark for core #8 (CL 2-pipe sleep)
0.00:00:16.59 [168,755,292 keys/sec]
RC5-72: using core #9 (CL 4-pipe sleep).
RC5-72: Benchmark for core #9 (CL 4-pipe sleep)
0.00:00:16.64 [170,413,224 keys/sec]
RC5-72 benchmark summary :
Default core : #-1 (undefined) 0 keys/sec
Fastest core : #5 (CL 2-pipe large) 177,573,667 keys/sec

"-bench" NVidia GeForce 820M 2048 MB, ForceWare 382.05

RC5-72: using core #0 (CL ANSI 1-pipe).
RC5-72: Benchmark for core #0 (CL ANSI 1-pipe)
0.00:00:16.20 [102,620,050 keys/sec]
RC5-72: using core #1 (CL 1-pipe).
RC5-72: Benchmark for core #1 (CL 1-pipe)
0.00:00:16.98 [129,678,653 keys/sec]
RC5-72: using core #2 (CL 2-pipe).
RC5-72: Benchmark for core #2 (CL 2-pipe)
0.00:00:16.95 [123,092,851 keys/sec]
RC5-72: using core #3 (CL 4-pipe).
RC5-72: Benchmark for core #3 (CL 4-pipe)
0.00:00:16.98 [78,567,847 keys/sec]
RC5-72: using core #4 (CL 1-pipe large).
RC5-72: Benchmark for core #4 (CL 1-pipe large)
0.00:00:17.03 [135,449,921 keys/sec]
RC5-72: using core #5 (CL 2-pipe large).
RC5-72: Benchmark for core #5 (CL 2-pipe large)
0.00:00:16.89 [128,422,603 keys/sec]
RC5-72: using core #6 (CL 4-pipe large).
RC5-72: Benchmark for core #6 (CL 4-pipe large)
0.00:00:16.43 [78,558,193 keys/sec]
RC5-72: using core #7 (CL 1-pipe sleep).
RC5-72: Benchmark for core #7 (CL 1-pipe sleep)
0.00:00:16.65 [127,347,752 keys/sec]
RC5-72: using core #8 (CL 2-pipe sleep).
RC5-72: Benchmark for core #8 (CL 2-pipe sleep)
0.00:00:16.10 [117,091,782 keys/sec]
RC5-72: using core #9 (CL 4-pipe sleep).
RC5-72: Benchmark for core #9 (CL 4-pipe sleep)
0.00:00:16.14 [71,550,849 keys/sec]
RC5-72 benchmark summary :
Default core : #-1 (undefined) 0 keys/sec
Fastest core : #4 (CL 1-pipe large) 135,449,921 keys/sec

It is inefficient to poll GPU for results wasting CPU time and
(in case of dGPUs) PCIe bandwidth, especially if CPU is powerful
while (i)GPU is not. Original "CL N-pipe" cores are not touched,
OpenCL kernels are not touched, but scheduling code is modified
to permit 100 times larger work units ("CL 1-pipe large" etc) and
also to flush assignment to GPU and put CPU to sleep
("CL 1-pipe sleep" etc).

"Large" cores are marginally faster than original ones.
"Sleep" cores are slightly slower than "large" ones because GPU may
sometimes finish processing work unit while CPU still sleeps.
These cores, however, consume zero CPU (all other cores consume 1
logical CPU unless sleep is transparently performed by GPU driver -
Intel does this for gen8 but not for newer GPUs, this helps but only
if work unit is large enough for CPU to sleep for several milliseconds).
This results in higher power efficiency and, if we are not limited
by TDP, significant performance improvement. Effect is more pronounced
when CPU does not support MT.

Note that with "sleep" cores there is no need to manually limit
number of threads for CPU cruncher.

Performance/efficiency can be further improved by growing work unit
size faster. Wider testing and benchmarking (especially on high-end
GPUs) are welcome.

Benchmarks below are performed with CPU being loaded with
2.9116.525-amd64 core dcti#4 (YK AVX2).
CUDA client is 2.9110.519b, core dcti#10 (CUDA 1-pipe 64-thd sleep 100us).
"521" refers to 2.9112.521 dnetc-win32-x86-opencl.zip/
  dnetc-linux-amd64-opencl.tar.gz
Power consumption is "measured" with "Core Temp" / "s-tui".

Core i5-8265U (15W, 4C8T, 14 nm, 1.6-3.9 GHz,
  Intel UHD Graphics 620 [gen9] 1100 MHz), Ubuntu 20.04
CL 2-pipe/large/sleep

Mode               CPU     iGPU   Summary Power Efficiency
521, 7 threads     124      150       274    15   18.27
521, 8 threads     127      150       277    15   18.47
521, iGPU only       0      184       184    15   12.27
CPU only           181        0       181    15   12.07
Sleep, 8 threads   135      148       283    15   18.87
iGPU only, sleep     0      186       186    15   12.40
iGPU only, large     0      186       187    15   12.47
[1.022 efficiency improvement, "sleep" is optimal]

Core i7-9700K (95W, 8C8T, 14 nm, 3.6-4.9 GHz,
  Intel UHD Graphics 630 [gen9] 1200 MHz), Windows 10 20H2
CL 2-pipe/large/sleep

Mode               CPU     iGPU   Summary Power Efficiency
521, 8 threads     480       92       572    95    6.02
521, 7 threads     406      187       593    95    6.24
Sleep, 8 threads   457      178       635    95    6.68
Sleep, 7 threads   403      188       591    95    6.22
CPU only           473        0       473    95    4.98
iGPU only, sleep     0      188       188    22    8.55
iGPU only, large     0      190       190    44    4.32
<Note terrible power efficiency of polling - "large" vs "sleep">
[1.071 efficiency improvement, "sleep" is optimal]

Core i5-5200U (15W, 2C4T, 14 nm, 2.2-2.7 GHz,
  Intel HD Graphics 5500 [gen8] 900 MHz)
NVidia GeForce 820M 2048 MB, ForceWare 382.05
Windows 10 20H2
CL 4-pipe/large/sleep

Mode                CPU     iGPU   dGPU Summary Power* Efficiency*
521, 4 threads       66       59      0     125    15    8.33
521, 3 threads       63      167      0     230  21.4   10.75
521, 3 threads, CUDA 29      161     89     279   15*       *
CPU only             67        0      0      67  10.2    6.57
Sleep, 4 threads     71      168      0     239  21.4   11.17
Large, 4 threads     68      172      0     240  21.4   11.21
iGPU only, sleep      0      173      0     173  13.5   12.81
iGPU only, large      0      175      0     175  13.5   12.96
dGPU only, sleep      0        0    123     123  1.3*       *
dGPU only, large      0        0    134     134  8.3*       *
Sleep, 4 threads, dG 42      153    119     314   15*       *
Custom**, 4 threads  41      155    120     316   15*       *

*dGPU is not included in power measurements
**Custom - "large" for iGPU (gen8 driver idles CPU himself),
  "sleep" for dGPU
[1.043 efficiency improvement, "large" is optimal for iGPU]
[CPU+iGPU+dGPU: 1.133 performance improvement, "sleep" is optimal for dGPU]

"-bench" Intel UHD Graphics 620 [gen9] 1100 MHz (Core i5-8265U)

RC5-72: using core #0 (CL ANSI 1-pipe).
RC5-72: Benchmark for core #0 (CL ANSI 1-pipe)
0.00:00:16.14 [113,990,283 keys/sec]
RC5-72: using core dcti#1 (CL 1-pipe).
RC5-72: Benchmark for core dcti#1 (CL 1-pipe)
0.00:00:16.32 [187,106,455 keys/sec]
RC5-72: using core dcti#2 (CL 2-pipe).
RC5-72: Benchmark for core dcti#2 (CL 2-pipe)
0.00:00:16.92 [184,015,486 keys/sec]
RC5-72: using core dcti#3 (CL 4-pipe).
RC5-72: Benchmark for core dcti#3 (CL 4-pipe)
0.00:00:16.80 [166,416,580 keys/sec]
RC5-72: using core dcti#4 (CL 1-pipe large).
RC5-72: Benchmark for core dcti#4 (CL 1-pipe large)
0.00:00:16.80 [184,818,394 keys/sec]
RC5-72: using core dcti#5 (CL 2-pipe large).
RC5-72: Benchmark for core dcti#5 (CL 2-pipe large)
0.00:00:16.81 [188,636,921 keys/sec]
RC5-72: using core dcti#6 (CL 4-pipe large).
RC5-72: Benchmark for core dcti#6 (CL 4-pipe large)
0.00:00:16.61 [170,029,327 keys/sec]
RC5-72: using core dcti#7 (CL 1-pipe sleep).
RC5-72: Benchmark for core dcti#7 (CL 1-pipe sleep)
0.00:00:16.05 [189,540,521 keys/sec]
RC5-72: using core dcti#8 (CL 2-pipe sleep).
RC5-72: Benchmark for core dcti#8 (CL 2-pipe sleep)
0.00:00:17.02 [192,711,899 keys/sec]
RC5-72: using core dcti#9 (CL 4-pipe sleep).
RC5-72: Benchmark for core dcti#9 (CL 4-pipe sleep)
0.00:00:16.93 [174,570,008 keys/sec]
RC5-72 benchmark summary :
Default core : #-1 (undefined) 0 keys/sec
Fastest core : dcti#8 (CL 2-pipe sleep) 192,711,899 keys/sec

"-bench" Intel UHD Graphics 630 [gen9] 1200 MHz (Core i7-9700K)

RC5-72: using core #0 (CL ANSI 1-pipe).
RC5-72: Benchmark for core #0 (CL ANSI 1-pipe)
0.00:00:16.96 [124,370,534 keys/sec]
RC5-72: using core dcti#1 (CL 1-pipe).
RC5-72: Benchmark for core dcti#1 (CL 1-pipe)
0.00:00:16.84 [186,580,220 keys/sec]
RC5-72: using core dcti#2 (CL 2-pipe).
RC5-72: Benchmark for core dcti#2 (CL 2-pipe)
0.00:00:16.76 [189,445,953 keys/sec]
RC5-72: using core dcti#3 (CL 4-pipe).
RC5-72: Benchmark for core dcti#3 (CL 4-pipe)
0.00:00:16.53 [172,042,275 keys/sec]
RC5-72: using core dcti#4 (CL 1-pipe large).
RC5-72: Benchmark for core dcti#4 (CL 1-pipe large)
0.00:00:16.10 [191,761,686 keys/sec]
RC5-72: using core dcti#5 (CL 2-pipe large).
RC5-72: Benchmark for core dcti#5 (CL 2-pipe large)
0.00:00:16.84 [192,842,719 keys/sec]
RC5-72: using core dcti#6 (CL 4-pipe large).
RC5-72: Benchmark for core dcti#6 (CL 4-pipe large)
0.00:00:16.59 [176,169,744 keys/sec]
RC5-72: using core dcti#7 (CL 1-pipe sleep).
RC5-72: Benchmark for core dcti#7 (CL 1-pipe sleep)
0.00:00:16.59 [183,669,420 keys/sec]
RC5-72: using core dcti#8 (CL 2-pipe sleep).
RC5-72: Benchmark for core dcti#8 (CL 2-pipe sleep)
0.00:00:16.57 [186,548,997 keys/sec]
RC5-72: using core dcti#9 (CL 4-pipe sleep).
RC5-72: Benchmark for core dcti#9 (CL 4-pipe sleep)
0.00:00:16.35 [169,087,725 keys/sec]
RC5-72 benchmark summary :
Default core : #-1 (undefined) 0 keys/sec
Fastest core : dcti#5 (CL 2-pipe large) 192,842,719 keys/sec

"-bench" Intel HD Graphics 5500 [gen8] 900 MHz (Core i5-5200U)

RC5-72: using core #0 (CL ANSI 1-pipe).
RC5-72: Benchmark for core #0 (CL ANSI 1-pipe)
0.00:00:16.15 [9,209,485 keys/sec]
RC5-72: using core dcti#1 (CL 1-pipe).
RC5-72: Benchmark for core dcti#1 (CL 1-pipe)
0.00:00:16.06 [168,667,029 keys/sec]
RC5-72: using core dcti#2 (CL 2-pipe).
RC5-72: Benchmark for core dcti#2 (CL 2-pipe)
0.00:00:16.81 [168,043,318 keys/sec]
RC5-72: using core dcti#3 (CL 4-pipe).
RC5-72: Benchmark for core dcti#3 (CL 4-pipe)
0.00:00:17.03 [171,313,110 keys/sec]
RC5-72: using core dcti#4 (CL 1-pipe large).
RC5-72: Benchmark for core dcti#4 (CL 1-pipe large)
0.00:00:16.86 [173,663,198 keys/sec]
RC5-72: using core dcti#5 (CL 2-pipe large).
RC5-72: Benchmark for core dcti#5 (CL 2-pipe large)
0.00:00:17.06 [177,573,667 keys/sec]
RC5-72: using core dcti#6 (CL 4-pipe large).
RC5-72: Benchmark for core dcti#6 (CL 4-pipe large)
0.00:00:16.70 [176,852,285 keys/sec]
RC5-72: using core dcti#7 (CL 1-pipe sleep).
RC5-72: Benchmark for core dcti#7 (CL 1-pipe sleep)
0.00:00:16.51 [166,997,768 keys/sec]
RC5-72: using core dcti#8 (CL 2-pipe sleep).
RC5-72: Benchmark for core dcti#8 (CL 2-pipe sleep)
0.00:00:16.59 [168,755,292 keys/sec]
RC5-72: using core dcti#9 (CL 4-pipe sleep).
RC5-72: Benchmark for core dcti#9 (CL 4-pipe sleep)
0.00:00:16.64 [170,413,224 keys/sec]
RC5-72 benchmark summary :
Default core : #-1 (undefined) 0 keys/sec
Fastest core : dcti#5 (CL 2-pipe large) 177,573,667 keys/sec

"-bench" NVidia GeForce 820M 2048 MB, ForceWare 382.05

RC5-72: using core #0 (CL ANSI 1-pipe).
RC5-72: Benchmark for core #0 (CL ANSI 1-pipe)
0.00:00:16.20 [102,620,050 keys/sec]
RC5-72: using core dcti#1 (CL 1-pipe).
RC5-72: Benchmark for core dcti#1 (CL 1-pipe)
0.00:00:16.98 [129,678,653 keys/sec]
RC5-72: using core dcti#2 (CL 2-pipe).
RC5-72: Benchmark for core dcti#2 (CL 2-pipe)
0.00:00:16.95 [123,092,851 keys/sec]
RC5-72: using core dcti#3 (CL 4-pipe).
RC5-72: Benchmark for core dcti#3 (CL 4-pipe)
0.00:00:16.98 [78,567,847 keys/sec]
RC5-72: using core dcti#4 (CL 1-pipe large).
RC5-72: Benchmark for core dcti#4 (CL 1-pipe large)
0.00:00:17.03 [135,449,921 keys/sec]
RC5-72: using core dcti#5 (CL 2-pipe large).
RC5-72: Benchmark for core dcti#5 (CL 2-pipe large)
0.00:00:16.89 [128,422,603 keys/sec]
RC5-72: using core dcti#6 (CL 4-pipe large).
RC5-72: Benchmark for core dcti#6 (CL 4-pipe large)
0.00:00:16.43 [78,558,193 keys/sec]
RC5-72: using core dcti#7 (CL 1-pipe sleep).
RC5-72: Benchmark for core dcti#7 (CL 1-pipe sleep)
0.00:00:16.65 [127,347,752 keys/sec]
RC5-72: using core dcti#8 (CL 2-pipe sleep).
RC5-72: Benchmark for core dcti#8 (CL 2-pipe sleep)
0.00:00:16.10 [117,091,782 keys/sec]
RC5-72: using core dcti#9 (CL 4-pipe sleep).
RC5-72: Benchmark for core dcti#9 (CL 4-pipe sleep)
0.00:00:16.14 [71,550,849 keys/sec]
RC5-72 benchmark summary :
Default core : #-1 (undefined) 0 keys/sec
Fastest core : dcti#4 (CL 1-pipe large) 135,449,921 keys/sec
@bovine
Copy link
Member

bovine commented Feb 23, 2021

is clWaitForEvents() the main contributor of the wasted CPU time that the sleeping is solving?

@void234
Copy link
Author

void234 commented Feb 24, 2021

is clWaitForEvents() the main contributor of the wasted CPU time that the sleeping is solving?

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants