Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on v1.3.3.1 : GPU not being utilized #107

Open
cavalia88 opened this issue Nov 18, 2023 · 65 comments
Open

Feedback on v1.3.3.1 : GPU not being utilized #107

cavalia88 opened this issue Nov 18, 2023 · 65 comments

Comments

@cavalia88
Copy link

Hi,

I was previously using the par2j64 command line version 1.3.2.8. Tested out the latest 1.3.3.1 version and found that speeds are about double, mainly because more threads are being utilized. I'm using a Ryzen 9 7900X 12 core processor with a AMD 7900XTX Radeon graphics card. I have two questions:

(i) With the new 1.3.3.1 version, 24 threads are being used and i get 100% CPU utilization. Is there any way for me to limit the maximum number of threads?

(ii) I cannot seem to get the GPU to do the processing. In both versions, all the work is being done by the CPU (I see little utilization of GPU in the task manager). How do we know if the GPU is being used by Multipar from the command line? Is there any command to force the use of GPU? My current command line is as below.

par2j64 c -rr10 -ss640000 -lc32 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample*.*"

 ======================================
Parchive 2.0 client version 1.3.2.8 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 SSSE3 CLMUL AVX2
Memory usage : Auto (85178 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 12
Slice distribution : 2, power of two (until 158)
Packet Repetition limit : 0

100.0% : Computing file hash
100.0% : Making index file
100.0% : Constructing recovery file
100.0% : Creating recovery slice

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48

======================================

Parchive 2.0 client version 1.3.3.1 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 24 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (88769 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

100.0% : Computing file hash
100.0% : Making index file
100.0% : Constructing recovery file
100.0% : Creating recovery slice

======================================

@Yutaka-Sawada
Copy link
Owner

Is there any way for me to limit the maximum number of threads?

To support more using threads, lc option was changed from v1.3.3.0. lc32 meaned "enable GPU acceleration" at v1.3.2.8 or before. Now, it means "use 32 threads" at v1.3.3.0 or after. (Because you set 32 threads, it uses 24 threads as max for your CPU.) Please read "Command_par2j.txt" for the change. So, you must update your command-line for new version.

How do we know if the GPU is being used by Multipar from the command line?

When GPU was used, it prints a line later like;
OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48

Is there any command to force the use of GPU?

At v1.3.3.0 or later, set lc256 to enable GPU acceleration. I'm sorry for the incompatibility. I will add a notice to version history.

@cavalia88
Copy link
Author

Thank you for the clarification. I managed to activate the GPU mode with lc256

With GPU mode and 12 threads, it took about 20 seconds. Without GPU and 24 threads, it took about 15 seconds. It seems that pure CPU is still faster than the GPU (even though I have an AMD Radeon 7900XTX).

In GPU mode, i find that my GPU is barely used, with loads on the GPU not even rising above 1%. Is there something more that can be done to optimize the GPU processing? It seems most of the work is still being done on the CPU side.

Do we have an option to measure the time used to complete the par file generation? This will make it easier for benchmarking the different settings.

par2j64 c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.3.1 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (88904 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

100.0% : Computing file hash
100.0% : Making index file
100.0% : Constructing recovery file
100.0% : Creating recovery slice

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48

par2j64 c -rr10 -ss640000 -lc32 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.3.1 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 24 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (88970 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

100.0% : Computing file hash
100.0% : Making index file
100.0% : Constructing recovery file
100.0% : Creating recovery slice

Created successfully

@Yutaka-Sawada
Copy link
Owner

Yutaka-Sawada commented Nov 19, 2023

Is there something more that can be done to optimize the GPU processing?

Because I used NVIDIA GeForce graphics board to test OpenCL speed, my implementation is optimized for GeForce series. It's known to be slow on AMD GPU. Though there may be a technique to improve speed for AMD, I cannot test myself. While some AMD users helped develpment and tested some methods ago, I could not get good result. If you want to help, I will try again. But, I don't know what is good for AMD GPU at this time.

Do we have an option to measure the time used to complete the par file generation?

No, such option doesn't exit. But, I made debug versions to see behavior and speed. I put the package (par2j_debug_2023-11-19.zip) in "MultiPar_sample" folder on OneDrive. If you want to see internal calculation speed, please use them.

It seems that pure CPU is still faster than the GPU

The calculation power of GPU is less than recent multi-core CPU mostly. If the GPU power is less than CPU's 13 threads, using CPU 24 threads may be faster than 12 threads (CPU 11 threads and GPU thread). You should compare "With GPU mode and 24 threads" to "Without GPU and 24 threads". lc option is bitwise OR. You set lc280 (24 + 256 = 280) to use 24 threads and GPU.

@cavalia88
Copy link
Author

lc24: '24 threads" = 14.77 seconds
lc280: "GPU + 24 threads" = 15 seconds
lc256: "GPU + 12 threads" = 20 seconds

Not much difference between "24 threads" and "GPU + 24 threads". Most of work is done by the multi-core CPU, very little seems to be offloaded to the GPU.

Not familiar with programming or OpenCL, but if you need someone to help test out on AMD Radeon 7900XTX, happy to help out.

@Yutaka-Sawada
Copy link
Owner

if you need someone to help test out on AMD Radeon 7900XTX, happy to help out.

Thank you for help. I want to test some methods on OpenCL implementation. There are 2 ways to improve speed.

  1. Data transfur between PC's RAM and GPU's VRAM
  2. Calculation over GPU

At first, I try (1) "Data transfur".

Currently, I use OpenCL buffer with CL_MEM_USE_HOST_PTR flag. For AMD or Intel's integrated GPU, this is the best, because it can access PC's RAM directly. This seems to work well on NVIDIA's discrete GPU (GeForce graphics board), too. It will copy data on PC's RAM to GPU's VRAM automatically. Therefore, discrete GPU would use fast VRAM at calculation, even when the original data exists on PC's RAM.

I don't know how AMD OpenCL driver acts. I made 2 samples. One uses CL_MEM_COPY_HOST_PTR flag. This must force copy data from PC's RAM to GPU's VRAM always. It may be same behavior as CL_MEM_USE_HOST_PTR on NVIDIA GPU.

Another uses both CL_MEM_ALLOC_HOST_PTR and CL_MEM_COPY_HOST_PTR. This may put data on pinned memory area. Though pinned memory is fast on NVIDIA GPU, the area size is limited. It seems that pinned memory isn't fast on AMD GPU.

I put the package (par2j_debug_2023-11-20.zip) in "MultiPar_sample" folder on OneDrive. Please test these 3 modes for same data set. When some users tested on their graphics boards ago, there was no remarkable difference. But, I don't know what happens on the latest AMD GPU.

To compare speed, set lc256 only. If you set more threads than number of CPU's cores, it may reach max speed of your PC. Basically, memory speed is the bottle-neck in recent calcation of Reed-Solomon Codes. Thus, using CPU's L3 cache could improve speed. While using more number of threads increase calculation speed, it's limited by memory speed (or data transfur speed). Though GPU's VRAM is faster than PC's RAM, it requires data copy anyway. When it's max already, I won't be able to improve anymore.

@cavalia88
Copy link
Author

cavalia88 commented Nov 20, 2023

Test results attached below. Looks like the VRAM version is the fastest (around 11 seconds), PIN version and 1331 (about the same around 15 seconds) and 1329 (slowest at around 57. seconds)

par2j64_VRAM c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (89226 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.594 sec, 3344 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.078 sec
0.0% : Creating recovery slice
matrix size = 34 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 628 KB (643072 Bytes), OK
factor buf : 17550 Bytes (8775 factors), OK
CreateKernel : method2

Max number of work items = 12288 (256 * 48)
OpenCL_method = 2, vram_max = 8775
partial encode = 103 / 8775 (1.1%), read = 8775, skip = 0
remain = 8672, src_off = 103, src_max = 32
GPU: remain = 8640, src_off = 135, src_num = 64
GPU: remain = 8352, src_off = 423, src_num = 1670
GPU: remain = 3578, src_off = 5197, src_num = 1217
GPU: remain = 89, src_off = 8686, src_num = 32
CPU last: src_off = 8718, src_num = 57
100.0%
read 2.406 sec
write 0.797 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 499587
1st encode 2.375 sec, 30430 loop, 7857 MB/s
2nd encode 7.516 sec, 469157 loop, 38281 MB/s
sub-thread : total loop = 498562
1st encode 2.375 sec, 29899 loop, 7720 MB/s
2nd encode 7.516 sec, 468663 loop, 38241 MB/s
sub-thread : total loop = 483568
1st encode 2.375 sec, 30105 loop, 7773 MB/s
2nd encode 7.516 sec, 453463 loop, 37001 MB/s
sub-thread : total loop = 449254
2nd encode 7.516 sec, 449254 loop, 36657 MB/s
sub-thread : total loop = 445426
2nd encode 7.516 sec, 445426 loop, 36345 MB/s
sub-thread : total loop = 455024
2nd encode 7.516 sec, 455024 loop, 37128 MB/s
sub-thread : total loop = 459159
2nd encode 7.516 sec, 459159 loop, 37465 MB/s
sub-thread : total loop = 451689
2nd encode 7.516 sec, 451689 loop, 36856 MB/s
sub-thread : total loop = 444444
2nd encode 7.516 sec, 444444 loop, 36265 MB/s
sub-thread : total loop = 454247
2nd encode 7.516 sec, 454247 loop, 37065 MB/s
sub-thread : total loop = 444411
2nd encode 7.516 sec, 444411 loop, 36262 MB/s
gpu-thread :
2nd encode 7.656 sec, 2619074 loop, 209800 MB/s
total 11.391 sec

Created successfully

par2j64_PIN c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (89210 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.578 sec, 3377 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.078 sec
0.0% : Creating recovery slice
matrix size = 34 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 628 KB (643072 Bytes), OK
factor buf : 17550 Bytes (8775 factors), OK
CreateKernel : method2

Max number of work items = 12288 (256 * 48)
OpenCL_method = 2, vram_max = 8775
partial encode = 108 / 8775 (1.2%), read = 8775, skip = 0
remain = 8667, src_off = 108, src_max = 32
GPU: remain = 8635, src_off = 140, src_num = 64
GPU: remain = 7515, src_off = 1260, src_num = 417
GPU: remain = 570, src_off = 8205, src_num = 33
CPU last: src_off = 8718, src_num = 57
100.0%
read 2.516 sec
write 0.813 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 704176
1st encode 2.500 sec, 31800 loop, 7800 MB/s
2nd encode 11.312 sec, 672376 loop, 36452 MB/s
sub-thread : total loop = 700455
1st encode 2.500 sec, 31352 loop, 7691 MB/s
2nd encode 11.312 sec, 669103 loop, 36275 MB/s
sub-thread : total loop = 697636
1st encode 2.500 sec, 31672 loop, 7769 MB/s
2nd encode 11.312 sec, 665964 loop, 36105 MB/s
sub-thread : total loop = 647689
2nd encode 11.328 sec, 647689 loop, 35064 MB/s
sub-thread : total loop = 647369
2nd encode 11.312 sec, 647369 loop, 35097 MB/s
sub-thread : total loop = 647741
2nd encode 11.312 sec, 647741 loop, 35117 MB/s
sub-thread : total loop = 638786
2nd encode 11.312 sec, 638786 loop, 34631 MB/s
sub-thread : total loop = 642898
2nd encode 11.312 sec, 642898 loop, 34854 MB/s
sub-thread : total loop = 636169
2nd encode 11.312 sec, 636169 loop, 34489 MB/s
sub-thread : total loop = 647081
2nd encode 11.312 sec, 647081 loop, 35081 MB/s
sub-thread : total loop = 643155
2nd encode 11.312 sec, 643155 loop, 34868 MB/s
gpu-thread :
2nd encode 11.390 sec, 451292 loop, 24299 MB/s
total 15.234 sec

Created successfully

par2j64_1331 c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.3.1 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (89193 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.610 sec, 3310 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.078 sec
0.0% : Creating recovery slice
matrix size = 34 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 628 KB (643072 Bytes), OK
factor buf : 17550 Bytes (8775 factors), OK
CreateKernel : method2

Max number of work items = 12288 (256 * 48)
OpenCL_method = 2, vram_max = 8775
partial encode = 104 / 8775 (1.1%), read = 8775, skip = 0
remain = 8671, src_off = 104, src_max = 32
GPU: remain = 8639, src_off = 136, src_num = 64
GPU: remain = 7583, src_off = 1192, src_num = 446
GPU: remain = 481, src_off = 8294, src_num = 29
CPU last: src_off = 8739, src_num = 36
100.0%
read 2.422 sec
write 0.812 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 687211
1st encode 2.407 sec, 30517 loop, 7775 MB/s
2nd encode 11.328 sec, 656694 loop, 35552 MB/s
sub-thread : total loop = 696290
1st encode 2.407 sec, 30364 loop, 7736 MB/s
2nd encode 11.328 sec, 665926 loop, 36052 MB/s
sub-thread : total loop = 685754
1st encode 2.407 sec, 30431 loop, 7753 MB/s
2nd encode 11.328 sec, 655323 loop, 35478 MB/s
sub-thread : total loop = 643399
2nd encode 11.313 sec, 643399 loop, 34878 MB/s
sub-thread : total loop = 648455
2nd encode 11.328 sec, 648455 loop, 35106 MB/s
sub-thread : total loop = 647931
2nd encode 11.312 sec, 647931 loop, 35127 MB/s
sub-thread : total loop = 649308
2nd encode 11.328 sec, 649308 loop, 35152 MB/s
sub-thread : total loop = 644136
2nd encode 11.327 sec, 644136 loop, 34875 MB/s
sub-thread : total loop = 631592
2nd encode 11.327 sec, 631592 loop, 34196 MB/s
sub-thread : total loop = 655282
2nd encode 11.312 sec, 655282 loop, 35526 MB/s
sub-thread : total loop = 641846
2nd encode 11.312 sec, 641846 loop, 34797 MB/s
gpu-thread :
2nd encode 11.453 sec, 473242 loop, 25340 MB/s
total 15.203 sec

Created successfully

par2j64_1329 c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.2.9 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 SSSE3 CLMUL AVX2
Memory usage : Auto (88836 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
100.0% : Computing file hash
100.0% : Making index file
100.0% : Constructing recovery file
0.0% : Creating recovery slice
matrix size = 34.4 KB
get_io_size: part_min = 264, part_max = 878

read all source blocks, and keep some parity blocks
buffer size = 5892 MB, io_size = 640016, split = 1
cache: limit size = 131072, chunk_size = 128032, split = 5
prog_base = 7945022, unit_size = 640032, part_num = 878
partial encode = 150 / 8775 (1.7%), read = 8775, skip = 0
100.0%
read 3.219 sec
write 0.782 sec
sub-thread : total loop = 651205
1st encode 3.234 sec, 19855 loop, 3747 MB/s
2nd encode 53.500 sec, 631350 loop, 7203 MB/s
sub-thread : total loop = 653136
1st encode 3.234 sec, 20061 loop, 3786 MB/s
2nd encode 53.453 sec, 633075 loop, 7229 MB/s
sub-thread : total loop = 648458
1st encode 3.234 sec, 20558 loop, 3880 MB/s
2nd encode 53.469 sec, 627900 loop, 7167 MB/s
sub-thread : total loop = 648292
1st encode 3.234 sec, 20392 loop, 3848 MB/s
2nd encode 53.391 sec, 627900 loop, 7178 MB/s
sub-thread : total loop = 649919
1st encode 3.234 sec, 22019 loop, 4155 MB/s
2nd encode 53.407 sec, 627900 loop, 7176 MB/s
sub-thread : total loop = 665337
1st encode 3.234 sec, 28812 loop, 5437 MB/s
2nd encode 53.438 sec, 636525 loop, 7270 MB/s
sub-thread : total loop = 631350
2nd encode 53.453 sec, 631350 loop, 7209 MB/s
sub-thread : total loop = 629625
2nd encode 53.485 sec, 629625 loop, 7185 MB/s
sub-thread : total loop = 631350
2nd encode 53.485 sec, 631350 loop, 7205 MB/s
sub-thread : total loop = 629625
2nd encode 53.438 sec, 629625 loop, 7191 MB/s
sub-thread : total loop = 633075
2nd encode 53.407 sec, 633075 loop, 7235 MB/s
sub-thread : total loop = 633075
2nd encode 53.485 sec, 633075 loop, 7224 MB/s
total 57.594 sec

Created successfully

@Yutaka-Sawada
Copy link
Owner

Looks like the VRAM version is the fastest

PC's RAM (CL_MEM_USE_HOST_PTR) : 25340 MB/s
GPU's VRAM (CL_MEM_COPY_HOST_PTR) : 209800 MB/s
Pinned memory (CL_MEM_ALLOC_HOST_PTR & CL_MEM_COPY_HOST_PTR) : 24299 MB/s

Thank you for test. The difference is incredibly bigger than I thought. VRAM version is 8 times faster than others. This might be why my OpenCL implementation was slow on AMD GPU. I understand that AMD's OpenCL driver doesn't make cache (copy data automatically) on VRAM at CL_MEM_USE_HOST_PTR flag. Then, I must set CL_MEM_COPY_HOST_PTR flag to copy data on VRAM explicitly for AMD's discrete GPU. Also, I need to distinguish integrated GPU (AMD's APU or Intel CPU) and discrete GPU (Radeon or GeForce). Because GPU's VRAM speed is faster than PC's RAM normally, setting CL_MEM_COPY_HOST_PTR flag won't be bad for NVIDIA GPU (GeForce), too. I will change memory access setting at next sample.

@Yutaka-Sawada
Copy link
Owner

I found that CL_MEM_COPY_HOST_PTR flag was good for AMD's discrete GPU. I made a sample (par2j64_auto.exe), which switches memory location.
CL_MEM_COPY_HOST_PTR for discrete GPU
CL_MEM_USE_HOST_PTR for integrated GPU

Next, I try to improve speed of calculation over GPU. 16 byte memory access seems to be good for AMD GPU. (On the other hand, accessing 4 byte would be good for NVIDIA GPU.) Now, I use vector data type in OpenCL to support 16 byte memory access. I made 2 samples of 4 byte data type (uchar4) and 16 byte data type (uchar16).

On my PC's Intel GPU;
Old : average 7958 MB/s
uchar4 : average 8130 MB/s
uchar16 : average 8991 MB/s

Using uchar4 is slightly (2 %) faster than old implementation. Using uchar16 is 12 % faster than old implementation. By using vector data type, I could simplify my source code. So, it may be good for NVIDIA GPU. (However I'm not sure.) Though uchar16 is faster than uchar4, it requires more local memory. (It may not work enough on old graphics board.) The uchar16's speed may come from less number of looping.

I put the package (par2j_debug_2023-11-22.zip) in "MultiPar_sample" folder on OneDrive. Please test these 3 methods for same data set. I don't know which is faster on recent discrete GPUs. Calculation property is differ in NVIDIA or AMD GPUs. When there is no big difference, I prefer simple implementation.

@cavalia88
Copy link
Author

Fastest version is the 16 byte, followed by auto version, and then 4 byte version. Results below:

par2j64_4byte c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (87791 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.594 sec, 3344 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.078 sec
0.0% : Creating recovery slice
matrix size = 34 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
HOST_UNIFIED_MEMORY = 0
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256
HOST_UNIFIED_MEMORY = 1

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 628 KB (643072 Bytes), OK
factor buf : 17550 Bytes (8775 factors), OK
CreateKernel : method2

Max number of work items = 12288 (256 * 48)
OpenCL_method = 2, vram_max = 8775
partial encode = 104 / 8775 (1.1%), read = 8775, skip = 0
remain = 8671, src_off = 104, src_max = 32
GPU: remain = 8639, src_off = 136, src_num = 64
GPU: remain = 8351, src_off = 424, src_num = 1670
GPU: remain = 3545, src_off = 5230, src_num = 1199
GPU last +: src_off = 8733, src_num = 10 + 32
100.0%
read 2.406 sec
write 0.797 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 498500
1st encode 2.390 sec, 30443 loop, 7811 MB/s
2nd encode 7.453 sec, 468057 loop, 38514 MB/s
sub-thread : total loop = 498994
1st encode 2.390 sec, 30438 loop, 7810 MB/s
2nd encode 7.469 sec, 468556 loop, 38473 MB/s
sub-thread : total loop = 502495
1st encode 2.390 sec, 30431 loop, 7808 MB/s
2nd encode 7.453 sec, 472064 loop, 38844 MB/s
sub-thread : total loop = 454009
2nd encode 7.469 sec, 454009 loop, 37278 MB/s
sub-thread : total loop = 462956
2nd encode 7.453 sec, 462956 loop, 38095 MB/s
sub-thread : total loop = 445529
2nd encode 7.453 sec, 445529 loop, 36661 MB/s
sub-thread : total loop = 438521
2nd encode 7.453 sec, 438521 loop, 36084 MB/s
sub-thread : total loop = 446860
2nd encode 7.453 sec, 446860 loop, 36770 MB/s
sub-thread : total loop = 445996
2nd encode 7.469 sec, 445996 loop, 36620 MB/s
sub-thread : total loop = 460185
2nd encode 7.469 sec, 460185 loop, 37785 MB/s
sub-thread : total loop = 438348
2nd encode 7.453 sec, 438348 loop, 36070 MB/s
gpu-thread :
2nd encode 7.672 sec, 2612050 loop, 208801 MB/s
total 11.391 sec

par2j64_16byte c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (87842 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.578 sec, 3377 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.078 sec
0.0% : Creating recovery slice
matrix size = 34 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
HOST_UNIFIED_MEMORY = 0
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256
HOST_UNIFIED_MEMORY = 1

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 628 KB (643072 Bytes), OK
factor buf : 17550 Bytes (8775 factors), OK
CreateKernel : method2

Max number of work items = 12288 (256 * 48)
OpenCL_method = 2, vram_max = 8775
partial encode = 104 / 8775 (1.1%), read = 8775, skip = 0
remain = 8671, src_off = 104, src_max = 32
GPU: remain = 8639, src_off = 136, src_num = 64
GPU: remain = 8383, src_off = 392, src_num = 1862
GPU: remain = 4473, src_off = 4302, src_num = 2052
GPU: remain = 149, src_off = 8626, src_num = 69
CPU last: src_off = 8727, src_num = 48
100.0%
read 2.422 sec
write 0.813 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 412902
1st encode 2.406 sec, 30445 loop, 7760 MB/s
2nd encode 6.094 sec, 382457 loop, 38489 MB/s
sub-thread : total loop = 409418
1st encode 2.422 sec, 30407 loop, 7699 MB/s
2nd encode 6.094 sec, 379011 loop, 38142 MB/s
sub-thread : total loop = 399471
1st encode 2.406 sec, 30460 loop, 7764 MB/s
2nd encode 6.094 sec, 369011 loop, 37136 MB/s
sub-thread : total loop = 371212
2nd encode 6.094 sec, 371212 loop, 37357 MB/s
sub-thread : total loop = 368521
2nd encode 6.094 sec, 368521 loop, 37086 MB/s
sub-thread : total loop = 368499
2nd encode 6.094 sec, 368499 loop, 37084 MB/s
sub-thread : total loop = 367392
2nd encode 6.094 sec, 367392 loop, 36973 MB/s
sub-thread : total loop = 355936
2nd encode 6.094 sec, 355936 loop, 35820 MB/s
sub-thread : total loop = 369369
2nd encode 6.094 sec, 369369 loop, 37172 MB/s
sub-thread : total loop = 363376
2nd encode 6.094 sec, 363376 loop, 36569 MB/s
sub-thread : total loop = 365084
2nd encode 6.094 sec, 365084 loop, 36740 MB/s
gpu-thread :
2nd encode 6.265 sec, 3553266 loop, 347829 MB/s
total 10.016 sec

Created successfully

par2j64_auto c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (87839 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.578 sec, 3377 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.078 sec
0.0% : Creating recovery slice
matrix size = 34 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
HOST_UNIFIED_MEMORY = 0
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256
HOST_UNIFIED_MEMORY = 1

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 628 KB (643072 Bytes), OK
factor buf : 17550 Bytes (8775 factors), OK
CreateKernel : method2

Max number of work items = 12288 (256 * 48)
OpenCL_method = 2, vram_max = 8775
partial encode = 104 / 8775 (1.1%), read = 8775, skip = 0
remain = 8671, src_off = 104, src_max = 32
GPU: remain = 8639, src_off = 136, src_num = 64
GPU: remain = 8351, src_off = 424, src_num = 1670
GPU: remain = 3513, src_off = 5262, src_num = 1180
GPU: remain = 125, src_off = 8650, src_num = 42
CPU last: src_off = 8724, src_num = 51
100.0%
read 2.406 sec
write 0.813 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 500794
1st encode 2.391 sec, 30555 loop, 7837 MB/s
2nd encode 7.437 sec, 470239 loop, 38777 MB/s
sub-thread : total loop = 502139
1st encode 2.391 sec, 30281 loop, 7766 MB/s
2nd encode 7.437 sec, 471858 loop, 38911 MB/s
sub-thread : total loop = 494881
1st encode 2.391 sec, 30476 loop, 7816 MB/s
2nd encode 7.437 sec, 464405 loop, 38296 MB/s
sub-thread : total loop = 454265
2nd encode 7.437 sec, 454265 loop, 37460 MB/s
sub-thread : total loop = 459013
2nd encode 7.437 sec, 459013 loop, 37851 MB/s
sub-thread : total loop = 454360
2nd encode 7.437 sec, 454360 loop, 37468 MB/s
sub-thread : total loop = 446594
2nd encode 7.437 sec, 446594 loop, 36827 MB/s
sub-thread : total loop = 446492
2nd encode 7.437 sec, 446492 loop, 36819 MB/s
sub-thread : total loop = 452466
2nd encode 7.437 sec, 452466 loop, 37311 MB/s
sub-thread : total loop = 449651
2nd encode 7.437 sec, 449651 loop, 37079 MB/s
sub-thread : total loop = 448424
2nd encode 7.437 sec, 448424 loop, 36978 MB/s
gpu-thread :
2nd encode 7.546 sec, 2595368 loop, 210931 MB/s
total 11.282 sec

@Yutaka-Sawada
Copy link
Owner

Fastest version is the 16 byte, followed by auto version, and then 4 byte version.

Oh, I see. Thank you for confirm the property. While I read an OpenCL optimization guide, I didn't think so much difference. 16-byte memory access is around 65% faster on AMD GPU. But, it's not so fast on Intel nor NVIDIA GPUs. I took long time to put an old graphics board, install driver, and test OpenCL behavior. When I tested the GeForce GPU, 16-byte access was very slow. After test, I needed to restore Intel GPU, re-install old driver, and install new driver again.

AMD Radeon 7900XTX (64 KB local memory);
16-byte access: 347829 MB/s (165% speed than 4-byte access)
4-byte access : 208801 ~ 210931 MB/s

Intel UHD Graphics 630 (64 KB local memory);
16-byte access: 8987 ~ 8995 MB/s (112% speed than 4-byte access)
4-byte access : 7936 ~ 7973 MB/s

NVIDIA GeForce GT 240 (16 KB local memory);
16-byte access: 3045 MB/s (53% speed than 4-byte access)
4-byte access : 5742 ~ 5769 MB/s

I think that the slowness of GeForce GPU may come from small local memory size. (This local memory is different from VRAM.) While I tested on Intel GPU, using more local memory became slower. Traditionally, NVIDIA GeForce GPU has less local memory than AMD GPU. While AMD GPUs have 32 ~ 64 KB, NVIDIA GPUs have 16 ~ 47 KB. When there are enough local memory on GPU, 16-byte memory access will be good. I need to check local memory size and will change functions (4-byte or 16-byte). But, I'm not sure how much local memory is required. Maybe 32 KB would be enough, as AMD GPUs contain the amount. If it becomes slow on a NVIDIA GPU, users will report the problem later.

When I read your test log, I found that GPU's first task was not full speed. The trial task size might be too small for Radeon 7900XTX's calculation power. The max was set to 2 times larger than CPU's task size. (The task size was enough for GeForce RTX 2060 or GeForce RTX 3070 ago.) I made a new sample with 3 times larger limit. I put the package (par2j_debug_2023-11-24.zip) in "MultiPar_sample" folder on OneDrive. Please test it for same data set. If there is no difference, I will return to setting 2 times.

@cavalia88
Copy link
Author

The 16byte3 version is slightly slower than the 16 byte version

par2j64_16byte3 c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (89626 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.594 sec, 3344 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.078 sec
0.0% : Creating recovery slice
matrix size = 34 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
HOST_UNIFIED_MEMORY = 0
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
HOST_UNIFIED_MEMORY = 1
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 628 KB (643072 Bytes), OK
factor buf : 17550 Bytes (8775 factors), OK
CreateKernel : method8

Max number of work items = 12288 (256 * 48)
OpenCL_method = 8, vram_max = 8775
partial encode = 106 / 8775 (1.2%), read = 8775, skip = 0
remain = 8669, src_off = 106, src_max = 32
GPU: remain = 8637, src_off = 138, src_num = 96
GPU: remain = 8317, src_off = 458, src_num = 2268
GPU: remain = 3681, src_off = 5094, src_num = 1744
GPU: remain = 625, src_off = 8150, src_num = 319
CPU last: src_off = 8725, src_num = 50
100.0%
read 2.516 sec
write 0.813 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 397679
1st encode 2.516 sec, 30920 loop, 7536 MB/s
2nd encode 6.844 sec, 366759 loop, 32864 MB/s
sub-thread : total loop = 373255
1st encode 2.501 sec, 31061 loop, 7616 MB/s
2nd encode 6.828 sec, 342194 loop, 30735 MB/s
sub-thread : total loop = 374370
1st encode 2.501 sec, 31087 loop, 7622 MB/s
2nd encode 6.844 sec, 343283 loop, 30761 MB/s
sub-thread : total loop = 334993
2nd encode 6.813 sec, 334993 loop, 30154 MB/s
sub-thread : total loop = 327126
2nd encode 6.829 sec, 327126 loop, 29377 MB/s
sub-thread : total loop = 325845
2nd encode 6.844 sec, 325845 loop, 29198 MB/s
sub-thread : total loop = 338550
2nd encode 6.829 sec, 338550 loop, 30403 MB/s
sub-thread : total loop = 336319
2nd encode 6.844 sec, 336319 loop, 30137 MB/s
sub-thread : total loop = 326257
2nd encode 6.829 sec, 326257 loop, 29299 MB/s
sub-thread : total loop = 343753
2nd encode 6.844 sec, 343753 loop, 30803 MB/s
sub-thread : total loop = 339394
2nd encode 6.829 sec, 339394 loop, 30479 MB/s
gpu-thread :
2nd encode 6.782 sec, 3886906 loop, 351484 MB/s
total 10.719 sec

Created successfully

@Slava46
Copy link

Slava46 commented Nov 24, 2023

Is there point to test those versions for me with usual GUI?
You know my specs.

@Yutaka-Sawada
Copy link
Owner

The 16byte3 version is slightly slower than the 16 byte version

Thanks cavalia88 for test. Even when a GPU is slow starter, task management seems to work enough. I returned GPU thread's initial task size to 2 times of CPU thread. Now, OpenCL optimization for AMD GPU was successful.

gpu-thread speed from tests on Radeon 7900XTX by cavalia88;
par2j64_1331.exe   :  25340 MB/s (Base line as v1.3.3.1)
par2j64_VRAM.exe   : 209800 MB/s (827% speed of base line)
par2j64_4byte.exe  : 208801 MB/s (823% speed of base line)
par2j64_16byte.exe : 347829 MB/s (1372% speed of base line)

At last, I could improve speed largely for AMD GPU. 13 times faster !! Using CL_MEM_COPY_HOST_PTR is much faster than CL_MEM_USE_HOST_PTR flag on AMD GPU. This may come from behavior of AMD's OpenCL driver. Using uchar4 (4 byte vector data type) isn't fast. (almost same) This may come from design of graphics board. Using 16 byte memory access (uint4 and uchar16 vector data type) is fast. This is because AMD GPU has read cache for VRAM.

gpu-thread speed from tests on Intel UHD Graphics 630;
par2j64_1331.exe   : 7958 MB/s (Base line as v1.3.3.1)
par2j64_VRAM.exe   : 7925 MB/s (99% speed of base line)
par2j64_4byte.exe  : 8130 MB/s (102% speed of base line)
par2j64_16byte.exe : 8991 MB/s (112% speed of base line)

I tested a slow GPU in my Intel Core i5-10400 CPU. Because VRAM speed of integrated GPU is same as system RAM, copying data to VRAM is useless on Intel GPU. Though using uchar4 is slightly faster, the difference is ignorable mostly. Using 16 byte memory access is faster, however the difference isn't so big than AMD GPU.

gpu-thread speed from tests on NVIDIA GeForce GT 240;
par2j64_1331.exe   : 5758 MB/s (Base line as v1.3.3.1)
par2j64_VRAM.exe   : 5753 MB/s (100% speed of base line)
par2j64_4byte.exe  : 5720 MB/s (99% speed of base line)
par2j64_16byte.exe : 3045 MB/s (52% speed of base line)

I tested an old GeForce graphics board. Using CL_MEM_COPY_HOST_PTR is same speed as CL_MEM_USE_HOST_PTR flag on NVIDIA GPU. This is because NVIDIA's OpenCL driver makes cache of data on VRAM automatically. Using uchar4 (4 byte vector data type) isn't fast, as same as AMD GPU. Data conversion between vector and 1 byte scalar seems to be slow on most GPUs. Though 16 byte memory access is very slow. The slowness may come from small local memory size.

Is there point to test those versions for me with usual GUI?

Oh, thanks Slava46 for the offering. While I tested an old GeForce GPU, recent high-end GPU may differ. I made a package, which contains those 4 samples. I put the sample (par2j_debug_2023-11-25.zip) in "MultiPar_sample" folder on OneDrive. Though they are debug versions, it's possible to use by MultiPar GUI, too. By enable log, the result is saved on MultiPar.log in MultiPar's save folder. Please test them and post the results. Because each log may be long, no need to put whole log. (I know OpenCL spec of your GeForce RTX 3070 already.) Speed of gpu-thread is important line, which is printed at the last of each result. I want to know which is faster (or slower) than previous v1.3.3.1. Small difference like 1~2 % is ignorable. Only when there is big noticeable difference, I will adopt the method to next v1.3.3.2.

@Slava46
Copy link

Slava46 commented Nov 25, 2023

For sure OpenCL specs from log:
Platform[0] = NVIDIA CUDA
Platform version = OpenCL 3.0 CUDA 12.3.99

Device[0] = NVIDIA GeForce RTX 3070
Device version = OpenCL 3.0 CUDA
LOCAL_MEM_SIZE = 48 KB
MAX_MEM_ALLOC_SIZE = 2047 MB
MAX_COMPUTE_UNITS = 46
MAX_WORK_GROUP_SIZE = 1024
GLOBAL_MEM_SIZE = 8191 MB

Selected platform = NVIDIA CUDA
Selected device = NVIDIA GeForce RTX 3070
src buf : 2096248 KB (1162 blocks), possible
dst buf : 1804 KB (1847296 Bytes), OK
factor buf : 2324 Bytes (1162 factors), OK
CreateKernel : method2

OpenCL : NVIDIA GeForce RTX 3070, OpenCL 3.0 CUDA, 256*46

GPU enabled, CPU high:

The same 70 Gb test files as previous tests.
par2j64_1331: GPU thread: 140137 MB/s; 06:37
CPU thread: 10914 - 11449 MB/s

par2j64_VRAM: GPU thread: 129142 MB/s; 06:37
par2j64_4byte: GPU thread: 129422 MB/s; 06:39
par2j64_16byte: GPU thread: 170206 MB/s; 06:19 - ~22% faster than par2j64_1331.

Seems par2j64_16byte faster than other methods.
Difference not so dramatic as for Radeon 7900XTX but still nice and faster.

P.S. So, compared time with last my best results here: #99 (comment) and here: #99 (comment)

MultiPar_sample_2023-10-25: GPU enabled CPU high: 06:35
MultiPar_sample_2023-10-26: GPU enabled CPU high: 06:43

now little faster again, hardware the same, files, SSD etc just newest Windows version and last NVIDIA drivers since those testst.

@Yutaka-Sawada
Copy link
Owner

Difference not so dramatic as for Radeon 7900XTX but still nice and faster.

Thanks Slava46 for test sometimes. Because recent GPU has enough local memory, 16 byte memory access seems to be faster. I will use the function in next v1.3.3.2.

gpu-thread speed from tests on GeForce RTX 3070 by Slava46;
par2j64_1331.exe   : 140137 MB/s (Base line as v1.3.3.1)
par2j64_VRAM.exe   : 129142 MB/s (92% speed of base line)
par2j64_4byte.exe  : 129422 MB/s (92% speed of base line)
par2j64_16byte.exe : 170206 MB/s (121% speed of base line)

One strange point is slowness of CL_MEM_COPY_HOST_PTR flag. Explict copying seems to be slightly slower than automatic caching on recent NVIDIA GPU. (However the difference is ignorable in total calculation time.) Maybe it comes from difference between "big copy at first" and "consecutive background copy". But, I don't know how is NVIDIA GPU's caching system.

Anyway, using CL_MEM_USE_HOST_PTR (as same as v1.3.3.1) may be good for NVIDIA GPUs. (Because it's slow on AMD GPUs, I need to switch flags.) To test the case of CL_MEM_USE_HOST_PTR flag and 16 byte memory access, I made a new sample. I put the sample (par2j_debug_2023-11-26.zip) in "MultiPar_sample" folder on OneDrive. When you have time, please test it. If the difference is noticeable, I will change flag at NVIDIA GPUs.

@Slava46
Copy link

Slava46 commented Nov 26, 2023

Also difference for Radeon 7900XTX so big because at the first place it was too low speed and somethg wrong but for NVidia it was already fine, so you fixed it for Radeon 7900XTX and because of that so big difference and for NVidia just 21% increase, but still 21% faster is a good result.
And Radeon 7900XTX twice faster than 3070 because seems it's by one generation newest GPU plus 3 times more memory (24 GB vs 8 GB).

par2j_debug_2023-11-26

par2j64_Host16byte: GPU thread: 185141 MB/s; 06:07

so, you made it again faster hehe, 32% faster than par2j64_1331 and ~9% faster than yesterday par2j64_16byte.

@cavalia88
Copy link
Author

cavalia88 commented Nov 26, 2023

Nice to see that even the Nvidia cards are able to see improvements in GPU speeds

I was monitoring the task manager for GPU load. The 7900XTX load was at most 5% whilst running the latest par2j64_16byte version. Seems like we have not tapped the full potential of the card yet. The CPU still seems to do quite a bit of the processing.

Nonetheless, already very happy that we have seen such a big increase in AMD GPU speeds.

Edit: I realized i was looking at the compute utilization for the AMD onboard iGPU. When i checked the compute utilization for the Radeon 7900XTX, it was indeed above 90%. So all good.

@Nodens-
Copy link

Nodens- commented Nov 27, 2023

@Yutaka-Sawada Hi, I'm the developer of Realbench and I can tell you OpenCL performance suffers on Nvidia cards in general. You need to use CUDA for those cards as Nvidia is specifically crippling their performance under OpenCL (for the obvious reasons).

@cavalia88 You will not see Compute loads (CUDA/OpenCL) under GPU load on task manager or mainstream tool. That 5% you see is not representative of the actual Compute load. The metric is for normal graphics loads, not Compute.

@Yutaka-Sawada
Copy link
Owner

so, you made it again faster hehe, 32% faster than par2j64_1331 and ~9% faster than yesterday par2j64_16byte.

Thanks Slava46 for test again. Changed point for AMD GPU was bad for NVIDIA GPU. Sometimes optimization for NVIDIA GPU and for AMD GPU are different. Next v1.3.3.2 will recognize them and select faster method automatically.

I posted alpha version of v1.3.3.2 on github. I put current sample (par2j_debug_2023-11-27.zip) in "MultiPar_sample" folder on OneDrive. When someone wants to test his GPU, he may try them. If there is a problem, I will change more.

Nonetheless, already very happy that we have seen such a big increase in AMD GPU speeds.

I'm glad, too. Though I don't use (noisy) graphics board on my PC, you helped other GPU users. When some AMD users helped my development ago, I could not success. This time was a good chance for retry. Thanks cavalia88 for new optimization trials.

You need to use CUDA for those cards as Nvidia is specifically crippling their performance under OpenCL (for the obvious reasons).

Thanks Nodens for advice and helpful information. I use OpenCL for general usage for most GPUs. It's difficult to make CUDA implementation without real graphics board. I don't want to put a fast GeForce graphics board on my PC, because it's noisy. I may try, if NVIDIA releases a fanless silent GPU.

@Slava46
Copy link

Slava46 commented Nov 27, 2023

So for par2j_debug_2023-11-27 nothing new things to need tested?

You need to use CUDA for those cards

Agreed with that, using CUDA for NVIDIA cards will increase perfomance much, I can test things if you'll add CUDA support.

I don't want to put a fast GeForce graphics board on my PC, because it's noisy.

Actually threre is modern GPU could be pretty silent. About my MSI 3070 Gaming Z I can't say it's noisy.
Of course while loading 70-100% GPU will be noisy but it's not all time actually and different GPU much noisy.
So here need to ask people, read reviews/tests etc.

For example exist ASUS GeForce RTX 3070 Noctua Edition cteated for really silent cooling GPU - I read some reviewes and it's should be really pretty silent.
Also there is ASUS RTX 3080 Noctua Edition and ASUS RTX 4080 Noctua Edition.

Also depends of PC case that you're using of course, water cooler or usual cooling etc.

@Yutaka-Sawada
Copy link
Owner

So for par2j_debug_2023-11-27 nothing new things to need tested?

Thank you, but you don't need to test. It just selects faster method for the device. Only when you want to confirm that the selection is correct, you may try debug version (par2j64_1332.exe) to see OpenCL_method value in the debug output.

OpenCL_method values in par2j64_1332.exe:
1 = general slow method
2 = 4-byte memory access for recent CPU (SSSE3 or AVX2)
3 = 16-byte memory access for recent CPU (SSSE3 or AVX2)
4 = slow method for old CPU (SSE2)
+8 = NVIDIA GPU or Integrated GPU (AMD APU or Intel CPU)

Then, the value should be 11 for NVIDIA GeForce graphics board on recent PC. (If not, I mistook somewhere and it failed to recognize.) It should be 3 for AMD Radeon graphics board on recent PC. When someone uses old CPU or GPU, the value may differ. It needs to support old devices for compatibility.

@Slava46
Copy link

Slava46 commented Nov 27, 2023

par2j64_1332 tried, looks like fine for me.

Platform version = OpenCL 3.0 CUDA 12.3.99

Device[0] = NVIDIA GeForce RTX 3070
Device version = OpenCL 3.0 CUDA
HOST_UNIFIED_MEMORY = 0
LOCAL_MEM_SIZE = 48 KB
MAX_MEM_ALLOC_SIZE = 2047 MB
MAX_COMPUTE_UNITS = 46
MAX_WORK_GROUP_SIZE = 1024
GLOBAL_MEM_SIZE = 8191 MB

Selected platform = NVIDIA CUDA
Selected device = NVIDIA GeForce RTX 3070
src buf : 2096664 KB (1317 blocks), possible
dst buf : 1592 KB (1630208 Bytes), OK
factor buf : 2634 Bytes (1317 factors), OK
CreateKernel : method3

Max number of work items = 11776 (256 * 46)
OpenCL_method = 11, vram_max = 1317

@Nodens-
Copy link

Nodens- commented Nov 27, 2023

Thanks Nodens for advice and helpful information. I use OpenCL for general usage for most GPUs. It's difficult to make CUDA implementation without real graphics board. I don't want to put a fast GeForce graphics board on my PC, because it's noisy. I may try, if NVIDIA releases a fanless silent GPU.

This is exactly the problem. Nvidia cards could easily dominate OpenCL performance but Nvidia wants to lock the Compute market (scientific/research and now ML/ANN) into CUDA exactly because AMD is not option for CUDA. They want to avoid just that, single OpenCL implementations that make AMD a viable choice. Hence their drivers artificially cripple OpenCL performance so the only way to go is CUDA. As you can see in your tests, the performance of high end cards is abysmal compared to AMD cards on the OpenCL implementation. This is intended.
On Realbench it was a very common issue with users, for the OpenCL test, complaining how their 4digit cost cards were performing so badly. I could not implement a CUDA test though because it is a benchmark and it must run on everything on the same terms. :)

For testing, I suggest getting a cheap 1030. It comes fanless, with passive heatsink, and in low bracket form too. It's slow, but for testing it's fine. :) They go for like 80Eur new. Can probably find a used one for 40ish.

@Yutaka-Sawada
Copy link
Owner

par2j64_1332 tried, looks like fine for me.

Thank you for confirmation.

Hence their drivers artificially cripple OpenCL performance so the only way to go is CUDA.

Oh, I see. But, I don't plan to implement CUDA version at this time, sorry.

@Nodens-
Copy link

Nodens- commented Nov 28, 2023

Oh, I see. But, I don't plan to implement CUDA version at this time, sorry.

No problem. The only reason I went into this conversation was because I read your notes regarding pushing for performance increases and I just wanted to warn you that there's no way to squeeze any real performance out of Nvidia cards with OpenCL, aiming to perhaps save you some frustration in the process. It was not a request for a CUDA implementation.
I've been using Multipar for ages, it's a great tool no matter what :) Have a great day!

@animetosho
Copy link

@nodens what about Vulkan? Works on all vendors, and I can't see Nvidia hemorrhaging the performance of one of the most popular APIs.

@Yutaka-Sawada another idea is to pack multiple values into the lookup table. The classic lookup algorithm only fetches 16-bit products, but if 32-bit is more efficient, you could pack two products into each lookup entry (I'm surprised no-one has previously tried doing this).
I've been toying with the idea; the results don't seem to be all that impressive, but I'm still investigating.

@nodens
Copy link

nodens commented Nov 28, 2023

@nodens what about Vulkan? Works on all vendors, and I can't see Nvidia hemorrhaging the performance of one of the most popular APIs.

You got the wrong nodens ;)
You want to notify @Nodens- instead

@Yutaka-Sawada
Copy link
Owner

The classic lookup algorithm only fetches 16-bit products, but if 32-bit is more efficient, you could pack two products into each lookup entry (I'm surprised no-one has previously tried doing this).

I'm not sure how they are. Currently, I use 2 composite tables (High and Low with 256 entries) for 16-bit multipy. Basically, the method is same as par2cmdline. If you say about vector data type (such like ushort2 ushort4, ushort8), most GPUs may be slow to handle vector values. I'm afraid that GPU seems to calculate each element one by one for a packed vector value. (But, this is just my experience.)

For example, I write a line;
uint4 b = a + 1;

Then, GPU may calculate the math as below;

b.x = a.x + 1;
b.y = a.y + 1;
b.z = a.z + 1;
b.w = a.w + 1;

While vector data looks like simple on source code, the caluclation cost is still heavy. Using vector type doesn't improve speed at all.

From OpenCL for NVIDIA GPUs by Chris Lamb;

NVIDIA GPUs have a scalar architecture
Use vector types in OpenCL for convenience, not performance

@animetosho
Copy link

animetosho commented Nov 29, 2023

If you say about vector data type

Not quite. The typical approach is to do two lookups for one 16-bit multiply.

For example, if we have one source block and two recovery blocks, the process looks something like:

# generate lookup tables
for i=0 to 256
	lookup0_low[i] = gf_mul(factor0, i)
	lookup0_high[i] = gf_mul(lookup0_low[i], 256)
	
	lookup1_low[i] = gf_mul(factor1, i)
	lookup1_high[i] = gf_mul(lookup1_low[i], 256)

# compute recovery data for 2 blocks
for i=0 to block_size
	recovery0[i] = lookup0_low[input[i] & 0xff] ^ lookup0_high[input[i] >> 8]
	recovery1[i] = lookup1_low[input[i] & 0xff] ^ lookup1_high[input[i] >> 8]

This approach does two lookups for two 16-bit multiplies:

# generate lookup tables
for i=0 to 256
	lookup_low[i] = gf_mul(factor0, i) | (gf_mul(factor1, i) << 16)
	lookup_high[i] = gf_mul(factor0, i*256) | (gf_mul(factor1, i*256) << 16)

# compute recovery data for 2 blocks
for i=0 to block_size
	recoveryPair = lookup_low[input[i] & 0xff] ^ lookup_high[input[i] >> 8]
	
	# de-interleave; this can be deferred to a later step, after all recovery has been computed
	recovery0[i] = recoveryPair & 0xffff
	recovery1[i] = recoveryPair >> 16

Since 32-bit is as efficient as 16-bit, this should halve the number of lookups needed.

@Yutaka-Sawada
Copy link
Owner

Since 32-bit is as efficient as 16-bit, this should halve the number of lookups needed.

Oh, it's a nice idea. Thank you for detailed explanation and example. Because ParPar calculates 2 recovery blocks at once, it may easy to implement there. CPU calculation will become faster, when SSSE3 or AVX2 are unavailable. Only a problem in OpenCL implementation for GPU is that it requires more local memory. When there are enough local memory on a GPU, it will be fast. Or else, it may be slow.

@cavalia88
Copy link
Author

cavalia88 commented Dec 5, 2023

In the latest version, par2j64_16byte and par2j64_auto about the same speed. par2j64_4byte is slightly slower. But all faster than previous debug versions.

par2j64_16byte = 9.000s
par2j64_auto = 8.984s
par2j64_4byte = 9.906s

par2j64_16byte c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (88854 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.593 sec, 3346 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.079 sec
0.0% : Creating recovery slice
matrix size = 51 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
HOST_UNIFIED_MEMORY = 1
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 1256 KB (1286144 Bytes), OK
factor buf : 35100 Bytes (8775 factors), OK
CreateKernel : method3

Max number of work items = 12288 (256 * 48)
OpenCL_method = 3, vram_max = 8775
partial encode = 107 / 8775 (1.2%), read = 8775, skip = 0
remain = 8668, src_off = 107, src_max = 32
GPU: remain = 8636, src_off = 139, src_num = 64
GPU: remain = 8412, src_off = 363, src_num = 2103
GPU: remain = 4773, src_off = 4002, src_num = 2655
GPU: remain = 166, src_off = 8609, src_num = 94
CPU last: src_off = 8735, src_num = 40
100.0%
read 2.484 sec
write 0.797 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 335347
1st encode 2.468 sec, 31439 loop, 7812 MB/s
2nd encode 4.953 sec, 303908 loop, 37629 MB/s
sub-thread : total loop = 334393
1st encode 2.468 sec, 31238 loop, 7762 MB/s
2nd encode 4.953 sec, 303155 loop, 37536 MB/s
sub-thread : total loop = 345214
1st encode 2.468 sec, 31269 loop, 7770 MB/s
2nd encode 4.953 sec, 313945 loop, 38872 MB/s
sub-thread : total loop = 300486
2nd encode 4.953 sec, 300486 loop, 37206 MB/s
sub-thread : total loop = 297912
2nd encode 4.953 sec, 297912 loop, 36887 MB/s
sub-thread : total loop = 304084
2nd encode 4.953 sec, 304084 loop, 37651 MB/s
sub-thread : total loop = 299955
2nd encode 4.953 sec, 299955 loop, 37140 MB/s
sub-thread : total loop = 294683
2nd encode 4.953 sec, 294683 loop, 36487 MB/s
sub-thread : total loop = 296668
2nd encode 4.953 sec, 296668 loop, 36733 MB/s
sub-thread : total loop = 293075
2nd encode 4.953 sec, 293075 loop, 36288 MB/s
sub-thread : total loop = 286380
2nd encode 4.953 sec, 286380 loop, 35459 MB/s
gpu-thread :
2nd encode 5.078 sec, 4316248 loop, 521282 MB/s
total 9.000 sec

Created successfully

par2j64_4byte c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample*.*"
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (88578 MB available), Fast SSD

filename is invalid, UjPGgFjavolplR8geCekcqwXz.par2

par2j64_4byte c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (88745 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.593 sec, 3346 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.079 sec
0.0% : Creating recovery slice
matrix size = 51 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
HOST_UNIFIED_MEMORY = 1
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 1256 KB (1286144 Bytes), OK
factor buf : 35100 Bytes (8775 factors), OK
CreateKernel : method2

Max number of work items = 12288 (256 * 48)
OpenCL_method = 2, vram_max = 8775
partial encode = 106 / 8775 (1.2%), read = 8775, skip = 0
remain = 8669, src_off = 106, src_max = 32
GPU: remain = 8637, src_off = 138, src_num = 64
GPU: remain = 8413, src_off = 362, src_num = 2103
GPU: remain = 4038, src_off = 4737, src_num = 1889
GPU: remain = 101, src_off = 8674, src_num = 47
CPU last: src_off = 8721, src_num = 54
100.0%
read 2.453 sec
write 0.796 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 412169
1st encode 2.437 sec, 30993 loop, 7799 MB/s
2nd encode 5.922 sec, 381176 loop, 39474 MB/s
sub-thread : total loop = 404064
1st encode 2.437 sec, 30990 loop, 7798 MB/s
2nd encode 5.922 sec, 373074 loop, 38635 MB/s
sub-thread : total loop = 401749
1st encode 2.437 sec, 31085 loop, 7822 MB/s
2nd encode 5.922 sec, 370664 loop, 38385 MB/s
sub-thread : total loop = 361288
2nd encode 5.922 sec, 361288 loop, 37414 MB/s
sub-thread : total loop = 365071
2nd encode 5.922 sec, 365071 loop, 37806 MB/s
sub-thread : total loop = 355780
2nd encode 5.922 sec, 355780 loop, 36844 MB/s
sub-thread : total loop = 365958
2nd encode 5.922 sec, 365958 loop, 37898 MB/s
sub-thread : total loop = 358242
2nd encode 5.922 sec, 358242 loop, 37099 MB/s
sub-thread : total loop = 355095
2nd encode 5.922 sec, 355095 loop, 36773 MB/s
sub-thread : total loop = 360049
2nd encode 5.922 sec, 360049 loop, 37286 MB/s
sub-thread : total loop = 362545
2nd encode 5.922 sec, 362545 loop, 37545 MB/s
gpu-thread :
2nd encode 5.985 sec, 3602434 loop, 369140 MB/s
total 9.906 sec

Created successfully

par2j64_auto c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (88690 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.593 sec, 3346 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.079 sec
0.0% : Creating recovery slice
matrix size = 51 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
HOST_UNIFIED_MEMORY = 1
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 1256 KB (1286144 Bytes), OK
factor buf : 35100 Bytes (8775 factors), OK
CreateKernel : method3

Max number of work items = 12288 (256 * 48)
OpenCL_method = 3, vram_max = 8775
partial encode = 106 / 8775 (1.2%), read = 8775, skip = 0
remain = 8669, src_off = 106, src_max = 32
GPU: remain = 8637, src_off = 138, src_num = 64
GPU: remain = 8413, src_off = 362, src_num = 2103
GPU: remain = 4774, src_off = 4001, src_num = 2656
GPU: remain = 166, src_off = 8609, src_num = 94
CPU last: src_off = 8735, src_num = 40
100.0%
read 2.468 sec
write 0.813 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 336465
1st encode 2.453 sec, 31065 loop, 7766 MB/s
2nd encode 4.890 sec, 305400 loop, 38301 MB/s
sub-thread : total loop = 338566
1st encode 2.453 sec, 31043 loop, 7761 MB/s
2nd encode 4.906 sec, 307523 loop, 38442 MB/s
sub-thread : total loop = 330809
1st encode 2.453 sec, 30960 loop, 7740 MB/s
2nd encode 4.890 sec, 299849 loop, 37605 MB/s
sub-thread : total loop = 302507
2nd encode 4.859 sec, 302507 loop, 38181 MB/s
sub-thread : total loop = 302328
2nd encode 4.874 sec, 302328 loop, 38041 MB/s
sub-thread : total loop = 303313
2nd encode 4.874 sec, 303313 loop, 38164 MB/s
sub-thread : total loop = 294270
2nd encode 4.874 sec, 294270 loop, 37027 MB/s
sub-thread : total loop = 293475
2nd encode 4.890 sec, 293475 loop, 36806 MB/s
sub-thread : total loop = 294667
2nd encode 4.874 sec, 294667 loop, 37077 MB/s
sub-thread : total loop = 297806
2nd encode 4.890 sec, 297806 loop, 37349 MB/s
sub-thread : total loop = 293115
2nd encode 4.874 sec, 293115 loop, 36881 MB/s
gpu-thread :
2nd encode 5.046 sec, 4317126 loop, 524695 MB/s
total 8.984 sec

Created successfully

@Slava46
Copy link

Slava46 commented Dec 5, 2023

For me looks like the same as previous, the same files.
CPU thread: 10551 - 11522 MB/s
par2j64_4byte: GPU thread: 143267 MB/s; 06:34
par2j64_16byte: GPU thread: 185287 MB/s; 06:12
par2j64_auto: GPU thread: 186310 MB/s; 06:05

@Yutaka-Sawada
Copy link
Owner

Thanks cavalia88 and Slava46 for testing new method. It's interesting that AMD GPU becomes much faster (50% speed up), while there is no difference at NVIDIA GPU. Because 16-byte access version is faster than 4-byte access always, both should have enough private memory (number of registers). This difference may come from access speed of local memory or effective caching system. When table lookup speed is very fast already, reducing the times of lookup would be worthless in total calculation time. This might be why Anime Tosho could not get impressive result in his tests.

gpu-thread speed from tests on Radeon 7900XTX by cavalia88;
 With classic method;
par2j64_4byte : 208801 MB/s
par2j64_16byte: 347829 MB/s
 With new method;
par2j64_4byte:  369140 MB/s
par2j64_16byte: 521282 MB/s
par2j64_auto:   524695 MB/s

gpu-thread speed from tests on GeForce RTX 3070 by Slava46;
 With classic method;
par2j64_Host16byte: 185141 MB/s
 With new method;
par2j64_4byte:  143267 MB/s
par2j64_16byte: 185287 MB/s
par2j64_auto:   186310 MB/s

When there is no big difference, I prefer simple implementation. I may switch functions for NVIDIA and AMD GPUs. Such like, it will calculate 2 blocks at once for AMD GPU or integrated GPU, while it uses classic method for NVIDIA's discrete GPU. Because it distinguishes NVIDIA GPU already, it's possible. I will try to make new automatic selection mechanism.

@animetosho
Copy link

animetosho commented Dec 7, 2023

This might be why Anime Tosho could not get impressive result in his tests.

Actually, I'm deferring de-interleave until the end, so it should just always be better. But I'm seeing little difference in my own tests.

Your tests however, generally show better performance, except for the weird case of the RX570 being oddly slow.

Using 2200MB source file with arguments c -rr10 -ss1048576 -lc257 (or -lc513)

2023-11-27 : AMD RX570

Selected platform = AMD Accelerated Parallel Processing
Selected device = Ellesmere
src buf : 1048560 KB (1020 blocks), possible
dst buf : 1028 KB (1052672 Bytes), OK
factor buf : 2040 Bytes (1020 factors), OK
16byte
CreateKernel : method8

Max number of work items = 8192 (256 * 32)
OpenCL_method = 8, vram_max = 1020
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 1870, src_off = 330, src_num = 340
GPU: remain = 150, src_off = 2050, src_num = 29
CPU last: src_off = 2169, src_num = 31
100.0%
read   0.859 sec
write  0.406 sec

OpenCL : Ellesmere, OpenCL 2.0 AMD-APP (3584.0), 256*32
sub-thread : total loop = 389620
 2nd encode 11.313 sec, 389620 loop, 34574 MB/s
gpu-thread :
 2nd encode 11.454 sec, 94380 loop, 8272 MB/s
total  13.641 sec
Host16byte
CreateKernel : method8

Max number of work items = 8192 (256 * 32)
OpenCL_method = 24, vram_max = 1020
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 1720, src_off = 480, src_num = 215
GPU last ?: src_off = 2165, src_num = 7
CPU last: src_off = 2172, src_num = 28
100.0%
read   0.813 sec
write  0.516 sec

OpenCL : Ellesmere, OpenCL 2.0 AMD-APP (3584.0), 256*32
sub-thread : total loop = 421960
 2nd encode 12.031 sec, 421960 loop, 35209 MB/s
gpu-thread :
 2nd encode 12.093 sec, 62040 loop, 5150 MB/s
total  14.469 sec
VRAM
CreateKernel : method2

Max number of work items = 8192 (256 * 32)
OpenCL_method = 2, vram_max = 1020
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 1870, src_off = 330, src_num = 340
GPU: remain = 240, src_off = 1960, src_num = 48
CPU last: src_off = 2158, src_num = 42
100.0%
read   0.875 sec
write  0.422 sec

OpenCL : Ellesmere, OpenCL 2.0 AMD-APP (3584.0), 256*32
sub-thread : total loop = 385440
 2nd encode 12.703 sec, 385440 loop, 30460 MB/s
gpu-thread :
 2nd encode 12.657 sec, 98560 loop, 7817 MB/s
total  15.140 sec
1332
CreateKernel : method3

Max number of work items = 8192 (256 * 32)
OpenCL_method = 3, vram_max = 1020
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 1840, src_off = 360, src_num = 306
GPU: remain = 274, src_off = 1926, src_num = 52
CPU last: src_off = 2158, src_num = 42
100.0%
read   0.953 sec
write  0.578 sec

OpenCL : Ellesmere, OpenCL 2.0 AMD-APP (3584.0), 256*32
sub-thread : total loop = 392040
 2nd encode 11.235 sec, 392040 loop, 35030 MB/s
gpu-thread :
 2nd encode 11.235 sec, 91960 loop, 8217 MB/s
total  13.828 sec

2023-12-05 : AMD RX570

Selected platform = AMD Accelerated Parallel Processing
Selected device = Ellesmere
src buf : 1048560 KB (1020 blocks), possible
dst buf : 2056 KB (2105344 Bytes), OK
factor buf : 4080 Bytes (1020 factors), OK
4byte
CreateKernel : method2

Max number of work items = 8192 (256 * 32)
OpenCL_method = 2, vram_max = 1020
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 1810, src_off = 390, src_num = 278
GPU: remain = 62, src_off = 2138, src_num = 9
CPU last: src_off = 2147, src_num = 53
100.0%
read   0.922 sec
write  0.500 sec

OpenCL : Ellesmere, OpenCL 2.0 AMD-APP (3584.0), 256*32
sub-thread : total loop = 407660
 2nd encode 12.360 sec, 407660 loop, 33111 MB/s
gpu-thread :
 2nd encode 12.485 sec, 76340 loop, 6138 MB/s
total  15.172 sec
16byte
CreateKernel : method3

Max number of work items = 8192 (256 * 32)
OpenCL_method = 3, vram_max = 1020
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 1870, src_off = 330, src_num = 340
GPU: remain = 270, src_off = 1930, src_num = 55
CPU last: src_off = 2165, src_num = 35
100.0%
read   0.813 sec
write  0.782 sec

OpenCL : Ellesmere, OpenCL 2.0 AMD-APP (3584.0), 256*32
sub-thread : total loop = 383900
 2nd encode 16.250 sec, 383900 loop, 23716 MB/s
gpu-thread :
 2nd encode 16.453 sec, 100100 loop, 6107 MB/s
total  19.203 sec

2023-11-27 : Nvidia GTX 960

Selected platform = NVIDIA CUDA
Selected device = NVIDIA GeForce GTX 960
src buf : 523252 KB (509 blocks), possible
dst buf : 1028 KB (1052672 Bytes), OK
factor buf : 1018 Bytes (509 factors), OK
16byte
CreateKernel : method8

Max number of work items = 2048 (256 * 8)
OpenCL_method = 8, vram_max = 509
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 8
GPU: remain = 2192, src_off = 8, src_num = 16
GPU: remain = 2168, src_off = 32, src_num = 509
GPU: remain = 1467, src_off = 733, src_num = 509
GPU: remain = 782, src_off = 1418, src_num = 509
GPU: remain = 81, src_off = 2119, src_num = 58
CPU last: src_off = 2185, src_num = 15
100.0%
read   2.266 sec
write  1.015 sec

OpenCL : NVIDIA GeForce GTX 960, OpenCL 3.0 CUDA, 256*8
sub-thread : total loop = 131780
 2nd encode 18.860 sec, 131780 loop, 7014 MB/s
gpu-thread :
 2nd encode 18.766 sec, 352220 loop, 18842 MB/s
total  23.172 sec
Host16byte
CreateKernel : method8

Max number of work items = 2048 (256 * 8)
OpenCL_method = 24, vram_max = 509
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 8
GPU: remain = 2192, src_off = 8, src_num = 16
GPU: remain = 2168, src_off = 32, src_num = 509
GPU: remain = 1491, src_off = 709, src_num = 509
GPU: remain = 806, src_off = 1394, src_num = 509
GPU: remain = 121, src_off = 2079, src_num = 89
CPU last: src_off = 2192, src_num = 8
100.0%
read   2.265 sec
write  1.031 sec

OpenCL : NVIDIA GeForce GTX 960, OpenCL 3.0 CUDA, 256*8
sub-thread : total loop = 124960
 2nd encode 17.953 sec, 124960 loop, 6987 MB/s
gpu-thread :
 2nd encode 17.797 sec, 359040 loop, 20252 MB/s
total  22.266 sec
VRAM
CreateKernel : method2

Max number of work items = 2048 (256 * 8)
OpenCL_method = 2, vram_max = 509
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 8
GPU: remain = 2192, src_off = 8, src_num = 16
GPU: remain = 2168, src_off = 32, src_num = 509
GPU: remain = 1483, src_off = 717, src_num = 509
GPU: remain = 790, src_off = 1410, src_num = 509
GPU: remain = 105, src_off = 2095, src_num = 77
CPU last: src_off = 2188, src_num = 12
100.0%
read   2.266 sec
write  1.016 sec

OpenCL : NVIDIA GeForce GTX 960, OpenCL 3.0 CUDA, 256*8
sub-thread : total loop = 127600
 2nd encode 17.969 sec, 127600 loop, 7128 MB/s
gpu-thread :
 2nd encode 17.937 sec, 356400 loop, 19947 MB/s
total  22.156 sec
1332
CreateKernel : method3

Max number of work items = 2048 (256 * 8)
OpenCL_method = 11, vram_max = 509
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 8
GPU: remain = 2192, src_off = 8, src_num = 16
GPU: remain = 2168, src_off = 32, src_num = 509
GPU: remain = 1483, src_off = 717, src_num = 509
GPU: remain = 790, src_off = 1410, src_num = 509
GPU: remain = 89, src_off = 2111, src_num = 65
CPU last: src_off = 2192, src_num = 8
100.0%
read   2.250 sec
write  1.031 sec

OpenCL : NVIDIA GeForce GTX 960, OpenCL 3.0 CUDA, 256*8
sub-thread : total loop = 130240
 2nd encode 17.640 sec, 130240 loop, 7412 MB/s
gpu-thread :
 2nd encode 17.469 sec, 353760 loop, 20329 MB/s
total  21.969 sec

2023-12-05 : Nvidia GTX 960

Selected platform = NVIDIA CUDA
Selected device = NVIDIA GeForce GTX 960
src buf : 523252 KB (509 blocks), possible
dst buf : 2056 KB (2105344 Bytes), OK
factor buf : 2036 Bytes (509 factors), OK
4byte
CreateKernel : method2

Max number of work items = 2048 (256 * 8)
OpenCL_method = 10, vram_max = 509
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 8
GPU: remain = 2192, src_off = 8, src_num = 16
GPU: remain = 2168, src_off = 32, src_num = 509
GPU: remain = 1539, src_off = 661, src_num = 509
GPU: remain = 902, src_off = 1298, src_num = 509
GPU: remain = 273, src_off = 1927, src_num = 218
CPU last: src_off = 2185, src_num = 15
100.0%
read   2.281 sec
write  1.016 sec

OpenCL : NVIDIA GeForce GTX 960, OpenCL 3.0 CUDA, 256*8
sub-thread : total loop = 96580
 2nd encode 13.531 sec, 96580 loop, 7165 MB/s
gpu-thread :
 2nd encode 13.453 sec, 387420 loop, 28910 MB/s
total  17.406 sec
16byte
CreateKernel : method3

Max number of work items = 2048 (256 * 8)
OpenCL_method = 11, vram_max = 509
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 8
GPU: remain = 2192, src_off = 8, src_num = 16
GPU: remain = 2168, src_off = 32, src_num = 509
GPU: remain = 1539, src_off = 661, src_num = 509
GPU: remain = 902, src_off = 1298, src_num = 509
GPU: remain = 273, src_off = 1927, src_num = 218
CPU last: src_off = 2185, src_num = 15
100.0%
read   2.437 sec
write  1.031 sec

OpenCL : NVIDIA GeForce GTX 960, OpenCL 3.0 CUDA, 256*8
sub-thread : total loop = 96580
 2nd encode 13.672 sec, 96580 loop, 7091 MB/s
gpu-thread :
 2nd encode 13.484 sec, 387420 loop, 28844 MB/s
total  19.141 sec

2023-11-27 : Intel UHD 770

Selected platform = Intel(R) OpenCL HD Graphics
Selected device = Intel(R) AlderLake-S Mobile Graphics Controller
src buf : 4194240 KB (4080 blocks), possible
dst buf : 1028 KB (1052672 Bytes), OK
factor buf : 4400 Bytes (2200 factors), OK
16byte
CreateKernel : method8

Max number of work items = 8192 (256 * 32)
OpenCL_method = 40, vram_max = 2200
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 2020, src_off = 180, src_num = 673
GPU: remain = 537, src_off = 1663, src_num = 236
GPU last +: src_off = 2169, src_num = 1 + 30
100.0%
read   0.859 sec
write  0.469 sec

OpenCL : Intel(R) AlderLake-S Mobile Graphics Controller, OpenCL 3.0 NEO , 256*32
sub-thread : total loop = 264000
 2nd encode 9.797 sec, 264000 loop, 27052 MB/s
gpu-thread :
 2nd encode 10.015 sec, 220000 loop, 22052 MB/s
total  12.453 sec
Host16byte
CreateKernel : method8

Max number of work items = 8192 (256 * 32)
OpenCL_method = 24, vram_max = 2200
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 2020, src_off = 180, src_num = 673
GPU: remain = 537, src_off = 1663, src_num = 236
GPU last +: src_off = 2169, src_num = 1 + 30
100.0%
read   0.891 sec
write  0.453 sec

OpenCL : Intel(R) AlderLake-S Mobile Graphics Controller, OpenCL 3.0 NEO , 256*32
sub-thread : total loop = 264000
 2nd encode 9.719 sec, 264000 loop, 27269 MB/s
gpu-thread :
 2nd encode 10.047 sec, 220000 loop, 21982 MB/s
total  12.266 sec
1332
CreateKernel : method3

Max number of work items = 8192 (256 * 32)
OpenCL_method = 11, vram_max = 2200
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 2020, src_off = 180, src_num = 673
GPU: remain = 627, src_off = 1573, src_num = 292
CPU last: src_off = 2165, src_num = 35
100.0%
read   0.875 sec
write  0.469 sec

OpenCL : Intel(R) AlderLake-S Mobile Graphics Controller, OpenCL 3.0 NEO , 256*32
sub-thread : total loop = 258500
 2nd encode 10.390 sec, 258500 loop, 24976 MB/s
gpu-thread :
 2nd encode 10.250 sec, 225500 loop, 22085 MB/s
total  12.625 sec

2023-12-05 : Intel UHD 770

Selected platform = Intel(R) OpenCL HD Graphics
Selected device = Intel(R) AlderLake-S Mobile Graphics Controller
src buf : 4194240 KB (4080 blocks), possible
dst buf : 2056 KB (2105344 Bytes), OK
factor buf : 8800 Bytes (2200 factors), OK
4byte
CreateKernel : method2

Max number of work items = 8192 (256 * 32)
OpenCL_method = 10, vram_max = 2200
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 2050, src_off = 150, src_num = 820
GPU: remain = 270, src_off = 1930, src_num = 123
CPU last: src_off = 2143, src_num = 57
100.0%
read   0.828 sec
write  0.438 sec

OpenCL : Intel(R) AlderLake-S Mobile Graphics Controller, OpenCL 3.0 NEO , 256*32
sub-thread : total loop = 263340
 2nd encode 9.719 sec, 263340 loop, 27201 MB/s
gpu-thread :
 2nd encode 9.781 sec, 220660 loop, 22648 MB/s
total  12.125 sec
16byte
CreateKernel : method3

Max number of work items = 8192 (256 * 32)
OpenCL_method = 11, vram_max = 2200
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 2050, src_off = 150, src_num = 820
GPU: remain = 480, src_off = 1720, src_num = 245
CPU last: src_off = 2145, src_num = 55
100.0%
read   0.843 sec
write  0.531 sec

OpenCL : Intel(R) AlderLake-S Mobile Graphics Controller, OpenCL 3.0 NEO , 256*32
sub-thread : total loop = 236500
 2nd encode 8.750 sec, 236500 loop, 27134 MB/s
gpu-thread :
 2nd encode 8.390 sec, 247500 loop, 29614 MB/s
total  11.125 sec

@Yutaka-Sawada
Copy link
Owner

Thanks Anime Tosho for tests on some devices. The slow result of Radeon RX 570 is interesting. Because the GPU is old low price model at the age, private memory may lack. While Radeon RX 570 has 32 KB local memory, it's the minimum size at OpenCL version 1.2. Though Radeon RX 570 is fast with 16-byte memory access, it's slow at calculating 2 blocks at once.

 gpu-thread speed from tests on AMD Radeon RX 570 by Anime Tosho;
par2j64_Host16byte : 5150 MB/s (CL_MEM_USE_HOST_PTR is slow)
par2j64_VRAM       : 7817 MB/s (CL_MEM_COPY_HOST_PTR is faster)
par2j64_16byte     : 8272 MB/s (16-byte memory access is faster)
par2j64_4byte2     : 6138 MB/s (Calculating 2 blocks is slow)
par2j64_16byte2    : 6107 MB/s (Calculating 2 blocks is slow)

On the other hand, Nvidia GeForce GTX 960 shows different property. GeForce GTX 960 isn't fast with 16-byte memory access. But, calculating 2 blocks at once is fast. GeForce GTX 960 has 48 KB local memory. Though it's different from private memory size, it may indicate the price level (higher rank model).

 gpu-thread speed from tests on NVIDIA GeForce GTX 960 by Anime Tosho;
par2j64_Host16byte : 20252 MB/s (16-byte memory access is faster)
par2j64_VRAM       : 19947 MB/s (CL_MEM_COPY_HOST_PTR is slightly slower)
par2j64_16byte     : 18842 MB/s (CL_MEM_COPY_HOST_PTR is slightly slower)
par2j64_4byte2     : 28910 MB/s (Calculating 2 blocks is faster)
par2j64_16byte2    : 28844 MB/s (Calculating 2 blocks is faster)

Intel UHD Graphics 770's property is similar to Intel UHD Graphics 630. Integrated GPU on recent Intel CPU is almost same speed as old graphics board.

par2j64_Host16byte : 21982 MB/s (CL_MEM_USE_HOST_PTR)
par2j64_16byte     : 22052 MB/s (CL_MEM_COPY_HOST_PTR is same)
par2j64_4byte2     : 22648 MB/s (Calculating 2 blocks is faster)
par2j64_16byte2    : 29614 MB/s (Calculating 2 blocks is faster)

To distinguish GPU's price level, checking local memory size may be good. Even when the age is similar, cheep model and higher rank model are different speed. For recent GPU, calculating 2 blocks will be good (much faster or at least similar speed). I need to improve how to select fast method automatically.

@cavalia88
Copy link
Author

To distinguish GPU's price level, checking local memory size may be good. Even when the age is similar, cheep model and higher rank model are different speed. For recent GPU, calculating 2 blocks will be good (much faster or at least similar speed). I need to improve how to select fast method automatically.

Perhaps you can introduce a new argument that allows users to try out the different methods themselves (if they wanted to)? Maybe by default, method selection is always set to auto, but user can override with /method-classic or /method-new etc

This will allow more users to try out for themselves and provide you with feedback on which method works best with different hardware setups

@animetosho
Copy link

animetosho commented Dec 9, 2023

Because the GPU is old low price model at the age, private memory may lack

The RX570 is much faster than the other two. And my own code seems to demonstrate such as well (I typically get >60GB/s).
I'm not sure why par2j's implementation is slower though.

@Yutaka-Sawada
Copy link
Owner

Perhaps you can introduce a new argument that allows users to try out the different methods themselves (if they wanted to)?

Yes, I do. Thanks cavalia88 for the advice. I made new "lc" option values to test combination. Now, I can easy to change behavior and test results. Because MultiPar GUI doesn't support them, a user needs to test at command-line.

 You may set additional combinations for GPU control;
+256 or +512 (slower device) to enable GPU acceleration
+65536 for classic method, +131072 for composite table to calculate 2 blocks at once
+262144 for 4-byte memory access, +524288 for 16-byte memory access
+1048576 for CL_MEM_COPY_HOST_PTR, +2097152 for CL_MEM_USE_HOST_PTR

And my own code seems to demonstrate such as well (I typically get >60GB/s).

Oh, I see. I feel that private memory size is the reason. When a OpenCL function (kernel) uses a few registers, it would run very fast. But, it may be slow, when the kernel uses many registers. I found a way to determine private memory usage at OpenCL. CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE seems to indicate how many work items is good. The value may be the minimum to run the kernel quickly (with enough private memory). When I use more variables (or complex) in my source code, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE becomes fewer.

I have set 256 work items per Compute Unit in each kernel always. This might be bad for some (cheep or old) GPUs. Because Anime Tosho sets fewer work items, there was no problem in his implementation. I will need to change the number of work items by the result of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. On my Integrated GPU, setting 128 work items is faster than setting 256 mostly. As number of work items becomes half, they uses half number of resisters in a Compute Unit. Because private memory isn't enough on my GPU, using more variables was slower ago. The best setting depends on the GPU's spec (known by CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE).

I made some samples to test less private memory usage. From tests on my PC, setting fewer work items seems to be good (faster). Because the Intel GPU 's property may be different from other GPUs, I include other samples, too. I put the package (par2j_debug_2023-12-10.zip) in "MultiPar_sample" folder on OneDrive. If someone interested in GPU optimization, please test them. I will reduce private memory usage by refering the results of others.

@Slava46
Copy link

Slava46 commented Dec 10, 2023

par2j_debug_2023-12-10
par2j64_item64: gpu-thread: 110225 MB/s; 06:51
par2j64_item128: 142530 MB/s; 06:34
par2j64_item256: 176494 MB/s; 06:21
par2j64_less: 164179 MB/s; 06:20
par2j64_table: 147754 MB/s; 06:37

seems for this test for me faster speed par2j64_item256.

but here were faster speed than now: #107 (comment)

@Yutaka-Sawada
Copy link
Owner

seems for this test for me faster speed par2j64_item256.

Thanks Slava46 for tests. Recent GPU may have enough private memory even for the heavy kernel.

but here were faster speed than now: #107 (comment)

This is strange. I'm not sure why the difference. I changed 2 points from the old source code;
(1) table construction (ternary operator or if)
(2) additional function (gpu_multiply_blocks2)

Though speed is same on my Intel GPU, it may differ on other faster GPUs. I feel that ternary operator might be good than using if. I returned these changed points in new package (par2j_debug_2023-12-11.zip). If you are curious about such small difference, you may compare their results. Also, please post log of par2j64_item256.exe to see OpenCL property and automatic selection. (No need whole log, but just after showing OpenCL spec.) I might mistake somewhere.

@Slava46
Copy link

Slava46 commented Dec 11, 2023

of course needed test for find the truth
par2j_debug_2023-12-11:
par2j64_item64: 114349 MB/s; 07:04
par2j64_item128: 142226 MB/s; 06:32
par2j64_item256: 180293 MB/s; 06:15
par2j64_less: 168113 MB/s; 06:22
par2j64_table: 146491 MB/s; 06:41

par2j64_item256 faster than previous par2j_debug_2023-12-10 but this one still little faster hehe: #107 (comment)

par2j64_item256:


 read all source blocks, and keep all parity blocks (GPU)
buffer size = 7055 MB, io_size = 1630192, split = 17
cache: limit size = 65536, chunk_size = 65216, chunk_num = 25
unit_size = 1630208, cpu_num1 = 4, cpu_num2 = 16

Platform[0] = NVIDIA CUDA
Platform version = OpenCL 3.0 CUDA 12.3.99

Device[0] = NVIDIA GeForce RTX 3070
Device version = OpenCL 3.0 CUDA
MAX_MEM_ALLOC_SIZE = 2047 MB
MAX_COMPUTE_UNITS = 46
MAX_WORK_GROUP_SIZE = 1024
LOCAL_MEM_SIZE = 48 KB
GLOBAL_MEM_CACHE_SIZE = 1288 KB
GLOBAL_MEM_SIZE = 8191 MB

Selected platform = NVIDIA CUDA
Selected device = NVIDIA GeForce RTX 3070
src buf : 2096664 KB (1317 blocks), possible
dst buf : 3184 KB (3260416 Bytes), OK
factor buf : 5268 Bytes (1317 factors), OK
CreateKernel : method12
WORK_GROUP_SIZE = 256
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Max number of work items = 11776 (256 * 46)
OpenCL_method = 44, vram_max = 1317

@cavalia88
Copy link
Author

cavalia88 commented Dec 11, 2023

My results:

par2j_debug_2023-12-11:
par2j64_item64: 263474 MB/s; 11.328 seconds
par2j64_item128: 373056 MB/s; 10.39 seconds
par2j64_item256: 493259 MB/s; 9.531 seconds
par2j64_less: 509592 MB/s; 9.453 seconds
par2j64_table: 377287 MB/s; 10.172 seconds

Slightly slower as compared to the earlier par2j_debug_2023-12-05 version

D:\Process>par2j64_item64 c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory  : "I:\Output\Sample\"
Recovery File   : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread      : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra       : x64 AVX2 CLMUL
Memory usage    : Auto (88190 MB available), Fast SSD

Input File count        : 56
Input File total size   : 5589315326
Input File Slice size   : 640000
Input File Slice count  : 8775
Recovery Slice count    : 878
Redundancy rate         : 10.00%
Recovery File count     : 11
Slice distribution      : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.610 sec, 3310 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.078 sec
  0.0% : Creating recovery slice
matrix size = 51 KB

 read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
LOCAL_MEM_SIZE = 64 KB
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
HOST_UNIFIED_MEMORY = 1
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 1256 KB (1286144 Bytes), OK
factor buf : 35100 Bytes (8775 factors), OK
CreateKernel : method12

Max number of work items = 3072 (64 * 48)
OpenCL_method = 12, vram_max = 8775
partial encode = 115 / 8775 (1.3%), read = 8775, skip = 0
remain = 8660, src_off = 115, src_max = 32
GPU: remain = 8628, src_off = 147, src_num = 64
GPU: remain = 8308, src_off = 467, src_num = 1510
GPU: remain = 4558, src_off = 4217, src_num = 1748
GPU: remain = 282, src_off = 8493, src_num = 111
CPU last: src_off = 8732, src_num = 43
100.0%
read   2.657 sec
write  0.796 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 64*48
sub-thread : total loop = 465377
 1st encode 2.625 sec, 33858 loop, 7910 MB/s
 2nd encode 6.812 sec, 431519 loop, 38849 MB/s
sub-thread : total loop = 459489
 1st encode 2.625 sec, 33433 loop, 7810 MB/s
 2nd encode 6.812 sec, 426056 loop, 38357 MB/s
sub-thread : total loop = 461830
 1st encode 2.625 sec, 33679 loop, 7868 MB/s
 2nd encode 6.780 sec, 428151 loop, 38728 MB/s
sub-thread : total loop = 417418
 2nd encode 6.796 sec, 417418 loop, 37668 MB/s
sub-thread : total loop = 407159
 2nd encode 6.812 sec, 407159 loop, 36656 MB/s
sub-thread : total loop = 416355
 2nd encode 6.796 sec, 416355 loop, 37572 MB/s
sub-thread : total loop = 413946
 2nd encode 6.812 sec, 413946 loop, 37267 MB/s
sub-thread : total loop = 414716
 2nd encode 6.796 sec, 414716 loop, 37424 MB/s
sub-thread : total loop = 405407
 2nd encode 6.796 sec, 405407 loop, 36584 MB/s
sub-thread : total loop = 415415
 2nd encode 6.812 sec, 415415 loop, 37399 MB/s
sub-thread : total loop = 413160
 2nd encode 6.796 sec, 413160 loop, 37284 MB/s
gpu-thread :
 2nd encode 7.016 sec, 3014174 loop, 263474 MB/s
total  11.328 sec

Created successfully

=================================


D:\Process>par2j64_item128 c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory  : "I:\Output\Sample\"
Recovery File   : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread      : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra       : x64 AVX2 CLMUL
Memory usage    : Auto (88201 MB available), Fast SSD

Input File count        : 56
Input File total size   : 5589315326
Input File Slice size   : 640000
Input File Slice count  : 8775
Recovery Slice count    : 878
Redundancy rate         : 10.00%
Recovery File count     : 11
Slice distribution      : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.593 sec, 3346 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.079 sec
  0.0% : Creating recovery slice
matrix size = 51 KB

 read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
LOCAL_MEM_SIZE = 64 KB
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
HOST_UNIFIED_MEMORY = 1
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 1256 KB (1286144 Bytes), OK
factor buf : 35100 Bytes (8775 factors), OK
CreateKernel : method12

Max number of work items = 6144 (128 * 48)
OpenCL_method = 12, vram_max = 8775
partial encode = 116 / 8775 (1.3%), read = 8775, skip = 0
remain = 8659, src_off = 116, src_max = 32
GPU: remain = 8627, src_off = 148, src_num = 64
GPU: remain = 8339, src_off = 436, src_num = 1667
GPU: remain = 4976, src_off = 3799, src_num = 2338
GPU: remain = 398, src_off = 8377, src_num = 196
CPU last: src_off = 8733, src_num = 42
100.0%
read   2.687 sec
write  0.812 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 128*48
sub-thread : total loop = 396569
 1st encode 2.672 sec, 34005 loop, 7804 MB/s
 2nd encode 5.875 sec, 362564 loop, 37847 MB/s
sub-thread : total loop = 389874
 1st encode 2.672 sec, 33909 loop, 7782 MB/s
 2nd encode 5.875 sec, 355965 loop, 37158 MB/s
sub-thread : total loop = 392472
 1st encode 2.672 sec, 33934 loop, 7788 MB/s
 2nd encode 5.859 sec, 358538 loop, 37529 MB/s
sub-thread : total loop = 354491
 2nd encode 5.859 sec, 354491 loop, 37105 MB/s
sub-thread : total loop = 352122
 2nd encode 5.859 sec, 352122 loop, 36857 MB/s
sub-thread : total loop = 355052
 2nd encode 5.859 sec, 355052 loop, 37164 MB/s
sub-thread : total loop = 344814
 2nd encode 5.859 sec, 344814 loop, 36092 MB/s
sub-thread : total loop = 343776
 2nd encode 5.859 sec, 343776 loop, 35984 MB/s
sub-thread : total loop = 340294
 2nd encode 5.859 sec, 340294 loop, 35619 MB/s
sub-thread : total loop = 351767
 2nd encode 5.859 sec, 351767 loop, 36820 MB/s
sub-thread : total loop = 338546
 2nd encode 5.859 sec, 338546 loop, 35436 MB/s
gpu-thread :
 2nd encode 6.156 sec, 3744670 loop, 373056 MB/s
total  10.390 sec

Created successfully


===================================


D:\Process>par2j64_item256 c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory  : "I:\Output\Sample\"
Recovery File   : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread      : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra       : x64 AVX2 CLMUL
Memory usage    : Auto (88197 MB available), Fast SSD

Input File count        : 56
Input File total size   : 5589315326
Input File Slice size   : 640000
Input File Slice count  : 8775
Recovery Slice count    : 878
Redundancy rate         : 10.00%
Recovery File count     : 11
Slice distribution      : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.594 sec, 3344 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.079 sec
  0.0% : Creating recovery slice
matrix size = 51 KB

 read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
LOCAL_MEM_SIZE = 64 KB
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
HOST_UNIFIED_MEMORY = 1
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 1256 KB (1286144 Bytes), OK
factor buf : 35100 Bytes (8775 factors), OK
CreateKernel : method12

Max number of work items = 12288 (256 * 48)
OpenCL_method = 12, vram_max = 8775
partial encode = 117 / 8775 (1.3%), read = 8775, skip = 0
remain = 8658, src_off = 117, src_max = 32
GPU: remain = 8626, src_off = 149, src_num = 64
GPU: remain = 8370, src_off = 405, src_num = 1860
GPU: remain = 5102, src_off = 3673, src_num = 2760
GPU: remain = 326, src_off = 8449, src_num = 183
CPU last: src_off = 8728, src_num = 47
100.0%
read   2.703 sec
write  0.797 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 347544
 1st encode 2.687 sec, 34347 loop, 7839 MB/s
 2nd encode 5.063 sec, 313197 loop, 37937 MB/s
sub-thread : total loop = 350347
 1st encode 2.703 sec, 34308 loop, 7784 MB/s
 2nd encode 5.063 sec, 316039 loop, 38281 MB/s
sub-thread : total loop = 337194
 1st encode 2.703 sec, 34071 loop, 7730 MB/s
 2nd encode 5.063 sec, 303123 loop, 36717 MB/s
sub-thread : total loop = 300511
 2nd encode 5.063 sec, 300511 loop, 36400 MB/s
sub-thread : total loop = 295809
 2nd encode 5.063 sec, 295809 loop, 35831 MB/s
sub-thread : total loop = 304831
 2nd encode 5.063 sec, 304831 loop, 36924 MB/s
sub-thread : total loop = 304643
 2nd encode 5.063 sec, 304643 loop, 36901 MB/s
sub-thread : total loop = 300262
 2nd encode 5.063 sec, 300262 loop, 36370 MB/s
sub-thread : total loop = 294659
 2nd encode 5.063 sec, 294659 loop, 35692 MB/s
sub-thread : total loop = 298971
 2nd encode 5.063 sec, 298971 loop, 36214 MB/s
sub-thread : total loop = 296448
 2nd encode 5.063 sec, 296448 loop, 35908 MB/s
gpu-thread :
 2nd encode 5.313 sec, 4273226 loop, 493259 MB/s
total  9.531 sec

Created successfully


======================================================

D:\Process>par2j64_less c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory  : "I:\Output\Sample\"
Recovery File   : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread      : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra       : x64 AVX2 CLMUL
Memory usage    : Auto (88254 MB available), Fast SSD

filename is invalid, UjPGgFjavolplR8geCekcqwXz.par2

D:\Process>par2j64_less c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory  : "I:\Output\Sample\"
Recovery File   : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread      : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra       : x64 AVX2 CLMUL
Memory usage    : Auto (88315 MB available), Fast SSD

Input File count        : 56
Input File total size   : 5589315326
Input File Slice size   : 640000
Input File Slice count  : 8775
Recovery Slice count    : 878
Redundancy rate         : 10.00%
Recovery File count     : 11
Slice distribution      : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.594 sec, 3344 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.078 sec
  0.0% : Creating recovery slice
matrix size = 51 KB

 read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
LOCAL_MEM_SIZE = 64 KB
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
HOST_UNIFIED_MEMORY = 1
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 1256 KB (1286144 Bytes), OK
factor buf : 35100 Bytes (8775 factors), OK
CreateKernel : method12

Max number of work items = 12288 (256 * 48)
OpenCL_method = 12, vram_max = 8775
partial encode = 118 / 8775 (1.3%), read = 8775, skip = 0
remain = 8657, src_off = 118, src_max = 32
GPU: remain = 8625, src_off = 150, src_num = 64
GPU: remain = 8401, src_off = 374, src_num = 2100
GPU: remain = 4733, src_off = 4042, src_num = 2610
GPU: remain = 267, src_off = 8508, src_num = 151
CPU last: src_off = 8723, src_num = 52
100.0%
read   2.719 sec
write  0.812 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 332487
 1st encode 2.703 sec, 34500 loop, 7827 MB/s
 2nd encode 5.016 sec, 297987 loop, 36433 MB/s
sub-thread : total loop = 338513
 1st encode 2.703 sec, 34567 loop, 7842 MB/s
 2nd encode 5.016 sec, 303946 loop, 37161 MB/s
sub-thread : total loop = 331877
 1st encode 2.703 sec, 34537 loop, 7836 MB/s
 2nd encode 5.016 sec, 297340 loop, 36354 MB/s
sub-thread : total loop = 299671
 2nd encode 5.016 sec, 299671 loop, 36639 MB/s
sub-thread : total loop = 301957
 2nd encode 5.016 sec, 301957 loop, 36918 MB/s
sub-thread : total loop = 299104
 2nd encode 5.016 sec, 299104 loop, 36569 MB/s
sub-thread : total loop = 294000
 2nd encode 5.016 sec, 294000 loop, 35945 MB/s
sub-thread : total loop = 299195
 2nd encode 5.016 sec, 299195 loop, 36581 MB/s
sub-thread : total loop = 294271
 2nd encode 5.016 sec, 294271 loop, 35979 MB/s
sub-thread : total loop = 295973
 2nd encode 5.016 sec, 295973 loop, 36187 MB/s
sub-thread : total loop = 293247
 2nd encode 5.016 sec, 293247 loop, 35853 MB/s
gpu-thread :
 2nd encode 5.204 sec, 4324150 loop, 509592 MB/s
total  9.453 sec

Created successfully

=================================================


D:\Process>par2j64_table c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory  : "I:\Output\Sample\"
Recovery File   : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread      : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra       : x64 AVX2 CLMUL
Memory usage    : Auto (88345 MB available), Fast SSD

Input File count        : 56
Input File total size   : 5589315326
Input File Slice size   : 640000
Input File Slice count  : 8775
Recovery Slice count    : 878
Redundancy rate         : 10.00%
Recovery File count     : 11
Slice distribution      : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.609 sec, 3312 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.078 sec
  0.0% : Creating recovery slice
matrix size = 51 KB

 read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
LOCAL_MEM_SIZE = 64 KB
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
HOST_UNIFIED_MEMORY = 1
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 1256 KB (1286144 Bytes), OK
table & factor buf : 297244 Bytes (max 8775 factors), OK
CreateKernel : method12

Max number of work items = 12288 (256 * 48)
OpenCL_method = 12, vram_max = 8775
partial encode = 116 / 8775 (1.3%), read = 8775, skip = 0
remain = 8659, src_off = 116, src_max = 32
GPU: remain = 8627, src_off = 148, src_num = 64
GPU: remain = 8339, src_off = 436, src_num = 1667
GPU: remain = 5040, src_off = 3735, src_num = 2410
GPU: remain = 198, src_off = 8577, src_num = 96
CPU last: src_off = 8737, src_num = 38
100.0%
read   2.672 sec
write  0.797 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 395911
 1st encode 2.641 sec, 33917 loop, 7876 MB/s
 2nd encode 5.844 sec, 361994 loop, 37988 MB/s
sub-thread : total loop = 392710
 1st encode 2.656 sec, 33947 loop, 7838 MB/s
 2nd encode 5.829 sec, 358763 loop, 37746 MB/s
sub-thread : total loop = 387111
 1st encode 2.656 sec, 33984 loop, 7847 MB/s
 2nd encode 5.844 sec, 353127 loop, 37057 MB/s
sub-thread : total loop = 355230
 2nd encode 5.829 sec, 355230 loop, 37374 MB/s
sub-thread : total loop = 353143
 2nd encode 5.829 sec, 353143 loop, 37154 MB/s
sub-thread : total loop = 346927
 2nd encode 5.829 sec, 346927 loop, 36500 MB/s
sub-thread : total loop = 349860
 2nd encode 5.829 sec, 349860 loop, 36809 MB/s
sub-thread : total loop = 349024
 2nd encode 5.829 sec, 349024 loop, 36721 MB/s
sub-thread : total loop = 353101
 2nd encode 5.829 sec, 353101 loop, 37150 MB/s
sub-thread : total loop = 354732
 2nd encode 5.829 sec, 354732 loop, 37322 MB/s
sub-thread : total loop = 346610
 2nd encode 5.844 sec, 346610 loop, 36373 MB/s
gpu-thread :
 2nd encode 6.047 sec, 3720086 loop, 377287 MB/s
total  10.172 sec

Created successfully

@animetosho
Copy link

I found a way to determine private memory usage at OpenCL. CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE seems to indicate how many work items is good

The multiple likely indicates the number of units in a wavefront, but GPUs prefer having multiple threads per unit to hide latency. Therefore, it's often preferable to have a number larger than the multiple.

Because Anime Tosho sets fewer work items, there was no problem in his implementation

Actually I just use CL_DEVICE_MAX_WORK_GROUP_SIZE at the moment. This results in 256 being the workgroup size on the RX570, which seems to be ideal, but Nvidia cards tend to declare 1024 as the maximum, which is often too big, so there's performance loss from that.

Results for new code:

par2j64_item64
CreateKernel : method4

Max number of work items = 2048 (64 * 32)
OpenCL_method = 4, vram_max = 1020
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 1900, src_off = 300, src_num = 380
CPU last: src_off = 2150, src_num = 50
read   0.891 sec
write  0.422 sec

OpenCL : Ellesmere, OpenCL 2.0 AMD-APP (3584.0), 64*32
sub-thread : total loop = 387200
 2nd encode 12.016 sec, 387200 loop, 32349 MB/s
gpu-thread :
 2nd encode 12.219 sec, 96800 loop, 7953 MB/s
total  15.219 sec
par2j64_item128
CreateKernel : method4

Max number of work items = 4096 (128 * 32)
OpenCL_method = 4, vram_max = 1020
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 1870, src_off = 330, src_num = 340
GPU: remain = 60, src_off = 2140, src_num = 11
CPU last: src_off = 2151, src_num = 49
read   0.906 sec
write  0.422 sec

OpenCL : Ellesmere, OpenCL 2.0 AMD-APP (3584.0), 128*32
sub-thread : total loop = 393580
 2nd encode 11.453 sec, 393580 loop, 34499 MB/s
gpu-thread :
 2nd encode 11.375 sec, 90420 loop, 7980 MB/s
total  13.719 sec
par2j64_item256
CreateKernel : method4

Max number of work items = 8192 (256 * 32)
OpenCL_method = 4, vram_max = 1020
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 1870, src_off = 330, src_num = 340
GPU: remain = 210, src_off = 1990, src_num = 42
CPU last: src_off = 2152, src_num = 48
read   0.890 sec
write  0.406 sec

OpenCL : Ellesmere, OpenCL 2.0 AMD-APP (3584.0), 256*32
sub-thread : total loop = 386760
 2nd encode 11.672 sec, 386760 loop, 33265 MB/s
gpu-thread :
 2nd encode 11.578 sec, 97240 loop, 8431 MB/s
total  13.906 sec
par2j64_less
CreateKernel : method4

Max number of work items = 8192 (256 * 32)
OpenCL_method = 4, vram_max = 1020
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 1870, src_off = 330, src_num = 340
GPU: remain = 150, src_off = 2050, src_num = 29
CPU last: src_off = 2169, src_num = 31
read   0.890 sec
write  0.391 sec

OpenCL : Ellesmere, OpenCL 2.0 AMD-APP (3584.0), 256*32
sub-thread : total loop = 389620
 2nd encode 11.265 sec, 389620 loop, 34721 MB/s
gpu-thread :
 2nd encode 11.203 sec, 94380 loop, 8457 MB/s
total  13.422 sec
par2j64_table
CreateKernel : method4

Max number of work items = 8192 (256 * 32)
OpenCL_method = 4, vram_max = 1020
partial encode = 0 / 2200 (0.0%), read = 2200, skip = 0
remain = 2200, src_off = 0, src_max = 30
GPU: remain = 2170, src_off = 30, src_num = 60
GPU: remain = 1840, src_off = 360, src_num = 306
GPU: remain = 184, src_off = 2016, src_num = 33
CPU last: src_off = 2169, src_num = 31
read   0.875 sec
write  0.407 sec

OpenCL : Ellesmere, OpenCL 2.0 AMD-APP (3584.0), 256*32
sub-thread : total loop = 396220
 2nd encode 11.329 sec, 396220 loop, 35110 MB/s
gpu-thread :
 2nd encode 11.359 sec, 87780 loop, 7757 MB/s
total  13.516 sec

Maybe I'll see if I can find out what's going on.

@Yutaka-Sawada
Copy link
Owner

par2j64_item256 faster than previous par2j_debug_2023-12-10 but this one still little faster hehe: #107 (comment)

Thanks Slava46 for additional tests. I'm not sure why the difference came. Such small difference may be ignorable. The time or speed shown in debug version was not so accurate. Because I used GetTickCount Win32API, it might be slowed in CPU's busy state. I should have use CPU time, when it uses all CPU cores.

Then, I changed funtion to clock in C-runtime library for measure time or speed. It seems to use GetProcessTimes Win32API. Even when CPU is busy to process heavy task, it will show relatively accurate result. But, the timer resolution is 1/64 seconds anyway.

Running 256 work items is the max for method12, even when the GPU supports upto 1024. I understand that reducing work items is slow on fast GPU.

Slightly slower as compared to the earlier par2j_debug_2023-12-05 version

Thanks cavalia88 for tests. As I wrote above, I don't know why. A few percent difference would be acceptable.

As same as GeForce GPU, running 256 work items is good on recent Radeon GPU. Also, putting tables on VRAM was useless for speed. It failed to show CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE value of your GPU. That was my mistake. When there are multiple OpenCL devices on a PC, it failed. I fixed this bug at next sample.

Actually I just use CL_DEVICE_MAX_WORK_GROUP_SIZE at the moment. This results in 256 being the workgroup size on the RX570, which seems to be ideal

Oh, I see. Then, my OpenCL implementation is bad (inefficient). Thanks Anime Tosho for information and tests. Reducing used registers in kernel was worthless.

Now, I tried 2 points in the next sample. I removed if from 3 kernels (method9, method10, and method12) in my OpenCL source code. These kernels switched routes for 1 or 2 blocks. They will calculate 2 blocks always. Though it may become slightly slow for odd number of blocks, it should be ignorable. I cannot see any difference in my tests. (It's not fast nor slow.)

Another point is showing CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE to know difference. 3 kernels (method9, method10, and method12) may become slow on some (cheep or old) GPUs. To see when it happens to be slow, I added a debug output like following;

Testing another kernel
CreateKernel : method4
WORK_GROUP_SIZE = 256
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32
ReleaseKernel, OK

CreateKernel : method12
WORK_GROUP_SIZE = 256
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 16

In these sample output, it compare method4 (calculate single block) and method12 (calculate 2 blocks at once). While both kernels support max 256 work items, the minimum size is different. This may indicate the kernel will be heavy on the GPU.

I put the sample (par2j_debug_2023-12-12.zip) in "MultiPar_sample" folder on OneDrive. It will try to calculate 2 blocks at once always. Though I'm not sure the effect of removing if from code, it won't become slow. Anyway, it may print slightly faster speed, because I changed the timer function. If your GPU has fewer registers for the kernel, values in the debug output may differ like my above example. Please test it and post results. As year end is close, no need hurry.

@Slava46
Copy link

Slava46 commented Dec 12, 2023

par2j_debug_2023-12-12
par2j64_always2:
GPU thread: 188408 MB/s; 06:07 - here max GPU speed.

Also better CPU speed per thread, all thread more stable and more than 11000 MB/s: 11009 -11660 MB/s - was here: #107 (comment) and for another tests
10551 - 11522 MB/s

Testing another kernel
CreateKernel : method4
WORK_GROUP_SIZE = 256
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32
ReleaseKernel, OK

CreateKernel : method12
WORK_GROUP_SIZE = 256
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

@Yutaka-Sawada
Copy link
Owner

Thanks Slava46 for test again. While GeForce RTX 3070 supports max 1024 work items per Compute Unit (known by CL_DEVICE_MAX_WORK_GROUP_SIZE), it can use max 256 work items per Compute Unit for my implemented OpenCL kernels (known by CL_KERNEL_WORK_GROUP_SIZE). So, I won't need to run over than 256 work items per Compute Unit. Because GeForce (CUDA) runs 32 work items at once normally, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32 means that it has enough private memory for them. There would be no problem on recent GPU.

While I use many variables in my code, the value may become 16 or 8 on my using Intel GPU. When a kernel is heavy and GPU becomes too busy, OS seems to determine the GPU is freezed. I saw GPU freeze and reset rarely, when it calculates 2 blocks at once. It may be difficult to support slow GPU. Anyway, a user will enable GPU acceleration, only when he has a fast graphics board.

@Yutaka-Sawada
Copy link
Owner

Because I could not write a light code for old GPU devices, I just avoid (mostly) slow one. I implemented a simple method to select which OpenCL kernel. It will check CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE value in each kernel.

method2  : 4-byte memory access
method4  : 16-byte memory access
method10 : 4-byte memory access, and calculate 2 blocks at once
method12 : 16-byte memory access, and calculate 2 blocks at once

Possible theoretical speed:
(slow) method2 < method4 < method10 < method12 (fast)

Number of using registers:
(few = light) method2 < method10 < method4 < method12 (many = heavy)

At first, it checks method2 kernel to know the lightest state. The value may be 32 for Nvidia GeForce GPU. The value may be 64 for AMD Radeon GPU.

Order of testing kernels:
(first) method2 -> method12 -> method10 -> method4 (last)

Then, it checks other method's values. If faster method's value is same, it uses the method. If faster method's value is smaller, it tests next method. The result is shown like below on my Intel GPU;

Testing method2
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Testing method12
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 16

Testing method10
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Selected method10

Though method12 is slightly faster than method10 on my PC, they are almost same speed. So, the automatic selection will be acceptable. If there is no problem in this way of selection, I will adopt it for version 1.3.3.2.

Comparison of GPU-thread speed on Intel Core i5-10400 (UHD Graphics 630).
method2  : 7422 MB/s
method4  : 7778 MB/s
method10 : 11686 MB/s
method12 : 12201 MB/s

I put the sample (par2j_debug_2023-12-16.zip) in "MultiPar_sample" folder on OneDrive. It will select method12 (heavy but possibly fast) for recent fast GPUs. Comparing CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE values will be good for unknown GPUs. If someone is interested in the flow of automatic selection, he may try the debug version. It's possible to select OpenCL kernel manually by setting lc option at command-line, too.

@Slava46
Copy link

Slava46 commented Dec 17, 2023

par2j_debug_2023-12-16
par2j64_compare: gpu-thread: 221110 MB/s; 05:59

Selected device = NVIDIA GeForce RTX 3070
src buf : 2096952 KB (1551 blocks), possible

Testing method2
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Testing method12
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Selected method12
dst buf : 2704 KB (2768896 Bytes), OK
factor buf : 6204 Bytes (1551 factors), OK

we have here increased GPU speed.

But tried few times more the same files and got those results:
2nd test:
Selected device = NVIDIA GeForce RTX 3070
src buf : 2096664 KB (1317 blocks), possible

Testing method2
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Testing method12
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Selected method12
dst buf : 3184 KB (3260416 Bytes), OK
factor buf : 5268 Bytes (1317 factors), OK

GPU: 187985 MB/s; 06:10

3rd test:
Selected device = NVIDIA GeForce RTX 3070
src buf : 2096664 KB (1317 blocks), possible

Testing method2
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Testing method12
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Selected method12
dst buf : 3184 KB (3260416 Bytes), OK
factor buf : 5268 Bytes (1317 factors), OK

GPU: 187724 MB/s; 06:05

4th test:
Selected platform = NVIDIA CUDA
Selected device = NVIDIA GeForce RTX 3070
src buf : 2096664 KB (1317 blocks), possible

Testing method2
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Testing method12
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Selected method12
dst buf : 3184 KB (3260416 Bytes), OK
factor buf : 5268 Bytes (1317 factors), OK

GPU: 184936 MB/s; 06:14

So, I see that here for 1st test
src buf : 2096952 KB (1551 blocks)
dst buf : 2704 KB (2768896 Bytes), OK
factor buf : 6204 Bytes (1551 factors), OK

and for 3 others almost all the same
src buf : 2096664 KB (1317 blocks), possible
dst buf : 3184 KB (3260416 Bytes), OK
factor buf : 5268 Bytes (1317 factors), OK

and 1st test was faster 221110 MB/s vs ~187000 MB/s and curious why for first test choosed different buffer and for others another.

@Yutaka-Sawada
Copy link
Owner

curious why for first test choosed different buffer and for others another.

Thanks Slava46 for tests sometimes. When all file data cannot fit free memory space, it splits file data and processes them one by one. For example, you have 12 GB file data. When available free memory space is 10 GB, it splits into 2 pieces and processes 6 GB x 2 times. When available free memory space is 6 GB, it splits into 3 pieces and processes 4 GB x 3 times. Even when file data is same, buffer size may become different by available memory size at that time. You may see lines like below in debug version log.

 read all source blocks, and keep all parity blocks (GPU)
buffer size = 8929 MB, io_size = 2101232, split = 1

The split = 1 value shows how many times. You should compare results with same split value's one. Normally, it would not be different, unless you (or Windows OS) use large memory for another application.

@cavalia88
Copy link
Author

cavalia88 commented Dec 17, 2023

My results below. Slightly slower than the earlier par2j_debug_2023-12-05 version.

par2j64_compare c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (89292 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
100.0% : Computing file hash
100.0% : Making index file
100.0% : Constructing recovery file
0.0% : Creating recovery slice
matrix size = 51 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
HOST_UNIFIED_MEMORY = 1
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible

Testing method2
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Testing method12
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Selected method12
dst buf : 1256 KB (1286144 Bytes), OK
factor buf : 35100 Bytes (8775 factors), OK

Max number of work items = 12288 (256 * 48)
OpenCL_method = 12, vram_max = 8775
partial encode = 108 / 8775 (1.2%), read = 8775, skip = 0
remain = 8667, src_off = 108, src_max = 32
GPU: remain = 8635, src_off = 140, src_num = 64
GPU: remain = 8379, src_off = 396, src_num = 1862
GPU: remain = 5109, src_off = 3666, src_num = 2765
GPU: remain = 328, src_off = 8447, src_num = 184
CPU last: src_off = 8727, src_num = 48
100.0%
read 2.492 sec
write 0.796 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 335312
1st encode 2.483 sec, 31840 loop, 7864 MB/s
2nd encode 5.020 sec, 303472 loop, 37074 MB/s
sub-thread : total loop = 341179
1st encode 2.482 sec, 31352 loop, 7747 MB/s
2nd encode 5.024 sec, 309827 loop, 37821 MB/s
sub-thread : total loop = 337068
1st encode 2.482 sec, 31632 loop, 7816 MB/s
2nd encode 5.023 sec, 305436 loop, 37292 MB/s
sub-thread : total loop = 302806
2nd encode 5.023 sec, 302806 loop, 36971 MB/s
sub-thread : total loop = 303065
2nd encode 5.024 sec, 303065 loop, 36995 MB/s
sub-thread : total loop = 305542
2nd encode 5.024 sec, 305542 loop, 37298 MB/s
sub-thread : total loop = 309840
2nd encode 5.023 sec, 309840 loop, 37830 MB/s
sub-thread : total loop = 299052
2nd encode 5.025 sec, 299052 loop, 36498 MB/s
sub-thread : total loop = 294249
2nd encode 5.023 sec, 294249 loop, 35926 MB/s
sub-thread : total loop = 301270
2nd encode 5.022 sec, 301270 loop, 36791 MB/s
sub-thread : total loop = 294812
2nd encode 5.019 sec, 294812 loop, 36024 MB/s
gpu-thread :
2nd encode 5.287 sec, 4280250 loop, 496500 MB/s
total 9.221 sec

Created successfully

@Yutaka-Sawada
Copy link
Owner

Thanks cavalia88 for test. Now, it can get CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE value correctly, even when there are multiple OpenCL devices on a PC. While CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE value for Radeon GPU is optimal at 64, the result shows 32 for both kernels. My implemented OpenCL code seems to be heavy (may use many registers). But, I don't know what is bad in my code at this time.

I found a section about "ternary operator" on "AMD OpenCL Programing Optimization Guide". There is a specific optimization technique for AMD GPU. I added it in my OpenCL source code. Putting constant values in "ternary operator" seems to be good. method16 becomes 10% faster on Intel GPU. Because other methods use it for table setup only, the speed improvement may be very small. At least, it won't be slower on other non-Radeon GPUs. I put the sample (par2j_debug_2023-12-18.zip) in "MultiPar_sample" folder on OneDrive. If there is no problem, I will adopt it for version 1.3.3.2.

@cavalia88
Copy link
Author

Seems about the same, or slightly slower than previous

par2j64_ternary c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"

Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (87946 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
100.0% : Computing file hash
100.0% : Making index file
100.0% : Constructing recovery file
0.0% : Creating recovery slice
matrix size = 51 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
HOST_UNIFIED_MEMORY = 1
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible

Testing method2
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Testing method12
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Selected method12
dst buf : 1256 KB (1286144 Bytes), OK
factor buf : 35100 Bytes (8775 factors), OK

Max number of work items = 12288 (256 * 48)
OpenCL_method = 12, vram_max = 8775
partial encode = 109 / 8775 (1.2%), read = 8775, skip = 0
remain = 8666, src_off = 109, src_max = 32
GPU: remain = 8634, src_off = 141, src_num = 64
GPU: remain = 8378, src_off = 397, src_num = 1861
GPU: remain = 5141, src_off = 3634, src_num = 2807
GPU: remain = 318, src_off = 8457, src_num = 180
CPU last: src_off = 8733, src_num = 42
100.0%
read 2.512 sec
write 0.799 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 333964
1st encode 2.487 sec, 32002 loop, 7892 MB/s
2nd encode 5.048 sec, 301962 loop, 36685 MB/s
sub-thread : total loop = 336347
1st encode 2.488 sec, 31793 loop, 7837 MB/s
2nd encode 5.049 sec, 304554 loop, 36993 MB/s
sub-thread : total loop = 337154
1st encode 2.489 sec, 31907 loop, 7862 MB/s
2nd encode 5.047 sec, 305247 loop, 37092 MB/s
sub-thread : total loop = 294343
2nd encode 5.049 sec, 294343 loop, 35753 MB/s
sub-thread : total loop = 303244
2nd encode 5.048 sec, 303244 loop, 36841 MB/s
sub-thread : total loop = 301778
2nd encode 5.046 sec, 301778 loop, 36678 MB/s
sub-thread : total loop = 303756
2nd encode 5.046 sec, 303756 loop, 36918 MB/s
sub-thread : total loop = 301472
2nd encode 5.047 sec, 301472 loop, 36633 MB/s
sub-thread : total loop = 299637
2nd encode 5.047 sec, 299637 loop, 36410 MB/s
sub-thread : total loop = 292172
2nd encode 5.046 sec, 292172 loop, 35510 MB/s
sub-thread : total loop = 287843
2nd encode 5.045 sec, 287843 loop, 34991 MB/s
gpu-thread :
2nd encode 5.312 sec, 4312736 loop, 497914 MB/s
total 9.350 sec

Created successfully

@Slava46
Copy link

Slava46 commented Dec 18, 2023

par2j64_ternary

Testing method2
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Testing method12
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Selected method12
dst buf : 3008 KB (3080192 Bytes), OK
factor buf : 5576 Bytes (1394 factors), OK

GPU: 183723 MB/s; 06:11

@Yutaka-Sawada
Copy link
Owner

Thanks cavalia88 and Slava46 for test. While special "ternary operator" usage for AMD GPU isn't slower than normal "ternary operator", it's not faster, also. The interesting point is that both users claim that old "par2j_debug_2023-12-05.zip" was slightly faster. When I tested the old debug version again on my PC, recent code is slightly faster than that old code on my Intel GPU. I'm not sure what makes such difference. While their ways of table setup are different, I'm doubyful of causing 4 ~ 5 % speed difference.

So, I made new debug version with old OpenCL code (par2j64_old1205.exe), which uses relatively accurate timer function. On my PC, par2j64_16byte.exe (included in par2j_debug_2023-12-05.zip) and par2j64_old1205.exe (new timer with old code) are almost same speed. Both are slightly slower than new par2j64_ternary.exe. But, the result may be different on other graphics boards.

If par2j64_16byte.exe and par2j64_old1205.exe shows different speed, it may come from their timers. Theoretically, CPU time shows faster speed on busy state PC.

If par2j64_old1205.exe is faster than par2j64_ternary.exe, there is a problem in my OpenCL code. Sometimes faster code on a GPU may become slower on other GPUs.

If the same par2j64_16byte.exe shows different speed than previous test, old test might be a rare good result.

I put the sample (par2j_debug_2023-12-19.zip) in "MultiPar_sample" folder on OneDrive. If you are interested in why old code showed faster speed ago, you may test the 3 samples and compare their results. Though I think very small difference is ignorable, I'm welcome to improve or solve problem.

@cavalia88
Copy link
Author

The 3 versions in latest file are about the same speed.

par2j64_old1205 c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (89236 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
100.0% : Computing file hash
100.0% : Making index file
100.0% : Constructing recovery file
0.0% : Creating recovery slice
matrix size = 51 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
HOST_UNIFIED_MEMORY = 1
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible

Testing method2
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Testing method12
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Selected method12
dst buf : 1256 KB (1286144 Bytes), OK
factor buf : 35100 Bytes (8775 factors), OK

Max number of work items = 12288 (256 * 48)
OpenCL_method = 12, vram_max = 8775
partial encode = 113 / 8775 (1.2%), read = 8775, skip = 0
remain = 8662, src_off = 113, src_max = 32
GPU: remain = 8630, src_off = 145, src_num = 64
GPU: remain = 8374, src_off = 401, src_num = 1860
GPU: remain = 5170, src_off = 3605, src_num = 2848
GPU: remain = 274, src_off = 8501, src_num = 155
CPU last: src_off = 8720, src_num = 55
100.0%
read 2.606 sec
write 0.810 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 340406
1st encode 2.576 sec, 33148 loop, 7892 MB/s
2nd encode 5.007 sec, 307258 loop, 37634 MB/s
sub-thread : total loop = 338818
1st encode 2.578 sec, 33036 loop, 7859 MB/s
2nd encode 5.005 sec, 305782 loop, 37469 MB/s
sub-thread : total loop = 333444
1st encode 2.579 sec, 33030 loop, 7854 MB/s
2nd encode 5.006 sec, 300414 loop, 36803 MB/s
sub-thread : total loop = 295809
2nd encode 5.004 sec, 295809 loop, 36254 MB/s
sub-thread : total loop = 293470
2nd encode 5.009 sec, 293470 loop, 35931 MB/s
sub-thread : total loop = 297896
2nd encode 5.003 sec, 297896 loop, 36517 MB/s
sub-thread : total loop = 295736
2nd encode 5.001 sec, 295736 loop, 36267 MB/s
sub-thread : total loop = 291904
2nd encode 5.004 sec, 291904 loop, 35775 MB/s
sub-thread : total loop = 297079
2nd encode 5.002 sec, 297079 loop, 36424 MB/s
sub-thread : total loop = 296537
2nd encode 5.005 sec, 296537 loop, 36336 MB/s
sub-thread : total loop = 297442
2nd encode 5.003 sec, 297442 loop, 36461 MB/s
gpu-thread :
2nd encode 5.273 sec, 4325906 loop, 503129 MB/s
total 9.349 sec

Created successfully

par2j64_16byte c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (89223 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
cpu_num = 12, entity_num = 56, multi_read = 5
100.0% : Computing file hash
hash 1.594 sec, 3344 MB/s
100.0% : Making index file
100.0% : Constructing recovery file
write 0.062 sec
0.0% : Creating recovery slice
matrix size = 51 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
LOCAL_MEM_SIZE = 64 KB
HOST_UNIFIED_MEMORY = 1
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible
dst buf : 1256 KB (1286144 Bytes), OK
factor buf : 35100 Bytes (8775 factors), OK
CreateKernel : method3

Max number of work items = 12288 (256 * 48)
OpenCL_method = 3, vram_max = 8775
partial encode = 113 / 8775 (1.2%), read = 8775, skip = 0
remain = 8662, src_off = 113, src_max = 32
GPU: remain = 8630, src_off = 145, src_num = 64
GPU: remain = 8406, src_off = 369, src_num = 2101
GPU: remain = 4737, src_off = 4038, src_num = 2612
GPU: remain = 205, src_off = 8570, src_num = 115
CPU last: src_off = 8717, src_num = 58
100.0%
read 2.609 sec
write 0.797 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 343652
1st encode 2.578 sec, 33079 loop, 7869 MB/s
2nd encode 4.985 sec, 310573 loop, 38208 MB/s
sub-thread : total loop = 339646
1st encode 2.563 sec, 33026 loop, 7902 MB/s
2nd encode 4.985 sec, 306620 loop, 37722 MB/s
sub-thread : total loop = 338637
1st encode 2.578 sec, 33109 loop, 7876 MB/s
2nd encode 5.000 sec, 305528 loop, 37474 MB/s
sub-thread : total loop = 302483
2nd encode 4.969 sec, 302483 loop, 37332 MB/s
sub-thread : total loop = 298422
2nd encode 4.985 sec, 298422 loop, 36713 MB/s
sub-thread : total loop = 300924
2nd encode 4.984 sec, 300924 loop, 37028 MB/s
sub-thread : total loop = 299174
2nd encode 5.000 sec, 299174 loop, 36695 MB/s
sub-thread : total loop = 297641
2nd encode 4.985 sec, 297641 loop, 36617 MB/s
sub-thread : total loop = 291483
2nd encode 5.000 sec, 291483 loop, 35752 MB/s
sub-thread : total loop = 302043
2nd encode 4.984 sec, 302043 loop, 37166 MB/s
sub-thread : total loop = 295164
2nd encode 4.985 sec, 295164 loop, 36312 MB/s
gpu-thread :
2nd encode 5.235 sec, 4295176 loop, 503180 MB/s
total 9.297 sec

Created successfully

par2j64_ternary c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample\*.*"
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada

Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (89319 MB available), Fast SSD

Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0

read_block_num = 8775
2-pass processing is selected, -12
100.0% : Computing file hash
100.0% : Making index file
100.0% : Constructing recovery file
0.0% : Creating recovery slice
matrix size = 51 KB

read all source blocks, and keep all parity blocks (GPU)
buffer size = 6458 MB, io_size = 643056, split = 1
cache: limit size = 131072, chunk_size = 128640, chunk_num = 5
unit_size = 643072, cpu_num1 = 3, cpu_num2 = 12

Platform[0] = AMD Accelerated Parallel Processing
Platform version = OpenCL 2.1 AMD-APP (3584.0)

Device[0] = gfx1100
Device version = OpenCL 2.0 AMD-APP (3584.0)
MAX_MEM_ALLOC_SIZE = 20876 MB
MAX_COMPUTE_UNITS = 48
MAX_WORK_GROUP_SIZE = 256
GLOBAL_MEM_SIZE = 24560 MB

Device[1] = gfx1036
Device version = OpenCL 2.0 AMD-APP (3584.0)
HOST_UNIFIED_MEMORY = 1
MAX_MEM_ALLOC_SIZE = 30893 MB
MAX_COMPUTE_UNITS = 1
MAX_WORK_GROUP_SIZE = 256

Selected platform = AMD Accelerated Parallel Processing
Selected device = gfx1100
src buf : 6286908 KB (10011 blocks), possible

Testing method2
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Testing method12
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Selected method12
dst buf : 1256 KB (1286144 Bytes), OK
factor buf : 35100 Bytes (8775 factors), OK

Max number of work items = 12288 (256 * 48)
OpenCL_method = 12, vram_max = 8775
partial encode = 114 / 8775 (1.2%), read = 8775, skip = 0
remain = 8661, src_off = 114, src_max = 32
GPU: remain = 8629, src_off = 146, src_num = 64
GPU: remain = 8405, src_off = 370, src_num = 2101
GPU: remain = 4768, src_off = 4007, src_num = 2651
GPU: remain = 197, src_off = 8578, src_num = 112
CPU last: src_off = 8722, src_num = 53
100.0%
read 2.610 sec
write 0.797 sec

OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
sub-thread : total loop = 335259
1st encode 2.600 sec, 33493 loop, 7900 MB/s
2nd encode 4.996 sec, 301766 loop, 37043 MB/s
sub-thread : total loop = 336117
1st encode 2.599 sec, 33295 loop, 7857 MB/s
2nd encode 4.996 sec, 302822 loop, 37173 MB/s
sub-thread : total loop = 339043
1st encode 2.600 sec, 33304 loop, 7856 MB/s
2nd encode 4.996 sec, 305739 loop, 37531 MB/s
sub-thread : total loop = 301307
2nd encode 4.994 sec, 301307 loop, 37002 MB/s
sub-thread : total loop = 302117
2nd encode 4.996 sec, 302117 loop, 37086 MB/s
sub-thread : total loop = 292336
2nd encode 4.996 sec, 292336 loop, 35886 MB/s
sub-thread : total loop = 295522
2nd encode 4.996 sec, 295522 loop, 36277 MB/s
sub-thread : total loop = 291094
2nd encode 4.993 sec, 291094 loop, 35755 MB/s
sub-thread : total loop = 298258
2nd encode 4.996 sec, 298258 loop, 36612 MB/s
sub-thread : total loop = 304343
2nd encode 4.996 sec, 304343 loop, 37359 MB/s
sub-thread : total loop = 282267
2nd encode 4.996 sec, 282267 loop, 34650 MB/s
gpu-thread :
2nd encode 5.208 sec, 4326784 loop, 509511 MB/s
total 9.287 sec

Created successfully

@Slava46
Copy link

Slava46 commented Dec 20, 2023

par2j_debug_2023-12-19
par2j64_16byte: GPU: 149380 MB/s; 06:24
src buf : 2096248 KB (1162 blocks), possible
dst buf : 3608 KB (3694592 Bytes), OK
factor buf : 4648 Bytes (1162 factors), OK
CreateKernel : method3

par2j64_old1205: 142308 MB/s; 06:39
src buf : 2096248 KB (1162 blocks), possible

Testing method2
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Testing method12
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Selected method12
dst buf : 3608 KB (3694592 Bytes), OK
factor buf : 4648 Bytes (1162 factors), OK

par2j64_ternary: 148345 MB/s; 06:26
src buf : 2096248 KB (1162 blocks), possible

Testing method2
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Testing method12
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Selected method12
dst buf : 3608 KB (3694592 Bytes), OK
factor buf : 4648 Bytes (1162 factors), OK

all low speed.

@Yutaka-Sawada
Copy link
Owner

Thanks cavalia88 and Slava46 for additional tests. It seems that small change in tabel setup doesn't affect total speed, when GPU is fast enough. Small (several percent like +-2%) difference of speed might be normal. Anyway timer isn't so accurate.

gpu-thread speed from tests on Radeon 7900XTX by cavalia88;

unit_size = 628 KB
(Max VRAM usage = 628 KB * 8,775 = 5,381 MB)
par2j64_16byte (12-05) : 521282 MB/s
par2j64_16byte (12-19) : 503180 MB/s
par2j64_old1205(12-19) : 503129 MB/s
par2j64_compare(12-17) : 496500 MB/s
par2j64_ternary(12-18) : 497914 MB/s
par2j64_ternary(12-19) : 509511 MB/s

I sorted results by their known unit_size in Slava46's tests. As he wrote ago, smaller buffer size seems to be faster for GPU calculation. When buffer size is small, unit_size becomes smaller, too. Then, it may hit VRAM cache more often on Nvidia GeForce GPU. Because recent GeForce GPU contains large cache memory, considering cache usage will be good.

gpu-thread speed from tests on GeForce RTX 3070 by Slava46

unit_size = 1,804 KB
(Max VRAM usage = 1,804 KB * 1,162 = 2,047 MB)
par2j64_16byte (12-21) : 149380 MB/s
par2j64_old1205(12-21) : 142308 MB/s
par2j64_ternary(12-21) : 148345 MB/s

unit_size = 1,592 KB
(Max VRAM usage = 1,592 KB * 1,317 = 2,047 MB)
par2j64_item256(12-11) : 180293 MB/s
par2j64_compare(12-17) : 187985 MB/s
par2j64_compare(12-17) : 187724 MB/s
par2j64_compare(12-17) : 184936 MB/s

unit_size = 1,504 KB
(Max VRAM usage = 1,504 KB * 1,394 = 2,047 MB)
par2j64_ternary(12-19) : 183723 MB/s

unit_size = 1,352 KB
(Max VRAM usage = 1,352 KB * 1,551 = 2,047 MB)
par2j64_compare(12-17) : 221110 MB/s

@Yutaka-Sawada
Copy link
Owner

Though I tried to improve VRAM cache usage, I could not succeed. As I'm tired of many trials, I want to finish GPU optimization. Thanks testers for their long term help. I made a sample version to test behaior. I posted alpha version of v1.3.3.2 on github. I put current sample (par2j_sample_2023-12-26.zip) in "MultiPar_sample" folder on OneDrive. If there is no problem, I will release next version at next year.

@Slava46
Copy link

Slava46 commented Dec 26, 2023

par2j_sample_2023-12-26

par2j64_1329: 14:19 - very slow
par2j64_1331: GPU: 143534 MB/s; 06:26
src buf : 2096664 KB (1317 blocks), possible
dst buf : 1592 KB (1630208 Bytes), OK
factor buf : 2634 Bytes (1317 factors), OK
CreateKernel : method2

Max number of work items = 11776 (256 * 46)
OpenCL_method = 2, vram_max = 1317

par2j64_1332: GPU: 179446 MB/s; 06:19

src buf : 2096664 KB (1317 blocks), possible

Testing method12
PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 32

Selected method12
dst buf : 3184 KB (3260416 Bytes), OK
factor buf : 5268 Bytes (1317 factors), OK

Max number of work items = 11776 (256 * 46)
OpenCL_method = 44, vram_max = 1317

par2j64 main: 06:09 GPU speed I don't see in log.

@Yutaka-Sawada
Copy link
Owner

Thanks Slava46 for confirmation. I will release v1.3.3.2 soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants