New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback on v1.3.3.1 : GPU not being utilized #107
Comments
To support more using threads,
When GPU was used, it prints a line later like;
At v1.3.3.0 or later, set |
Thank you for the clarification. I managed to activate the GPU mode with lc256 With GPU mode and 12 threads, it took about 20 seconds. Without GPU and 24 threads, it took about 15 seconds. It seems that pure CPU is still faster than the GPU (even though I have an AMD Radeon 7900XTX). In GPU mode, i find that my GPU is barely used, with loads on the GPU not even rising above 1%. Is there something more that can be done to optimize the GPU processing? It seems most of the work is still being done on the CPU side. Do we have an option to measure the time used to complete the par file generation? This will make it easier for benchmarking the different settings.
Parchive 2.0 client version 1.3.3.1 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 100.0% : Computing file hash OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
Parchive 2.0 client version 1.3.3.1 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 100.0% : Computing file hash Created successfully |
Because I used NVIDIA GeForce graphics board to test OpenCL speed, my implementation is optimized for GeForce series. It's known to be slow on AMD GPU. Though there may be a technique to improve speed for AMD, I cannot test myself. While some AMD users helped develpment and tested some methods ago, I could not get good result. If you want to help, I will try again. But, I don't know what is good for AMD GPU at this time.
No, such option doesn't exit. But, I made debug versions to see behavior and speed. I put the package (par2j_debug_2023-11-19.zip) in "MultiPar_sample" folder on OneDrive. If you want to see internal calculation speed, please use them.
The calculation power of GPU is less than recent multi-core CPU mostly. If the GPU power is less than CPU's 13 threads, using CPU 24 threads may be faster than 12 threads (CPU 11 threads and GPU thread). You should compare "With GPU mode and 24 threads" to "Without GPU and 24 threads". |
lc24: '24 threads" = 14.77 seconds Not much difference between "24 threads" and "GPU + 24 threads". Most of work is done by the multi-core CPU, very little seems to be offloaded to the GPU. Not familiar with programming or OpenCL, but if you need someone to help test out on AMD Radeon 7900XTX, happy to help out. |
Thank you for help. I want to test some methods on OpenCL implementation. There are 2 ways to improve speed.
At first, I try (1) "Data transfur". Currently, I use OpenCL buffer with I don't know how AMD OpenCL driver acts. I made 2 samples. One uses Another uses both I put the package (par2j_debug_2023-11-20.zip) in "MultiPar_sample" folder on OneDrive. Please test these 3 modes for same data set. When some users tested on their graphics boards ago, there was no remarkable difference. But, I don't know what happens on the latest AMD GPU. To compare speed, set |
Test results attached below. Looks like the VRAM version is the fastest (around 11 seconds), PIN version and 1331 (about the same around 15 seconds) and 1329 (slowest at around 57. seconds)
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48 Created successfully
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48 Created successfully
Parchive 2.0 client version 1.3.3.1 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48 Created successfully
Parchive 2.0 client version 1.3.2.9 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep some parity blocks Created successfully |
PC's RAM (CL_MEM_USE_HOST_PTR) : 25340 MB/s Thank you for test. The difference is incredibly bigger than I thought. VRAM version is 8 times faster than others. This might be why my OpenCL implementation was slow on AMD GPU. I understand that AMD's OpenCL driver doesn't make cache (copy data automatically) on VRAM at |
I found that Next, I try to improve speed of calculation over GPU. 16 byte memory access seems to be good for AMD GPU. (On the other hand, accessing 4 byte would be good for NVIDIA GPU.) Now, I use vector data type in OpenCL to support 16 byte memory access. I made 2 samples of 4 byte data type (uchar4) and 16 byte data type (uchar16). On my PC's Intel GPU; Using uchar4 is slightly (2 %) faster than old implementation. Using uchar16 is 12 % faster than old implementation. By using vector data type, I could simplify my source code. So, it may be good for NVIDIA GPU. (However I'm not sure.) Though uchar16 is faster than uchar4, it requires more local memory. (It may not work enough on old graphics board.) The uchar16's speed may come from less number of looping. I put the package (par2j_debug_2023-11-22.zip) in "MultiPar_sample" folder on OneDrive. Please test these 3 methods for same data set. I don't know which is faster on recent discrete GPUs. Calculation property is differ in NVIDIA or AMD GPUs. When there is no big difference, I prefer simple implementation. |
Fastest version is the 16 byte, followed by auto version, and then 4 byte version. Results below:
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48 Created successfully
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48 |
Oh, I see. Thank you for confirm the property. While I read an OpenCL optimization guide, I didn't think so much difference. 16-byte memory access is around 65% faster on AMD GPU. But, it's not so fast on Intel nor NVIDIA GPUs. I took long time to put an old graphics board, install driver, and test OpenCL behavior. When I tested the GeForce GPU, 16-byte access was very slow. After test, I needed to restore Intel GPU, re-install old driver, and install new driver again. AMD Radeon 7900XTX (64 KB local memory); Intel UHD Graphics 630 (64 KB local memory); NVIDIA GeForce GT 240 (16 KB local memory); I think that the slowness of GeForce GPU may come from small local memory size. (This local memory is different from VRAM.) While I tested on Intel GPU, using more local memory became slower. Traditionally, NVIDIA GeForce GPU has less local memory than AMD GPU. While AMD GPUs have 32 ~ 64 KB, NVIDIA GPUs have 16 ~ 47 KB. When there are enough local memory on GPU, 16-byte memory access will be good. I need to check local memory size and will change functions (4-byte or 16-byte). But, I'm not sure how much local memory is required. Maybe 32 KB would be enough, as AMD GPUs contain the amount. If it becomes slow on a NVIDIA GPU, users will report the problem later. When I read your test log, I found that GPU's first task was not full speed. The trial task size might be too small for Radeon 7900XTX's calculation power. The max was set to 2 times larger than CPU's task size. (The task size was enough for GeForce RTX 2060 or GeForce RTX 3070 ago.) I made a new sample with 3 times larger limit. I put the package (par2j_debug_2023-11-24.zip) in "MultiPar_sample" folder on OneDrive. Please test it for same data set. If there is no difference, I will return to setting 2 times. |
The 16byte3 version is slightly slower than the 16 byte version
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48 Created successfully |
Is there point to test those versions for me with usual GUI? |
Thanks cavalia88 for test. Even when a GPU is slow starter, task management seems to work enough. I returned GPU thread's initial task size to 2 times of CPU thread. Now, OpenCL optimization for AMD GPU was successful.
At last, I could improve speed largely for AMD GPU. 13 times faster !! Using CL_MEM_COPY_HOST_PTR is much faster than CL_MEM_USE_HOST_PTR flag on AMD GPU. This may come from behavior of AMD's OpenCL driver. Using uchar4 (4 byte vector data type) isn't fast. (almost same) This may come from design of graphics board. Using 16 byte memory access (uint4 and uchar16 vector data type) is fast. This is because AMD GPU has read cache for VRAM.
I tested a slow GPU in my Intel Core i5-10400 CPU. Because VRAM speed of integrated GPU is same as system RAM, copying data to VRAM is useless on Intel GPU. Though using uchar4 is slightly faster, the difference is ignorable mostly. Using 16 byte memory access is faster, however the difference isn't so big than AMD GPU.
I tested an old GeForce graphics board. Using CL_MEM_COPY_HOST_PTR is same speed as CL_MEM_USE_HOST_PTR flag on NVIDIA GPU. This is because NVIDIA's OpenCL driver makes cache of data on VRAM automatically. Using uchar4 (4 byte vector data type) isn't fast, as same as AMD GPU. Data conversion between vector and 1 byte scalar seems to be slow on most GPUs. Though 16 byte memory access is very slow. The slowness may come from small local memory size.
Oh, thanks Slava46 for the offering. While I tested an old GeForce GPU, recent high-end GPU may differ. I made a package, which contains those 4 samples. I put the sample (par2j_debug_2023-11-25.zip) in "MultiPar_sample" folder on OneDrive. Though they are debug versions, it's possible to use by MultiPar GUI, too. By enable log, the result is saved on |
For sure OpenCL specs from log: Device[0] = NVIDIA GeForce RTX 3070 Selected platform = NVIDIA CUDA OpenCL : NVIDIA GeForce RTX 3070, OpenCL 3.0 CUDA, 256*46 GPU enabled, CPU high: The same 70 Gb test files as previous tests. par2j64_VRAM: GPU thread: 129142 MB/s; 06:37 Seems par2j64_16byte faster than other methods. P.S. So, compared time with last my best results here: #99 (comment) and here: #99 (comment) MultiPar_sample_2023-10-25: GPU enabled CPU high: 06:35 now little faster again, hardware the same, files, SSD etc just newest Windows version and last NVIDIA drivers since those testst. |
Thanks Slava46 for test sometimes. Because recent GPU has enough local memory, 16 byte memory access seems to be faster. I will use the function in next v1.3.3.2.
One strange point is slowness of CL_MEM_COPY_HOST_PTR flag. Explict copying seems to be slightly slower than automatic caching on recent NVIDIA GPU. (However the difference is ignorable in total calculation time.) Maybe it comes from difference between "big copy at first" and "consecutive background copy". But, I don't know how is NVIDIA GPU's caching system. Anyway, using CL_MEM_USE_HOST_PTR (as same as v1.3.3.1) may be good for NVIDIA GPUs. (Because it's slow on AMD GPUs, I need to switch flags.) To test the case of CL_MEM_USE_HOST_PTR flag and 16 byte memory access, I made a new sample. I put the sample (par2j_debug_2023-11-26.zip) in "MultiPar_sample" folder on OneDrive. When you have time, please test it. If the difference is noticeable, I will change flag at NVIDIA GPUs. |
Also difference for Radeon 7900XTX so big because at the first place it was too low speed and somethg wrong but for NVidia it was already fine, so you fixed it for Radeon 7900XTX and because of that so big difference and for NVidia just 21% increase, but still 21% faster is a good result.
par2j64_Host16byte: GPU thread: 185141 MB/s; 06:07 so, you made it again faster hehe, 32% faster than par2j64_1331 and ~9% faster than yesterday par2j64_16byte. |
Nice to see that even the Nvidia cards are able to see improvements in GPU speeds I was monitoring the task manager for GPU load. The 7900XTX load was at most 5% whilst running the latest par2j64_16byte version. Seems like we have not tapped the full potential of the card yet. The CPU still seems to do quite a bit of the processing. Nonetheless, already very happy that we have seen such a big increase in AMD GPU speeds. Edit: I realized i was looking at the compute utilization for the AMD onboard iGPU. When i checked the compute utilization for the Radeon 7900XTX, it was indeed above 90%. So all good. |
@Yutaka-Sawada Hi, I'm the developer of Realbench and I can tell you OpenCL performance suffers on Nvidia cards in general. You need to use CUDA for those cards as Nvidia is specifically crippling their performance under OpenCL (for the obvious reasons). @cavalia88 You will not see Compute loads (CUDA/OpenCL) under GPU load on task manager or mainstream tool. That 5% you see is not representative of the actual Compute load. The metric is for normal graphics loads, not Compute. |
Thanks Slava46 for test again. Changed point for AMD GPU was bad for NVIDIA GPU. Sometimes optimization for NVIDIA GPU and for AMD GPU are different. Next v1.3.3.2 will recognize them and select faster method automatically. I posted alpha version of v1.3.3.2 on github. I put current sample (par2j_debug_2023-11-27.zip) in "MultiPar_sample" folder on OneDrive. When someone wants to test his GPU, he may try them. If there is a problem, I will change more.
I'm glad, too. Though I don't use (noisy) graphics board on my PC, you helped other GPU users. When some AMD users helped my development ago, I could not success. This time was a good chance for retry. Thanks cavalia88 for new optimization trials.
Thanks Nodens for advice and helpful information. I use OpenCL for general usage for most GPUs. It's difficult to make CUDA implementation without real graphics board. I don't want to put a fast GeForce graphics board on my PC, because it's noisy. I may try, if NVIDIA releases a fanless silent GPU. |
So for par2j_debug_2023-11-27 nothing new things to need tested?
Agreed with that, using CUDA for NVIDIA cards will increase perfomance much, I can test things if you'll add CUDA support.
Actually threre is modern GPU could be pretty silent. About my MSI 3070 Gaming Z I can't say it's noisy. For example exist ASUS GeForce RTX 3070 Noctua Edition cteated for really silent cooling GPU - I read some reviewes and it's should be really pretty silent. Also depends of PC case that you're using of course, water cooler or usual cooling etc. |
Thank you, but you don't need to test. It just selects faster method for the device. Only when you want to confirm that the selection is correct, you may try debug version (par2j64_1332.exe) to see
Then, the value should be |
par2j64_1332 tried, looks like fine for me.
|
This is exactly the problem. Nvidia cards could easily dominate OpenCL performance but Nvidia wants to lock the Compute market (scientific/research and now ML/ANN) into CUDA exactly because AMD is not option for CUDA. They want to avoid just that, single OpenCL implementations that make AMD a viable choice. Hence their drivers artificially cripple OpenCL performance so the only way to go is CUDA. As you can see in your tests, the performance of high end cards is abysmal compared to AMD cards on the OpenCL implementation. This is intended. For testing, I suggest getting a cheap 1030. It comes fanless, with passive heatsink, and in low bracket form too. It's slow, but for testing it's fine. :) They go for like 80Eur new. Can probably find a used one for 40ish. |
Thank you for confirmation.
Oh, I see. But, I don't plan to implement CUDA version at this time, sorry. |
No problem. The only reason I went into this conversation was because I read your notes regarding pushing for performance increases and I just wanted to warn you that there's no way to squeeze any real performance out of Nvidia cards with OpenCL, aiming to perhaps save you some frustration in the process. It was not a request for a CUDA implementation. |
@nodens what about Vulkan? Works on all vendors, and I can't see Nvidia hemorrhaging the performance of one of the most popular APIs. @Yutaka-Sawada another idea is to pack multiple values into the lookup table. The classic lookup algorithm only fetches 16-bit products, but if 32-bit is more efficient, you could pack two products into each lookup entry (I'm surprised no-one has previously tried doing this). |
I'm not sure how they are. Currently, I use 2 composite tables (High and Low with 256 entries) for 16-bit multipy. Basically, the method is same as par2cmdline. If you say about vector data type (such like ushort2 ushort4, ushort8), most GPUs may be slow to handle vector values. I'm afraid that GPU seems to calculate each element one by one for a packed vector value. (But, this is just my experience.) For example, I write a line; Then, GPU may calculate the math as below;
While vector data looks like simple on source code, the caluclation cost is still heavy. Using vector type doesn't improve speed at all. From OpenCL for NVIDIA GPUs by Chris Lamb;
|
Not quite. The typical approach is to do two lookups for one 16-bit multiply. For example, if we have one source block and two recovery blocks, the process looks something like:
This approach does two lookups for two 16-bit multiplies:
Since 32-bit is as efficient as 16-bit, this should halve the number of lookups needed. |
Oh, it's a nice idea. Thank you for detailed explanation and example. Because ParPar calculates 2 recovery blocks at once, it may easy to implement there. CPU calculation will become faster, when SSSE3 or AVX2 are unavailable. Only a problem in OpenCL implementation for GPU is that it requires more local memory. When there are enough local memory on a GPU, it will be fast. Or else, it may be slow. |
In the latest version, par2j64_16byte and par2j64_auto about the same speed. par2j64_4byte is slightly slower. But all faster than previous debug versions. par2j64_16byte = 9.000s
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48 Created successfully par2j64_4byte c -rr10 -ss640000 -lc256 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample*.*" Base Directory : "I:\Output\Sample" filename is invalid, UjPGgFjavolplR8geCekcqwXz.par2
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48 Created successfully
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48 Created successfully |
For me looks like the same as previous, the same files. |
Thanks cavalia88 and Slava46 for testing new method. It's interesting that AMD GPU becomes much faster (50% speed up), while there is no difference at NVIDIA GPU. Because 16-byte access version is faster than 4-byte access always, both should have enough private memory (number of registers). This difference may come from access speed of local memory or effective caching system. When table lookup speed is very fast already, reducing the times of lookup would be worthless in total calculation time. This might be why Anime Tosho could not get impressive result in his tests.
When there is no big difference, I prefer simple implementation. I may switch functions for NVIDIA and AMD GPUs. Such like, it will calculate 2 blocks at once for AMD GPU or integrated GPU, while it uses classic method for NVIDIA's discrete GPU. Because it distinguishes NVIDIA GPU already, it's possible. I will try to make new automatic selection mechanism. |
Actually, I'm deferring de-interleave until the end, so it should just always be better. But I'm seeing little difference in my own tests. Your tests however, generally show better performance, except for the weird case of the RX570 being oddly slow. Using 2200MB source file with arguments 2023-11-27 : AMD RX570
16byte
Host16byte
VRAM
1332
2023-12-05 : AMD RX570
4byte
16byte
2023-11-27 : Nvidia GTX 960
16byte
Host16byte
VRAM
1332
2023-12-05 : Nvidia GTX 960
4byte
16byte
2023-11-27 : Intel UHD 770
16byte
Host16byte
1332
2023-12-05 : Intel UHD 770
4byte
16byte
|
Thanks Anime Tosho for tests on some devices. The slow result of Radeon RX 570 is interesting. Because the GPU is old low price model at the age, private memory may lack. While Radeon RX 570 has 32 KB local memory, it's the minimum size at OpenCL version 1.2. Though Radeon RX 570 is fast with 16-byte memory access, it's slow at calculating 2 blocks at once.
On the other hand, Nvidia GeForce GTX 960 shows different property. GeForce GTX 960 isn't fast with 16-byte memory access. But, calculating 2 blocks at once is fast. GeForce GTX 960 has 48 KB local memory. Though it's different from private memory size, it may indicate the price level (higher rank model).
Intel UHD Graphics 770's property is similar to Intel UHD Graphics 630. Integrated GPU on recent Intel CPU is almost same speed as old graphics board.
To distinguish GPU's price level, checking local memory size may be good. Even when the age is similar, cheep model and higher rank model are different speed. For recent GPU, calculating 2 blocks will be good (much faster or at least similar speed). I need to improve how to select fast method automatically. |
Perhaps you can introduce a new argument that allows users to try out the different methods themselves (if they wanted to)? Maybe by default, method selection is always set to auto, but user can override with /method-classic or /method-new etc This will allow more users to try out for themselves and provide you with feedback on which method works best with different hardware setups |
The RX570 is much faster than the other two. And my own code seems to demonstrate such as well (I typically get >60GB/s). |
Yes, I do. Thanks cavalia88 for the advice. I made new "lc" option values to test combination. Now, I can easy to change behavior and test results. Because MultiPar GUI doesn't support them, a user needs to test at command-line.
Oh, I see. I feel that private memory size is the reason. When a OpenCL function (kernel) uses a few registers, it would run very fast. But, it may be slow, when the kernel uses many registers. I found a way to determine private memory usage at OpenCL. I have set 256 work items per Compute Unit in each kernel always. This might be bad for some (cheep or old) GPUs. Because Anime Tosho sets fewer work items, there was no problem in his implementation. I will need to change the number of work items by the result of I made some samples to test less private memory usage. From tests on my PC, setting fewer work items seems to be good (faster). Because the Intel GPU 's property may be different from other GPUs, I include other samples, too. I put the package (par2j_debug_2023-12-10.zip) in "MultiPar_sample" folder on OneDrive. If someone interested in GPU optimization, please test them. I will reduce private memory usage by refering the results of others. |
par2j_debug_2023-12-10 seems for this test for me faster speed par2j64_item256. but here were faster speed than now: #107 (comment) |
Thanks Slava46 for tests. Recent GPU may have enough private memory even for the heavy kernel.
This is strange. I'm not sure why the difference. I changed 2 points from the old source code; Though speed is same on my Intel GPU, it may differ on other faster GPUs. I feel that |
of course needed test for find the truth par2j64_item256 faster than previous par2j_debug_2023-12-10 but this one still little faster hehe: #107 (comment) par2j64_item256:
|
My results: par2j_debug_2023-12-11: Slightly slower as compared to the earlier par2j_debug_2023-12-05 version
|
The multiple likely indicates the number of units in a wavefront, but GPUs prefer having multiple threads per unit to hide latency. Therefore, it's often preferable to have a number larger than the multiple.
Actually I just use Results for new code: par2j64_item64
par2j64_item128
par2j64_item256
par2j64_less
par2j64_table
Maybe I'll see if I can find out what's going on. |
Thanks Slava46 for additional tests. I'm not sure why the difference came. Such small difference may be ignorable. The time or speed shown in debug version was not so accurate. Because I used Then, I changed funtion to Running 256 work items is the max for method12, even when the GPU supports upto 1024. I understand that reducing work items is slow on fast GPU.
Thanks cavalia88 for tests. As I wrote above, I don't know why. A few percent difference would be acceptable. As same as GeForce GPU, running 256 work items is good on recent Radeon GPU. Also, putting tables on VRAM was useless for speed. It failed to show
Oh, I see. Then, my OpenCL implementation is bad (inefficient). Thanks Anime Tosho for information and tests. Reducing used registers in kernel was worthless. Now, I tried 2 points in the next sample. I removed Another point is showing
In these sample output, it compare method4 (calculate single block) and method12 (calculate 2 blocks at once). While both kernels support max 256 work items, the minimum size is different. This may indicate the kernel will be heavy on the GPU. I put the sample (par2j_debug_2023-12-12.zip) in "MultiPar_sample" folder on OneDrive. It will try to calculate 2 blocks at once always. Though I'm not sure the effect of removing |
par2j_debug_2023-12-12 Also better CPU speed per thread, all thread more stable and more than 11000 MB/s: 11009 -11660 MB/s - was here: #107 (comment) and for another tests Testing another kernel CreateKernel : method12 |
Thanks Slava46 for test again. While GeForce RTX 3070 supports max 1024 work items per Compute Unit (known by While I use many variables in my code, the value may become 16 or 8 on my using Intel GPU. When a kernel is heavy and GPU becomes too busy, OS seems to determine the GPU is freezed. I saw GPU freeze and reset rarely, when it calculates 2 blocks at once. It may be difficult to support slow GPU. Anyway, a user will enable GPU acceleration, only when he has a fast graphics board. |
Because I could not write a light code for old GPU devices, I just avoid (mostly) slow one. I implemented a simple method to select which OpenCL kernel. It will check
At first, it checks method2 kernel to know the lightest state. The value may be 32 for Nvidia GeForce GPU. The value may be 64 for AMD Radeon GPU.
Then, it checks other method's values. If faster method's value is same, it uses the method. If faster method's value is smaller, it tests next method. The result is shown like below on my Intel GPU;
Though method12 is slightly faster than method10 on my PC, they are almost same speed. So, the automatic selection will be acceptable. If there is no problem in this way of selection, I will adopt it for version 1.3.3.2.
I put the sample (par2j_debug_2023-12-16.zip) in "MultiPar_sample" folder on OneDrive. It will select method12 (heavy but possibly fast) for recent fast GPUs. Comparing CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE values will be good for unknown GPUs. If someone is interested in the flow of automatic selection, he may try the debug version. It's possible to select OpenCL kernel manually by setting |
par2j_debug_2023-12-16 Selected device = NVIDIA GeForce RTX 3070 Testing method2 Testing method12 Selected method12 we have here increased GPU speed. But tried few times more the same files and got those results: Testing method2 Testing method12 Selected method12 GPU: 187985 MB/s; 06:10 3rd test: Testing method2 Testing method12 Selected method12 GPU: 187724 MB/s; 06:05 4th test: Testing method2 Testing method12 Selected method12 GPU: 184936 MB/s; 06:14 So, I see that here for 1st test and for 3 others almost all the same and 1st test was faster 221110 MB/s vs ~187000 MB/s and curious why for first test choosed different buffer and for others another. |
Thanks Slava46 for tests sometimes. When all file data cannot fit free memory space, it splits file data and processes them one by one. For example, you have 12 GB file data. When available free memory space is 10 GB, it splits into 2 pieces and processes 6 GB x 2 times. When available free memory space is 6 GB, it splits into 3 pieces and processes 4 GB x 3 times. Even when file data is same, buffer size may become different by available memory size at that time. You may see lines like below in debug version log.
The |
My results below. Slightly slower than the earlier par2j_debug_2023-12-05 version.
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Testing method2 Testing method12 Selected method12 Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48 Created successfully |
Thanks cavalia88 for test. Now, it can get I found a section about "ternary operator" on "AMD OpenCL Programing Optimization Guide". There is a specific optimization technique for AMD GPU. I added it in my OpenCL source code. Putting constant values in "ternary operator" seems to be good. method16 becomes 10% faster on Intel GPU. Because other methods use it for table setup only, the speed improvement may be very small. At least, it won't be slower on other non-Radeon GPUs. I put the sample (par2j_debug_2023-12-18.zip) in "MultiPar_sample" folder on OneDrive. If there is no problem, I will adopt it for version 1.3.3.2. |
Seems about the same, or slightly slower than previous
Parchive 2.0 client version 1.3.3.2 by Yutaka Sawada Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Testing method2 Testing method12 Selected method12 Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48 Created successfully |
par2j64_ternary Testing method2 Testing method12 Selected method12 GPU: 183723 MB/s; 06:11 |
Thanks cavalia88 and Slava46 for test. While special "ternary operator" usage for AMD GPU isn't slower than normal "ternary operator", it's not faster, also. The interesting point is that both users claim that old "par2j_debug_2023-12-05.zip" was slightly faster. When I tested the old debug version again on my PC, recent code is slightly faster than that old code on my Intel GPU. I'm not sure what makes such difference. While their ways of table setup are different, I'm doubyful of causing 4 ~ 5 % speed difference. So, I made new debug version with old OpenCL code ( If If If the same I put the sample (par2j_debug_2023-12-19.zip) in "MultiPar_sample" folder on OneDrive. If you are interested in why old code showed faster speed ago, you may test the 3 samples and compare their results. Though I think very small difference is ignorable, I'm welcome to improve or solve problem. |
The 3 versions in latest file are about the same speed.
Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Testing method2 Testing method12 Selected method12 Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48 Created successfully
Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48 Created successfully
Base Directory : "I:\Output\Sample" Input File count : 56 read_block_num = 8775 read all source blocks, and keep all parity blocks (GPU) Platform[0] = AMD Accelerated Parallel Processing Device[0] = gfx1100 Device[1] = gfx1036 Selected platform = AMD Accelerated Parallel Processing Testing method2 Testing method12 Selected method12 Max number of work items = 12288 (256 * 48) OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48 Created successfully |
par2j_debug_2023-12-19 par2j64_old1205: 142308 MB/s; 06:39 Testing method2 Testing method12 Selected method12 par2j64_ternary: 148345 MB/s; 06:26 Testing method2 Testing method12 Selected method12 all low speed. |
Thanks cavalia88 and Slava46 for additional tests. It seems that small change in tabel setup doesn't affect total speed, when GPU is fast enough. Small (several percent like +-2%) difference of speed might be normal. Anyway timer isn't so accurate.
I sorted results by their known
|
Though I tried to improve VRAM cache usage, I could not succeed. As I'm tired of many trials, I want to finish GPU optimization. Thanks testers for their long term help. I made a sample version to test behaior. I posted alpha version of v1.3.3.2 on github. I put current sample (par2j_sample_2023-12-26.zip) in "MultiPar_sample" folder on OneDrive. If there is no problem, I will release next version at next year. |
par2j_sample_2023-12-26 par2j64_1329: 14:19 - very slow Max number of work items = 11776 (256 * 46) par2j64_1332: GPU: 179446 MB/s; 06:19 src buf : 2096664 KB (1317 blocks), possible Testing method12 Selected method12 Max number of work items = 11776 (256 * 46) par2j64 main: 06:09 GPU speed I don't see in log. |
Thanks Slava46 for confirmation. I will release v1.3.3.2 soon. |
Hi,
I was previously using the par2j64 command line version 1.3.2.8. Tested out the latest 1.3.3.1 version and found that speeds are about double, mainly because more threads are being utilized. I'm using a Ryzen 9 7900X 12 core processor with a AMD 7900XTX Radeon graphics card. I have two questions:
(i) With the new 1.3.3.1 version, 24 threads are being used and i get 100% CPU utilization. Is there any way for me to limit the maximum number of threads?
(ii) I cannot seem to get the GPU to do the processing. In both versions, all the work is being done by the CPU (I see little utilization of GPU in the task manager). How do we know if the GPU is being used by Multipar from the command line? Is there any command to force the use of GPU? My current command line is as below.
par2j64 c -rr10 -ss640000 -lc32 -rd2 -lr316 "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz" "I:\Output\Sample*.*"
======================================
Parchive 2.0 client version 1.3.2.8 by Yutaka Sawada
Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 12 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 SSSE3 CLMUL AVX2
Memory usage : Auto (85178 MB available), Fast SSD
Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 12
Slice distribution : 2, power of two (until 158)
Packet Repetition limit : 0
100.0% : Computing file hash
100.0% : Making index file
100.0% : Constructing recovery file
100.0% : Creating recovery slice
OpenCL : gfx1100, OpenCL 2.0 AMD-APP (3584.0), 256*48
======================================
Parchive 2.0 client version 1.3.3.1 by Yutaka Sawada
Base Directory : "I:\Output\Sample"
Recovery File : "I:\Output\Sample\UjPGgFjavolplR8geCekcqwXz.par2"
CPU thread : 24 / 24
CPU cache limit : 128 KB, 2048 KB
CPU extra : x64 AVX2 CLMUL
Memory usage : Auto (88769 MB available), Fast SSD
Input File count : 56
Input File total size : 5589315326
Input File Slice size : 640000
Input File Slice count : 8775
Recovery Slice count : 878
Redundancy rate : 10.00%
Recovery File count : 11
Slice distribution : 2, power of two (until 316)
Packet Repetition limit : 0
100.0% : Computing file hash
100.0% : Making index file
100.0% : Constructing recovery file
100.0% : Creating recovery slice
======================================
The text was updated successfully, but these errors were encountered: