Conversation
@@ -26,7 +26,7 @@ | |||
#endif | |||
|
|||
__global__ void | |||
__launch_bounds__(TPB, BPSM) | |||
//__launch_bounds__(TPB, BPSM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this to a newbie?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance. Therefore, the compiler uses heuristics to minimize register usage while keeping register spilling and instruction count to a minimum. And this is achieved by using launch_bounds__(TPB, BPSM).
It has 2 parameters:
maxThreadsPerBlock: specifies the maximum number of threads per block with which the application will ever launch the kernel
minBlocksPerMultiprocessor: is optional and specifies the desired minimum number of resident blocks per multiprocessor
However, for the ethminer code to run on GTX1060, the restriction of launch_bound is too tight (register number is limited to about 47, while the optimal should be about 70), which makes the overhead from register spilling to be higher than performance benefit from reducing register number.
When deleting launch bound, we have about 2% performance improvement on GTX1060.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this decrease performance on other cards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends. As long as the card has enough register file, this should increase the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just remove the code?
Are TPB
and BPSM
defined by us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can.
TPB and BPSM are defined in the same file (libethash-cuda/ethash_cuda_miner_kernel.cu) with launch_bound
Benchmarks for Nvidia GTX 1070 (mobile)Beforeethminer -U -Z 4000000 ethminer -U -M Afterethminer -U -Z 4000000 ethminer -U -M Improvement ~3% |
@@ -26,7 +26,7 @@ | |||
#endif | |||
|
|||
__global__ void | |||
__launch_bounds__(TPB, BPSM) | |||
//__launch_bounds__(TPB, BPSM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just remove the code?
Are TPB
and BPSM
defined by us?
libethash-cuda/keccak.cuh
Outdated
@@ -328,12 +331,22 @@ __device__ __forceinline__ void keccak_f1600_init(uint2* s) | |||
|
|||
/* iota: a[0,0] ^= round constant */ | |||
s[0] ^= vectorize(keccak_round_constants[23]); | |||
|
|||
for(uint32_t i=0; i<12; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you reformat this to for (int i = 0; i < 12; ++i)
and skip {}
?
libethash-cuda/keccak.cuh
Outdated
uint2 t[5], u, v; | ||
|
||
for (uint32_t i = 0; i<12; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same reformatting here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several for loops in this file that has the "for (uint32_t i = 0; i<?; i++)" style, should I re-format them all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the ones you have added or modified.
…ed from 'min/mean/max: 22369621/22579336/22719146 H/s' to 'min/mean/max: 23767722/23907532/24117248 H/s' on a flashed GTX 1060 with 2 GPCs 9 TPCs (the product chip should have 10 TPCs). Note that the code is tested on the code pulled from May-11. The current code from github cannot generate reasonable scores ('min/max/avg is 0/0/0 H/s')
…e for loop in keccak.cuh
libethash-cuda/keccak.cuh
Outdated
@@ -328,12 +331,19 @@ __device__ __forceinline__ void keccak_f1600_init(uint2* s) | |||
|
|||
/* iota: a[0,0] ^= round constant */ | |||
s[0] ^= vectorize(keccak_round_constants[23]); | |||
|
|||
for(int i=0; i<12; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also had in mind to add some spaces. After for
, around =
, around <
.
libethash-cuda/keccak.cuh
Outdated
@@ -328,12 +331,19 @@ __device__ __forceinline__ void keccak_f1600_init(uint2* s) | |||
|
|||
/* iota: a[0,0] ^= round constant */ | |||
s[0] ^= vectorize(keccak_round_constants[23]); | |||
|
|||
for(int i=0; i<12; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not quite understand. Should I add a new line between "s[0] ^= vectorize(keccak_round_constants[23]);" and "}" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean for_(int i_=_0; i_<_12; ++i)
. Required spaces marked with _
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry for my misunderstanding. I have checked in my code, please help to review. Thank you.
Tested on Nvidia GRID K520 on Ubuntu, performance decreased by about 5%, maybe it's specific to my card. |
min/mean/max: 0/4858402/7689557 H/s |
This code is optimized for GTX1060 according to its architecture. As you know, if we want to get the highest performance, we have to optimize code according to the underlying architecture of each specific device. Can we separate the code for each device, or can we add switches/macros in the code to distinguish device related optimizations? |
I understand this, but looks we cannot ship this in this form. Yes for separating code for each architecture (not device), but I don't know how to do it. |
Yeah, adding command line switch will surely work. |
I'm happy with enable it for 10XX GPUs (this works good for at least some of them). We can add a switch in next iteration. This is all up to you. Do you have access to more hardware to perform more tests? |
Agree, we can enable it and wait for user's feedback. |
Is this going to help with performance on all GDDR5X gpus or is it addressing a different issue? |
@brianorwhatever This code is optimized for GP106 with 1GPC GDDR5 case, and may be helpful for other chips. You can scale the “#define PARALLEL_HASH 4” in dagger_shuffled.cuh from 1 to 8 to check whether can improve performance for your GPU. |
Tested on M60: Before:
After:
Which is 2% improvement. Didn't notice any difference from 4 or 8 parallel hashes. |
@qtxwang, Hi, qtxwang, can you test my code again with "./ethminer -M -U --benchmark-warmup 100" Thank you very much. min/mean/max: 0/4858402/7689557 H/s @andrusha Hi, Andrusha, thank you for your testing. The code have the best performance on my GTX1060 when PARALLEL_HASH = 4. |
hi, davilizh what 1060 did you test on? |
can anyone provide a windows binary with this change, I would love to test it on a GTX 1070. Thanks! |
Windows binaries are always available on AppVeyor CI. |
Not sure if you're checking this anymore but here are my results:
|
Occurs after around 20 minutes of mining, unable to reproduce on Claymore which appears to run fine. Running on 6x Gigabyte 1060 3GB cards |
@michael-pesce which nvidia driver version are you running? how were you able to change clock speeds in linux? I have tried via nvidia-smi and nvidia-settings but can't with 375.66 linux driver and a 1070 devtalk.nvidia discussion |
@spieiga 375.66 First, make sure that the x.conf configuration allows overclocking by using this command and restarting:
Then use nvidia-settings to overclock. Note: you must be in an X session to do so. Here's a script I use at startup:
Each command would look something like this:
That sets fans to 60% and memory transfer speed offset to +1000, and the script will iterate through each GPU (in my case 0..5) Output of nvidia-smi -q just so you can see version numbers and such.
|
@sleepeeg3, the code is for cuda only. |
Works like charm. |
4x Palit GTX 1060 3gb |
@spieiga You can also find out the maximum value for your gpu's |
@michael-pesce any idea to get this working on a headless ubuntu machine? I tried a fresh Ubuntu server 16.04 LTS installation with xfce4 and nvidia-375... |
@eliclement from what I've read, it can't be headless, otherwise the NVIDIA driver is not loaded. |
@eliclement I couldn't figure it out. Instead, I bought these: https://www.amazon.com/gp/product/B00JKFTYA8/ They fake a monitor UDID (or whatever) so X loads. Then you can set a file to autostart when X loads that does the overclocking/fan speed setting. If you need to adjust you simply screen share into the X session. |
@michael-pesce thanks for the tip. I have such a dummy and tried it, but when setting a nvidia-settings params, i always get: (logged in through vnc, Ubuntu 16.04, xfce4, nvidia-375) |
It works now by installing the Ubuntu 16.04 LTS Desktop instead of the server distro with the dummy plug. |
@michael-pesce @eliclement you don't need a monitor connected to make X work. At the time of installation save the EDID of the monitor using nvidia-settings and then use the edid.bin file in your xorg.conf to fake X that there is a monitor connected. I have this working on my rig and X has no issues. You can add edid by using |
@michael-pesce @eliclement
|
Anyone using Linux having success with the --cuda-parallel-hash setting? Not sure if it's related to the OS, but it seems to have no noticable effect on my hashrates for the GTX 1060 from MSI Gaming with 6 GB. |
@efrister I have on Ubuntu 17.04, on a GTX 1060. The default (4) seems best for me, with slight decreases for 2 and 8. Are you starting ethminer with the -U flag? Use a large number (4000) for accurate hashrates with --farm-recheck. |
@pabloi I get 23 MH/s using "-U --cuda-parallel-hash 4 -S $pool -SP 1 -O $walletETH.$rigName --farm-recheck 4000". I got 23 before when using Claymore, and I get the same when I just leave out the --cuda-parallel-hash. |
@efrister That is strange. Did you compile it yourself? |
@pabloi I'm using it pre-packaged with the Simplemining OS (simplemining.net). Don't think that it's self-compiled there, but I can't say for sure. Using their beta image, so maybe it is related to that. |
@efrister I am not familiar with simple mining but the CUDA miner is not compiled by default, so I built it myself setting the appropriate flag (see instructions on repository main page). Perhaps that is the issue. |
So, you are getting 23Mh/s with both Ethminer and Claymore. This is perfect, use Ethminer, it's still free and no dev fees ! What are you asking for ? The default argument for --cuda-parallel-hash is 4, everything is ok. |
@petunder I had similar luck on Ethermine. The miner's reported hashrate was about 100-102 MH/s but the pool reported an average hashrate of 94. Claymore's miner reports a hashrate of 94 as well. |
trust the hashrate you are seeing locally on the GPUs. the hashrates reported by the pools depend on how many shares you are contributing. they can't measure your true hashrate, all they can go by is how many shares you submit. and the shares you submit depend on the difficulty of the block you are mining. |
Glad to say my hasrate improved from 62 to 66 !!! ( 3 gtx1050ti + 1 1060 6gb). |
@ajthemacboy It seems that this fork of ethminer does not send to the pool all the shares found. Genoil version works fine, local hashrate strong equal pool accepted hashrate. |
Downloaded the exe and put it into a folder with my bat file and when I try and run the exe it keeps saying it is getting work package and then failed to submit hash and client connection error and it cant connect to http blah blah. I even tried copying the four dll files from the ethermine folder I am currently using that works and still the same thing. Thoughts? |
The performance is improv…ed from 'min/mean/max: 22369621/22579336/22719146 H/s' to 'min/mean/max: 23767722/23907532/24117248 H/s' on a flashed GTX 1060 with 2 GPCs 9 TPCs (the product chip should have 10 TPCs). Note that the code is tested on the code pulled from May-11. The current code from github cannot generate reasonable scores ('min/max/avg is 0/0/0 H/s')
Optimizations include:
1. ethash_cuda_miner_kernel.cu
We have commented out "launch_bounds" in the code. launch_bound is discussed in http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#axzz4fzSzZc9p in detail.
2. dagger_shuffle.cuh
The state in compute_hash of dagger_shuffle.cuh is modified.
Every thread is the master for calculating one hash value. Each thread initializes its version of state using keccak_f1600_init. Then in the main loop: When i=0 threads 0-7 copy the values of thread 0's state[0-7] into each threads' shuffle[0-7], do the main computation, and then thread 0 captures the result of shuffle[0-3] into state[8-11]. On the next loop when i=1 threads 0-7 copy the values of thread 1's state[0-7] into each threads' shuffle[0-7], do the main computation, and then thread 1 captures the result of shuffle[0-3] into state[8-11].
With the modification this is changed so that if PARALLEL_HASH=2: When i=0 threads 0-7 copy the values of thread 0's state[0-7] into each threads' shuffle[0][0-7] and thread 1's state[0-7] into each threads' shuffle[1][0-7]. They do the main computation on these 2 shuffle vectors in parallel. Then thread 0 captures the result of shuffle[0][0-3] into its state[8-11] and thread 1 captures the result of shuffle[1][0-3] into its state[8-11].
3. keccak.cuh
Since the input argument uint2 *s is changed in dagger_shuffle.cuh, we have to modify keccak_f1600_init and keccak_f1600_final in keccak.cuh accordingly.