Skip to content
This repository has been archived by the owner on Apr 24, 2022. It is now read-only.

The optimized code by David Li #18

Merged
merged 7 commits into from Jun 27, 2017
Merged

Conversation

davilizh
Copy link
Contributor

The performance is improv…ed from 'min/mean/max: 22369621/22579336/22719146 H/s' to 'min/mean/max: 23767722/23907532/24117248 H/s' on a flashed GTX 1060 with 2 GPCs 9 TPCs (the product chip should have 10 TPCs). Note that the code is tested on the code pulled from May-11. The current code from github cannot generate reasonable scores ('min/max/avg is 0/0/0 H/s')
Optimizations include:
1. ethash_cuda_miner_kernel.cu
We have commented out "launch_bounds" in the code. launch_bound is discussed in http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#axzz4fzSzZc9p in detail.
2. dagger_shuffle.cuh

  1. We moved around and reduced variable definitions to the minimum required. The compiler should have been able to do this analysis, but it never hurts to help out the compiler.
    The state in compute_hash of dagger_shuffle.cuh is modified.
  2. We simplify the nested if/else blocks into a switch statement.
  3. We simplify control flow. Remove the conditional from the inner loop so all threads calculate the value, and then all threads use a __shfl to read thread t's value (throwing away the rest of the threads' calculated value).
  4. We increase the total number of LDGs to increase occupancy. We define PARALLEL_HASH to let each warp have PARALLEL_HASH LDGs in-flight at a time, not 1 at a time, which is the original case.
    Every thread is the master for calculating one hash value. Each thread initializes its version of state using keccak_f1600_init. Then in the main loop: When i=0 threads 0-7 copy the values of thread 0's state[0-7] into each threads' shuffle[0-7], do the main computation, and then thread 0 captures the result of shuffle[0-3] into state[8-11]. On the next loop when i=1 threads 0-7 copy the values of thread 1's state[0-7] into each threads' shuffle[0-7], do the main computation, and then thread 1 captures the result of shuffle[0-3] into state[8-11].
    With the modification this is changed so that if PARALLEL_HASH=2: When i=0 threads 0-7 copy the values of thread 0's state[0-7] into each threads' shuffle[0][0-7] and thread 1's state[0-7] into each threads' shuffle[1][0-7]. They do the main computation on these 2 shuffle vectors in parallel. Then thread 0 captures the result of shuffle[0][0-3] into its state[8-11] and thread 1 captures the result of shuffle[1][0-3] into its state[8-11].
    3. keccak.cuh
    Since the input argument uint2 *s is changed in dagger_shuffle.cuh, we have to modify keccak_f1600_init and keccak_f1600_final in keccak.cuh accordingly.

@@ -26,7 +26,7 @@
#endif

__global__ void
__launch_bounds__(TPB, BPSM)
//__launch_bounds__(TPB, BPSM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this to a newbie?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance. Therefore, the compiler uses heuristics to minimize register usage while keeping register spilling and instruction count to a minimum. And this is achieved by using launch_bounds__(TPB, BPSM).
It has 2 parameters:
maxThreadsPerBlock: specifies the maximum number of threads per block with which the application will ever launch the kernel
minBlocksPerMultiprocessor: is optional and specifies the desired minimum number of resident blocks per multiprocessor
However, for the ethminer code to run on GTX1060, the restriction of launch_bound is too tight (register number is limited to about 47, while the optimal should be about 70), which makes the overhead from register spilling to be higher than performance benefit from reducing register number.
When deleting launch bound, we have about 2% performance improvement on GTX1060.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this decrease performance on other cards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. As long as the card has enough register file, this should increase the performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just remove the code?
Are TPB and BPSM defined by us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can.
TPB and BPSM are defined in the same file (libethash-cuda/ethash_cuda_miner_kernel.cu) with launch_bound

@chfast
Copy link
Contributor

chfast commented May 15, 2017

Benchmarks for Nvidia GTX 1070 (mobile)

Before

ethminer -U -Z 4000000
ℹ 17:22:25|ethminer Mining on difficulty 29 23.07MH/s
ℹ 17:22:26|ethminer Mining on difficulty 29 26.21MH/s
ℹ 17:22:27|ethminer Mining on difficulty 29 25.17MH/s
ℹ 17:22:28|ethminer Mining on difficulty 29 25.17MH/s
ℹ 17:22:29|ethminer Mining on difficulty 29 26.21MH/s
ℹ 17:22:30|ethminer Mining on difficulty 29 25.17MH/s
ℹ 17:22:31|ethminer Mining on difficulty 29 25.17MH/s
ℹ 17:22:32|ethminer Mining on difficulty 29 25.17MH/s
ℹ 17:22:33|ethminer Mining on difficulty 29 26.21MH/s
ℹ 17:22:34|ethminer Mining on difficulty 29 25.17MH/s
ℹ 17:22:35|ethminer Mining on difficulty 29 25.17MH/s
ℹ 17:22:36|ethminer Mining on difficulty 29 26.21MH/s
ℹ 17:22:37|ethminer Mining on difficulty 29 25.17MH/s
ℹ 17:22:38|ethminer Mining on difficulty 29 25.17MH/s
ℹ 17:22:39|ethminer Mining on difficulty 29 26.21MH/s
ℹ 17:22:40|ethminer Mining on difficulty 29 25.17MH/s

ethminer -U -M
Trial 1... 25515349
Trial 2... 25864874
Trial 3... 25864874
Trial 4... 25515349
Trial 5... 25864874
min/mean/max: 25515349/25725064/25864874 H/s

After

ethminer -U -Z 4000000
ℹ 17:31:25|ethminer Mining on difficulty 31 26.21MH/s
ℹ 17:31:26|ethminer Mining on difficulty 31 26.21MH/s
ℹ 17:31:27|ethminer Mining on difficulty 31 26.21MH/s
ℹ 17:31:28|ethminer Mining on difficulty 31 27.26MH/s
ℹ 17:31:29|ethminer Mining on difficulty 31 26.21MH/s
ℹ 17:31:30|ethminer Mining on difficulty 31 26.21MH/s
ℹ 17:31:31|ethminer Mining on difficulty 31 26.21MH/s
ℹ 17:31:32|ethminer Mining on difficulty 31 26.21MH/s
ℹ 17:31:33|ethminer Mining on difficulty 31 26.21MH/s
ℹ 17:31:34|ethminer Mining on difficulty 31 26.21MH/s
ℹ 17:31:35|ethminer Mining on difficulty 31 26.21MH/s
ℹ 17:31:36|ethminer Mining on difficulty 31 26.21MH/s
ℹ 17:31:37|ethminer Mining on difficulty 31 26.21MH/s
ℹ 17:31:38|ethminer Mining on difficulty 31 26.21MH/s
ℹ 17:31:39|ethminer Mining on difficulty 31 26.21MH/s

ethminer -U -M
Trial 1... 26563925
Trial 2... 26563925
Trial 3... 26214400
Trial 4... 26563925
Trial 5... 26563925
min/mean/max: 26214400/26494020/26563925 H/s

Improvement ~3%

@@ -26,7 +26,7 @@
#endif

__global__ void
__launch_bounds__(TPB, BPSM)
//__launch_bounds__(TPB, BPSM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just remove the code?
Are TPB and BPSM defined by us?

@@ -328,12 +331,22 @@ __device__ __forceinline__ void keccak_f1600_init(uint2* s)

/* iota: a[0,0] ^= round constant */
s[0] ^= vectorize(keccak_round_constants[23]);

for(uint32_t i=0; i<12; i++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reformat this to for (int i = 0; i < 12; ++i) and skip {}?

uint2 t[5], u, v;

for (uint32_t i = 0; i<12; i++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same reformatting here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several for loops in this file that has the "for (uint32_t i = 0; i<?; i++)" style, should I re-format them all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the ones you have added or modified.

…ed from 'min/mean/max: 22369621/22579336/22719146 H/s' to 'min/mean/max: 23767722/23907532/24117248 H/s' on a flashed GTX 1060 with 2 GPCs 9 TPCs (the product chip should have 10 TPCs). Note that the code is tested on the code pulled from May-11. The current code from github cannot generate reasonable scores ('min/max/avg is 0/0/0 H/s')
@@ -328,12 +331,19 @@ __device__ __forceinline__ void keccak_f1600_init(uint2* s)

/* iota: a[0,0] ^= round constant */
s[0] ^= vectorize(keccak_round_constants[23]);

for(int i=0; i<12; ++i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had in mind to add some spaces. After for, around =, around <.

@@ -328,12 +331,19 @@ __device__ __forceinline__ void keccak_f1600_init(uint2* s)

/* iota: a[0,0] ^= round constant */
s[0] ^= vectorize(keccak_round_constants[23]);

for(int i=0; i<12; ++i)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not quite understand. Should I add a new line between "s[0] ^= vectorize(keccak_round_constants[23]);" and "}" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean for_(int i_=_0; i_<_12; ++i). Required spaces marked with _.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry for my misunderstanding. I have checked in my code, please help to review. Thank you.

@qtxwang
Copy link

qtxwang commented May 25, 2017

Tested on Nvidia GRID K520 on Ubuntu, performance decreased by about 5%, maybe it's specific to my card.

@qtxwang
Copy link

qtxwang commented May 25, 2017

min/mean/max: 0/4858402/7689557 H/s
inner mean: 5534151 H/s

@davilizh
Copy link
Contributor Author

This code is optimized for GTX1060 according to its architecture. As you know, if we want to get the highest performance, we have to optimize code according to the underlying architecture of each specific device. Can we separate the code for each device, or can we add switches/macros in the code to distinguish device related optimizations?

@chfast
Copy link
Contributor

chfast commented May 26, 2017

I understand this, but looks we cannot ship this in this form.

Yes for separating code for each architecture (not device), but I don't know how to do it.
The best might be to have command line switch to enable this optimization to get feedback from users if it make it faster for their device.

@davilizh
Copy link
Contributor Author

Yeah, adding command line switch will surely work.
BTW, you can also use the architecture identification macro "CUDA_ARCH" supported by NVCC, which can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled.
Detailed description about the macro is here: http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#axzz4iAhM64FX

@chfast
Copy link
Contributor

chfast commented May 26, 2017

I'm happy with enable it for 10XX GPUs (this works good for at least some of them). We can add a switch in next iteration. This is all up to you.

Do you have access to more hardware to perform more tests?

@davilizh
Copy link
Contributor Author

Agree, we can enable it and wait for user's feedback.
Currently, I have no other device at hand. But I will probably evaluate the code on another device in the near future (maybe 1-2 weeks).

@davilizh davilizh changed the title The optimized code by Nvidia Architecturer. The optimized code by David Li May 26, 2017
@brianorwhatever
Copy link

Is this going to help with performance on all GDDR5X gpus or is it addressing a different issue?

@davilizh
Copy link
Contributor Author

davilizh commented Jun 14, 2017

@brianorwhatever This code is optimized for GP106 with 1GPC GDDR5 case, and may be helpful for other chips. You can scale the “#define PARALLEL_HASH 4” in dagger_shuffled.cuh from 1 to 8 to check whether can improve performance for your GPU.

@andrusha
Copy link

Tested on M60:

Before:

min/mean/max: 17476266/17755886/17825792 H/s
inner mean: 5941930 H/s

After:

min/mean/max: 17825792/18035507/18175317 H/s
inner mean: 6058439 H/s

Which is 2% improvement. Didn't notice any difference from 4 or 8 parallel hashes.

@davilizh
Copy link
Contributor Author

davilizh commented Jun 23, 2017

@qtxwang, Hi, qtxwang, can you test my code again with "./ethminer -M -U --benchmark-warmup 100"
It's weird that your min score is 0, which might bring about noise to the comparison.
BTW, is your pasted code tested on my code, or on the master branch code?
Can you paste the results for both version of the code?

Thank you very much.

min/mean/max: 0/4858402/7689557 H/s
inner mean: 5534151 H/s

@andrusha Hi, Andrusha, thank you for your testing. The code have the best performance on my GTX1060 when PARALLEL_HASH = 4.

@kiwina
Copy link

kiwina commented Jun 24, 2017

hi, davilizh what 1060 did you test on?
i cant get them over 19

@cfelicio
Copy link

can anyone provide a windows binary with this change, I would love to test it on a GTX 1070. Thanks!

@chfast
Copy link
Contributor

chfast commented Jun 24, 2017

Windows binaries are always available on AppVeyor CI.
https://ci.appveyor.com/project/ethereum-mining/ethminer/build/93/job/ss7k95dsy1kly4vl/artifacts

@emily-pesce
Copy link

Not sure if you're checking this anymore but here are my results:

GTX 1070
Ubuntu 16.04
Memory offset +1500
Power ceiling 115w
./ethminer -U -M --cuda-parallel-hash X:
31.10 at 1
32.36 at 2
28.87 at 3
32.42 at 4
25.86 at 5
21.60 at 6
18.59 at 7
32.22 at 8

@jacksgituk
Copy link

jacksgituk commented Jul 2, 2017

CUDA error in func 'ethash_cuda_miner::search' at line 359 : unspecified launch failure

Occurs after around 20 minutes of mining, unable to reproduce on Claymore which appears to run fine.

Running on 6x Gigabyte 1060 3GB cards

@spieiga
Copy link

spieiga commented Jul 2, 2017

@michael-pesce which nvidia driver version are you running? how were you able to change clock speeds in linux? I have tried via nvidia-smi and nvidia-settings but can't with 375.66 linux driver and a 1070 devtalk.nvidia discussion

@emily-pesce
Copy link

emily-pesce commented Jul 2, 2017

@spieiga 375.66

First, make sure that the x.conf configuration allows overclocking by using this command and restarting:

sudo nvidia-xconfig -a --cool-bits=31 --allow-empty-initial-configuration

Then use nvidia-settings to overclock. Note: you must be in an X session to do so. Here's a script I use at startup:

#! /bin/bash

SET='/usr/bin/nvidia-settings'
    
NUMGPU="$(nvidia-smi -L | wc -l)"

echo "Setting up ${NUMGPU} GPU(s)"

    n=0
    while [  $n -lt $NUMGPU ];
    do

  ${SET} -a [gpu:${n}]/GPUFanControlState=1 -a [fan:${n}]/GPUTargetFanSpeed=60 -a [gpu:${n}]/GPUPowerMizerMode=1 -a [gpu:${n}]/GPUMemoryTransferRateOffset[3]=1000

      let n=n+1
    done
    echo "Complete"; 
    exit 0;

Each command would look something like this:

nvidia-settings -a [gpu:0]/GPUFanControlState=1 -a [fan:0]/GPUTargetFanSpeed=60 -a [gpu:0]/GPUPowerMizerMode=1 -a [gpu:0]/GPUMemoryTransferRateOffset[3]=1000

That sets fans to 60% and memory transfer speed offset to +1000, and the script will iterate through each GPU (in my case 0..5)

Output of nvidia-smi -q just so you can see version numbers and such.

==============NVSMI LOG==============

Timestamp : Sun Jul 2 11:32:45 2017
Driver Version : 375.66

Attached GPUs : 6
GPU 0000:01:00.0
Product Name : GeForce GTX 1070
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-xxx
Minor Number : 0
VBIOS Version : 86.04.50.00.AB
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A

@davilizh
Copy link
Contributor Author

davilizh commented Jul 3, 2017

@sleepeeg3, the code is for cuda only.

@murathai
Copy link

murathai commented Jul 5, 2017

Works like charm.
4x palit gtx 1060 3gb ram (+748 mem, -200 core (same rate for 0 core), 75% tdp (100% tdp also gives same)) = 95.33 - 96.2 megahash/sec stable.
claymore got 90 megahash/sec using same settings.

@froggrog
Copy link

froggrog commented Jul 5, 2017

4x Palit GTX 1060 3gb
Ubuntu 16.04
claymore got 94 mh/s
ethminer got 104 mh/s
but the Nanopool statistics show me a smaller hashrate than when i use a claymore. why?

@froggrog
Copy link

froggrog commented Jul 5, 2017

@spieiga
You also can use this command to increase Power Limit. This increases the scope of overclocking
sudo nvidia-smi -i GPU -pl WATT
example:
sudo nvidia-smi -i 1 -pl 140

You can also find out the maximum value for your gpu's
nvidia-smi -q -d POWER

@LtMerlin
Copy link

LtMerlin commented Jul 5, 2017

@michael-pesce any idea to get this working on a headless ubuntu machine? I tried a fresh Ubuntu server 16.04 LTS installation with xfce4 and nvidia-375...
Did you use an Ubuntu Desktop installation with standard Xorg config?

@spieiga
Copy link

spieiga commented Jul 5, 2017

@eliclement from what I've read, it can't be headless, otherwise the NVIDIA driver is not loaded.

@emily-pesce
Copy link

@eliclement I couldn't figure it out.

Instead, I bought these: https://www.amazon.com/gp/product/B00JKFTYA8/

They fake a monitor UDID (or whatever) so X loads. Then you can set a file to autostart when X loads that does the overclocking/fan speed setting. If you need to adjust you simply screen share into the X session.

@LtMerlin
Copy link

LtMerlin commented Jul 6, 2017

@michael-pesce thanks for the tip. I have such a dummy and tried it, but when setting a nvidia-settings params, i always get:
ERROR: Error querying connected displays on GPU 0 (Missing Extension)

(logged in through vnc, Ubuntu 16.04, xfce4, nvidia-375)
any ideas how to solve this?

@LtMerlin
Copy link

LtMerlin commented Jul 6, 2017

It works now by installing the Ubuntu 16.04 LTS Desktop instead of the server distro with the dummy plug.

@rizwansarwar
Copy link

@michael-pesce @eliclement you don't need a monitor connected to make X work. At the time of installation save the EDID of the monitor using nvidia-settings and then use the edid.bin file in your xorg.conf to fake X that there is a monitor connected. I have this working on my rig and X has no issues. You can add edid by using nvidia-xconfig --custom-edid=<location of edid.bin>. This will generate your xconfig using fake edid, X should start fine after that.

@froggrog
Copy link

froggrog commented Jul 7, 2017

@michael-pesce @eliclement
You can use this config, it is made on 4 gpu's

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 375.66  (buildmeister@swio-display-x86-rhel47-06)  Mon May  1 15:45:32 PDT 2017

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
    Screen      1  "Screen1" RightOf "Screen0"
    Screen      2  "Screen2" RightOf "Screen1"
    Screen      3  "Screen3" RightOf "Screen2"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Monitor"
    Identifier     "Monitor1"
    VendorName     "Unknown"
    ModelName      "Unknown"
EndSection

Section "Monitor"
    Identifier     "Monitor2"
    VendorName     "Unknown"
    ModelName      "Unknown"
EndSection

Section "Monitor"
    Identifier     "Monitor3"
    VendorName     "Unknown"
    ModelName      "Unknown"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:1:0:0"
EndSection

Section "Device"
    Identifier     "Device1"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:4:0:0"
EndSection

Section "Device"
    Identifier     "Device2"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:5:0:0"
EndSection

Section "Device"
    Identifier     "Device3"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:7:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    Option         "Coolbits" "28"
    DefaultDepth    24
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Section "Screen"
    Identifier     "Screen1"
    Device         "Device1"
    Monitor        "Monitor1"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "Coolbits" "28"
EndSection

Section "Screen"
    Identifier     "Screen2"
    Device         "Device2"
    Monitor        "Monitor2"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "Coolbits" "28"
EndSection

Section "Screen"
    Identifier     "Screen3"
    Device         "Device3"
    Monitor        "Monitor3"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "Coolbits" "28"
EndSection

@efrister
Copy link

Anyone using Linux having success with the --cuda-parallel-hash setting? Not sure if it's related to the OS, but it seems to have no noticable effect on my hashrates for the GTX 1060 from MSI Gaming with 6 GB.

@pabloi
Copy link

pabloi commented Jul 14, 2017

@efrister I have on Ubuntu 17.04, on a GTX 1060. The default (4) seems best for me, with slight decreases for 2 and 8. Are you starting ethminer with the -U flag? Use a large number (4000) for accurate hashrates with --farm-recheck.

@efrister
Copy link

@pabloi I get 23 MH/s using "-U --cuda-parallel-hash 4 -S $pool -SP 1 -O $walletETH.$rigName --farm-recheck 4000". I got 23 before when using Claymore, and I get the same when I just leave out the --cuda-parallel-hash.

@pabloi
Copy link

pabloi commented Jul 14, 2017

@efrister That is strange. Did you compile it yourself?

@efrister
Copy link

@pabloi I'm using it pre-packaged with the Simplemining OS (simplemining.net). Don't think that it's self-compiled there, but I can't say for sure. Using their beta image, so maybe it is related to that.

@pabloi
Copy link

pabloi commented Jul 14, 2017

@efrister I am not familiar with simple mining but the CUDA miner is not compiled by default, so I built it myself setting the appropriate flag (see instructions on repository main page). Perhaps that is the issue.

@Singman33
Copy link

So, you are getting 23Mh/s with both Ethminer and Claymore. This is perfect, use Ethminer, it's still free and no dev fees ! What are you asking for ? The default argument for --cuda-parallel-hash is 4, everything is ok.
I'm using Ubuntu 16.04 and I'm getting the same hashrate with Claymore and Ethminer, this is a great improvement.

@petunder
Copy link

Tonight I experimented with the version of miner 0.11.0. Locally, it shows a significant increase in the hashrate. In my case (3x1050 ti) from 36 to 40 Mh / s. However, Nicehash does not actually take this speed. As a result, overnight average dropped exactly 36 Mh/s. It means that the miner worked in the void. With other pools I did not try to run a new version. At the morning I returned the version of Genoil, everything works fine, the speed is normally accepted by the Nicehash.
screenshot_2017-07-22-10-17-45-154_net miner

@ajthemacboy
Copy link

@petunder I had similar luck on Ethermine. The miner's reported hashrate was about 100-102 MH/s but the pool reported an average hashrate of 94. Claymore's miner reports a hashrate of 94 as well.

@spieiga
Copy link

spieiga commented Jul 25, 2017

trust the hashrate you are seeing locally on the GPUs. the hashrates reported by the pools depend on how many shares you are contributing. they can't measure your true hashrate, all they can go by is how many shares you submit. and the shares you submit depend on the difficulty of the block you are mining.

@charlesxabier
Copy link

Glad to say my hasrate improved from 62 to 66 !!! ( 3 gtx1050ti + 1 1060 6gb).

@petunder
Copy link

petunder commented Aug 4, 2017

@ajthemacboy It seems that this fork of ethminer does not send to the pool all the shares found. Genoil version works fine, local hashrate strong equal pool accepted hashrate.

@djglamrock
Copy link

Downloaded the exe and put it into a folder with my bat file and when I try and run the exe it keeps saying it is getting work package and then failed to submit hash and client connection error and it cant connect to http blah blah. I even tried copying the four dll files from the ethermine folder I am currently using that works and still the same thing. Thoughts?

@ethereum-mining ethereum-mining locked and limited conversation to collaborators Aug 6, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet