~10% speedup #1

pallas1 · 2016-04-02T13:27:31Z

Untested on ARCH < 5.2

tpruvot · 2016-04-02T15:35:48Z

Thanks for the commit, i will try them on earlier cuda cards and on the testnet... before commiting

PS: the testnet is not "testable" as-is, there is a line to comment in ccminer to allow the empty version (data[0])

tpruvot · 2016-04-02T19:39:31Z

So, the first review is fine... with mixed results on linux regarding generations, but works well SM 3.0, 5 and 5.2. I ask decred team, if its me which should merge that ;)

pallas1 · 2016-04-02T19:55:02Z

Good. Will need tuning on 750. Maybe just removing the GS4 calls on < 5.2. I only have 970s.

alexis78 · 2016-04-02T20:29:08Z

Hi @pallas1 . That's a nice work there. I'd recommend to use a return statement inside the final if condition just like i did on blake8-rounds. https://github.com/alexis78/ccminer/blob/windows/Algo256/blake256.cu#L287

This addition may sound dummy, but as you'll notice it's not dummy at all.
In detail, it allows CUDA7.5 to

Store the values fetched from constant memory in registers.
Perform more precomputations on the outside of the for loop.
Stop the continuous fetches from the constant memory while iterating

On the other side, register usage of the kernel is increasing.
As i previously stated here,
https://v.cash/forum/threads/ccminer-faster-8-round-blake-algo-1-16x.282/page-2#post-5039
you can observe these differences (with or without the return statement) by observing the ptx generated code

pallas1 · 2016-04-02T21:00:54Z

Thanks, I will try that!

cjepson · 2016-04-03T22:58:21Z

Waiting for @jolan 's comment because I think he is the only one with GPUs.

bshiner · 2016-04-04T14:49:12Z

I have some 750 cards (Ubuntu 14.04 and CUDA 7.5). I see an increase (which is great, thanks!) but it's not that large with my setup - from about 461 to 462.5 MH/s. If I can provide any other info or help, I'd be glad to.

tpruvot · 2016-04-04T14:56:05Z

The best improvements i've seen are on the GTX 9xx on linux, but indeed that's uncommon the 750ti doesn't profit...

pallas1 · 2016-04-04T14:57:44Z

I had reports of both the 750 and the windows build to be slower. Unfortunately I do not have a 750 nor a windows setup ready for testing and tuning. I suspect it has to do with the "aggregated" GS4 calls, please try replacing them with standard blake rounds. If that helps, we might remove the GS4 calls altogether, or add some precompiler directives to select the best method based on operating system and card.

bshiner · 2016-04-04T15:01:19Z

I'm not sure how to do that. If you can explain it to me, or send me the change, I can recompile and test.

pallas1 · 2016-04-04T15:04:15Z

replace this function:

__device__ __forceinline__
uint32_t blake256_compress_14(uint32_t *m, uint32_t *v_init, uint32_t d_data6, uint32_t d_data7)
{
    uint32_t v[16];

    #pragma unroll
    for (uint32_t i = 0; i < 16; i++) v[i] = v_init[i];
    // these two are not modified:
    v[ 9] = 0x85A308D3;
    v[13] = 0x299F31D0 ^ (180U*8U);

    // round 1 with nonce
    GSPREC(1, 5, 0x9, 0xD, 2,  3);
    GSPREC(0, 5, 0xA, 0xF, 8,  9);
    GSPREC(1, 6, 0xB, 0xC, 10, 11);
    GSPREC(2, 7, 0x8, 0xD, 12, 13);
    GSPREC(3, 4, 0x9, 0xE, 14, 15);
    // round 2
    GSPREC(0, 4, 0x8, 0xC, 14, 10);
    GSPREC(1, 5, 0x9, 0xD, 4,  8);
    GSPREC(2, 6, 0xA, 0xE, 9,  15);
    GSPREC(3, 7, 0xB, 0xF, 13, 6);
    GSPREC(0, 5, 0xA, 0xF, 1,  12);
    GSPREC(1, 6, 0xB, 0xC, 0,  2);
    GSPREC(2, 7, 0x8, 0xD, 11, 7);
    GSPREC(3, 4, 0x9, 0xE, 5,  3);
    // round 3
    GSPREC(0, 4, 0x8, 0xC, 11, 8);
    GSPREC(1, 5, 0x9, 0xD, 12, 0);
    GSPREC(2, 6, 0xA, 0xE, 5,  2);
    GSPREC(3, 7, 0xB, 0xF, 15, 13);
    GSPREC(0, 5, 0xA, 0xF, 10, 14);
    GSPREC(1, 6, 0xB, 0xC, 3,  6);
    GSPREC(2, 7, 0x8, 0xD, 7,  1);
    GSPREC(3, 4, 0x9, 0xE, 9,  4);
    // round 4
    GSPREC(0, 4, 0x8, 0xC, 7,  9);
    GSPREC(1, 5, 0x9, 0xD, 3,  1);
    GSPREC(2, 6, 0xA, 0xE, 13, 12);
    GSPREC(3, 7, 0xB, 0xF, 11, 14);
    GSPREC(0, 5, 0xA, 0xF, 2,  6);
    GSPREC(1, 6, 0xB, 0xC, 5,  10);
    GSPREC(2, 7, 0x8, 0xD, 4,  0);
    GSPREC(3, 4, 0x9, 0xE, 15, 8);
    // round 5
    GSPREC(0, 4, 0x8, 0xC, 9,  0);
    GSPREC(1, 5, 0x9, 0xD, 5,  7);
    GSPREC(2, 6, 0xA, 0xE, 2,  4);
    GSPREC(3, 7, 0xB, 0xF, 10, 15);
    GSPREC(0, 5, 0xA, 0xF, 14, 1);
    GSPREC(1, 6, 0xB, 0xC, 11, 12);
    GSPREC(2, 7, 0x8, 0xD, 6,  8);
    GSPREC(3, 4, 0x9, 0xE, 3,  13);
    // round 6
    GSPREC(0, 4, 0x8, 0xC, 2, 12);
    GSPREC(1, 5, 0x9, 0xD, 6, 10);
    GSPREC(2, 6, 0xA, 0xE, 0, 11);
    GSPREC(3, 7, 0xB, 0xF, 8, 3);
    GSPREC(0, 5, 0xA, 0xF, 4, 13);
    GSPREC(1, 6, 0xB, 0xC, 7, 5);
    GSPREC(2, 7, 0x8, 0xD, 15,14);
    GSPREC(3, 4, 0x9, 0xE, 1, 9);
    // round 7
    GSPREC(0, 4, 0x8, 0xC, 12, 5);
    GSPREC(1, 5, 0x9, 0xD, 1, 15);
    GSPREC(2, 6, 0xA, 0xE, 14,13);
    GSPREC(3, 7, 0xB, 0xF, 4, 10);
    GSPREC(0, 5, 0xA, 0xF, 0,  7);
    GSPREC(1, 6, 0xB, 0xC, 6,  3);
    GSPREC(2, 7, 0x8, 0xD, 9,  2);
    GSPREC(3, 4, 0x9, 0xE, 8, 11);
    // round 8
    GSPREC(0, 4, 0x8, 0xC, 13,11);
    GSPREC(1, 5, 0x9, 0xD, 7, 14);
    GSPREC(2, 6, 0xA, 0xE, 12, 1);
    GSPREC(3, 7, 0xB, 0xF, 3,  9);
    GSPREC(0, 5, 0xA, 0xF, 5,  0);
    GSPREC(1, 6, 0xB, 0xC, 15, 4);
    GSPREC(2, 7, 0x8, 0xD, 8,  6);
    GSPREC(3, 4, 0x9, 0xE, 2, 10);
    // round 9
    GSPREC(0, 4, 0x8, 0xC, 6, 15);
    GSPREC(1, 5, 0x9, 0xD, 14, 9);
    GSPREC(2, 6, 0xA, 0xE, 11, 3);
    GSPREC(3, 7, 0xB, 0xF, 0,  8);
    GSPREC(0, 5, 0xA, 0xF, 12, 2);
    GSPREC(1, 6, 0xB, 0xC, 13, 7);
    GSPREC(2, 7, 0x8, 0xD, 1,  4);
    GSPREC(3, 4, 0x9, 0xE, 10, 5);
    // round 10
    GSPREC(0, 4, 0x8, 0xC, 10, 2);
    GSPREC(1, 5, 0x9, 0xD, 8,  4);
    GSPREC(2, 6, 0xA, 0xE, 7,  6);
    GSPREC(3, 7, 0xB, 0xF, 1,  5);
    GSPREC(0, 5, 0xA, 0xF, 15,11);
    GSPREC(1, 6, 0xB, 0xC, 9, 14);
    GSPREC(2, 7, 0x8, 0xD, 3, 12);
    GSPREC(3, 4, 0x9, 0xE, 13, 0);
    // round 11
    GSPREC(0, 4, 0x8, 0xC, 0,  1);
    GSPREC(1, 5, 0x9, 0xD, 2,  3);
    GSPREC(2, 6, 0xA, 0xE, 4,  5);
    GSPREC(3, 7, 0xB, 0xF, 6,  7);
    GSPREC(0, 5, 0xA, 0xF, 8,  9);
    GSPREC(1, 6, 0xB, 0xC, 10,11);
    GSPREC(2, 7, 0x8, 0xD, 12,13);
    GSPREC(3, 4, 0x9, 0xE, 14,15);
    // round 12
    GSPREC(0, 4, 0x8, 0xC, 14,10);
    GSPREC(1, 5, 0x9, 0xD, 4,  8);
    GSPREC(2, 6, 0xA, 0xE, 9, 15);
    GSPREC(3, 7, 0xB, 0xF, 13, 6);
    GSPREC(0, 5, 0xA, 0xF, 1, 12);
    GSPREC(1, 6, 0xB, 0xC, 0,  2);
    GSPREC(2, 7, 0x8, 0xD, 11, 7);
    GSPREC(3, 4, 0x9, 0xE, 5,  3);
    // round 13
    GSPREC(0, 4, 0x8, 0xC, 11, 8);
    GSPREC(1, 5, 0x9, 0xD, 12, 0);
    GSPREC(2, 6, 0xA, 0xE, 5,  2);
    GSPREC(3, 7, 0xB, 0xF, 15,13);
    GSPREC(0, 5, 0xA, 0xF, 10,14);
    GSPREC(1, 6, 0xB, 0xC, 3,  6);
    GSPREC(2, 7, 0x8, 0xD, 7,  1);
    GSPREC(3, 4, 0x9, 0xE, 9,  4);
    // round 14
    GSPREC(0, 4, 0x8, 0xC, 7,  9);
    GSPREC(1, 5, 0x9, 0xD, 3,  1);
    GSPREC(2, 6, 0xA, 0xE, 13,12);
    GSPREC(3, 7, 0xB, 0xF, 11,14);
    GSPREC(0, 5, 0xA, 0xF, 2,  6);
    GSPREC(2, 7, 0x8, 0xD, 4,  0);

    if ((d_data7 ^ v[7] ^ v[15]) == 0) {
        GSPREC(1, 6, 0xB, 0xC, 5, 10);
        GSPREC(3, 4, 0x9, 0xE, 15, 8);
        return (d_data6 ^ v[6] ^ v[14]);
    }
    return UINT32_MAX;
}

tpruvot · 2016-04-04T15:31:15Z

nope, that doesnt improve... but the vanilla "return;" does

pallas1 · 2016-04-04T15:36:22Z

I tried the "return" (also "break"), and it hashed slower on my rig.

alexis78 · 2016-04-04T15:38:54Z

read or pastebin the ptx obtained by cuobjdump . "break;" wont work

bshiner · 2016-04-04T15:41:18Z

I don't see any difference going back to the standard blake rounds

pallas1 · 2016-04-04T15:45:15Z

Try playing with TPB and NPR. They are defined at the beginning.

tpruvot · 2016-04-04T15:46:29Z

i tried too, but return; improve from 520 to 540MH

see c0fca5c

bshiner · 2016-04-04T17:30:17Z

Any tips on what types of values to try for TPB and NPR? Stay in powers of two? Multiples of the number of cores?

pallas1 · 2016-04-04T17:31:33Z

TPB try 256, NPR power of two

bshiner · 2016-04-04T18:37:05Z

I tried a bunch of different settings and here's what I got:

TPB 256, NPR 64/128/256 all gave about 456 MH/s
TPB 512, NPR 64/128/256 gave about 460/462/460 MH/s

TPB 512 NPR 128 w return line added from c0fca5c gave 463 MH/s

So not much difference either way, seems like your initial TPB/NPR settings were optimal, and the return line made another small improvement.

@tpruvot are you getting 540 MH/s on a 750 or something else? If you are, then perhaps I'm doing something else wrong to have such a low starting point compared to you and not seeing any noticeable improvements, which would invalidate my tests.

tpruvot · 2016-04-04T19:06:26Z

750 Ti Black Edition, a bit faster one, the important is the +20MHs

[2016-04-04 21:10:25] GPU #1: Gigabyte GTX 750 Ti, 542.10 MH/s
[2016-04-04 21:10:25] accepted: 31/31 (100.00%), 1508.38 MH/s yes!
[2016-04-04 21:10:25] Stratum difficulty set to 2
[2016-04-04 21:10:28] GPU #0: Gigabyte GTX 960, 974.30 MH/s
[2016-04-04 21:10:28] accepted: 32/32 (100.00%), 1508.40 MH/s yes!
[2016-04-04 21:10:31] GPU #0: Gigabyte GTX 960, 965.74 MH/s
[2016-04-04 21:10:32] GPU #0: Gigabyte GTX 960, 965.55 MH/s
[2016-04-04 21:10:32] accepted: 33/33 (100.00%), 1507.88 MH/s yes!

NVIDIA-SMI 352.79
Linux 3.19.0-30-lowlatency tpruvot#34~14.04.1-Ubuntu

| 0 GeForce GTX 960 Off | 0000:01:00.0 Off | N/A |
| 39% 67C P0 114W / 160W | 63MiB / 4095MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 750 Ti Off | 0000:02:00.0 Off | N/A |
| 35% 48C P0 35W / 46W | 44MiB / 2047MiB | 99% Default |

bshiner · 2016-04-04T19:10:57Z

@tpruvot right, I know the change is what's important, but I'm not seeing that much of a change... strange. My cards are stock 750TI from EVGA

@alexis78 In my first post I mentioned that I'm using CUDA 7.5 on Ubuntu 14.04.

from nvidia-smi:

power: 33 W
temp: 48 C
SM: 99%
mem: 0%
memory: 2700 MHz
core: 1163 MHz

pallas1 · 2016-04-04T19:15:14Z

For the record, my ~10% improvement is measured at fixed power usage, attained using a low TDP settings on nvidia-smi

bshiner · 2016-04-05T12:49:24Z

Sorry, I double checked and I hadn't actually pulled down the file from c0fca5c - with that version, I'm getting 481MH/s from 463MH/s. That's 18MH/s.

jolan · 2016-04-05T17:07:15Z

I tested c0fca5c on Arch Linux w/361.28 on 750ti/960/970 and various 980 Ti models and see a noticeable improvement across the board.

My EVGA 750 Ti SC also went from 520MH/s to 540MH/s.

@tpruvot Whenever you want to merge is fine by me.

tpruvot · 2016-04-05T20:11:59Z

done

~10% speedup

2b0efe4

tpruvot closed this Apr 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

~10% speedup #1

~10% speedup #1

pallas1 commented Apr 2, 2016

tpruvot commented Apr 2, 2016

tpruvot commented Apr 2, 2016

pallas1 commented Apr 2, 2016

alexis78 commented Apr 2, 2016

pallas1 commented Apr 2, 2016

cjepson commented Apr 3, 2016

bshiner commented Apr 4, 2016

tpruvot commented Apr 4, 2016

pallas1 commented Apr 4, 2016

bshiner commented Apr 4, 2016

pallas1 commented Apr 4, 2016

tpruvot commented Apr 4, 2016

pallas1 commented Apr 4, 2016

alexis78 commented Apr 4, 2016

bshiner commented Apr 4, 2016

pallas1 commented Apr 4, 2016

tpruvot commented Apr 4, 2016

bshiner commented Apr 4, 2016

pallas1 commented Apr 4, 2016

bshiner commented Apr 4, 2016

tpruvot commented Apr 4, 2016

bshiner commented Apr 4, 2016

pallas1 commented Apr 4, 2016

bshiner commented Apr 5, 2016

jolan commented Apr 5, 2016

tpruvot commented Apr 5, 2016

~10% speedup #1

~10% speedup #1

Conversation

pallas1 commented Apr 2, 2016

tpruvot commented Apr 2, 2016

tpruvot commented Apr 2, 2016

pallas1 commented Apr 2, 2016

alexis78 commented Apr 2, 2016

pallas1 commented Apr 2, 2016

cjepson commented Apr 3, 2016

bshiner commented Apr 4, 2016

tpruvot commented Apr 4, 2016

pallas1 commented Apr 4, 2016

bshiner commented Apr 4, 2016

pallas1 commented Apr 4, 2016

tpruvot commented Apr 4, 2016

pallas1 commented Apr 4, 2016

alexis78 commented Apr 4, 2016

bshiner commented Apr 4, 2016

pallas1 commented Apr 4, 2016

tpruvot commented Apr 4, 2016

bshiner commented Apr 4, 2016

pallas1 commented Apr 4, 2016

bshiner commented Apr 4, 2016

tpruvot commented Apr 4, 2016

bshiner commented Apr 4, 2016

pallas1 commented Apr 4, 2016

bshiner commented Apr 5, 2016

jolan commented Apr 5, 2016

tpruvot commented Apr 5, 2016