Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~10% speedup #1

Closed
wants to merge 1 commit into from
Closed

~10% speedup #1

wants to merge 1 commit into from

Conversation

pallas1
Copy link

@pallas1 pallas1 commented Apr 2, 2016

Untested on ARCH < 5.2

@tpruvot
Copy link
Collaborator

tpruvot commented Apr 2, 2016

Thanks for the commit, i will try them on earlier cuda cards and on the testnet... before commiting

PS: the testnet is not "testable" as-is, there is a line to comment in ccminer to allow the empty version (data[0])

@tpruvot
Copy link
Collaborator

tpruvot commented Apr 2, 2016

So, the first review is fine... with mixed results on linux regarding generations, but works well SM 3.0, 5 and 5.2. I ask decred team, if its me which should merge that ;)

@pallas1
Copy link
Author

pallas1 commented Apr 2, 2016

Good. Will need tuning on 750. Maybe just removing the GS4 calls on < 5.2. I only have 970s.

@alexis78
Copy link

alexis78 commented Apr 2, 2016

Hi @pallas1 . That's a nice work there. I'd recommend to use a return statement inside the final if condition just like i did on blake8-rounds. https://github.com/alexis78/ccminer/blob/windows/Algo256/blake256.cu#L287

This addition may sound dummy, but as you'll notice it's not dummy at all.
In detail, it allows CUDA7.5 to

  1. Store the values fetched from constant memory in registers.
  2. Perform more precomputations on the outside of the for loop.
  3. Stop the continuous fetches from the constant memory while iterating

On the other side, register usage of the kernel is increasing.
As i previously stated here,
https://v.cash/forum/threads/ccminer-faster-8-round-blake-algo-1-16x.282/page-2#post-5039
you can observe these differences (with or without the return statement) by observing the ptx generated code

@pallas1
Copy link
Author

pallas1 commented Apr 2, 2016

Thanks, I will try that!

@cjepson
Copy link

cjepson commented Apr 3, 2016

Waiting for @jolan 's comment because I think he is the only one with GPUs.

@bshiner
Copy link

bshiner commented Apr 4, 2016

I have some 750 cards (Ubuntu 14.04 and CUDA 7.5). I see an increase (which is great, thanks!) but it's not that large with my setup - from about 461 to 462.5 MH/s. If I can provide any other info or help, I'd be glad to.

@tpruvot
Copy link
Collaborator

tpruvot commented Apr 4, 2016

The best improvements i've seen are on the GTX 9xx on linux, but indeed that's uncommon the 750ti doesn't profit...

@pallas1
Copy link
Author

pallas1 commented Apr 4, 2016

I had reports of both the 750 and the windows build to be slower. Unfortunately I do not have a 750 nor a windows setup ready for testing and tuning. I suspect it has to do with the "aggregated" GS4 calls, please try replacing them with standard blake rounds. If that helps, we might remove the GS4 calls altogether, or add some precompiler directives to select the best method based on operating system and card.

@bshiner
Copy link

bshiner commented Apr 4, 2016

I'm not sure how to do that. If you can explain it to me, or send me the change, I can recompile and test.

@pallas1
Copy link
Author

pallas1 commented Apr 4, 2016

replace this function:

__device__ __forceinline__
uint32_t blake256_compress_14(uint32_t *m, uint32_t *v_init, uint32_t d_data6, uint32_t d_data7)
{
    uint32_t v[16];

    #pragma unroll
    for (uint32_t i = 0; i < 16; i++) v[i] = v_init[i];
    // these two are not modified:
    v[ 9] = 0x85A308D3;
    v[13] = 0x299F31D0 ^ (180U*8U);

    // round 1 with nonce
    GSPREC(1, 5, 0x9, 0xD, 2,  3);
    GSPREC(0, 5, 0xA, 0xF, 8,  9);
    GSPREC(1, 6, 0xB, 0xC, 10, 11);
    GSPREC(2, 7, 0x8, 0xD, 12, 13);
    GSPREC(3, 4, 0x9, 0xE, 14, 15);
    // round 2
    GSPREC(0, 4, 0x8, 0xC, 14, 10);
    GSPREC(1, 5, 0x9, 0xD, 4,  8);
    GSPREC(2, 6, 0xA, 0xE, 9,  15);
    GSPREC(3, 7, 0xB, 0xF, 13, 6);
    GSPREC(0, 5, 0xA, 0xF, 1,  12);
    GSPREC(1, 6, 0xB, 0xC, 0,  2);
    GSPREC(2, 7, 0x8, 0xD, 11, 7);
    GSPREC(3, 4, 0x9, 0xE, 5,  3);
    // round 3
    GSPREC(0, 4, 0x8, 0xC, 11, 8);
    GSPREC(1, 5, 0x9, 0xD, 12, 0);
    GSPREC(2, 6, 0xA, 0xE, 5,  2);
    GSPREC(3, 7, 0xB, 0xF, 15, 13);
    GSPREC(0, 5, 0xA, 0xF, 10, 14);
    GSPREC(1, 6, 0xB, 0xC, 3,  6);
    GSPREC(2, 7, 0x8, 0xD, 7,  1);
    GSPREC(3, 4, 0x9, 0xE, 9,  4);
    // round 4
    GSPREC(0, 4, 0x8, 0xC, 7,  9);
    GSPREC(1, 5, 0x9, 0xD, 3,  1);
    GSPREC(2, 6, 0xA, 0xE, 13, 12);
    GSPREC(3, 7, 0xB, 0xF, 11, 14);
    GSPREC(0, 5, 0xA, 0xF, 2,  6);
    GSPREC(1, 6, 0xB, 0xC, 5,  10);
    GSPREC(2, 7, 0x8, 0xD, 4,  0);
    GSPREC(3, 4, 0x9, 0xE, 15, 8);
    // round 5
    GSPREC(0, 4, 0x8, 0xC, 9,  0);
    GSPREC(1, 5, 0x9, 0xD, 5,  7);
    GSPREC(2, 6, 0xA, 0xE, 2,  4);
    GSPREC(3, 7, 0xB, 0xF, 10, 15);
    GSPREC(0, 5, 0xA, 0xF, 14, 1);
    GSPREC(1, 6, 0xB, 0xC, 11, 12);
    GSPREC(2, 7, 0x8, 0xD, 6,  8);
    GSPREC(3, 4, 0x9, 0xE, 3,  13);
    // round 6
    GSPREC(0, 4, 0x8, 0xC, 2, 12);
    GSPREC(1, 5, 0x9, 0xD, 6, 10);
    GSPREC(2, 6, 0xA, 0xE, 0, 11);
    GSPREC(3, 7, 0xB, 0xF, 8, 3);
    GSPREC(0, 5, 0xA, 0xF, 4, 13);
    GSPREC(1, 6, 0xB, 0xC, 7, 5);
    GSPREC(2, 7, 0x8, 0xD, 15,14);
    GSPREC(3, 4, 0x9, 0xE, 1, 9);
    // round 7
    GSPREC(0, 4, 0x8, 0xC, 12, 5);
    GSPREC(1, 5, 0x9, 0xD, 1, 15);
    GSPREC(2, 6, 0xA, 0xE, 14,13);
    GSPREC(3, 7, 0xB, 0xF, 4, 10);
    GSPREC(0, 5, 0xA, 0xF, 0,  7);
    GSPREC(1, 6, 0xB, 0xC, 6,  3);
    GSPREC(2, 7, 0x8, 0xD, 9,  2);
    GSPREC(3, 4, 0x9, 0xE, 8, 11);
    // round 8
    GSPREC(0, 4, 0x8, 0xC, 13,11);
    GSPREC(1, 5, 0x9, 0xD, 7, 14);
    GSPREC(2, 6, 0xA, 0xE, 12, 1);
    GSPREC(3, 7, 0xB, 0xF, 3,  9);
    GSPREC(0, 5, 0xA, 0xF, 5,  0);
    GSPREC(1, 6, 0xB, 0xC, 15, 4);
    GSPREC(2, 7, 0x8, 0xD, 8,  6);
    GSPREC(3, 4, 0x9, 0xE, 2, 10);
    // round 9
    GSPREC(0, 4, 0x8, 0xC, 6, 15);
    GSPREC(1, 5, 0x9, 0xD, 14, 9);
    GSPREC(2, 6, 0xA, 0xE, 11, 3);
    GSPREC(3, 7, 0xB, 0xF, 0,  8);
    GSPREC(0, 5, 0xA, 0xF, 12, 2);
    GSPREC(1, 6, 0xB, 0xC, 13, 7);
    GSPREC(2, 7, 0x8, 0xD, 1,  4);
    GSPREC(3, 4, 0x9, 0xE, 10, 5);
    // round 10
    GSPREC(0, 4, 0x8, 0xC, 10, 2);
    GSPREC(1, 5, 0x9, 0xD, 8,  4);
    GSPREC(2, 6, 0xA, 0xE, 7,  6);
    GSPREC(3, 7, 0xB, 0xF, 1,  5);
    GSPREC(0, 5, 0xA, 0xF, 15,11);
    GSPREC(1, 6, 0xB, 0xC, 9, 14);
    GSPREC(2, 7, 0x8, 0xD, 3, 12);
    GSPREC(3, 4, 0x9, 0xE, 13, 0);
    // round 11
    GSPREC(0, 4, 0x8, 0xC, 0,  1);
    GSPREC(1, 5, 0x9, 0xD, 2,  3);
    GSPREC(2, 6, 0xA, 0xE, 4,  5);
    GSPREC(3, 7, 0xB, 0xF, 6,  7);
    GSPREC(0, 5, 0xA, 0xF, 8,  9);
    GSPREC(1, 6, 0xB, 0xC, 10,11);
    GSPREC(2, 7, 0x8, 0xD, 12,13);
    GSPREC(3, 4, 0x9, 0xE, 14,15);
    // round 12
    GSPREC(0, 4, 0x8, 0xC, 14,10);
    GSPREC(1, 5, 0x9, 0xD, 4,  8);
    GSPREC(2, 6, 0xA, 0xE, 9, 15);
    GSPREC(3, 7, 0xB, 0xF, 13, 6);
    GSPREC(0, 5, 0xA, 0xF, 1, 12);
    GSPREC(1, 6, 0xB, 0xC, 0,  2);
    GSPREC(2, 7, 0x8, 0xD, 11, 7);
    GSPREC(3, 4, 0x9, 0xE, 5,  3);
    // round 13
    GSPREC(0, 4, 0x8, 0xC, 11, 8);
    GSPREC(1, 5, 0x9, 0xD, 12, 0);
    GSPREC(2, 6, 0xA, 0xE, 5,  2);
    GSPREC(3, 7, 0xB, 0xF, 15,13);
    GSPREC(0, 5, 0xA, 0xF, 10,14);
    GSPREC(1, 6, 0xB, 0xC, 3,  6);
    GSPREC(2, 7, 0x8, 0xD, 7,  1);
    GSPREC(3, 4, 0x9, 0xE, 9,  4);
    // round 14
    GSPREC(0, 4, 0x8, 0xC, 7,  9);
    GSPREC(1, 5, 0x9, 0xD, 3,  1);
    GSPREC(2, 6, 0xA, 0xE, 13,12);
    GSPREC(3, 7, 0xB, 0xF, 11,14);
    GSPREC(0, 5, 0xA, 0xF, 2,  6);
    GSPREC(2, 7, 0x8, 0xD, 4,  0);

    if ((d_data7 ^ v[7] ^ v[15]) == 0) {
        GSPREC(1, 6, 0xB, 0xC, 5, 10);
        GSPREC(3, 4, 0x9, 0xE, 15, 8);
        return (d_data6 ^ v[6] ^ v[14]);
    }
    return UINT32_MAX;
}

@tpruvot
Copy link
Collaborator

tpruvot commented Apr 4, 2016

nope, that doesnt improve... but the vanilla "return;" does

@pallas1
Copy link
Author

pallas1 commented Apr 4, 2016

I tried the "return" (also "break"), and it hashed slower on my rig.

@alexis78
Copy link

alexis78 commented Apr 4, 2016

read or pastebin the ptx obtained by cuobjdump . "break;" wont work

@bshiner
Copy link

bshiner commented Apr 4, 2016

I don't see any difference going back to the standard blake rounds

@pallas1
Copy link
Author

pallas1 commented Apr 4, 2016

Try playing with TPB and NPR. They are defined at the beginning.

@tpruvot
Copy link
Collaborator

tpruvot commented Apr 4, 2016

i tried too, but return; improve from 520 to 540MH

see c0fca5c

@bshiner
Copy link

bshiner commented Apr 4, 2016

Any tips on what types of values to try for TPB and NPR? Stay in powers of two? Multiples of the number of cores?

@pallas1
Copy link
Author

pallas1 commented Apr 4, 2016

TPB try 256, NPR power of two

@bshiner
Copy link

bshiner commented Apr 4, 2016

I tried a bunch of different settings and here's what I got:

TPB 256, NPR 64/128/256 all gave about 456 MH/s
TPB 512, NPR 64/128/256 gave about 460/462/460 MH/s

TPB 512 NPR 128 w return line added from c0fca5c gave 463 MH/s

So not much difference either way, seems like your initial TPB/NPR settings were optimal, and the return line made another small improvement.

@tpruvot are you getting 540 MH/s on a 750 or something else? If you are, then perhaps I'm doing something else wrong to have such a low starting point compared to you and not seeing any noticeable improvements, which would invalidate my tests.

@tpruvot
Copy link
Collaborator

tpruvot commented Apr 4, 2016

750 Ti Black Edition, a bit faster one, the important is the +20MHs

[2016-04-04 21:10:25] GPU #1: Gigabyte GTX 750 Ti, 542.10 MH/s
[2016-04-04 21:10:25] accepted: 31/31 (100.00%), 1508.38 MH/s yes!
[2016-04-04 21:10:25] Stratum difficulty set to 2
[2016-04-04 21:10:28] GPU #0: Gigabyte GTX 960, 974.30 MH/s
[2016-04-04 21:10:28] accepted: 32/32 (100.00%), 1508.40 MH/s yes!
[2016-04-04 21:10:31] GPU #0: Gigabyte GTX 960, 965.74 MH/s
[2016-04-04 21:10:32] GPU #0: Gigabyte GTX 960, 965.55 MH/s
[2016-04-04 21:10:32] accepted: 33/33 (100.00%), 1507.88 MH/s yes!

NVIDIA-SMI 352.79
Linux 3.19.0-30-lowlatency tpruvot#34~14.04.1-Ubuntu

| 0 GeForce GTX 960 Off | 0000:01:00.0 Off | N/A |
| 39% 67C P0 114W / 160W | 63MiB / 4095MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 750 Ti Off | 0000:02:00.0 Off | N/A |
| 35% 48C P0 35W / 46W | 44MiB / 2047MiB | 99% Default |

@bshiner
Copy link

bshiner commented Apr 4, 2016

@tpruvot right, I know the change is what's important, but I'm not seeing that much of a change... strange. My cards are stock 750TI from EVGA

@alexis78 In my first post I mentioned that I'm using CUDA 7.5 on Ubuntu 14.04.

from nvidia-smi:

  • power: 33 W
  • temp: 48 C
  • SM: 99%
  • mem: 0%
  • memory: 2700 MHz
  • core: 1163 MHz

@pallas1
Copy link
Author

pallas1 commented Apr 4, 2016

For the record, my ~10% improvement is measured at fixed power usage, attained using a low TDP settings on nvidia-smi

@bshiner
Copy link

bshiner commented Apr 5, 2016

Sorry, I double checked and I hadn't actually pulled down the file from c0fca5c - with that version, I'm getting 481MH/s from 463MH/s. That's 18MH/s.

@jolan
Copy link

jolan commented Apr 5, 2016

I tested c0fca5c on Arch Linux w/361.28 on 750ti/960/970 and various 980 Ti models and see a noticeable improvement across the board.

My EVGA 750 Ti SC also went from 520MH/s to 540MH/s.

@tpruvot Whenever you want to merge is fine by me.

@tpruvot
Copy link
Collaborator

tpruvot commented Apr 5, 2016

done

@tpruvot tpruvot closed this Apr 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants