Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFv2 speed #15

Open
itwysgsl opened this issue Apr 3, 2019 · 145 comments

Comments

Projects
None yet
8 participants
@itwysgsl
Copy link
Contributor

commented Apr 3, 2019

Hello again @bschn2, I have concerns about v2 speed. Don't you think that rainforest became way to slow? My MacBook's Intel i7 produce ~30 H/s and it's not enough to mine at least one block on lowest diff on test network 馃槄

@wtarreau

This comment has been minimized.

Copy link

commented Apr 5, 2019

Strange about these 30 H/s, it's 40 times slower than my i7 (mine does 1200). Did you build with -march=native -O3 ?
How did you test, using rf_test or did you simply iterate over rh_hash() with a NULL rambox ? This last one will solely depend on memory speed since it needs to initialize it. That's what I figured when implementing the bench mode in rf_test. You have to pre-allocate a rambox and pass it to rf_hash(), which restores it at the end. At first I didn't like the principle but thinking about the reasons (making it very hard for FPGAs and ASICs to try to emulate the RAM) I finally found it very smart, as the cost is only incurred to those trying to massively scale it ;-)

@wtarreau

This comment has been minimized.

Copy link

commented Apr 5, 2019

My old ATOM D510 does 55 H/s :-)

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2019

@wtarreau oh, I guess there is little missunderstanding. It gives me ~30 H/s per thread using cpu miner source code example from rfv2 branch. I just tried recompile test example with -march=native -O3 option and now it give me ~ 130 H/s per thread :D

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2019

Interesting, what purpose of this -march=native -O3 option?

@wtarreau

This comment has been minimized.

Copy link

commented Apr 5, 2019

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2019

Hello again @wtarreau. After some tests (~20 minutes mining on testnet on lowest difficulty without any blocks) I still think that RFv2 in current state way to slow :(

@wtarreau

This comment has been minimized.

Copy link

commented Apr 5, 2019

@wtarreau

This comment has been minimized.

Copy link

commented Apr 5, 2019

@wtarreau

This comment has been minimized.

Copy link

commented Apr 5, 2019

@wtarreau

This comment has been minimized.

Copy link

commented Apr 5, 2019

@LinuXperia

This comment has been minimized.

Copy link

commented Apr 5, 2019

I was never a fan of this difficulty target calculation idea.

The whole Bitcoin difficulty target calculation is a second class solution and does not really works.

Imagine the hashrate goes very high and the difficulty increases to maximum.
Then only 1 Hash or a very small number of hashes can be below such a extreme set Target
and producing such a Hash may even not be possible as no data in the block header can produce such a hash.

It is highly questionable even if it will be possible to create the needed hashes for MicroBitcoin as 21 trilions of coins need to be mined instead the 21 million coins bitcoin has.

We may very well get into problems with not having enogh of hashes to be mined in MicroBitcoin as trillions of Coins are planned to be mined.

Becouse of this lot of wasted hash calculation are produced and energy is wasted.

Reducing the Loops as i see benefits big Power Hungry CPUs over small power efficient CPUs and is against the Idea of the RainForest Hash Algorithm.

I suggest to go the path that EquiHash uses and is implemented as a example in Bitcoin Gold or ZCash and instead of finding a hash that is below
a target finding instead half hashes colusions inside the rambox.

I really recommend to drop this whole target calculation and instead use
the EquiHash way to find colusions inside the Rambox like Equihash does it.

If my memory serves me right EquiHash creates 1 RamBox
and does populate the RamBox for every Nonce using 1024 Loops with Salsa20 calculated Hashes.

After this it looks for Hash Colusions inside this RamBox and if it finds it
append it after the Nonce in the Blockheader and submit it to the Network.

Becouse of this the Block Header of EquiHash has 80 Bytes plus the 32 Byte Hash colusion hash.

Maybe something simillar can be done also with the RainForest Hash Algorithm ?

@bschn2

This comment has been minimized.

Copy link
Owner

commented Apr 5, 2019

Gentlemen,

please be careful if you start to change the number of loops, iterations per round, or the loop curve!

@wtarreau at least make sure you round the pow up by adding 1.5 and not 1.0. This aside, your suggestion looks reasonable to me as it gives more a body-like shape which stays longer in the higher range hence reduces the peaks. In any case you must check the average number of history buffer entries and double it in the define (well maybe less than double now with the curve change but it must at least be 20-25% larger to avoid repopulating the rambox from sratch at the end). If the number of entries is lower than 512 then you need to decrease the clrsbl threshold so that it writes more often.

The divbox+scramble calls you suggested to remove were indeed here only to balance the power between low-end CPUs and high-end ones. There are exactly 3 which are doubled and could safely be reduced to 1 each (the original smhasher tests were run with both configurations).

And yes, please keep an eye on raspberry-pi type of devices as it seems capital to me that such machines are about as fast as regular PCs if we really want to incentive to energy savings. Your numbers are fine to me as I initially targetted 1k to 10k H/s/core so we're pretty much in this area here.

@jdelorme3

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

And how many times more faster it is on large devices and small devices for equihash ? Are you sure we don't favour the large ones only there ?

@jdelorme3

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

Also the hash verifycation time count alot. I think that rainforest is great for this, it cost but not way two much.

@bschn2

This comment has been minimized.

Copy link
Owner

commented Apr 5, 2019

Dear @LinuXperia,

rfv2 's rambox is not far from what you describe since the rambox is modified by every lookup based on the hashed message (thus includes the nonce). However keep in mind that Salsa20 was designed to be extremely fast on x86 processors (typically less than 4 cycles per byte), and that this can hardly be considered fair for emerging countries where such hardware simply is not available, all people have is a previous generation smartphone to do everything. There the power often comes from local solar panels so maintaining a decent capacity to such devices is very important for the overall fossil energy consumption.

@wtarreau

This comment has been minimized.

Copy link

commented Apr 5, 2019

@bschn2 good catch for 1.5, I'll try this to stay safe. Thanks for confirming the divbox that could be reduced. Regarding the write ratio if you notice I already adjusted it, but granted, i didn't check the values. I'll do and will prepare a patch will all this soon.

bschn2 added a commit that referenced this issue Apr 6, 2019

refine the sin_scaled function to pack extremes more
In this discussion #15 @wtarreau experimented with two sqrt() around the sine to pack values better. This experiment proves to provide smoother oscillations and can be simplified by reducing the exponentiation by two. Also the extra addition is simplified with a call to round(). With this we can safely divide the number of rounds by almost 3 and see the hash rate increase by as much. This needs to adjust the memory access ratios however (4 times more) which overall increases both performance and security.
@LinuXperia

This comment has been minimized.

Copy link

commented Apr 6, 2019

@bschn2
I am sorry it looks like i did not express my self very good how to improve the actual situation with Rainforest so that we instead look to find a rare Hash with leading Zeros we look to find a collusion of each calculated Nonce Hash in the Rambox.

here is my easy to implement improvement suggestion:

The Problem is the part that require us to check for a specific rare hash with leading zeros in front of the Hash.
Such a rare hash occurence requires a lot of brute force hash calculation which is against what rainforest algorithm stands for.

My working approach that solves this problem without changing a lot of the code is that instead we look to find a rare end hash with leading zeros in front and be under a target hash we use the amount of the leading zeros in the nBits field value as amount of leading bytes of the calculated Nonce endhash to be matched in the rambox.

Lets say the lowest dificlulty require us to have a hash with two leading zeros.
Instead that we now do hash bruteforce till a rare hash is found with two leading zeros what we do now with rainforest is we match two bytes of the calculated end hash and look if such a byte variation collusion exist in the rambox using memcmp().

The Value of how many leading Zeros are required is stored in the 80 Bytes pdata Blockheader that each thread has. So implementing this is very easy and should work like a charme.

If somebody finds such a Byte End Hash Collusion very fast then the difficulty adjust automaticly the nbits field and require us to find a hash with let say 8 leading zeros now which is more harder than before.

We again instead looking to find a endhash with 8 leading zeros with rainforest just match 8 bytes of each nonce end hash and look in the the rambox if such a Byte combination exist.

If yes and this was found again faster than the 1 Minute requirement that Microbitcoin has for mining a block the difficulty Algorithm will adjust the nbits field value to find a hash with lets say 16 leading Zeros which for us and the Rainforest algoirthm means we need to map 16 Bytes of each calculated Nonce endhash in the rambox.

If finding such a 16 Byte Hash Combination needs now 2 Minutes instead the required 1 Minute then the dificulty algotithm will drop the difficulty in the nbits field to look for 14 leading zeros aka 14 Bytes in the rambox and make it easier than before.

By this it automaticly adjust everything so we stay inside the 1 Minute Time frame for mining a Block without that we need to bruteforce Hash calculations to find a rare hash with leadin zeros.

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2019

Hello again @bschn2 @wtarreau.
I just tested this 3b35a37 commit and it's actually ~3-4 times faster on my i7. But it's still not fast enough. I tried to mine around 10 minutes on same testnet with low diff and etc, but it still the same. After that I started experimenting by dividing amout of loops at this line for 3, 6, 12 and so on (don't ask why, I just wanted to test speed 馃槄):

loops = sin_scaled(msgh);

And here is what I got:

  • Divided by 3: 1111.15 H/s/thread without any blocks in a long time
  • Divided by 6: 2533.53 H/s/thread
  • Divided by 12: 5446.98 H/s/thread
  • Divided by 24: 11606.02 H/s/thread at this point I start getting blocks with more or less decent speed

Maybe this test would be helpful in some way :)

@bschn2

This comment has been minimized.

Copy link
Owner

commented Apr 6, 2019

@itwysgsl do you mean you don't find shares or you don't find blocks ? Not finding shares is indeed problematic but should simply be a matter of difficulty. Indeed, at 1.11kH/s you scan a full 16-bit range every 60 seconds so the pool's difficulty is low enough, you must find these shares. What is your difficulty in this case ?

If what you don't find is a block, this sounds normal as the purpose is that the chances to find a block are equally shared among miners so if you have 1000 miners one will mine the block while the 999 other ones will not. So on average if you emit a new block every minute, for 1000 miners each of them would on average find a block every 1000 minutes. But again even then it's a matter of adjusting the difficulty as if the target is 0x0000FFFF...FFFF then at 1kH/s you will find it in 60 seconds.

Last point, I'm a bit surprised by your i7's performance here, did you enable correct build options ? This is roughly 5 times slower than mine without dividing, hence 15 times slower. Did you enable -O3 and -march=native ?

@bschn2

This comment has been minimized.

Copy link
Owner

commented Apr 6, 2019

Oh and by the way, many thanks for sharing your observations!

@bschn2

This comment has been minimized.

Copy link
Owner

commented Apr 6, 2019

@LinuXperia I'm still unsure I really understand the principle you're describing. I think it is very similar to hashing except that you look up some bits in the rambox. What I don't understand is how you populate it and how you validate a share or a block afterwards. Also in any case the computation time spent is required as a proof of work. Whether you find the bits in the rambox or anywhere else inside the hash algorithm, it's the same, you have to iterate over nonces so that most participants find shares to be paid and that one of them finds the block. So it's unclear to me what your method brings at this point.

@wtarreau

This comment has been minimized.

Copy link

commented Apr 6, 2019

@itwysgsl I am also surprised by your numbers, how do you test ? Is it with the patched cpuminer maybe ? Have you tried "rfv2_test -b -t $(nproc)" ? I must confess I have not checked how or when it initializes the rambox. I hope it does it only once once the first call, but I don't know. This could explain your low performance if it does a full rambox for each hash.

@LinuXperia

This comment has been minimized.

Copy link

commented Apr 6, 2019

@LinuXperia Whether you find the bits in the rambox or anywhere else inside the hash algorithm, it's the same, you have to iterate over nonces so that most participants find shares to be paid and that one of them finds the block. So it's unclear to me what your method brings at this point.

@bschn2
The way i understand it is that the 110 H/s/Thread are way to low to find a Block in about 60 Seconds as a Single Miner at the lowest difficulty.

Becouse of this Problem improvements are needed so a Single Miner using a Single Board Computer running Microbitcoin Rainforest Miner is able to mine a Block as Solo Miner in about 60 Seconds.
The test of just a single miner runing just one node and one Miner failed as he was not able to Mine any blcoks in the given time.

So the minimal requirement to test on one Block Chain Node using one Miner to mine one Block in the given period failed.

His hashing Numbers looks okey as he has the same Hash Speed on his i7 as i have on my i7

Becouse of this i suggested to abdon the bitcoin aproach looking to find a rare hash with leading zeros and instead use the equihash apraoch in finding bit collusions.

Finding bit collusions make it easier to mine blocks as we dont need to bruteforce a huge amount of hashes and loose time till we found such a rare hash.

@wtarreau

This comment has been minimized.

Copy link

commented Apr 6, 2019

@LinuXperia how do you measure this performance, and how is this lowest difficulty calculated or configured (sorry I'm not much aware of all this, I'm only using cpuminer to validate the thermal robustness of my build farm). With rfv2_test I'm seeing numbers 8 times larger than yours:
$ gcc -march=native -O3 -o rfv2_test rfv2_test.c -pthread -lm
$ ./rfv2_test -b -t 1
847 hashes, 1.021 sec, 1 thread, 829.374 H/s, 829.374 H/s/thread
860 hashes, 1.000 sec, 1 thread, 859.975 H/s, 859.975 H/s/thread
849 hashes, 1.000 sec, 1 thread, 848.988 H/s, 848.988 H/s/thread
858 hashes, 1.000 sec, 1 thread, 857.988 H/s, 857.988 H/s/thread
^C
And on ARM:
$ ./rfv2_test -b -t 1
1334 hashes, 1.071 sec, 1 thread, 1245.832 H/s, 1245.832 H/s/thread
1336 hashes, 1.000 sec, 1 thread, 1335.768 H/s, 1335.768 H/s/thread
1333 hashes, 1.000 sec, 1 thread, 1332.841 H/s, 1332.841 H/s/thread
^C

@bschn2

This comment has been minimized.

Copy link
Owner

commented Apr 6, 2019

@LinuXperia well, I really don't understand the method you're trying to explain, I'm sorry. I don't understand why you say "rare hash with leading zeroes", the number of zeroes is log2(1/frequency) so if a matching hash is rare it's because it has been made so by the difficulty.
I will have a look at equihash to try to understand how it differs regarding this, but I still fail to see how that would change anything given that we want a miner to spend time to prove his work.

@bschn2

This comment has been minimized.

Copy link
Owner

commented Apr 6, 2019

Well after having read a bit about equihash I think I get it a bit better, but in my humble opinion it focuses solely on the memory-bound aspect and as a result it has already been ported to an ASIC (bitmain's Z9 which is 10 times faster for the price than a GPU) : https://www.heise.de/newsticker/meldung/Ende-der-Grafikkarten-Aera-8000-ASIC-Miner-fuer-Zcash-Bitcoin-Gold-Co-4091821.html
This even resulted in a 51% attack on Bitcoin Gold and a loss of $18M. This is exactly the type of things I want to avoid.

Also looking at the numbers, it's said that an Nvidia 1080Ti does only 650 sol/s (=hashes/s) so it's even way lower than what we're doing on rfv2. The main challenge we have to address is to make sure that MBC's short-lived blocks can be mined in the block's life, and the solution above apparently makes this situation worse from what I'm reading.

@endlessloop2

This comment has been minimized.

Copy link

commented Apr 6, 2019

Hello y'all, I've been following RF since last year and implemented V1 on my unreleased coin.
I'd like to know what is the purpose of making the algorithm "faster", considering coins can change starting difficulty.
What kind of tests are you doing that you are finding it "slow to find blocks"?
I believe the algorithm is fine at this point except for any bugs that may be unsolved, but it shouldn't be changed to make it "faster".

@wtarreau

This comment has been minimized.

Copy link

commented Apr 6, 2019

Interesting. It's important to keep in mind that memory speed varies with the device's price. The DRAM access times I've measured so far: http://git.1wt.eu/web?p=ramspeed.git;a=blob;f=data/results.txt
So a cheap board has 2-3 times the access time of a PC, and it's said (though I cannot verify) that GPUs are even faster with GDDR5. Also the PC's memory controller allows to initiate multiple accesses at once while the cheap devices don't, resulting in almost a 10 times difference in multi-core tests. It's not unreasonable to imagine someone plugging SRAM to an FPGA or ASIC and get 12ns access time where a PC needs 60. The cost of 96 MB of SRAM would certainly be prohibitively high though. I think that for what you guys are looking for, the algo mixes a lot of expensive features and makes it prohibitive to implement on hardware. I do have ideas how to help your algos be memory bound but they will be extremely slow and from what I'm reading the speed seems to be an issue for your use case.

@jdelorme3

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2019

Also looking at the numbers, it's said that an Nvidia 1080Ti does only 650 sol/s (=hashes/s) so it's even way lower than what we're doing on rfv2.

So this is what I was afraind of, equihash mostly target high performance hardware. I doubt I can run it on my Raspberry Pi!

@bschn2

This comment has been minimized.

Copy link
Owner

commented May 15, 2019

Great news Sir!
Ideally we should rebuild the cpuminer patch set and try to get it merged upstream so that you don't need to maintain this fork anymore.

@djm34

This comment has been minimized.

Copy link

commented May 15, 2019

On Tue, May 14, 2019 at 07:00:51PM -0700, djm34 wrote: yes indeed, keeping only the loops = 2 allows many simplification such as there is no need to update the rambox.
Why would you not update the rambox in this case ? It's used inside each round, so if you don't update it you will definitely produce bad hashes from time to time if the same location is visited multiple times.

the probability of multiple reading is very low, especially in the 2 loops case. You have roughly 50 rounds over a huge rambox...

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented May 15, 2019

@bschn2 since we added some platform specific functions like be32toh which present onlu on Linux, I think we should include this header file to repository to make it comatible with all systems. I tested it with cpuminer and it works like a charm :)

@wtarreau

This comment has been minimized.

Copy link

commented May 15, 2019

@itwysgsl we can do even simpler. I normally never use these macros, it's just that I lazily did that inside cpuminer which already uses them and didn't realize that by moving the code back to rfv2 I'd introduce this dependency. I'll replace them by generic functions. They're not even on a fast path, there's no reason to bother anyone with such dependencies.

@wtarreau

This comment has been minimized.

Copy link

commented May 15, 2019

PR #36 fixes this now.

@MikeMurdo

This comment has been minimized.

Copy link
Contributor

commented May 15, 2019

Ideally we should rebuild the cpuminer patch set and try to get it merged upstream so that you don't need to maintain this fork anymore.
Later today I should have some time to work on this and update my pending PR.

@MikeMurdo

This comment has been minimized.

Copy link
Contributor

commented May 15, 2019

Still late on my work, I'm afraid it's not for today anymore. Wanted to let you know.

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

Btw, yesterday @jareso (author of optimized private miner for rfv2) showed up in our discord and left this message:

Right now his miner, in average, produce 18 MH/s per miner.
Just want to let you know :)

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

Also I tested RFv2 with recent patches on my android phone, and it produce around 10 kh/s.
itwysgsl/RainforestAndroidMiner@ac744b3

@bschn2

This comment has been minimized.

Copy link
Owner

commented May 16, 2019

@itwysgsl his response is totally valid regarding the fact that others are attacking algorithms as well. What I'm contesting is not this. It's the fact that we're all working on a way to better decentralize mining, and that this person has the required skills to spot weaknesses before the algorithm is released, and silently keeps them secret incurring lots of work for you and the pools upon each release while disclosing such issues early would result in a more robust initial design. Thus I stand by my words, his primary motive is to grab most of the shares and not to help with fairness.

With this said, I really doubt he achieves 18 MH/s per card, I suspect there are quite a number of cards per miner; the algorithm involves operations that are extremely fast on ARM, very fast on x86 and not natively implemented on many other platforms, thus costing more. So the base performance per core will be lower. Anyway, this helps fuel my ideas for a v3 :)

I'm seeing on miningpoolstats that our latest update has done lots of good, with most miners going back to the public pools. Also there are mostly GPUs on zergpool and mostly CPUs on skypool. So far, so good.

Regarding your phone, it's ARMv7, right ? If so it's pretty decent for such a platform which doesn't have AES, IDIV, CRC nor 64-bit native operations! Let's wait for some feedback from users of more recent phones using ARMv8 (with CPUs like Snapdragon or Kryo).

@MikeMurdo

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

@bschn2 the CPU vs GPU specialization of pools is natural, expected and nothing new : if you're having a low power (say a CPU) you'd rather join a pool where users are mostly like you so that you have a chance to get a respectable share every time a block's found, even if it's not frequent. But if you have 40 GPUs and can expect to help a pool mine a block 3 times more frequently and get 50% of the shares, you'd rather join a pool already showing a high hash rate because you'll get a big share frequently. And if you know you're concentrating most of the power, you'd rather join the pool with the lowest fees or run solo.

And with few miners at the moment, MBC is extremely appealing to mine : even with moderate power you can expect to make a good share of 12.5k every 2 minutes or so. I watched yesterday and estimated that with only 1 MH/s it was possible to make roughly $25/day. This is why those with large enough power immediately jump onto such coins, they are extremely profitable at opening. Once more miners join, the revenue is more evenly spread and it's less possible to expect such high revenues. At the moment it seems to go mostly to Jareso though.

@bschn2: most android phones with armv8 seem to enable armv7a only: https://i.stack.imgur.com/7EF24.jpg https://i.stack.imgur.com/XCGxi.jpg
However as you can see, aes, idiv and crc are present there and I think you may be able to enable them even in armv7 mode. I remember seeing a raspi somewhere in armv7 mode using CRC32 instructions. Possibly that building with "-march=armv7-a+idiv+crypto+crc" would enable everything. Or maybe with "-m32 -mcpu=cortex-a53+crypto+crc".

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

@bschn2 I believe my phone have Snapdragon 835 inside. Also to make android miner work I commented out this part, otherwise error related to __builtin_clrsbll pop up.

@wtarreau

This comment has been minimized.

Copy link

commented May 16, 2019

It could be indicative of an arch mismatch indeed. Have you tried to pass "-march=armv8-a+crypto+crc" for example when building, instead of "-march=native" ?

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

@wtarreau nope, but I can try right now.

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

Here is error message itself error: definition of builtin function '__builtin_clrsbll' static. Probably this check don't have appropriate condition to handle android build 馃

@wtarreau

This comment has been minimized.

Copy link

commented May 16, 2019

Can you please report the output of "gcc -v" ? I guess the test on the version is not correct indeed. I did this part myself to try to build on older compilers. In the worst case you can simply disable this block and see if your perf is better with the build options.

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

@wtarreau I think Android Studio uses clang for compilation, but still here is output for gcc -v:

Apple LLVM version 10.0.0 (clang-1000.10.44.2)
Target: x86_64-apple-darwin18.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
@wtarreau

This comment has been minimized.

Copy link

commented May 16, 2019

Ah, got it! It's clang indeed, so it advertises v4.2.1. We'd need to check for the clang version then, as your version appears to include this builtin while the one I tested last time didn't. In the mean time it's harmless, feel free to comment out the whole block to be able to retest the performance on armv8 mode with all extensions enabled.

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

@wtarreau so, with -march=armv8-a+crypto+crc flag miner loses around 20-40% of performance.

@wtarreau

This comment has been minimized.

Copy link

commented May 16, 2019

Oh that's really funny, probably that certain architectural optimizations are lost. From what I've found, SD835 is an octa-A73 up to 2.45 GHz, it must be awesome! I'm having a hard time imagining that it can only be a compiler issue. Please try to use "-mcpu=cortex-a73+crypto+crc" instead, and add -m64 to be certain it builds in 64-bit mode. Also double-check that you didn't lose the optimization level (typically -O3/-Ofast) when forcing the other flags.

@MikeMurdo

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

I've updated the cpuminer PR with the latest speedups and fixes.

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

@wtarreau with -mcpu=cortex-a73+crypto+crc I'm getting around 14 kh/s :)
Can't compile with -m64 flag yet since it causes some troubles with sha256 implemetation in cpuminer. Working on a fix now.

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

So, I tried it, but with -m64 flag miner lose speed and don't produce any shares.

@wtarreau

This comment has been minimized.

Copy link

commented May 16, 2019

Ah so at least it proves it's currently running in 32-bit which explains the lower performance. If it's cpuminer you're using, you should build it with -DNOASM. Last time I updated the code there I even added a script "build-linux-arm" or something like this because the default options didn't work for me.

@djm34

This comment has been minimized.

Copy link

commented May 16, 2019

Btw, yesterday @jareso (author of optimized private miner for rfv2) showed up in our discord and left this message:

Right now his miner, in average, produce 18 MH/s per miner.
Just want to let you know :)

LOL gpu dev on the brink of a burnout

@wtarreau

This comment has been minimized.

Copy link

commented May 17, 2019

Guys, I've reviewed some details of the algo and also the performance ratio mentioned above. I'm thinking that the main issue is that the whole rambox should be used to make sure there is no way to bypass it nor to precompute it. The problem is that memory controllers nowadays are extremely fast and prefetch tons of data on large systems (high-end CPUs, GPUs) but are dumb and unable to prefetch much on cheap hardware. However the cheap hardware possesses architectural optimizations (crc32, aes) that large ones partially possesses (aes for x86) and that GPUs don't have.

If we consider the instruction-equivalent latency, by carefully chaining all this it should be possible to make everyone run at a bounded speed. Let's take my Neo4's 180ns RAM access time as a reference. This device has fast CPU cores with all extensions enabled. If we focus on a 200ns target as the time for a single operation, it means this machine must consume 20 more ns doing things that other machines will be slower at. My skylake has a 65ns access time. It does have AES but lacks CRC32. Fine, let's figure the fastest CRC32 implementation, repeat it as needed to fill. A raspi 3B+ shows 160ns, and it does have CRC32 but lacks AES. So we can put maybe one or two AES calls so that AES+RAM+CRC still gives 200ns.

Note, the PC will be able to prefetch multiple accesses at once so multiple threads will benefit from this. But we don't need to read lots of data, the bandwidth is not interesting here. Let's say we use a 64 MB work area (26 bits). This can be turned to 24 bits by considering 32-bit words. An AES operation returns 128 bits. This can be sufficient to produce 5 memory addresses to fetch data from. You also want to always write so that there is no efficient way to keep a shared read-only copy of this.

From what I'm reading at a few places, GPUs like the 1080Ti mentioned above seem to feature many memory channels (11, or 12 minus one to be more precise):
https://www.anandtech.com/show/11180/the-nvidia-geforce-gtx-1080-ti-review
https://www.anandtech.com/show/11180/the-nvidia-geforce-gtx-1080-ti-review/3
https://www.quora.com/Why-is-the-size-of-RAM-in-my-GPU-GTX-1080-Ti-not-in-the-power-of-2

The prefetch is a full 64 bytes cache-line like on many other CPUs. So a GPU could be 12 times faster than a PC or an SBC thanks to this. There's not much that can be done by adding extra instructions in the loop, as these could be amortized on the huge number of cores. However the memory size can limit the number of active cores. A GPU with 12 GB of RAM could only have 192 active threads with 64 MB of work area. But even with 192 cores at work, I expect they could perform well on AES or CRC. But in this case we're at 12 times the performance, not 100 times or so.

Am I overlooking anything ?

@itwysgsl

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2019

@wtarreau nice research, what you think about creating separate issue with discussion for Rainforest improvements?

@wtarreau

This comment has been minimized.

Copy link

commented May 17, 2019

you're right, we're not that much in the speed area anymore :-)

@bschn2

This comment has been minimized.

Copy link
Owner

commented May 17, 2019

Hello!
Indeed we'd rather move such a discussion somewhere else. A quick note though, as this /is/ related to this issue, which is that manipulating large amounts of RAM does have an impact on the hash rate. The initialization time for your 64 MB rambox can take a while. Even if we ignore it, let's say you want to modify your 64 MB 32-bit at a time, with 200ns access time. This is 16M x .2 = 3.2 seconds only for total access time, this would result in 0.3 H/s in the best case, which is not practical. Even if you only write 1M times you only have 6% chance of hitting an already visited location. By considering them as 128 bit locations instead, you'd have 4M locations. At 1 million changes it's 22% chance of collision and 5 H/s. We could also decide to write full cache lines (64 bytes) so that we do touch the whole rambox with one million writes but it's going to be expensive. Even 512k writes would result in 40% chances of collision only for 10 H/s. At 40% chance of collision, you can use shared memory and scale a lot. If you use 3000 threads and only 60% of them provide usable shares, it's still 1800 threads so the large memory doesn' t protect us here.

But let's open a new issue for this discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can鈥檛 perform that action at this time.