Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rfv2 #7

Merged
merged 25 commits into from Mar 23, 2019

Conversation

Projects
None yet
6 participants
@wtarreau
Copy link

commented Mar 23, 2019

Here are the cleanups and code reorganization. rainforest.h should also be rearranged or maybe even removed. I prefer not to go too far for now.

Willy Tarreau added some commits Mar 23, 2019

Willy Tarreau
rainforest: rename aes2r.c -> rf_aes2r.c
This file only contains the AES code used by rainforest.
Willy Tarreau
rf_aes2r: quickly reindent the whole file
Nothing else was changed but indentation to make it more readable
and easier to edit.
Willy Tarreau
rf_aes2r: add support for RF_NOASM to disable asm code
This is already used in the cpuminer variant and is useful to validate that
the code also works on other platforms.
Willy Tarreau
rf_aes2r: fix some uint8_t vs const on add_round_key()
This argument definitely is a const, better let the compiler know.
Willy Tarreau
rainforest: extract the crc32 calculation code into rf_crc32
This one is huge and makes the code less readable. Let's move it into
its own file which is included as-is at the same place. It was not
changed at all.
Willy Tarreau
rf_crc32: minimal reindent of the code
This is to make it more readable. Also it now includes stdint so that
it can be built standalone.
Willy Tarreau
rf_crc32: properly condition the table and ASM instructions
We don't want to build the rf_crc32_table if the CPU supports the CRC32
instruction set. Also let RF_NOASM disable this.
Willy Tarreau
rainforest: extract the test code into rf_test.c
Now just building rf_test.c will result in the test code being built.
The code was moved out as-is.
Willy Tarreau
rf_test: basic reindent
Same as for other files.
Willy Tarreau
rainforest: rename the core code to rf_core.c
This way it will be more obvious what the role of each part is.
Willy Tarreau
rf_core: fix system include files
We need stdint but not stdio/stdlib/unistd anymore.
Willy Tarreau
rf_core: statify the two const arrays
No need to export them.
Willy Tarreau
rf_core: rename type hash256_t to rf_hash256_t
Since it's visible in the .h it's better to use a less generic name.
Willy Tarreau
rf_core: use RF_ALIGN() instead of __attribute__((aligned()))
This will help with portability as cpuminer logs indicate it has faced
build issues on windows.
Willy Tarreau
rf_core: reintroduce the windows build fixes from cpuminer
cpuminer faced some build issues that were addressed with a few casts
and defines, let's apply the same here.
Willy Tarreau
rf_core: optimize rf_rambox()
it is counter productive to avoid writing 50% of the times, it
adds conditional jumps which are mispredicted half of the times.
Better use conditional moves and always write. This increase
performance by 6% on ARMv8.
Willy Tarreau
rf_core: optimize rf_ctx organization for better cache locality
By moving some fields in the structure, we can increase the performance
by an extra 6% on Cortex-A53 at least.
Willy Tarreau
rf_cpuminer: add the cpuminer-specific hash scan code
This is only the scanhash_rf256() function. It entirely relies on the
other common functions.
Willy Tarreau
rf_cpuminer: avoid closing the hash when it doesn't match
Drop all hashes which will have one of their highest 16 bits set since
they will not match. This saves 4 calls to rf256_one_round() via
rf256_final() and almost doubles the performance.
Willy Tarreau
rainforest: avoid the slow memcpy() operation to reset the state
It's really expensive to use memcpy() to copy 16 kB of data on some
small processors like ARM Cortex A53 which only have 64-bit data paths
from the L1 cache. This roughly consumes 2k cycles just for the copy.
"Perf top" shows that half of the time is spent in memcpy(), and given
that this exhausts the L1 cache, the rest of the operations must cause
a lot of thrashing.

Since there are few modifications applied to the rambox between two
consecutive calls, better keep a history of recent changes inside
the context itself. This doesn't cost much because the write bus
between the CPU and the L1 cache is 128 bit on A53 so we can afford
a few writes. Also, the typical amount of updates apparently is
between 16 and 32 so it makes sense to put an upper bound on 32 and
remain the memory footprint low..

The performance is roughly multiplied by 5 on A53 just by doing this,
the hash rate reaches about 14.4k/s on NanoPI-Neo4, or almost 10 times
the performance of the original code.
Willy Tarreau
rf_cpuminer: make sure to properly adjust the target in benchmark mode
The ptarget[7] was updated but not Htarg which is used to break out of
the loop, making slow hashes take ages to complete.
Willy Tarreau
rf_core: reindent the whole file
It's now much more readable.
Willy Tarreau
rf_core: add spaces around operators to make the code more readable
It's much more readable this way, with less visible confusion
between unary and binary operators.
Willy Tarreau
rf_crc32: add spaces around operators
The code is now more readable as well.
Willy Tarreau
rf_crc32: split rf_add64_crc32() in two
This function has one part which is standard CRC and another part
which is for the RF algo. Let's keep only the standard part in the
CRC. This allows to simplify it not to perform shift operations on
64 bit registers when using the table since 32 are sufficient. This
was tested both on x86_64 and arm64. The x86_64 code is now 1% faster.
@bschn2

This comment has been minimized.

Copy link
Owner

commented Mar 23, 2019

This looks pretty good, thank you! You work will definitely help me because I planed to extract the history buffer changes from your cpuminer code.
I'm seeing you haven't updated the OpenCL code nor the patches, I guess this is on purpose while waiting for the algorithm update.

@bschn2 bschn2 merged commit e65cef5 into bschn2:rfv2 Mar 23, 2019

@itwysgsl

This comment has been minimized.

Copy link
Contributor

commented Mar 23, 2019

I think, we can move discussion of rainforest v2 here :)
What you think?

@bschn2

This comment has been minimized.

Copy link
Owner

commented Mar 23, 2019

Gentlemen,

so I have uploaded a tentative version in the rfv2 branch which uses more RAM and supports the history buffer in order to restore the buffer between hashes. I also tried to preinit the rambox from a template using memcpy() but it actually is slower due to excessive cache thrashing. The loop is run 320 times. My skylake 4 GHz does about 1250 H/s. Finally I placed some perturbations based on float rounding error. I would like to see this code tested on ARM; my Geekbox doesn't boot anymore, I think it's dead so I'd appreciate if @wtarreau or anyone else tries it on such devices.

The code can be tested directly by building and running rf_test thanks to @wtarreau's reorganization, it does 1000 hashes by varying only the last word like a nonce, displays the result and quits.
I'll have to switch to redoing the cpuminer and yiimp patches. For sgminer, this will require porting to OpenCL again. The code didn't change that much but likely enough to cause difficulties.

It would be the right moment to rename the algorithm, especially to cancel confusion with the current version. "lopohash" was suggested, I'm fine with this and open to suggestions.

@LinuXperia

This comment has been minimized.

Copy link

commented Mar 27, 2019

@itwysgsl Parallel Block Mining and Parallel Nonce Calculation are two different things.

The Benchmarks he referrs do not state that Nonce Parallelisations is easyli possible.
They only state that you can run the Block Mining in parallel.

He clearly has no clue what he is talking about and has no clue how the Algo works.

What he is saying is
See i started the same Web Browser like you two times on the same Computer as you
So my Web Browsers runs now two times faster than yours and are able to show the same Website on each Webbrowser two times faster than yours on the same Computer and with the same Webbrowser.

LOL

@itwysgsl

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

@LinuXperia I know about nonce parallelisations and it's totally fine, but I'm curious if there any possibility to parallelize (for example to make it easier on FPGAs or ASICs) algo itself 🤔

@bschn2

This comment has been minimized.

Copy link
Owner

commented Mar 27, 2019

Hello @itwysgsl !
@LinuXperia is right. The algorithm doesn't prevent parallelization, it limits it by cost and time. The principles are the following:

  • 96 MB of dedicated RAM are required per hash output. This means that in order to run 1000 parallel hashes you will need 96 GB of RAM. This is not available nowadays on GPUs, so even a powerful GPU with 16 GB of RAM will be limited to around 160 parallel hashes
  • the processing cost of each hash: if you want to port the algorithm to an FPGA or an ASIC and implement one dedicated RAM bus for each, then you need to run an important number of rounds that will each be much faster executed on a regular CPU. And you cannot expect to parallelize the steps by implementing a pipeline since the RAM box is the same for all steps, so each step must run with its own copy of the whole RAM.
    Even assuming that a large ASIC vendor would eventually manage to build an ASIC making use of the finest pitch and the most advanced ALU blocks to reach the performance of CPUs, it would simply have the same performance as a generic CPU for thousands of times the cost.
    Given the numbers reported by @wtarreau and @jdelorme3 I predict that the performance of CPUs on this algorithm will only slowly improve over time but not much. Huge GPUs will likely be slightly faster than CPUs, but overall the the expected gains diminish as power usage increases, so you'd rather mine with a low-end efficient device than with a high-end powerful one.
@itwysgsl

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2019

Hello @bschn2, what you think about RFv2 in current state? Should I try set up test network for it or wait for more changes into it?

@bschn2

This comment has been minimized.

Copy link
Owner

commented Mar 29, 2019

Dear @itwysgsl, quick response before going to have some sleep.
I think it is not bad, I wanted to finalize the opencl port but my plans were disturbed by some unexpected family issues (elderly break bones when falling in their bathtub). Mom is now out of trouble so I should be able to finish the code this week-end. I would really like to give a chance to @wtarreau 's idea of variable area size as it made progress in my mind and I see how to do it nicely. If you think you can easily change your test code, then maybe it could be nice to start with the current one and update it later. I don't remember in what state I left the cpuminer-multi code and to be honest (though others already noticed, ping @MikeMurdo :-)) I'm not the best one at recreating patches, so if someone is willing to copy-paste the new code into cpuminer-multi and yiimp, that could definitely help me. For cpuminer-multi it's just the same as in the benchmark part of rf_test, you need to allocate one rambox per thread first, call rf_raminit() with it once, then you always pass it to rf256_hash() which guarantees to restore it before returning.

@itwysgsl

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2019

@bschn2 oh, I hope that your elderly is fine. I'm going to set up testnet soon and update cpuminer, yiimp and nomp, it shouldn't be to hard I think.

@itwysgsl

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2019

Hello again @bschn2 @wtarreau, authour of CUDA miner @djm34 just message me and said following:

I am looking a bit at the effect of the rambox... it is pretty much useless. It can be entirely bypassed.
Basially carry is just crc with 0x50 in front I mean 0xNNNNNNNN just becomes 0x50NNNNNNNN
So the rambox "add" the 50 in front
(I checked only 100 nonce)
@wtarreau

This comment has been minimized.

Copy link
Author

commented Mar 29, 2019

@djm34

This comment has been minimized.

Copy link

commented Mar 29, 2019

sorry, that was a mistake while reading the code. overever if the current amd miner can do 33GH/s, it definitely means that it can be bypassed somehow

@wtarreau

This comment has been minimized.

Copy link
Author

commented Mar 29, 2019

@djm34

This comment has been minimized.

Copy link

commented Mar 30, 2019

Can you elaborate a little bit ? (I am trying to code the cuda code).
I noticed that, indeed, the carry can be computed in advance of the round.
however, for the moment, I don't understand how to deal with the 4 rounds in rf256_final

@djm34

This comment has been minimized.

Copy link

commented Mar 30, 2019

lol, ok found it. the problem in my opinion, is more the full hash rotation rf256_rot32x256(&ctx->hash);
and the fact we are modifying only 128bits of the hash at each round.

The last 32bit of the final hash is known before computing the final 4 rounds, and even if it isn't enough to say if it is a solution or not. The probability is sufficiently high to just stop the calculation here...
and just submit the result

removing the full hash rotation (or doing it the other way around at the beginning )or doing it rather at the end would probably solve the problem (or postpone it... but would be enough to still need the rambox, which is nice btw, since not many can fit into gpu shared memory)

In my opinion, I think it is a bad idea to not mix the full hash at each rounds, this always gives some possibility for attacks.
Considering the rf256_one_round is already a rather long mixing, one could mix
q, q+1
q+1, q+2
q+2, q+3
q+3, q
or something like that. I don't think this would affect the original speed performance but would definitely avoid bypassing some of the calculation

@bschn2

This comment has been minimized.

Copy link
Owner

commented Mar 30, 2019

Hello Sir,

indeed, you've probably understood why it's called "carry", it's kept from the previous step to the next one and is not used for the current one. This guarantees that each step depends on both N-1 and N-2. This is a protection against reversibility because when you know step N, even if you figure how to reverse a crypto function, you'd need knowledge of step N-2 in order to get back to step N. I considered using it for step N as well but that would break this nice property. Instead now that we're running many loops (in v2), this is something we don't need to care about anymore. Also to be honest, it took a very long time to tune the function to successfully pass all SMHasher tests, and I admit I'd be a bit worried to detune it and risk introducing some bias. Running these tests takes one full week on a quad-core Skylake, so I tend to better trust our ability to chain these layers many times than trying to refine them a little bit at the risk of weakening them.

A quick update on my progress, I've wasted a lot of time trying to get the rambox area properly allocated in OpenCL and it's OK now, I can produce the same hashes as with rf_test. My son's rx560 is able to deliver up to 4.8 kH/s peak for now, and the performance decreases as the number of threads increase, likely due to contention in the memory controller.

I've implemented the random position and length in the work area as suggested by @wtarreau with a minor difference, the random is applied anywhere in the area with any length. It tends to increase the performance on GPUs, I should probably use a dampening function for the length. I'll send an updated version of the code soon.

@wtarreau

This comment has been minimized.

Copy link
Author

commented Mar 30, 2019

If you apply random offsets and lengths, you will end up reducing the risk of collisions between multiple threads sharing the same are, which would allow some GPUs to have a shared area. You'd rather keep the area centered around the center or the beginning instead so that the same areas are intensively used. It may slightly increase the cache hit ratio but not significantly enough that we'd care about it.

@bschn2

This comment has been minimized.

Copy link
Owner

commented Mar 30, 2019

Dear @wtarreau thanks for this comment, you're absolutely right and I've just fixed it. I'm currently looking at the best way to implement the variable loop count. I wouldn't want to depend on over-optimized sin() versions but maybe I'm worrying for no reason.

bschn2 added a commit that referenced this pull request Mar 31, 2019

implement a variable work area width
this idea was first suggested by @wtarreau on github :
   #7 (comment)

this version has a few differences in that the beginning and end can be
anywhere in the block.

bschn2 added a commit that referenced this pull request Mar 31, 2019

implement a variable number of loops
this idea was first suggested by @wtarreau on github :
   #7 (comment)

It was a bit refined by exponentiating the sine wave with an odd power, providing a heartbeat-like curve giving short variations around a moderately stable average. The max number of rounds can now reach 765 with an average around 383.

The sine is passed a 32 bit discrete value and its output is rounded to an integer number of loops as well. IEEE754-compliant implementations return the exact same values when operating on 64 bit floats. Simplified implementations and those working based on successive approximations will get a number of values wrong possibly even the remainder of the input. A wrong number of rounds will result in a wrong hash.

The OpenCL implementation was validated against the x86_64 one and both return similar hashes.
@bschn2

This comment has been minimized.

Copy link
Owner

commented Mar 31, 2019

Gentlemen,

I have pushed what I consider stable for version 2. I studied various sine-based curves for the loop count to end up with sin(x/16)^5 which brings the merit of not being implementable as a table (as no single integer value is a multiple of PI) and provides both smooth and fast variations depending on the location relative to the period. I got it to work and to report same hashes on my son's RX560 under OpenCL and on my Skylake, indicating both platforms are IEEE754 compliant. Non-compliant platforms (table-based, or successive approximations as could be done in FPGAs) will have a hard time returning valid number of rounds and will emit lots of invalid shares. This now requires to build with -lm but this should not be an issue at all.

The C code is now slightly slower due to the slight increase in the number of rounds but this remains fairly acceptable. My son's RX560 manages to reach about 5 times my skylake's performance. I wouldn't be surprised if a better expert like WildRig's author manages to get more. I'm interested in ARM's performance here, as the history buffer had to be double to stay on the safe side, and I know that such boards may be a impacted a bit there.

While trying to build for cpuminer-multi on top of V1 I figured we should rename the functions to avoid name collision and confusion. It could be the last round of patch. I think we can rename all rf_* and rf256_* prefixes with rfv2_ and be done with it. Last we'll have to update the comments and performance reports in the comments.

Please just let me know if I forgot anything.

@LinuXperia

This comment has been minimized.

Copy link

commented Mar 31, 2019

Hi @bschn2

Thank you very much for your hard work.
I was able to compile the code on a Ubuntu Linux 18.04 machine with a
Intel Intel(R) Core(TM) i7-7700HQ CPU Quad Core 8 Thread CPU @ 2.80GHz
Here are the full Specs of the CPU
followed by the rfv2 check and benchmark results

Architektur: x86_64
CPU Operationsmodus: 32-bit, 64-bit
Byte-Reihenfolge: Little Endian
CPU(s): 8
Liste der Online-CPU(s): 0-7
Thread(s) pro Kern: 2
Kern(e) pro Socket: 4
Sockel: 1
NUMA-Knoten: 1
Anbieterkennung: GenuineIntel
Prozessorfamilie: 6
Modell: 158
Modellname: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Stepping: 9
CPU MHz: 800.057
Maximale Taktfrequenz der CPU: 3800.0000
Minimale Taktfrequenz der CPU: 800.0000
BogoMIPS: 5616.00
Virtualisierung: VT-x
L1d Cache: 32K
L1i Cache: 32K
L2 Cache: 256K
L3 Cache: 6144K
NUMA-Knoten0 CPU(s): 0-7
Markierungen: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d

./rf_test -c
Single hash:
valid: ad43e40193aee36e9fcb8307956a74a3.91a7f7ee087ee5e2e410230d708500d5
256-loop hash:
valid: b7981703e677f598226320c8fa1c77ae.c7cc0654b9930ba28a52baedeeba5563

./rf_test -b -t 4
467 hashes, 1.107 sec, 4 threads, 421.722 H/s, 105.431 H/s/thread
473 hashes, 1.000 sec, 4 threads, 472.979 H/s, 118.245 H/s/thread
520 hashes, 1.000 sec, 4 threads, 519.985 H/s, 129.996 H/s/thread
554 hashes, 1.000 sec, 4 threads, 553.983 H/s, 138.496 H/s/thread
510 hashes, 1.000 sec, 4 threads, 509.984 H/s, 127.496 H/s/thread
514 hashes, 1.000 sec, 4 threads, 513.985 H/s, 128.496 H/s/thread
523 hashes, 1.000 sec, 4 threads, 522.985 H/s, 130.746 H/s/thread
538 hashes, 1.000 sec, 4 threads, 537.984 H/s, 134.496 H/s/thread
496 hashes, 1.000 sec, 4 threads, 495.986 H/s, 123.996 H/s/thread
566 hashes, 1.000 sec, 4 threads, 565.984 H/s, 141.496 H/s/thread
503 hashes, 1.000 sec, 4 threads, 502.987 H/s, 125.747 H/s/thread
531 hashes, 1.000 sec, 4 threads, 530.984 H/s, 132.746 H/s/thread
513 hashes, 1.000 sec, 4 threads, 512.985 H/s, 128.246 H/s/thread
503 hashes, 1.000 sec, 4 threads, 502.985 H/s, 125.746 H/s/thread
493 hashes, 1.000 sec, 4 threads, 492.986 H/s, 123.246 H/s/thread
499 hashes, 1.000 sec, 4 threads, 498.987 H/s, 124.747 H/s/thread
476 hashes, 1.000 sec, 4 threads, 475.987 H/s, 118.997 H/s/thread
488 hashes, 1.000 sec, 4 threads, 487.986 H/s, 121.997 H/s/thread

are there some instructions how to test the cl Code on my Nvidia GTX 1060 ?
I would like to test the benchmark also if possible with the GTX 1060 but dont know what steps at the moment are needed to do this benchmark test with the cl code.

@jdelorme3

This comment has been minimized.

Copy link
Contributor

commented Mar 31, 2019

Ah I didn't success with the cpuminer bench, apparently the patch is still v1 and I was not sure what to do from there. So I could try rf_test on my Raspberry Pi 3B+ (it require -lm also) and it say this:

./rf_test -b -t 4
962 hashes, 1.613 sec, 4 threads, 596.516 H/s, 149.129 H/s/thread
980 hashes, 1.002 sec, 4 threads, 978.483 H/s, 244.621 H/s/thread
979 hashes, 1.000 sec, 4 threads, 978.627 H/s, 244.657 H/s/thread
976 hashes, 1.000 sec, 4 threads, 975.542 H/s, 243.886 H/s/thread
979 hashes, 1.000 sec, 4 threads, 978.635 H/s, 244.659 H/s/thread
979 hashes, 1.000 sec, 4 threads, 978.612 H/s, 244.653 H/s/thread
977 hashes, 1.000 sec, 4 threads, 976.642 H/s, 244.160 H/s/thread

So it seems more faster than the i7-7700! Fantastic!
@LinuXperia probably you can try to activate the hyperthreading. On my i7-6700 with 4 threads it say this:

./rf_test -b -t 4
520 hashes, 1.077 sec, 4 threads, 482.974 H/s, 120.744 H/s/thread
562 hashes, 1.000 sec, 4 threads, 561.981 H/s, 140.495 H/s/thread
585 hashes, 1.000 sec, 4 threads, 584.988 H/s, 146.247 H/s/thread
595 hashes, 1.000 sec, 4 threads, 594.987 H/s, 148.747 H/s/thread

And with 8 threads it say this:

 ./rf_test -b -t 8
877 hashes, 1.144 sec, 8 threads, 766.653 H/s, 95.832 H/s/thread
899 hashes, 1.000 sec, 8 threads, 898.968 H/s, 112.371 H/s/thread
990 hashes, 1.000 sec, 8 threads, 989.979 H/s, 123.747 H/s/thread
1046 hashes, 1.000 sec, 8 threads, 1045.978 H/s, 130.747 H/s/thread
1006 hashes, 1.000 sec, 8 threads, 1005.980 H/s, 125.747 H/s/thread

So it is almost two times faster and become faster (but not much) than the Raspberry.
Fantastic work!

@bschn2

This comment has been minimized.

Copy link
Owner

commented Mar 31, 2019

@LinuXperia many thanks for testing! You should also try with -t 8, it will almost double your numbers as the code works very well with hyperthreading thanks to the limited ILP that allows the two threads to use different ports in the execution unit.

Regarding the tests on your Nvidia card, I don't really know. Last time I checked, OpenCL support was claimed to be fairly complete but I've read at various places that OpenCL is much slower than CUDA on these boards. I guess @djm34 has way more experience on this subject.

Best regards,
B.S.

@bschn2

This comment has been minimized.

Copy link
Owner

commented Mar 31, 2019

Thank you for your test @jdelorme3 this comforts me in that I didn't break anything for ARM!

Regarding your confusion on the cpuminer patch, yes it's totally my fault, I should have removed the patches instead. At least it confirms that the algo needs to be renamed to remove this confusion later. I think we can put "rfv2" everywhere.

@bschn2

This comment has been minimized.

Copy link
Owner

commented Mar 31, 2019

Also Gentlemen,
please do not forget to run "rf_test -c" at least once when you compile to verify that you're producing valid hashes, especially if you're seeing strange numbers.

@jdelorme3

This comment has been minimized.

Copy link
Contributor

commented Mar 31, 2019

Ah good catch, I start an old executable on my Raspberry Pi! Here is it again:

./rf_test -c
Single hash:
valid: ad43e40193aee36e9fcb8307956a74a3.91a7f7ee087ee5e2e410230d708500d5
256-loop hash:
valid: b7981703e677f598226320c8fa1c77ae.c7cc0654b9930ba28a52baedeeba5563
./rf_test -b -t 4
643 hashes, 1.620 sec, 4 threads, 396.810 H/s, 99.202 H/s/thread
721 hashes, 1.001 sec, 4 threads, 720.619 H/s, 180.155 H/s/thread
760 hashes, 1.000 sec, 4 threads, 759.723 H/s, 189.931 H/s/thread
714 hashes, 1.000 sec, 4 threads, 713.716 H/s, 178.429 H/s/thread
749 hashes, 1.000 sec, 4 threads, 748.680 H/s, 187.170 H/s/thread

So it's a bit less faster but still faster that 4 threads on the i7!

@LinuXperia

This comment has been minimized.

Copy link

commented Mar 31, 2019

@LinuXperia many thanks for testing! You should also try with -t 8, it will almost double your numbers as the code works very well with hyperthreading thanks to the limited ILP that allows the two threads to use different ports in the execution unit.

Should have posted it yes. sorry for this.
here are the 8 Thread results for the same CPU.

Hyperthreading should have been always activated. I am seeing always 8 CPU's displayed
Aka 4 real CPU Cores and 8 Virtual CPUs Cores.

CPU: Intel Core i7-7700HQ (-MT-MCP-) cache: 6144 KB
clock speeds: max: 3800 MHz 1: 2800 MHz 2: 2800 MHz 3: 2800 MHz 4: 2800 MHz 5: 2800 MHz 6: 2800 MHz 7: 2800 MHz 8: 2800 MHz

./rf_test -c
Single hash:
valid: ad43e40193aee36e9fcb8307956a74a3.91a7f7ee087ee5e2e410230d708500d5
256-loop hash:
valid: b7981703e677f598226320c8fa1c77ae.c7cc0654b9930ba28a52baedeeba5563

./rf_test -b -t 4
520 hashes, 1.000 sec, 4 threads, 519.984 H/s, 129.996 H/s/thread
554 hashes, 1.000 sec, 4 threads, 553.984 H/s, 138.496 H/s/thread
513 hashes, 1.000 sec, 4 threads, 512.987 H/s, 128.247 H/s/thread
508 hashes, 1.000 sec, 4 threads, 507.985 H/s, 126.996 H/s/thread
528 hashes, 1.000 sec, 4 threads, 527.985 H/s, 131.996 H/s/thread
537 hashes, 1.000 sec, 4 threads, 536.984 H/s, 134.246 H/s/thread
498 hashes, 1.000 sec, 4 threads, 497.987 H/s, 124.497 H/s/thread
565 hashes, 1.000 sec, 4 threads, 564.984 H/s, 141.246 H/s/thread
504 hashes, 1.000 sec, 4 threads, 503.985 H/s, 125.996 H/s/thread
531 hashes, 1.000 sec, 4 threads, 530.984 H/s, 132.746 H/s/thread
512 hashes, 1.000 sec, 4 threads, 511.987 H/s, 127.997 H/s/thread
503 hashes, 1.000 sec, 4 threads, 502.984 H/s, 125.746 H/s/thread
493 hashes, 1.000 sec, 4 threads, 492.984 H/s, 123.246 H/s/thread
498 hashes, 1.000 sec, 4 threads, 497.987 H/s, 124.497 H/s/thread

./rf_test -b -t 8
874 hashes, 1.000 sec, 8 threads, 873.973 H/s, 109.247 H/s/thread
850 hashes, 1.000 sec, 8 threads, 849.975 H/s, 106.247 H/s/thread
898 hashes, 1.000 sec, 8 threads, 897.973 H/s, 112.247 H/s/thread
842 hashes, 1.000 sec, 8 threads, 841.974 H/s, 105.247 H/s/thread
830 hashes, 1.000 sec, 8 threads, 829.976 H/s, 103.747 H/s/thread
918 hashes, 1.000 sec, 8 threads, 917.973 H/s, 114.747 H/s/thread
876 hashes, 1.000 sec, 8 threads, 875.974 H/s, 109.497 H/s/thread
842 hashes, 1.000 sec, 8 threads, 841.974 H/s, 105.247 H/s/thread
826 hashes, 1.000 sec, 8 threads, 825.975 H/s, 103.247 H/s/thread
930 hashes, 1.000 sec, 8 threads, 929.971 H/s, 116.246 H/s/thread
858 hashes, 1.000 sec, 8 threads, 857.974 H/s, 107.247 H/s/thread
878 hashes, 1.000 sec, 8 threads, 877.973 H/s, 109.747 H/s/thread
828 hashes, 1.000 sec, 8 threads, 827.976 H/s, 103.497 H/s/thread
874 hashes, 1.000 sec, 8 threads, 873.975 H/s, 109.247 H/s/thread

Thank you again for this amazing work @bschn2

Wish you all the best in Life.
Romeo.

@wtarreau

This comment has been minimized.

Copy link
Author

commented Apr 1, 2019

FWIW here is on my i7-6700k at 4.4 GHz:

./rf_test -b -t 8
1041 hashes, 1.135 sec, 8 threads, 917.322 H/s, 114.665 H/s/thread
1128 hashes, 1.000 sec, 8 threads, 1127.966 H/s, 140.996 H/s/thread
1179 hashes, 1.000 sec, 8 threads, 1178.981 H/s, 147.373 H/s/thread
1201 hashes, 1.000 sec, 8 threads, 1200.978 H/s, 150.122 H/s/thread
1124 hashes, 1.000 sec, 8 threads, 1123.979 H/s, 140.497 H/s/thread

On nanopi-neo4 (RK3399) at 2*2.0(A72) + 4*1.5(A53) GHz :

  1. with crypto extensions disabled (like raspi):
$ ./rf_test-armv8-noaes -b -t 6
1143 hashes, 1.509 sec, 6 threads, 757.701 H/s, 126.284 H/s/thread
1262 hashes, 1.000 sec, 6 threads, 1261.820 H/s, 210.303 H/s/thread
1285 hashes, 1.000 sec, 6 threads, 1284.868 H/s, 214.145 H/s/thread
1282 hashes, 1.000 sec, 6 threads, 1281.873 H/s, 213.646 H/s/thread
1277 hashes, 1.000 sec, 6 threads, 1276.874 H/s, 212.812 H/s/thread
  1. with crypto extensions:
$ ./rf_test-armv8 -b -t 6
1484 hashes, 1.504 sec, 6 threads, 986.518 H/s, 164.420 H/s/thread
1672 hashes, 1.000 sec, 6 threads, 1671.734 H/s, 278.622 H/s/thread
1633 hashes, 1.000 sec, 6 threads, 1632.807 H/s, 272.135 H/s/thread
1635 hashes, 1.000 sec, 6 threads, 1634.830 H/s, 272.472 H/s/thread
1643 hashes, 1.000 sec, 6 threads, 1642.834 H/s, 273.806 H/s/thread

And on nanopi-fire3 at 8*1.6 GHz (A53) (edited, previous ones were wrong) :

$ ./rf_test -b -t 8
1902 hashes, 2.166 sec, 8 threads, 877.968 H/s, 109.746 H/s/thread
2289 hashes, 1.004 sec, 8 threads, 2280.005 H/s, 285.001 H/s/thread
2250 hashes, 1.004 sec, 8 threads, 2241.076 H/s, 280.135 H/s/thread
2236 hashes, 1.004 sec, 8 threads, 2227.136 H/s, 278.392 H/s/thread
2250 hashes, 1.004 sec, 8 threads, 2241.121 H/s, 280.140 H/s/thread
2242 hashes, 1.004 sec, 8 threads, 2233.090 H/s, 279.136 H/s/thread

Given the nice advantage on the low-power machines, I think that the name "lopohash" would be deserved ;-)

@itwysgsl

This comment has been minimized.

Copy link
Contributor

commented Apr 1, 2019

I'm going to set up new testnet as soon as possible, thanks again for all your hard work!

@wtarreau

This comment has been minimized.

Copy link
Author

commented Apr 1, 2019

On a fanless celeron-J4105 at 4x2.4 GHz:

$ ./rf_test -b -t 4
376 hashes, 1.353 sec, 4 threads, 277.902 H/s, 69.475 H/s/thread
380 hashes, 1.000 sec, 4 threads, 379.958 H/s, 94.990 H/s/thread
424 hashes, 1.000 sec, 4 threads, 423.970 H/s, 105.992 H/s/thread
424 hashes, 1.000 sec, 4 threads, 423.971 H/s, 105.993 H/s/thread
437 hashes, 1.000 sec, 4 threads, 436.971 H/s, 109.243 H/s/thread

Not bad! It's only 1/3 of my core i7!

Just out of curiosity I built it on my old single-core dual-thread atom (N410), I had to redefine __builtin_clrsb() which wasn't provided by the compiler (I took it from the .cl version) but it fails the check, I don't know why. Maybe my __builtin_clz(), I'm not sure.

@wtarreau

This comment has been minimized.

Copy link
Author

commented Apr 1, 2019

So my atom now works, I ought to use __builtin_clzl() and not __builtin_clz(). I've sent an update for this in PR #10 .

@itwysgsl itwysgsl referenced this pull request Apr 11, 2019

Closed

Bug in RFv2 #20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.