Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfv2_mix_fp_loss #25

Closed
djm34 opened this issue Apr 24, 2019 · 24 comments

Comments

Projects
None yet
4 participants
@djm34
Copy link

commented Apr 24, 2019

this routine is giving me trouble due to that strange conversion to double.
the gpu card doesn't get the same result as the cpu...

I am pretty sure this is running into some machine dependent (as well as language dependent) precision problem and this shouldn't be done in a cryptographic algorithm.

https://docs.nvidia.com/cuda/floating-point/index.html

@itwysgsl

This comment has been minimized.

Copy link
Contributor

commented Apr 25, 2019

@bschn2 I did some reading and seams like IEEE754 standard could cause some troubles in long run. Any thoughts on this?

@itwysgsl

This comment has been minimized.

Copy link
Contributor

commented Apr 25, 2019

Here is more info about IEEE754 on CUDA which I managed to find https://stackoverflow.com/a/10335070/9217774

@djm34

This comment has been minimized.

Copy link
Author

commented Apr 25, 2019

also a xor operation on double doesn't make much sense.
and probably for that reason cuda didn't implement full ieee754 compliance with that operation

@bschn2

This comment has been minimized.

Copy link
Owner

commented Apr 25, 2019

Hello!

Well, this is a bad news because CUDA clearly is a platform advocating full support for IEEE754. The simple fact that some implementations take shortcuts and give bad results is a good indicator of the difficulty to get this right when done from scratch and is quite tempting to keep. However, figuring why it fails there and how to work around it will take time, and Mr @itwysgsl needs it soon so we have to help him and step back on this point. The original version didn't rely on FP and I expected it to be rock solid already (except the off-by-one padding round which I messed up of course). So we could move the FP stuff for a v3 and take more time to port it to various architectures and/or avoid a few corner cases if this is the only problem. I'm still confident there is a way to make it work fine everywhere, given that @wtarreau, @jdelorme3 and @LinuXperia managed to get it to work on x86_64, armv8 and i686, and me on AMD GPU. Thus different architectures and bit sizes. But I admit that the sample is still quite small to generalize for a short deployment.

I noticed in issue #26 that @wtarreau proposes an FP-less variant. I'm all open, of course! What matters is that MBC has no trouble to open soon, and no need to fork again in 3 months.

By the way, there is no xor on a double. The xor is performed between two integers, one coming from a perfectly defined conversion from a double, which is also supported by the fact that the 4 completely different architectures above provide the same hashes. It is used for error accumulation. But I agree that incomplete implementations taking illegal shortcuts will cause problems (this was the initial goal) and it's too late to try to enumerate them now. Let's see what @wtarreau proposes first, otherwise this week-end I'll get back to the keyboard. I will possibly not be able to access the GPU though (or I'll have to negotiate with my son who didn't inherit my passion for crypto and who prefers games :-))

@djm34

This comment has been minimized.

Copy link
Author

commented Apr 25, 2019

no offense but quite frankly, I really don't understand what is the point in this routine, nor why it seems so important to you. I don't think a cryptographic algo is the place to probe for ieee754 compliance in mining device.

@bschn2

This comment has been minimized.

Copy link
Owner

commented Apr 25, 2019

It's not important, the purpose it to keep ASICs and FPGAs away. It is difficult to get FP right and almost impossible for such categories in a reasonable time. CPUs have got it right for the last two decades, and GPUs normally get it right nowadays. The fact that you found one relevant GPU which doesn't is enough for me to consider that we probably raised the bar too high and we need to step back on this one. The goal is not to annoy developers, just to keep massively scalable hardware away.

@djm34

This comment has been minimized.

Copy link
Author

commented Apr 25, 2019

did some more research with cuda, actually cuda defines 4 intrinsics to convert a double to 64bit unsigned inter. I tried them all.
None of them return a result which agrees with the cpu code

here an example of the code I implemented:
ulong p0, q0;
ulong lp, lq;
double fp, fq;
// uint event_thread = (blockDim.x * blockIdx.x + threadIdx.x);
p0 = *p; q0 = *q;
fp = __ull2double_rd(p0); fq = __ull2double_rd(q0);
lp = __double2ull_rd(fp) ^ p0; lq = __double2ull_rd(fq) ^ q0;

p0 += lq;             q0 += lp;
*p = p0;              *q = q0;

}

_rd stands for round towards -inf
_ru round toward +inf
_rz round toward zero
_rn round to nearest, ties to even

which are in principle the for rounding method for ieee 754 standard. None agrees with cpu...

To be honest this isn't obvious which one is wrong or which one is right here

@wtarreau

This comment has been minimized.

Copy link

commented Apr 26, 2019

I'm sure that on CPUs double to long is performed by rounding towards zero. In the other direction there is no rounding, there is just a precision loss since above 53 bits you use the exponent and the mantissa keeps the 53 highest bits. It is possible that the "rounding" in this case is made to nearest in order to reduce precision loss. Just for the science I would suggest trying _rz for double2ul and _rn for ul2double. But I also agree that when it becomes difficult to implement something in software it might be going too far, especially since certain languages will not necessarily offer this level of control (e.g. if someone wants to implement your algo in Javascript it can be problematic).

@djm34

This comment has been minimized.

Copy link
Author

commented Apr 26, 2019

I tried your suggestion but it didn't work...

@wtarreau

This comment has been minimized.

Copy link

commented Apr 26, 2019

OK thanks for testing.

@djm34

This comment has been minimized.

Copy link
Author

commented Apr 28, 2019

I checked a little more why it wasn't working, and it seems that in certain case for x86, it will round sometimes to a upper value and sometime to a lower value, while the gpu always does one or the other... :D

here an example on CPU:

b391c2b4c29bf5a0 gets rounded to b391c2b4c29bf000
6a5848969d42df38 gets rounded to 6a5848969d42e000

The GPU depending of the type of rounding does this:

b391c2b4c29bf5a0 rounded to b391c2b4c29bf800 or b391c2b4c29bf000
6a5848969d42df38 rounded to 6a5848969d42e000 or 6a5848969d42dc00

So no matter what rounding type I chose, I always get one wrong (compared to cpu).

To be honest, I would tend to believe that GPU approach is more rigorous than the cpu one.
For the first number what is done by the cpu is even a little strange... (compared to mathematical rounding, I mean the one done by us, humans)

@wtarreau

This comment has been minimized.

Copy link

commented Apr 28, 2019

I noticed that CPUs will use different FP implementations depending on whether they use x87, soft-math, SSE, AVX etc... Most notably if you build with -O3 (or simply with -ffast-math) you can end up with incorrect values. I'm not sure we can easily decide what direction to round towards on CPUs so better stay safe. I've commented out mix_fp_loss() on my side.

@wtarreau

This comment has been minimized.

Copy link

commented Apr 28, 2019

I got another issue with the sine. I couldn't get check_sin() to report a correct value on armv7. I don't know why but given that the purpose of the function is to detect a risk of malfunction, we're sure this risk exists so I didn't even spend time going further to see if the hashes were correct.

I got my hand back to my old integer sine wave code. I got it to report the same form as the current one but quite frankly I think it is overkill and could simplify it and get a somewhat comparable curve with same performance, which will perfectly suit the initial purpose which is to have variations on the loop count to avoid unrolling. The initial implementation I had was this one:

unsigned char sq8(signed char a)
{
	a += 128;
	a = ((unsigned int)(a * a)) >> 7;
	return ~a;
}

// returns a 8-bit sine of angle/16 centered around 128
unsigned char sin8_16th(unsigned int angle)
{
	angle = (angle * 42722829) >> 24;
	angle &= 255;
	if (angle & 128)
		return ~sq8(angle << 1);
	else
		return sq8(angle << 1);
}

// returns the square root of x (22 bit input max)
unsigned int isqrt(unsigned int x)
{
	unsigned int bit = 1 << 20;
	unsigned int res = 0;
 
	while (bit > x)
		bit >>= 2;

	while (bit != 0) {
		if (x >= res + bit) {
			x  -= res + bit;
			res = (res >> 1) + bit;
		}
		else
			res >>= 1;
		bit >>= 2;
	}
	return res;
}

static unsigned int sin_scaled(unsigned int angle)
{
	int i;

	i = sin8_16th(angle) - 128;
	i = i * i * i;
	i += 2097152;
	return isqrt(i) >> 3;
}

Instead I came up with a much simpler one using the same period and average which I'm going to send in a PR:

static unsigned int sin_scaled(unsigned int x)
{
	int i = ((x * 42722829) >> 24) - 128;
	x = 15 * i * i * abs(i);  // 0 to 15<<21
	x = (x + (x >> 4)) >> 17;
	return 256 - x;
}

With this I get the same hashes on armv7/v8 and x86.

wtarreau pushed a commit to wtarreau/rainforest that referenced this issue Apr 28, 2019

Willy Tarreau
rfv2_core: temporarily disable rfv2_mix_fp_loss()
the results are inconsistent across platforms. See issue bschn2#25 and
measurements performed by @djm34.

bschn2 added a commit that referenced this issue Apr 28, 2019

disable floading point precision loss measurement, it is not reliable…
… enough and gives different results on CUDA from OpenCL and x86/arm as reported by djm34 in issue #25
@wtarreau

This comment has been minimized.

Copy link

commented Apr 28, 2019

updated in PR#29

@wtarreau

This comment has been minimized.

Copy link

commented Apr 28, 2019

Ah no sorry, forgot to update the test patterns, rfv2_test -c is wrong now. Let me update it again.

@wtarreau

This comment has been minimized.

Copy link

commented Apr 28, 2019

OK now updating the PR again. You should get this:
$ ./rfv2_test -c
Single hash:
valid: dee9fb19e450c6283598e14f507ab1faa543bcfecef3364f1472bd33060187eb
256-loop hash:
valid: 0c58f30c14e3d430368a3126cdc297ac9fd707d3d0077ded9a326cac270390c1

@wtarreau

This comment has been minimized.

Copy link

commented Apr 28, 2019

That's PR#30 this time. I've deleted the two previous ones to avoid confusion.

bschn2 added a commit that referenced this issue Apr 28, 2019

Merge pull request #30 from wtarreau/remove-fp3
floating point removal
see issues #25 and #26. It would be nice to revisit this later still.
@bschn2

This comment has been minimized.

Copy link
Owner

commented Apr 28, 2019

@wtarreau you suggested that I rename the rfv2 branch to master. What is the proper way to do this? Will we lose the changes in the master branch?

@wtarreau

This comment has been minimized.

Copy link

commented Apr 28, 2019

Well, if rfv2 is directly derived from master, you could create a v1 branch from the current master, and just merge rfv2 into it. If they have diverged, you may still do it but you can experience merge conflicts. Given the small number of projects following this code (no offence) it could possibly also be acceptable to switch the branches and force push the master so that it becomes the current v2.

It's important not to lose what you did in v1 even if it's broken, some projects might rely on it, or it might be studied for academic purposes.

I'll have a quick look to see it if makes sense to merge rfv2 into master and will let you know.

@wtarreau

This comment has been minimized.

Copy link

commented Apr 28, 2019

OK it's easy enough, thus I instead rebased rfv2 on top of master so that you keep a cleaner (more linear) history. Here's what you just have to do:

  • create an rfv1 branch out of master:
    $ git branch rfv1 master
    $ git push github rfv1
  • merge the PR I'll send you containing rfv2-rebased. I verified it's the exact same as rfv2. You can check yourself by doing this once merged:
    $ git diff rfv2 master
    $ git push github master
  • delete the rfv2 branch, you'll create it once v2 is released and enters maintenance mode:
    $ git branch -D rfv2
    $ git push github :rfv2

It's also better for those of us who could still be tracking rfv2 because if it's not deleted we might miss updates without noticing.

Normally the right way to proceed is to create maintenance branches for each major version and only merge fixes there (ideally backported from the master branch). Then your master is always the most up to date and contains all known fixes.

Let me know if you need some help.

@wtarreau

This comment has been minimized.

Copy link

commented Apr 28, 2019

Well it merges even better than it rebases in the end, I'll do it with a merge instead so that we keep commit IDs.

@bschn2

This comment has been minimized.

Copy link
Owner

commented Apr 28, 2019

Looks like it worked like a charm :-)
Many thanks for your detailed instructions, Sir, I am truly impressed!

Gentlemen, as discussed, now the latest updates for v2 are in the master branch, which is likely more consistent with where people expect to find them. It's also the branch listed on the front page by default. I am going to close these FP issues now.

@bschn2 bschn2 closed this Apr 28, 2019

@itwysgsl

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2019

Thanks for all your hard work @bschn2 !

@bschn2

This comment has been minimized.

Copy link
Owner

commented Apr 28, 2019

You're welcome, Sir! But this time again I wasn't the one doing most of the work :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.