New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolerance added for rff tests #1829

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
5 participants
@Anipik
Member

Anipik commented Dec 5, 2018

Fixes #1825

Rff is using the new cpumathalgorithm for matrix multiplication
So sometimes there is difference in last few decimal places(due to different number of multiplications) from the original baseline so adding tolerance corrects the error.

@Anipik Anipik force-pushed the Anipik:rffworkout branch from f36d190 to 26cf0fd Dec 5, 2018

@Anipik Anipik changed the title from [WIP]tolerance added for rff tests to Tolerance added for rff tests Dec 5, 2018

@Anipik Anipik requested review from tannergooding and eerhardt Dec 5, 2018

@wschin

This comment has been minimized.

Member

wschin commented Dec 5, 2018

It increases tolerance from 1e-6 to 1e-3? Is the result of matrix multiplication not deterministic? If numerical error is accumulated very quickly, I'd shrink the data size and the number of involved model parameters in those related tests.

@Anipik

This comment has been minimized.

Member

Anipik commented Dec 5, 2018

It increases tolerance from 1e-6 to 1e-3?

yes it does. And it also used our new floating point mathematical comparison algorithm

Is the result of matrix multiplication not deterministic?

Sometimes it results in slight differences in last few decimal places due to different number of multiplications. @tannergooding is more equipped to answer the reason

@singlis

This comment has been minimized.

Member

singlis commented Dec 5, 2018

Thank you, this will likely fix this issue #1825 where our RFF tests are failing intermittently.

@singlis

singlis approved these changes Dec 5, 2018

:shipit:

@wschin

This comment has been minimized.

Member

wschin commented Dec 5, 2018

@Anipik, different number of multiplications might not be the reason because as long as operation sequence is fixed and only one thread is used, the computation result is expected to be the same.

@tannergooding

This comment has been minimized.

Member

tannergooding commented Dec 6, 2018

different number of multiplications might not be the reason because as long as operation sequence is fixed and only one thread is used, the computation result is expected to be the same.

This is not true. Vectorization and alignment of the input data are both factors that impact how results are compounded together.

For a serial (non-vectorized) algorithm, you perform the operation on each element sequentially. For vectorized code, you perform the operation on batches of 4 or 8 elements, sequentially.

Additionally, if the input is not 16-byte aligned, you must process the up to X-1 leading elements (where X is the number of elements per operation) to become aligned, then process the majority of the elements as aligned, and finally you must process up to X-1 trailing elements that can't fit in a single vectorized operation.

@TomFinley

This comment has been minimized.

Contributor

TomFinley commented Dec 6, 2018

This is not true. Vectorization and alignment of the input data are both factors that impact how results are compounded together.

I am fine perhaps with the results being subject to vectorization differences (e.g., whether it's a 128 or 256 or 512 or whatever register). I am absolutely not fine with those being subject to alignment differences, because those could differ from run to run, even if we have fixed seed, and we have elsewhere spent significant architectural effort at making sure that on the same machine with the same result you get the same seed, and we consider the effect otherwise to be a bug.

Perhaps we could try to make sure that our results are consistent even with different alignment? Thanks.

@TomFinley

TomFinley requested changes Dec 6, 2018 edited

My understanding from @tannergooding's response here is that there is an issue with alignment causing different results. If that is so, this is a bug, and we ought to fix that instead of trying to mask it by just upping the tolerance by three orders of magnitude. (??)

I have particular interest in this. I have observed lately as Scott has that the RFF tests have been intermittently failing in a way that suggests they have become non-deterministic. The most suspicious "culprit" for why this transform could suddenly have become non-deterministic is #1657.

To reiterate what I said in my comment, for our math ops to become nondeterministic even on the same machine is not acceptable.

@Anipik

This comment has been minimized.

Member

Anipik commented Dec 6, 2018

#1657 is responsible for introducing the alignment change

@tannergooding

This comment has been minimized.

Member

tannergooding commented Dec 6, 2018

@TomFinley, there is no way to guarantee alignment of data in the CLR today without over-allocating, manually copying the data to be aligned, and pinning said data for the length of the operation (basically what AlignedArray was doing).

However, this adds quite a bit of overhead when dealing with large amounts of data as it requires you to check alignment and move the entire contents of the data if it is not already aligned. For all data sets, this means we are essentially doubling the number of reads/writes from/to memory and will generally (especially with larger data sets) end up with you invalidating the CPU cache (which makes the actual algorithm slower on startup as everything will be a cache-miss).

@tannergooding

This comment has been minimized.

Member

tannergooding commented Dec 6, 2018

It may also be worth mentioning that all the tolerance range's I've seen updated so far have (to my knowledge) matched on the 16 most-significant digits (which puts it on parity with bfloat16 for precision)

@wschin

This comment has been minimized.

Member

wschin commented Dec 6, 2018

@tannergooding. Sorry for my limited knowledge about vectorization. Could you please elaborate a bit more? My understanding is --- even if using vectorization with SSE plus some memory alignments, the order of arithmetic operations should still be the same when the same input data is used. Therefore, I still don't understand why vectorization and alignment are causes of those mismatches. I can imagine that .NET Core could possibly use dynamic vectorization so that different machines may lead to different vectorized code and therefore different results. Is it the case here?

@TomFinley

This comment has been minimized.

Contributor

TomFinley commented Dec 6, 2018

so that different machines may lead to different vectorized code and therefore different results. Is it the case here?

I'm afraid not @wschin , even the same machine. If you try running RffStatic and RffWorkout a number of times it will usually produce the same result, but sometimes won't. That's actually what brought me here was test failures on my machine in RffStatic and RffWorkout. Then I see this PR. 😄

It may also be worth mentioning that all the tolerance range's I've seen updated so far have (to my knowledge) matched on the 16 most-significant digits (which puts it on parity with bfloat16 for precision)

This may be so, but it is not the point, really. If it differed even one bit, that would be a sign of a bug. We want algorithms to be capable of determinism and reproducibility.

Also, the idea that it is impossible to achieve deterministic results without aligning data and using SSE/AVX is, I know, not really correct. I certainly see how it helps simplify the implementation a bunch (which I guess is why whoever wrote this code did it that way in the first place), and I could see how some implementations might be sensitive to that, but then the lesson is we need to change our implementations. But assuredly even without alignment, before we've succeeded (sometimes with effort, and perhaps even sometimes with some perf loss) at having vectorized code that produces consistent results, even on potentially unaligned data coming from CLR. This algorithm as written clearly doesn't achieve determinism, but the idea that none could without alignment, I'm afraid I reject. If it can't achieve determinism then it has gone too far. Let's say we must have determinism.

Now then, some algorithms have nondeterminism from multithreading, some might produce different results on different machines, but as far as I am aware we've never yet had a situation where we had an non-parallel algorithm that was incapable of producing deterministic results. 😄 The fact that this bug appears to lie in our CPU math utils is especially concerning.

I'm guessing from the delay and the fact that we're arguing that it is impossible that we don't know how to do that though. So, until we can investigate further, I am going to post a PR to revert #1657. If the bug can be fixed in a timely fashion I will of course close that PR.

@tannergooding

This comment has been minimized.

Member

tannergooding commented Dec 6, 2018

Also, the idea that it is impossible to achieve deterministic results without aligning data and using SSE/AVX is, I know, not really correct

No, it isn't impossible. You can also achieve it by only handling data as unaligned. However, this also incurs (for unaligned) data decent/significant perf penalties. For 128-bit vectorized code this incurs a perf penalty every 4 reads and every 2 reads for 256-bit vectorized code (due to having a read/write that splits a cache line). There are more significant penalties whenever a read/write splits a page boundary.

It also requires you to own the entire stack and guarantee that all algorithms behave as desired (and, AFAIK, we also call into some native math libraries for some of these, which often opt for performance over determinism as well).

@wschin

This comment has been minimized.

Member

wschin commented Dec 6, 2018

@tannergooding, a possibility is that writing sequential code in a SSE (or AVE) way so that the computation can be platform indepdendent. For example,

__mm128 sum;
for (int i = 0; i < n; i += 4)
   _mm_add_ps(sum, _mm_loadu_ps(data + i))

is equivalent to

float sum_0 = 0;
float sum_1 = 0;
float sum_2 = 0;
float sum_3 = 0;
for (int i = 0; i < n; i += 4)
{
   sum_0 += *(data + i);
   sum_1 += *(data + i + 1);
   sum_2 += *(data + i + 2);
   sum_3 += *(data + i + 3);
}

With the use of _mm_load_u_ps, I still expect we will have consistent results across sequential and vectorized implementations.

@Anipik Anipik closed this Dec 6, 2018

@Anipik Anipik deleted the Anipik:rffworkout branch Dec 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment