# Tolerance added for rff tests #1829

Closed
wants to merge 1 commit into
from
+3 −5

## Conversation

Projects
None yet
5 participants
Member

### Anipik commented Dec 5, 2018 • edited

 Fixes #1825 Rff is using the new cpumathalgorithm for matrix multiplication So sometimes there is difference in last few decimal places(due to different number of multiplications) from the original baseline so adding tolerance corrects the error.
``` tolerance added for rfftests ```
``` 26cf0fd ```

Member

### wschin commented Dec 5, 2018 • edited

 It increases tolerance from 1e-6 to 1e-3? Is the result of matrix multiplication not deterministic? If numerical error is accumulated very quickly, I'd shrink the data size and the number of involved model parameters in those related tests.
Member

### Anipik commented Dec 5, 2018

 It increases tolerance from 1e-6 to 1e-3? yes it does. And it also used our new floating point mathematical comparison algorithm Is the result of matrix multiplication not deterministic? Sometimes it results in slight differences in last few decimal places due to different number of multiplications. @tannergooding is more equipped to answer the reason
Member

### singlis commented Dec 5, 2018

 Thank you, this will likely fix this issue #1825 where our RFF tests are failing intermittently.

Closed

Member

### wschin commented Dec 5, 2018

 @Anipik, different number of multiplications might not be the reason because as long as operation sequence is fixed and only one thread is used, the computation result is expected to be the same.
Member

### tannergooding commented Dec 6, 2018

 different number of multiplications might not be the reason because as long as operation sequence is fixed and only one thread is used, the computation result is expected to be the same. This is not true. Vectorization and alignment of the input data are both factors that impact how results are compounded together. For a serial (non-vectorized) algorithm, you perform the operation on each element sequentially. For vectorized code, you perform the operation on batches of 4 or 8 elements, sequentially. Additionally, if the input is not 16-byte aligned, you must process the up to X-1 leading elements (where X is the number of elements per operation) to become aligned, then process the majority of the elements as aligned, and finally you must process up to X-1 trailing elements that can't fit in a single vectorized operation.
Contributor

### TomFinley commented Dec 6, 2018

 This is not true. Vectorization and alignment of the input data are both factors that impact how results are compounded together. I am fine perhaps with the results being subject to vectorization differences (e.g., whether it's a 128 or 256 or 512 or whatever register). I am absolutely not fine with those being subject to alignment differences, because those could differ from run to run, even if we have fixed seed, and we have elsewhere spent significant architectural effort at making sure that on the same machine with the same result you get the same seed, and we consider the effect otherwise to be a bug. Perhaps we could try to make sure that our results are consistent even with different alignment? Thanks.

### TomFinley requested changes Dec 6, 2018 • edited

My understanding from @tannergooding's response here is that there is an issue with alignment causing different results. If that is so, this is a bug, and we ought to fix that instead of trying to mask it by just upping the tolerance by three orders of magnitude. (??)

I have particular interest in this. I have observed lately as Scott has that the RFF tests have been intermittently failing in a way that suggests they have become non-deterministic. The most suspicious "culprit" for why this transform could suddenly have become non-deterministic is #1657.

To reiterate what I said in my comment, for our math ops to become nondeterministic even on the same machine is not acceptable.

Member

### Anipik commented Dec 6, 2018

 #1657 is responsible for introducing the alignment change
Member

### tannergooding commented Dec 6, 2018

 @TomFinley, there is no way to guarantee alignment of data in the CLR today without over-allocating, manually copying the data to be aligned, and pinning said data for the length of the operation (basically what `AlignedArray` was doing). However, this adds quite a bit of overhead when dealing with large amounts of data as it requires you to check alignment and move the entire contents of the data if it is not already aligned. For all data sets, this means we are essentially doubling the number of reads/writes from/to memory and will generally (especially with larger data sets) end up with you invalidating the CPU cache (which makes the actual algorithm slower on startup as everything will be a cache-miss).
Member

### tannergooding commented Dec 6, 2018

 It may also be worth mentioning that all the tolerance range's I've seen updated so far have (to my knowledge) matched on the 16 most-significant digits (which puts it on parity with `bfloat16` for precision)
Member

### wschin commented Dec 6, 2018

 @tannergooding. Sorry for my limited knowledge about vectorization. Could you please elaborate a bit more? My understanding is --- even if using vectorization with SSE plus some memory alignments, the order of arithmetic operations should still be the same when the same input data is used. Therefore, I still don't understand why `vectorization` and `alignment` are causes of those mismatches. I can imagine that .NET Core could possibly use dynamic vectorization so that different machines may lead to different vectorized code and therefore different results. Is it the case here?
Contributor

### TomFinley commented Dec 6, 2018 • edited

 so that different machines may lead to different vectorized code and therefore different results. Is it the case here? I'm afraid not @wschin , even the same machine. If you try running `RffStatic` and `RffWorkout` a number of times it will usually produce the same result, but sometimes won't. That's actually what brought me here was test failures on my machine in `RffStatic` and `RffWorkout`. Then I see this PR. 😄 It may also be worth mentioning that all the tolerance range's I've seen updated so far have (to my knowledge) matched on the 16 most-significant digits (which puts it on parity with `bfloat16` for precision) This may be so, but it is not the point, really. If it differed even one bit, that would be a sign of a bug. We want algorithms to be capable of determinism and reproducibility. Also, the idea that it is impossible to achieve deterministic results without aligning data and using SSE/AVX is, I know, not really correct. I certainly see how it helps simplify the implementation a bunch (which I guess is why whoever wrote this code did it that way in the first place), and I could see how some implementations might be sensitive to that, but then the lesson is we need to change our implementations. But assuredly even without alignment, before we've succeeded (sometimes with effort, and perhaps even sometimes with some perf loss) at having vectorized code that produces consistent results, even on potentially unaligned data coming from CLR. This algorithm as written clearly doesn't achieve determinism, but the idea that none could without alignment, I'm afraid I reject. If it can't achieve determinism then it has gone too far. Let's say we must have determinism. Now then, some algorithms have nondeterminism from multithreading, some might produce different results on different machines, but as far as I am aware we've never yet had a situation where we had an non-parallel algorithm that was incapable of producing deterministic results. 😄 The fact that this bug appears to lie in our CPU math utils is especially concerning. I'm guessing from the delay and the fact that we're arguing that it is impossible that we don't know how to do that though. So, until we can investigate further, I am going to post a PR to revert #1657. If the bug can be fixed in a timely fashion I will of course close that PR.

Merged

Member

### tannergooding commented Dec 6, 2018

 Also, the idea that it is impossible to achieve deterministic results without aligning data and using SSE/AVX is, I know, not really correct No, it isn't impossible. You can also achieve it by only handling data as unaligned. However, this also incurs (for unaligned) data decent/significant perf penalties. For 128-bit vectorized code this incurs a perf penalty every 4 reads and every 2 reads for 256-bit vectorized code (due to having a read/write that splits a cache line). There are more significant penalties whenever a read/write splits a page boundary. It also requires you to own the entire stack and guarantee that all algorithms behave as desired (and, AFAIK, we also call into some native math libraries for some of these, which often opt for performance over determinism as well).
Member

### wschin commented Dec 6, 2018 • edited

 @tannergooding, a possibility is that writing sequential code in a SSE (or AVE) way so that the computation can be platform indepdendent. For example, ```__mm128 sum; for (int i = 0; i < n; i += 4) _mm_add_ps(sum, _mm_loadu_ps(data + i))``` is equivalent to ```float sum_0 = 0; float sum_1 = 0; float sum_2 = 0; float sum_3 = 0; for (int i = 0; i < n; i += 4) { sum_0 += *(data + i); sum_1 += *(data + i + 1); sum_2 += *(data + i + 2); sum_3 += *(data + i + 3); }``` With the use of `_mm_load_u_ps`, I still expect we will have consistent results across sequential and vectorized implementations.