-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent failure of thin lens CI test #116
Comments
Confirmed: switching the order of the two subtests makes the smaller focal length fail, not the larger. So this is not focal length related. Test output:
|
May also fail with a |
The bug reappeared a few CI runs later. Failed for |
Breakthrough: the NaNs appear right after taking the FFT of the input array inside the While the number of pixels in the input array is not standard (513x513px), this should never happen with a commercial library. This however explains the intermittent behaviour, and why this only happens on the first run of the test and not the second, independent of the actual math that is going on. Logging operating system and build image for the Ubuntu system running the failed test:
Installed mkl versions:
Output of the offending test.
Relevant part of the code in # Don't overwrite f if it's the input array.
if _use_mkl:
kwargs = {'overwrite_x': self.cutout is not None and field.grid.ndim > 1}
else:
kwargs = {}
if not np.all(np.isfinite(f)):
import warnings
warnings.warn('Nans detected in f')
f = _fft_module.fftn(f, axes=tuple(range(-self.input_grid.ndim, 0)), **kwargs)
if not np.all(np.isfinite(f)):
import warnings
warnings.warn('Nans detected after fftn')
# Omitted code for getting tf in the right format.
f *= tf
if not np.all(np.isfinite(f)):
import warnings
warnings.warn('Nans detected after multiplication with tf.')
if _use_mkl:
kwargs = {'overwrite_x': True}
else:
kwargs = {}
f = _fft_module.ifftn(f, axes=tuple(range(-self.input_grid.ndim, 0)), **kwargs)
if not np.all(np.isfinite(f)):
import warnings
warnings.warn('Nans detected after ifftn') |
This is still super weird. I have never had this issue with python 3.8 on Ubuntu, or on my windows machines. I really don't understand what changed in python 3.9. It already happens after the first fftn. Could it be an error in the new build of mkl? Does this also happen with pyfftw? |
I just ran the test code 20 times or so and I passed it every time. I never get this error. I used: No errors whatsoever, even after many reruns. Is it the CI machines? |
I'm like 95% sure this is mkl fft related, yes, combined with their machines. There should be no reason for an array with no NaNs to be generating NaNs after an FFT. However, I can't find any bug report on this at all from others. It might be related to the 513px, so I might change that to 512. All builds on CI use mkl_fft. So I didn't check other FFT libraries (nor pocketfft that is built into numpy). I printed MKL versions + Azure Ubuntu image info to check for any software related differences between working and non-working runs. But so far I haven't seen the mkl versions change. I haven't been able to reproduce this on my Ubuntu 20.04 machine with an identical conda env. I haven't tried the Azure dockerfile yet. |
I've been manually triggering more CI runs over the day. A few runs with 512px across and a few with 513px across. The 512px runs all passed (6/6 passed). The 513px runs pass intermittently (1/7 passed). I'm gonna make the change to 512px. |
This test has been failing intermittently only on Linux with Python 3.9. It has been noted in #114 and #115 among others. This issue will further track this bug to reduce chatter on the mentioned PRs.
The thin lens CI test propagates light through a thin lens with a certain focal length. It the progressively propagates that light using a
AngularSpectrumPropagator
object through focus. It finds out where the maximum peak intensity occurs as function of distance from the lens. This distance should match the focal length of the thin lens. The test is performed twice: once at 30m focal length, the other at double that.The intermittent nature makes this hard to debug. I've been unable to reproduce this on my Linux machine with identical package versions. Branch
thin_lens_test_failure
has been set up to debug.It turns out the CI failure is because the AngularSpectrumPropagator returns NaNs for the propagated wavefront. But it does this only for the first of the two focal length tests. Test output copied below:
The text was updated successfully, but these errors were encountered: