Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent failure of thin lens CI test #116

Closed
ehpor opened this issue Nov 14, 2021 · 8 comments · Fixed by #119
Closed

Intermittent failure of thin lens CI test #116

ehpor opened this issue Nov 14, 2021 · 8 comments · Fixed by #119
Assignees
Labels
bug Something isn't working

Comments

@ehpor
Copy link
Owner

ehpor commented Nov 14, 2021

This test has been failing intermittently only on Linux with Python 3.9. It has been noted in #114 and #115 among others. This issue will further track this bug to reduce chatter on the mentioned PRs.

The thin lens CI test propagates light through a thin lens with a certain focal length. It the progressively propagates that light using a AngularSpectrumPropagator object through focus. It finds out where the maximum peak intensity occurs as function of distance from the lens. This distance should match the focal length of the thin lens. The test is performed twice: once at 30m focal length, the other at double that.

The intermittent nature makes this hard to debug. I've been unable to reproduce this on my Linux machine with identical package versions. Branch thin_lens_test_failure has been set up to debug.

It turns out the CI failure is because the AngularSpectrumPropagator returns NaNs for the propagated wavefront. But it does this only for the first of the two focal length tests. Test output copied below:

z [ 0.  3.  6.  9. 12. 15. 18. 21. 24. 27. 30. 33. 36. 39. 42. 45. 48. 51.
 54. 57. 60.]
peak [Field(1.94038034e-05), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan)]
z2 [  0.   6.  12.  18.  24.  30.  36.  42.  48.  54.  60.  66.  72.  78.
  84.  90.  96. 102. 108. 114. 120.]
peak2 [Field(1.94038034e-05), Field(3.35065073e-05), Field(4.375403e-05), Field(6.10293918e-05), Field(8.11504359e-05), Field(0.00011799), Field(0.00047752), Field(0.00038099), Field(0.001278), Field(0.00728061), Field(0.02073922), Field(0.00769245), Field(0.00100904), Field(0.00039627), Field(0.00048316), Field(0.00016863), Field(0.00010752), Field(6.44379246e-05), Field(8.44530374e-05), Field(9.41318839e-05), Field(6.92255138e-05)]
@ehpor ehpor added the bug Something isn't working label Nov 14, 2021
@ehpor ehpor self-assigned this Nov 14, 2021
@ehpor
Copy link
Owner Author

ehpor commented Nov 14, 2021

Confirmed: switching the order of the two subtests makes the smaller focal length fail, not the larger. So this is not focal length related. Test output:

z2 [  0.   6.  12.  18.  24.  30.  36.  42.  48.  54.  60.  66.  72.  78.
  84.  90.  96. 102. 108. 114. 120.]
peak2 [Field(1.94038034e-05), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan)]
z [ 0.  3.  6.  9. 12. 15. 18. 21. 24. 27. 30. 33. 36. 39. 42. 45. 48. 51.
 54. 57. 60.]
peak [Field(1.94038034e-05), Field(3.33415885e-05), Field(4.23350558e-05), Field(5.55878397e-05), Field(0.0001673), Field(0.00027638), Field(0.00023232), Field(0.00084203), Field(0.00173035), Field(0.00345173), Field(0.08296981), Field(0.00397364), Field(0.00105164), Field(0.00078266), Field(0.00023865), Field(0.00030616), Field(0.00010211), Field(9.74986306e-05), Field(0.00010154), Field(4.61751568e-05), Field(2.95089006e-05)]

@ehpor
Copy link
Owner Author

ehpor commented Nov 14, 2021

May also fail with a FresnelPropagator instead of an AngularSpectrumPropagator. This bug however disappeared upon a second confirmation run. I'll have to pick this back up when it resurfaces.

@ehpor
Copy link
Owner Author

ehpor commented Nov 14, 2021

The bug reappeared a few CI runs later. Failed for AngularSpectrumPropagator but passed for FresnelPropagator in the same run. I'm gonna push a change to FresnelPropagator to master and rebase #115 to see if this fixes it. The underlying cause has not been found yet.

@ehpor
Copy link
Owner Author

ehpor commented Nov 14, 2021

Breakthrough: the NaNs appear right after taking the FFT of the input array inside the FourierFilter class. There are no NaNs in the input array to the FFT, while there are NaNs on the output array. The transfer function itself has no NaNs.

While the number of pixels in the input array is not standard (513x513px), this should never happen with a commercial library. This however explains the intermittent behaviour, and why this only happens on the first run of the test and not the second, independent of the actual math that is going on.

Logging operating system and build image for the Ubuntu system running the failed test:

Agent name: 'Azure Pipelines 4'
Agent machine name: 'fv-az105-922'
Current agent version: '2.195.0'
Operating System
Ubuntu
20.04.3
LTS
Virtual Environment
Environment: ubuntu-20.04
Version: 20211108.1
Included Software: https://github.com/actions/virtual-environments/blob/ubuntu20/20211108.1/images/linux/Ubuntu2004-README.md
Image Release: https://github.com/actions/virtual-environments/releases/tag/ubuntu20%2F20211108.1
Virtual Environment Provisioner
1.0.0.0-master-20211108-1
Current image version: '20211108.1'
Agent running as: 'vsts'
Prepare build directory.
Set build variables.
Download all required tasks.
Downloading task: Bash (3.189.0)
Downloading task: PublishTestResults (2.192.1)

Installed mkl versions:

  mkl                pkgs/main/linux-64::mkl-2021.4.0-h06a4308_640
  mkl-service        pkgs/main/linux-64::mkl-service-2.4.0-py39h7f8727e_0
  mkl_fft            pkgs/main/linux-64::mkl_fft-1.3.1-py39hd3c417c_0
  mkl_random         pkgs/main/linux-64::mkl_random-1.2.2-py39h51133e4_0

Output of the offending test.

##[warning]Nans detected after fftn
##[warning]Nans detected after multiplication with tf.
##[warning]Nans detected after ifftn
##[warning]Nans detected in f
##[warning]NaNs detected in peak: [Field(1.94038034e-05), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan), Field(nan)] for focal length 30.0.
##[warning]tf nans for 30.0? False.
##[warning]transfer function: [[1.        -2.94491137e-05j 0.99999996-2.65068558e-04j

Relevant part of the code in FourierFilter._operation():

# Don't overwrite f if it's the input array.
if _use_mkl:
    kwargs = {'overwrite_x': self.cutout is not None and field.grid.ndim > 1}
else:
    kwargs = {}

if not np.all(np.isfinite(f)):
    import warnings
    warnings.warn('Nans detected in f')

f = _fft_module.fftn(f, axes=tuple(range(-self.input_grid.ndim, 0)), **kwargs)

if not np.all(np.isfinite(f)):
    import warnings
    warnings.warn('Nans detected after fftn')

# Omitted code for getting tf in the right format.
f *= tf

if not np.all(np.isfinite(f)):
    import warnings
    warnings.warn('Nans detected after multiplication with tf.')

if _use_mkl:
    kwargs = {'overwrite_x': True}
else:
    kwargs = {}
f = _fft_module.ifftn(f, axes=tuple(range(-self.input_grid.ndim, 0)), **kwargs)

if not np.all(np.isfinite(f)):
    import warnings
    warnings.warn('Nans detected after ifftn')

@syhaffert
Copy link
Collaborator

syhaffert commented Nov 15, 2021

This is still super weird. I have never had this issue with python 3.8 on Ubuntu, or on my windows machines. I really don't understand what changed in python 3.9. It already happens after the first fftn. Could it be an error in the new build of mkl? Does this also happen with pyfftw?

@syhaffert
Copy link
Collaborator

I just ran the test code 20 times or so and I passed it every time. I never get this error. I used:
Ubuntu 18.04.4 LTS
Python 3.9.4
Numpy 1.21.4 (standard linked with mkl).

No errors whatsoever, even after many reruns. Is it the CI machines?

@ehpor
Copy link
Owner Author

ehpor commented Nov 15, 2021

I'm like 95% sure this is mkl fft related, yes, combined with their machines. There should be no reason for an array with no NaNs to be generating NaNs after an FFT. However, I can't find any bug report on this at all from others. It might be related to the 513px, so I might change that to 512.

All builds on CI use mkl_fft. So I didn't check other FFT libraries (nor pocketfft that is built into numpy).

I printed MKL versions + Azure Ubuntu image info to check for any software related differences between working and non-working runs. But so far I haven't seen the mkl versions change. I haven't been able to reproduce this on my Ubuntu 20.04 machine with an identical conda env. I haven't tried the Azure dockerfile yet.

@ehpor
Copy link
Owner Author

ehpor commented Nov 16, 2021

I've been manually triggering more CI runs over the day. A few runs with 512px across and a few with 513px across. The 512px runs all passed (6/6 passed). The 513px runs pass intermittently (1/7 passed). I'm gonna make the change to 512px.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants