Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unit test failures on Perlmutter #1963

Closed
marcelo-alvarez opened this issue Jan 11, 2023 · 24 comments · Fixed by #2029
Closed

unit test failures on Perlmutter #1963

marcelo-alvarez opened this issue Jan 11, 2023 · 24 comments · Fixed by #2029
Assignees

Comments

@marcelo-alvarez
Copy link
Contributor

Two unit tests fail in the main environment (desiconda/20220119-2.0.1) on Perlmutter:

source /global/common/software/desi/desi_environment.sh main
git clone https://github.com/desihub/desispec.git
cd desispec
python setup.py test 
...
======================================================================
ERROR: test_load_gdarc_lines (desispec.test.test_bootcalib.TestBoot)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/code/desispec/main/py/desispec/test/test_bootcalib.py", line 124, in test_load_gdarc_lines
    llist = desiboot.load_arcline_list(camera)
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/code/desispec/main/py/desispec/bootcalib.py", line 640, in load_arcline_list
    tbl = parse_nist(iline, vacuum=vacuum)
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/code/desispec/main/py/desispec/bootcalib.py", line 563, in parse_nist
    nist_tbl = Table.read(nist_file, format='ascii.fixed_width')
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/conda/lib/python3.9/site-packages/astropy/table/connect.py", line 62, in __call__
    out = self.registry.read(cls, *args, **kwargs)
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/conda/lib/python3.9/site-packages/astropy/io/registry/core.py", line 199, in read
    data = reader(*args, **kwargs)
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/conda/lib/python3.9/site-packages/astropy/io/ascii/connect.py", line 18, in io_read
    return read(filename, **kwargs)
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/conda/lib/python3.9/site-packages/astropy/io/ascii/ui.py", line 317, in read
    table = fileobj.read()
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/conda/lib/python3.9/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 613: ordinal not in range(128)

======================================================================
ERROR: test_parse_nist (desispec.test.test_bootcalib.TestBoot)
Test parsing of NIST arc line files.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/code/desispec/main/py/desispec/test/test_bootcalib.py", line 114, in test_parse_nist
    tbl = desiboot.parse_nist('CdI')
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/code/desispec/main/py/desispec/bootcalib.py", line 563, in parse_nist
    nist_tbl = Table.read(nist_file, format='ascii.fixed_width')
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/conda/lib/python3.9/site-packages/astropy/table/connect.py", line 62, in __call__
    out = self.registry.read(cls, *args, **kwargs)
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/conda/lib/python3.9/site-packages/astropy/io/registry/core.py", line 199, in read
    data = reader(*args, **kwargs)
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/conda/lib/python3.9/site-packages/astropy/io/ascii/connect.py", line 18, in io_read
    return read(filename, **kwargs)
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/conda/lib/python3.9/site-packages/astropy/io/ascii/ui.py", line 317, in read
    table = fileobj.read()
  File "/global/common/software/desi/perlmutter/desiconda/20220119-2.0.1/conda/lib/python3.9/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 613: ordinal not in range(128)

@sbailey the same two errors also occur with a fresh desiconda install, so these failures should not block our updating desiconda, since desispec is working fine on Perlmutter in spite of these two failures. While not a high priority, we should get this fixed at some point.

@weaverba137
Copy link
Member

Is there a specific reason you're using python setup.py test? You should try to run the tests as similar as possible to the GitHub Actions tests, and those use pytest.

@marcelo-alvarez
Copy link
Contributor Author

@weaverba137 thanks for the suggestion. There is no specific reason other than for consistency with what is done in desitest. Did you determine that using pytest would solve these failures?

@weaverba137
Copy link
Member

If the tests are passing on GitHub Actions, then one should try to minimize the differences between that environment and perlmutter. Also, desitest is probably not what we want to be consistent with even in the near-term future.

@sbailey
Copy link
Contributor

sbailey commented Jan 11, 2023

Curiously, 4 (not 2!) bootcalib tests fail with pytest on perlmutter when running the full test suite (they all boil down to hitting the same "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 613: ordinal not in range(128)")

pytest py/desispec/test

But they pass if you run just the bootcalib test on its own:

pytest py/desispec/test/test_bootcalib.py

I don't have a good explanation for why or how that could matter.

Although the test failures are probably harmless, we should get them fixed ASAP so that we don't get used to failing unit tests, which eventually leads to not noticing a genuine failure.

@marcelo-alvarez
Copy link
Contributor Author

@sbailey @weaverba137 I suspect this failure is related to #291 but you may be able to find a better solution. Do you want to take a try at resolving this? Thanks.

@weaverba137
Copy link
Member

The short version is that we might not need the elaborate locale manipulations any longer because perlmutter sets LANG=en_US.UTF-8. Something for tomorrow.

@marcelo-alvarez
Copy link
Contributor Author

Sadly getting rid of the locale manipulations didn't fix it (although it would be great to get rid of them if no longer necessary, even cori sets LANG=en_US.UTF-8).

Something about test_binscripts.py changes the state such that the bootcalib unit test fails with it and succeeds when it is not included. Investigating further now.

@marcelo-alvarez
Copy link
Contributor Author

Replacing the line

        device_smoothed = cupyx.scipy.ndimage.median_filter(device_flux, width, mode='constant')

with

        return scipy.ndimage.median_filter(flux, width, mode='constant')

results in success of desispec unit tests on Perlmutter.

@sbailey please have a look and let me know if you have any idea why calling cupyx.scipy.ndimage might cause this type of error downstream. I debugged with

        sys.stdout = open('/global/cfs/cdirs/desi/users/malvarez/testscripts/env_before','w'); print(os.environ)
        device_smoothed = cupyx.scipy.ndimage.median_filter(device_flux, width, mode='constant')
        sys.stdout = open('/global/cfs/cdirs/desi/users/malvarez/testscripts/env_after','w'); print(os.environ);        

and verified that the environments are identical before and after this call, so there are no hidden changes to environment variables in the call to cupyx.scipy.ndimage.

@marcelo-alvarez
Copy link
Contributor Author

see the bootcalib-test branch for the change that "fixes" the unit test failures @sbailey

@sbailey
Copy link
Contributor

sbailey commented Mar 23, 2023

I have no idea. @dmargala or @craigwarner-ufastro do you have any ideas for why a call to cupyx.scipy.ndimage.median_filter could have a side effect of breaking the ability to parse a unicode string?

@craigwarner-ufastro
Copy link
Contributor

Wow. That is really strange. I was able to reproduce the error and the success when replacing with scipy.ndimage.median_filter but I have no idea why or how cupyx.scipy.ndimage.median_filter could cause that side effect.

@weaverba137
Copy link
Member

Do any of the scripts tested in test_binscripts.py use desispec.bootcalib at all?

Is the same cupyx failure seen on main, or just the bootcalib-test branch?

@dmargala
Copy link
Contributor

It seems like the underlying issue here is that there are non-ascii characters in a file that is parsed using an ascii codec. Is that correct?

From a quick and non-thorough look at the related issues referenced above, it seems like a work-around was put in place instead of a fix to the underlying problem. I'm not sure what or why but perhaps something in cupy is undo-ing that work-around?

source /global/common/software/desi/desi_environment.sh main
git clone https://github.com/desihub/desispec.git
cd desispec
pytest -x py/desispec/test # fails
sed -i s/Å/Ang/g py/desispec/data/arc_lines/*ascii # replace offending character
pytest -x py/desispec/test # passes 

@weaverba137
Copy link
Member

Yes, the file contains non-ascii characters.

The key point here is that it is the test that is broken, not the actual code.

If you simply open Python and load the file using the function in desispec.bootcalib, it just works.

@marcelo-alvarez
Copy link
Contributor Author

@weaverba137 I know the order is determined alphabetically, so maybe not feasible, but in the spirit of making the smallest change possible to keep the unit test in and not getting sidetracked, what do you think of somehow changing the order of unit tests so bootcalib comes first?

@weaverba137
Copy link
Member

@marcelo-alvarez, maybe I'm misunderstanding something here, but it seems like it's a bit too early to give up on fixing the problem.

@marcelo-alvarez
Copy link
Contributor Author

marcelo-alvarez commented Mar 24, 2023

That all depends on your definition of fixing the problem, but I'm happy to use yours :)

@weaverba137
Copy link
Member

OK, let's back up a step and let me ask: in detail, how did you identify that cupyx function as influencing the result?

@sbailey
Copy link
Contributor

sbailey commented Mar 24, 2023

The key point here is that it is the test that is broken, not the actual code.

I don't think that's accurate. The test is checking whether desiboot.load_arcline_list(camera) and desiboot.parse_nist('CdI') work, and it has somewhat accidentally discovered that those commands don't work if cupy is imported. We could say that we don't care because we never need to use bootcalib with cupy (bootcalib is for determining the wavelength solution of a newly installed CCD and is run by hand, not part of the regular pipeline). But it is a code error that it is using an ascii reader for a non-ascii file and that there are some situations where apparently that doesn't work.

Reproducer outsider of the framework of the tests:

from desispec import bootcalib

#- Works
bootcalib.parse_nist('CdI')

#- bootcalib fails if cupy is imported and used
import cupy as cp
x = cp.arange(10)

#- Fails
bootcalib.parse_nist('CdI')

For further context, the files that is it trying to parse are downloaded directly from NIST, so it would be nice to be able to use them as-is and not require a manual cleaning step if we ever need to download an updated version. But that might be less painful that writing our own unicode-aware equivalent of Table.read(filename, format='ascii.fixed_width')

@weaverba137
Copy link
Member

@sbailey, thank you for the clarification. I'm still very interested to hear exactly how @marcelo-alvarez discovered that cupy is having this effect, that could provide additional clues.

@marcelo-alvarez
Copy link
Contributor Author

marcelo-alvarez commented Mar 24, 2023

Thanks for the context @sbailey. I think the solution here may depend on how likely it is that these particular calib files will ever be read after importing and using cupy/cupyx. Given that it hasn't happened in Iron or on nights daily processing was on Perlmutter, maybe it's not likely?

OK, let's back up a step and let me ask: in detail, how did you identify that cupyx function as influencing the result?

@weaverba137 I traversed all lines previous to the unit test failures in question systematically until I came upon the line that, if removed, would result in success.

@dmargala
Copy link
Contributor

Using cupy does seem to have a side effect of changing the preferred encoding that astropy uses when it tries to load the file.

This works (no error):

> python -c 'from astropy.table import Table; Table.read("ArI_air.ascii", format="ascii.fixed_width")'

Adding a bit of cupy reproduces before that reproduces the error:

> python -c 'import cupy; cupy.arange(10); from astropy.table import Table; Table.read("ArI_air.ascii", format="ascii.fixed_width")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/conda/envs/test-ascii/lib/python3.10/site-packages/astropy/table/connect.py", line 62, in __call__
    out = self.registry.read(cls, *args, **kwargs)
  File "/tmp/conda/envs/test-ascii/lib/python3.10/site-packages/astropy/io/registry/core.py", line 219, in read
    data = reader(*args, **kwargs)
  File "/tmp/conda/envs/test-ascii/lib/python3.10/site-packages/astropy/io/ascii/connect.py", line 19, in io_read
    return read(filename, **kwargs)
  File "/tmp/conda/envs/test-ascii/lib/python3.10/site-packages/astropy/io/ascii/ui.py", line 350, in read
    table = fileobj.read()
  File "/tmp/conda/envs/test-ascii/lib/python3.10/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 675: ordinal not in range(128)

Comparing locale.getpreferredencoding() (this seems to be how astropy decides which encoding to use) before and after cupy usage:

> python -c 'import locale; print(locale.getpreferredencoding())'
UTF-8
> python -c 'import cupy; cupy.arange(10); import locale; print(locale.getpreferredencoding())'
ANSI_X3.4-1968

It looks like there is a special Python UTF-8 mode that you can use with either:

> python -X utf8 -c 'import cupy; cupy.arange(10); import locale; print(locale.getpreferredencoding())'
UTF-8
> PYTHONUTF8=1 python -c 'import cupy; cupy.arange(10); import locale; print(locale.getpreferredencoding())'
UTF-8

Forcing Python UTF-8 mode avoids the error:

> python -X utf8 -c 'import cupy; cupy.arange(10); from astropy.table import Table; Table.read("ArI_air.ascii", format="ascii.fixed_width")'

This actually seems more like a CUDA issue rather than cupy. Adding the "ANSI_X3.4-1968" to google search reveals similar occurrences of this issue in other libraries using CUDA besides cupy. This issue NVIDIA/cuda-python#29 has a reference to an nvidia bug report at https://developer.nvidia.com/nvidia_bug/3833924 but I am unable to view that.

@marcelo-alvarez
Copy link
Contributor Author

@dmargala thanks for getting (closer) to the bottom of this. I can't test it myself right now, but maybe a more comprehensive forcing of the encoding in parse_nist could be the solution

@weaverba137
Copy link
Member

I'm planning to run some tests now. I'll work on the bootcalib-test branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants