Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{lib}[GCCcore/11.2.0,foss/2021b] CuPy v11.4.0, pytest v7.2.2 w/ Python 3.9.6 #17526

Merged

Conversation

akesandgren
Copy link
Contributor

@akesandgren akesandgren commented Mar 14, 2023

(created using eb --new-pr)

Supercedes #17136

The problems with the scipy tests need more investigation.

Depends on easybuilders/easybuild-easyblocks#2872

Copy link
Contributor

@Micket Micket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Micket
Copy link
Contributor

Micket commented Mar 14, 2023

Test report by @Micket
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
vera-gpu2 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 510.73.08, Python 3.6.8
See https://gist.github.com/532841327d7fe94f33e3100917e8adee for a full test report.

@Micket
Copy link
Contributor

Micket commented Mar 14, 2023

Test report by @Micket
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
vera-r07-03 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A40, 510.73.08, Python 3.6.8
See https://gist.github.com/e9b66d2fbdc8d90c05cd89a030c8fbda for a full test report.

@akesandgren
Copy link
Contributor Author

Test report by @akesandgren
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2872
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
b-cn1501.hpc2n.umu.se - Linux Ubuntu 20.04, x86_64, Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz, 2 x NVIDIA Tesla V100-PCIE-16GB, 470.161.03, Python 3.8.10
See https://gist.github.com/613f8ddfadef02445b442a00242ad718 for a full test report.

@Micket
Copy link
Contributor

Micket commented Mar 14, 2023

Test report by @Micket
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2872
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
vera-r07-03 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A40, 510.73.08, Python 3.6.8
See https://gist.github.com/f4babeedc0ffdec0b0dd78566939f864 for a full test report.

@Micket
Copy link
Contributor

Micket commented Mar 14, 2023

Test report by @Micket
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2872
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
vera-gpu2 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 510.73.08, Python 3.6.8
See https://gist.github.com/182b92283c01e4c51a50e4ccc59a7c76 for a full test report.

@Micket
Copy link
Contributor

Micket commented Mar 14, 2023

Damnit, filled up home dir quota with .cupy/kernel_cache I wonder if we can point this to the build dir.

@Micket
Copy link
Contributor

Micket commented Mar 14, 2023

Test report by @Micket
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2872
FAILED
Build succeeded for 0 out of 2 (2 easyconfigs in total)
vera-r07-03 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A40, 510.73.08, Python 3.6.8
See https://gist.github.com/1af599a8b0f63ed2bb206a0a645a4390 for a full test report.

@Micket
Copy link
Contributor

Micket commented Mar 14, 2023

uh, i have no idea how the pytest test suite failed on the V100 nodes. it worked before.

('cuDNN', '8.2.2.26', versionsuffix, SYSTEM),
('NCCL', '2.10.3', versionsuffix),
('cuTENSOR', '1.6.1.5', versionsuffix, SYSTEM),
# ('cuSPARSELt', '0.3.0.3', versionsuffix, SYSTEM),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented? If it's really not needed, can't it just be removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an optional thing, but i had problems getting that to work in my PR. I've forgotten what the reason was, only that i didn't care enough about that optional feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It currently crashes/fails to compile some part, I left in in place to be enabled for a newer toolchain

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put that as a comment? I.e. that it's an optional dep, commented out because it gave issues, but could be re-enabled for newer toolchain?

@casparvl
Copy link
Contributor

Test report by @casparvl
FAILED
Build succeeded for 1 out of 3 (2 easyconfigs in total)
gcn1.local.snellius.surf.nl - Linux Rocky Linux 8.7, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 515.86.01, Python 3.6.8
See https://gist.github.com/272ac1bd28b66167b669bb39bd439833 for a full test report.

@Micket
Copy link
Contributor

Micket commented Mar 14, 2023

Test report by @Micket
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2872
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
vera-r07-03 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A40, 510.73.08, Python 3.6.8
See https://gist.github.com/41abadd119a6531eb493507d3818718a for a full test report.

@Micket
Copy link
Contributor

Micket commented Mar 14, 2023

Test report by @Micket
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2872
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
vera-gpu2 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 510.73.08, Python 3.6.8
See https://gist.github.com/df325434e96f1d1c0202d8235444ec79 for a full test report.

@casparvl
Copy link
Contributor

You can ignore my test, I completely overlooked the dependency on the EasyBlock. I'll re-test...

@casparvl
Copy link
Contributor

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2872
FAILED
Build succeeded for 0 out of 2 (2 easyconfigs in total)
gcn1.local.snellius.surf.nl - Linux Rocky Linux 8.7, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 515.86.01, Python 3.6.8
See https://gist.github.com/8c4663a10bc2c8ad67d588f971aadd66 for a full test report.

@casparvl
Copy link
Contributor

Same result, unfortunately...

@akesandgren
Copy link
Contributor Author

@casparvl you have an odd test failure in pytest already, can you look closer at that?
the pytest part doesn't need the updated easyblock

@boegelbot
Copy link
Collaborator

@akesandgren: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/4423597098
Output from first failing test suite run:

FAIL: test__parse_easyconfig_pytest-7.2.2-GCCcore-11.2.0.eb (test.easyconfigs.easyconfigs.EasyConfigTest)
Test for easyconfig pytest-7.2.2-GCCcore-11.2.0.eb
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 1543, in innertest
    template_easyconfig_test(self, spec_path)
  File "test/easyconfigs/easyconfigs.py", line 1350, in template_easyconfig_test
    "binutils or GCC is a build dep in %s: %s" % (spec, dep_names))
AssertionError: binutils or GCC is a build dep in /home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/p/pytest/pytest-7.2.2-GCCcore-11.2.0.eb: ['Python']

----------------------------------------------------------------------
Ran 16610 tests in 638.954s

FAILED (failures=1)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

@Micket
Copy link
Contributor

Micket commented Mar 15, 2023

A40's are failing cupy test

_________________________________________________________________ TestProduct.test_tensordot __________________________________________________________________
../../../../../eb-fn7mklhy/tmphplkbcb0/lib/python3.9/site-packages/cupy/testing/_loops.py:844: in test_func
    impl(*args, **kw)
../../../../../eb-fn7mklhy/tmphplkbcb0/lib/python3.9/site-packages/cupy/testing/_loops.py:362: in test_func
    check_func(cupy_r, numpy_r)
../../../../../eb-fn7mklhy/tmphplkbcb0/lib/python3.9/site-packages/cupy/testing/_loops.py:514: in check_func
    _array.assert_allclose(c, n, rtol1, atol1, err_msg, verbose)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

actual = array([[2938., 3016., 3094., 3172., 3250.],
       [7044., 7264., 7484., 7708., 7932.]], dtype=float16)
desired = array([[2938., 3016., 3094., 3172., 3250.],
       [7040., 7264., 7488., 7708., 7928.]], dtype=float16), rtol = 1e-07, atol = 0, err_msg = ''
verbose = True

@boegel boegel added this to the release after 4.7.1 milestone Mar 15, 2023
@Micket
Copy link
Contributor

Micket commented Mar 15, 2023

Test report by @Micket
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2872
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
vera-r07-03 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A40, 510.73.08, Python 3.6.8
See https://gist.github.com/393f55ec2b0409c1822188e2b019361a for a full test report.

@akesandgren
Copy link
Contributor Author

You forgot to include the easyblock, CuPy uses "testinstall" which is fixed in that easyblock

@casparvl
Copy link
Contributor

Ah, my bad, will retest. Thanks for the pointer.

@Micket
Copy link
Contributor

Micket commented Mar 16, 2023

I was investigating the TestProduct.test_tensordot failure for the A40.
The failing test is comparing "np.tensordot" with "cp.tensordot" for float16

    @testing.for_all_dtypes()
    @testing.numpy_cupy_allclose()
    def test_tensordot(self, xp, dtype):
        a = testing.shaped_arange((2, 3, 4), xp, dtype)
        b = testing.shaped_arange((3, 4, 5), xp, dtype)
        return xp.tensordot(a, b)

I can generate the test data manually

>>> cp.testing.shaped_arange((2, 3, 4), cupy, cupy.float16)
array([[[ 1.,  2.,  3.,  4.],
        [ 5.,  6.,  7.,  8.],
        [ 9., 10., 11., 12.]],

       [[13., 14., 15., 16.],
        [17., 18., 19., 20.],
        [21., 22., 23., 24.]]], dtype=float16)
>>> cp.testing.shaped_arange((3, 4, 5), cupy, cupy.float16)
array([[[ 1.,  2.,  3.,  4.,  5.],
        [ 6.,  7.,  8.,  9., 10.],
        [11., 12., 13., 14., 15.],
        [16., 17., 18., 19., 20.]],

       [[21., 22., 23., 24., 25.],
        [26., 27., 28., 29., 30.],
        [31., 32., 33., 34., 35.],
        [36., 37., 38., 39., 40.]],

       [[41., 42., 43., 44., 45.],
        [46., 47., 48., 49., 50.],
        [51., 52., 53., 54., 55.],
        [56., 57., 58., 59., 60.]]], dtype=float16)

So, with float16 precision on a T4, it does compute the expected output

>>> np.tensordot(a_np, b_np)
array([[2938., 3016., 3094., 3172., 3250.],
       [7040., 7264., 7488., 7708., 7928.]], dtype=float16)
>>> cp.tensordot(a, b)
array([[2938., 3016., 3094., 3172., 3250.],
       [7040., 7264., 7488., 7708., 7928.]], dtype=float16)

Since it wasn't clear from the actual/desired output, i can confirm that my icelake CPUs also produce the same output using numpy.
So, it's really just the A40 that computing something different here, incorrectly resulting in

actual = array([[2938., 3016., 3094., 3172., 3250.],
       [7044., 7264., 7484., 7708., 7932.]], dtype=float16)   # wrong results on A40

float16's have enough poor precision that these are actually rounding errors. With higher precision,

>>> np.tensordot(np.array(a, dtype=np.float64), np.array(b, dtype=np.float64))
array([[2938., 3016., 3094., 3172., 3250.],
       [7042., 7264., 7486., 7708., 7930.]])

So these numbers aren't really any worse in terms of floating point accuracy. Annoying that isn't 1 to 1 with numpy, but i think we just have to ignore this test.

@casparvl
Copy link
Contributor

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2872
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
gcn61.local.snellius.surf.nl - Linux Rocky Linux 8.7, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 1 x NVIDIA NVIDIA A100-SXM4-40GB, 515.86.01, Python 3.6.8
See https://gist.github.com/5c2829a1d37077587c1a8cbd6928508e for a full test report.

@casparvl
Copy link
Contributor

Ok, finally one build that completes. But only when using local scratch disks. Also, I don't see the same test failures that @Micket see's on the A40 (we have A100s). From his comment above, I interpret it as a flaky test though.

How do you both think we should proceed on this? I see two options:

  1. Would be to strip the tests that failed for me from pytest:
FAILED testing/_py/test_local.py::TestPOSIXLocalPath::test_copy_stat_file - AssertionError: asse...
FAILED testing/_py/test_local.py::TestPOSIXLocalPath::test_copy_stat_dir - assert (1678828669.54...
FAILED testing/test_assertrewrite.py::test_source_mtime_long_long[1] - OSError: [Errno 75] Value...
FAILED testing/test_stepwise.py::test_sw_skip_help - Failed: nomatch: '*Implicitly enables --ste...

And the one from @Micket from CuPy. Both don't seem to be pointing to issues in the build itself, I'd expect the installation to be sane in both cases.

  1. Would be to keep the tests, but include a remark in the EasyConfigs (or have them print a warning to stdout upon test failure?) that these are known to cause issues, and suggest that EasyBuild users can use --ignore-test-failures to get it installed (or install on a different FS, in the case of pytest)

What do you think?

@Micket
Copy link
Contributor

Micket commented Mar 16, 2023

Test report by @Micket
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2872
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
vera-r07-05 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 510.73.08, Python 3.6.8
See https://gist.github.com/0d1172d424f9b56c0b6ed1b38bf3e0a2 for a full test report.

@akesandgren
Copy link
Contributor Author

akesandgren commented Mar 22, 2023

A40's are failing cupy test

_________________________________________________________________ TestProduct.test_tensordot __________________________________________________________________
../../../../../eb-fn7mklhy/tmphplkbcb0/lib/python3.9/site-packages/cupy/testing/_loops.py:844: in test_func
    impl(*args, **kw)
../../../../../eb-fn7mklhy/tmphplkbcb0/lib/python3.9/site-packages/cupy/testing/_loops.py:362: in test_func
    check_func(cupy_r, numpy_r)
../../../../../eb-fn7mklhy/tmphplkbcb0/lib/python3.9/site-packages/cupy/testing/_loops.py:514: in check_func
    _array.assert_allclose(c, n, rtol1, atol1, err_msg, verbose)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

actual = array([[2938., 3016., 3094., 3172., 3250.],
       [7044., 7264., 7484., 7708., 7932.]], dtype=float16)
desired = array([[2938., 3016., 3094., 3172., 3250.],
       [7040., 7264., 7488., 7708., 7928.]], dtype=float16), rtol = 1e-07, atol = 0, err_msg = ''
verbose = True

The values in the range 4096 - 8192 have a precision interval of 4 so the test is incorrectly written, rtol should probably be {'float16': 1e-3} that would handle both ranges.
I.e. the possible values in that range are, 7040, 7044, 7048, ...

Conclusion, test is wrong and/or very badly choosen.

@akesandgren
Copy link
Contributor Author

@casparvl @Micket good enough?
@Micket can you test that change on A40's, it should be correct but I'm a bit too occupied to verify it.

@casparvl
Copy link
Contributor

@akesandgren Yep, that's a clear description. I would suggest to also link this PR, or more specifically this comment #17526 (comment) in there though. That will give people a concrete output to compare their own to if they run into this issue, and provide the full context.

@akesandgren
Copy link
Contributor Author

@casparvl does that look good?

@Micket
Copy link
Contributor

Micket commented Apr 4, 2023

Test report by @Micket
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2872
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
alvis7-02 - Linux Rocky Linux 8.6, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A40, 520.61.05, Python 3.6.8
See https://gist.github.com/Micket/bfa0af6f2a1ae1f5cc3f46b47bef5f01 for a full test report.

@Micket
Copy link
Contributor

Micket commented Apr 5, 2023

It's still runinng the tensordot test.

I tried specifying the full path to this particular test, including the ::test_tensordot part, but it still ran and failed it.

export CUPY_TEST_GPU_LIMIT=1 CUPY_CACHE_DIR="/local/tmp.919635/CuPy/11.4.0/foss-2021b-CUDA-11.4.1" && pytest --ignore=tests/cupyx_tests/scipy_tests --ignore=tests/cupyx_tests/distributed_tests --ignore=tests/cupyx_tests/tools_tests --ignore=tests/example_tests --ignore=tests/cupy_tests/testing_tests/test_parameterized.py --ignore=tests/cupy_tests/core_tests/test_carray.py::TestCArray32BitBoundary_param_ --ignore=tests/cupy_tests/fft_tests/test_fft.py --ignore=tests/cupy_tests/linalg_tests/test_product.py::TestProduct::test_tensordot tests
...
============================================================================================ short test summary info =============================================================================================
FAILED tests/cupy_tests/linalg_tests/test_product.py::TestProduct::test_tensordot - AssertionError: 
============================================================== 1 failed, 29238 passed, 760 skipped, 35 xfailed, 200 warnings in 1350.78s (0:22:30) ===============================================================

'tests/cupy_tests/fft_tests/test_fft.py',
# float16 has too low precision for these tests as they are written
# See https://github.com/easybuilders/easybuild-easyconfigs/pull/17526#issuecomment-1470843170 for details.
'tests/cupy_tests/linalg_tests/test_product.py::TestProduct',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, turns out this syntax isn't a thing in pytest. This line does nothing. Looks like this syntax pytest-dev/pytest#3198 but this was never supported, the issue was closed in favor of a regex of what to include.
Needs to be a different flag -k "not test_tensordot"

Ugh. This sucks. It won't even let us specify the full name, like "TestProduct::test_tensordot", just the last part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So options are

  1. let this fail on A40, whoever installs it just has to ignore test failures.
  2. patch this test out
  3. add the -k flag, which just really sucks if we ever need to add more than one.

Regardless, this broken ::TestProduct ignore line should be removed, as it does nothing

@Micket
Copy link
Contributor

Micket commented Apr 6, 2023

Test report by @Micket
SUCCESS
Build succeeded (with --ignore-test-failure) for 2 out of 2 (2 easyconfigs in total)
alvis7-02 - Linux Rocky Linux 8.6, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A40, 520.61.05, Python 3.6.8
See https://gist.github.com/Micket/083f2e6f0d34ffbcfa12fe67a5df23e9 for a full test report.

@boegel boegel modified the milestones: next release (4.7.2), 4.x Apr 12, 2023
@akesandgren
Copy link
Contributor Author

@Micket that last commit seem to fix the A40 problem, at least my build on A40's passed cleanly.
Can you verify?

@Micket
Copy link
Contributor

Micket commented Apr 24, 2023

Why leave in 'tests/cupy_tests/linalg_tests/test_product.py::TestProduct',? It's not a syntax that pytest supports

@akesandgren
Copy link
Contributor Author

It is supported when when using --deselect, according to the actually merged PR for that part

@Micket
Copy link
Contributor

Micket commented Apr 24, 2023

Test report by @Micket
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
vera-r07-01 - Linux Rocky Linux 8.6, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A40, 520.61.05, Python 3.6.8
See https://gist.github.com/Micket/338bb49dd925914a9a3fc897efbe1ca8 for a full test report.

Copy link
Contributor

@Micket Micket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Micket
Copy link
Contributor

Micket commented May 2, 2023

Going in, thanks @akesandgren!

@Micket Micket merged commit 6da8458 into easybuilders:develop May 2, 2023
10 checks passed
@akesandgren akesandgren deleted the 20230314080051_new_pr_CuPy1140 branch May 2, 2023 13:27
@boegel boegel modified the milestones: 4.x, next release (4.7.2) May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants