Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm 3.5+ support roadmap #4132

Closed
emcastillo opened this issue Oct 14, 2020 · 26 comments
Closed

ROCm 3.5+ support roadmap #4132

emcastillo opened this issue Oct 14, 2020 · 26 comments
Assignees
Labels
hip Topic: AMD ROCm / HIP prio:high

Comments

@emcastillo
Copy link
Member

emcastillo commented Oct 14, 2020

Tracking issue for ROCm tasks before v9
@leofang

  • rocSOLVER/ hipMAGMA
@kmaehashi kmaehashi added hip Topic: AMD ROCm / HIP prio:high labels Oct 20, 2020
@leofang
Copy link
Member

leofang commented Oct 20, 2020

Library support:

Improvement:

Known bugs:

We also need a "known ROCm issues" doc page (discussed with @kmaehashi)

@leofang
Copy link
Member

leofang commented Nov 2, 2020

Looks like ROCm 3.9.0 is out...😞😞😞

@leofang
Copy link
Member

leofang commented Nov 2, 2020

Looks like ROCm 3.9.0 is out...😞😞😞

It seems CuPy master builds just fine on 3.9.0: https://ci.preferred.jp/cupy.py3.amd/64299/ 🦾

@leofang
Copy link
Member

leofang commented Nov 26, 2020

btw I noticed the cythonized .c & .cpp codes could be very different between CUDA and HIP builds, so if uploading sdist to PyPI is on the roadmap, this issue must be addressed (ex: can we upload two sets of sdist, one for CUDA and the other for HIP?)

@leofang
Copy link
Member

leofang commented Dec 9, 2020

@kmaehashi
Copy link
Member

#4461

grids with total size of 2**32 do not launch threads at all.

this is not fixed as of ROCm 3.10/4.0.0.

@kmaehashi
Copy link
Member

We have to take care of branching based on CUDA versions.
e.g.

cupy/cupy/core/core.pyx

Lines 1971 to 1977 in 65ca081

cuda_path = cuda.get_cuda_path()
if bundled_include is None and cuda_path is None:
raise RuntimeError(
'Failed to auto-detect CUDA root directory. '
'Please specify `CUDA_PATH` environment variable if you '
'are using CUDA versions not yet supported by CuPy.')

@leofang
Copy link
Member

leofang commented Dec 24, 2020

We have to take care of branching based on CUDA versions.

@kmaehashi This will be handled in #4102.

@kmaehashi
Copy link
Member

We have to take care of branching based on CUDA versions.

@kmaehashi This will be handled in #4102.

Ah, nice! 👍🏼

@leofang
Copy link
Member

leofang commented Jan 16, 2021

I noticed that in many tests we call runtimeGetVersion() to determine whether to skip a test. The effect of this on HIP is to skip unconditionally, as in HIP runtimeGetVersion() gives an elusive version like 3212 (likely from the HIP runtime ROCclr, in which AMD_PLATFORM_BUILD_NUMBER is defined as such) that is much smaller that the currently supported CUDA versions. This is OK, but we should probably revisit it at some point.

@leofang
Copy link
Member

leofang commented Jan 21, 2021

For texture memory: ROCm has a renewed interest since 3.6.0 (the support is added in the ROCclr layer). I have a local branch that makes host-device texture copy work: https://github.com/leofang/cupy/tree/hip_texture, but device-device copy is not working. I consider this a low priority unless there are ROCm users who have a serious need, otherwise I think we should disable texture support and tests for ROCm/HIP until it catches up.

@kmaehashi
Copy link
Member

I started to publish ROCm 4.0 test results for the latest matser branch commit, every 1 hour (note that the full unit test takes 4 hours):
https://github.com/kmaehashi/cupy-rocm-ci-report/commits/gh-pages
(It seems there’s a breaking change for ROCm between commit 1f6f9eb and f1bda44 ...)

I've asked @takagi to work on resolving test failures.

@kmaehashi kmaehashi assigned takagi and emcastillo and unassigned emcastillo Feb 9, 2021
@leofang
Copy link
Member

leofang commented Feb 9, 2021

(It seems there’s a breaking change for ROCm between commit 1f6f9eb and f1bda44 ...)

tests/cupy_tests/cuda_tests/test_stream.py ..Fatal Python error: Segmentation fault

@kmaehashi @takagi It's likely we forgot to disable testing PTDS (#4322) on HIP. We hit the same segfault before. Stream 1 & 2 cannot be used on HIP.

@leofang
Copy link
Member

leofang commented Feb 9, 2021

I sent #4651 to fix PTDS.

@leofang
Copy link
Member

leofang commented Feb 10, 2021

Looks like we are back to the game after merging #4651 🙂
kmaehashi/cupy-rocm-ci-report@989f57f

@kmaehashi
Copy link
Member

The CI script has died with srun: error: Node failure on xxx-xxx2-09... modified the script to respawn when node failed, and restarted.

@leofang
Copy link
Member

leofang commented Feb 19, 2021

Cool, thanks. Maybe it was due to allocation timeout?

@kmaehashi
Copy link
Member

The CI script is on the login node, and I'm sruning for every test run, so I guess it's unlikely.
https://github.com/kmaehashi/cupy-rocm-ci-report/blob/master/tools/daemon.sh

@leofang
Copy link
Member

leofang commented Feb 19, 2021

I see, I thought it's run on the compute node interactively.

@leofang
Copy link
Member

leofang commented Mar 5, 2021

@kmaehashi @takagi Looks like something went wrong recently. The number of CI errors went from 9X to 18X:
https://github.com/kmaehashi/cupy-rocm-ci-report/compare/ce51038..d66d2c55
Looks like sorting is broken.

@leofang
Copy link
Member

leofang commented Mar 5, 2021

@kmaehashi @takagi Looks like something went wrong recently. The number of CI errors went from 9X to 18X:
https://github.com/kmaehashi/cupy-rocm-ci-report/compare/ce51038..d66d2c55
Looks like sorting is broken.

This is the offending commit: 74c54b0

@takagi
Copy link
Member

takagi commented Mar 11, 2021

Newly got strange failures...

FAILED tests/cupy_tests/lib_tests/test_polynomial.py::TestPoly1dInit_param_0_{variable=None}::test_poly1d_numpy_poly1d
FAILED tests/cupy_tests/lib_tests/test_polynomial.py::TestPoly1dInit_param_0_{variable=None}::test_poly1d_numpy_poly1d_variable
FAILED tests/cupy_tests/lib_tests/test_polynomial.py::TestPoly1dInit_param_1_{variable='y'}::test_poly1d_numpy_poly1d
FAILED tests/cupy_tests/lib_tests/test_polynomial.py::TestPoly1dInit_param_1_{variable='y'}::test_poly1d_numpy_poly1d_variable
E       AssertionError: 
E       Arrays are not equal
E       
E       Mismatched elements: 2 / 5 (40%)
E       Max absolute difference: 1.
E       Max relative difference: 1.
E        x: array([0.   , 1.875, 3.   , 4.   , 5.   ], dtype=float32)
E        y: array([1., 2., 3., 4., 5.], dtype=float32)
>>> import cupy, numpy
>>> cupy.asarray(numpy.arange(5, dtype='d')+1).get()
array([1., 2., 3., 4., 5.])
>>> cupy.asarray(numpy.arange(5, dtype='f')+1).get()
array([0.   , 1.875, 3.   , 4.   , 5.   ], dtype=float32)

1.875 is higher four bytes of double 1.0 interpreted as a (single) float so the first two elements seem to be occupied by 1.0 in double.

@kmaehashi
Copy link
Member

It's ok to close this?

@leofang
Copy link
Member

leofang commented Jun 17, 2021

I dunno 😅 Do we want to resolve all xfail tests first (if at all possible)?

@takagi
Copy link
Member

takagi commented Mar 10, 2022

Let's close this as the situation is updated.

@takagi takagi closed this as completed Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hip Topic: AMD ROCm / HIP prio:high
Projects
None yet
Development

No branches or pull requests

4 participants