New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCm 3.5+ support roadmap #4132
Comments
Library support: Improvement:
Known bugs:
We also need a "known ROCm issues" doc page (discussed with @kmaehashi)
|
Looks like ROCm 3.9.0 is out...😞😞😞 |
It seems CuPy master builds just fine on 3.9.0: https://ci.preferred.jp/cupy.py3.amd/64299/ 🦾 |
btw I noticed the cythonized .c & .cpp codes could be very different between CUDA and HIP builds, so if uploading sdist to PyPI is on the roadmap, this issue must be addressed (ex: can we upload two sets of sdist, one for CUDA and the other for HIP?) |
this is not fixed as of ROCm 3.10/4.0.0. |
We have to take care of branching based on CUDA versions. Lines 1971 to 1977 in 65ca081
|
@kmaehashi This will be handled in #4102. |
Ah, nice! 👍🏼 |
I noticed that in many tests we call |
For texture memory: ROCm has a renewed interest since 3.6.0 (the support is added in the ROCclr layer). I have a local branch that makes host-device texture copy work: https://github.com/leofang/cupy/tree/hip_texture, but device-device copy is not working. I consider this a low priority unless there are ROCm users who have a serious need, otherwise I think we should disable texture support and tests for ROCm/HIP until it catches up. |
I started to publish ROCm 4.0 test results for the latest matser branch commit, every 1 hour (note that the full unit test takes 4 hours): I've asked @takagi to work on resolving test failures. |
@kmaehashi @takagi It's likely we forgot to disable testing PTDS (#4322) on HIP. We hit the same segfault before. Stream 1 & 2 cannot be used on HIP. |
I sent #4651 to fix PTDS. |
Looks like we are back to the game after merging #4651 🙂 |
The CI script has died with |
Cool, thanks. Maybe it was due to allocation timeout? |
The CI script is on the login node, and I'm |
I see, I thought it's run on the compute node interactively. |
@kmaehashi @takagi Looks like something went wrong recently. The number of CI errors went from 9X to 18X: |
This is the offending commit: 74c54b0 |
Newly got strange failures...
|
It's ok to close this? |
I dunno 😅 Do we want to resolve all xfail tests first (if at all possible)? |
Let's close this as the situation is updated. |
Tracking issue for ROCm tasks before v9
@leofang
The text was updated successfully, but these errors were encountered: