-
Notifications
You must be signed in to change notification settings - Fork 13k
ci : cache ROCm installation in windows-latest-cmake-hip #15887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit attempts to fix the sporadic failures of the windows-latest-cmake-hip job. This job currently failes sometimes with the following error: ```console CMake Warning (dev) at C:/Program Files/AMD/ROCm/6.1/lib/cmake/hip/hip-config-amd.cmake:86 (message): amdgpu-arch failed with error Failed to load amdhip64.dll: amdhip64.dll: Can't open: The specified module could not be found. (0x7E) Call Stack (most recent call first): C:/Program Files/AMD/ROCm/6.1/lib/cmake/hip/hip-config.cmake:149 (include) ggml/src/ggml-hip/CMakeLists.txt:39 (find_package) ``` I was able to reproduces this locally and noticed that my installation of ROCm 6.1 does not have this .dll but it is instead named amdhip64_6.dll. Creating a symbolic link worked to enable this command to work and hopefully this will work for the CI job. Perhaps different runner have the original/old .ddl name and this could be way this job sometimes works and sometimes does not.
This looks like it might work: |
The real reason for the sporadic failures is that the installer (for unknown reasons) randomly hangs. I added a 10 minute timeout in #15365 which bypasses the hang, but then the installation is incomplete and you get this build failure instead. |
Thanks, I completely missed that PR. I'll see if we might be able to cache the installation and perhaps that might help with this situation. I've tried this out on my fork, re-running the job to verify that the cache worked: |
The |
Looks good in general, but my worry is that we will cache a broken install and have a broken CI until cache is evicted. Since it seems the missing DLL is a common denominator, perhaps add a check for it and force reinstall if missing? Also, once verified working this needs to be added to |
We could manually delete the cache if turns out that this breaks CI, and at that point add something to work around it. To save some work in case it is not needed perhaps we can try this out for a few days and see. |
This commit applies the same caching to the release workflow which currently exists for the main CI workflow that was introduced in Commit ff02caf ("ci : cache ROCm installation in windows-latest-cmake-hip (ggml-org#15887)").
…15924) This commit applies the same caching to the release workflow which currently exists for the main CI workflow that was introduced in Commit ff02caf ("ci : cache ROCm installation in windows-latest-cmake-hip (ggml-org#15887)").
This commit adds caching of the ROCm installation for the windows-latest-cmake-hip job.
The motivation for this is that the installation can sometimes hang and/or not complete properly leaving an invalid installation which later fails the build. By caching the installation hopefully we can keep a good installation available in the cache and avoid the installation step.
Refs: #15365