Skip to content

Conversation

danbev
Copy link
Member

@danbev danbev commented Sep 9, 2025

This commit adds caching of the ROCm installation for the windows-latest-cmake-hip job.

The motivation for this is that the installation can sometimes hang and/or not complete properly leaving an invalid installation which later fails the build. By caching the installation hopefully we can keep a good installation available in the cache and avoid the installation step.

Refs: #15365

This commit attempts to fix the sporadic failures of the
windows-latest-cmake-hip job.

This job currently failes sometimes with the following error:
```console
CMake Warning (dev) at C:/Program Files/AMD/ROCm/6.1/lib/cmake/hip/hip-config-amd.cmake:86 (message):
  amdgpu-arch failed with error Failed to load amdhip64.dll: amdhip64.dll:
  Can't open: The specified module could not be found.  (0x7E)
Call Stack (most recent call first):
  C:/Program Files/AMD/ROCm/6.1/lib/cmake/hip/hip-config.cmake:149 (include)
  ggml/src/ggml-hip/CMakeLists.txt:39 (find_package)
```
I was able to reproduces this locally and noticed that my installation
of ROCm 6.1 does not have this .dll but it is instead named
amdhip64_6.dll. Creating a symbolic link worked to enable this command
to work and hopefully this will work for the CI job.
Perhaps different runner have the original/old .ddl name and this could
be way this job sometimes works and sometimes does not.
@github-actions github-actions bot added the devops improvements to build systems and github actions label Sep 9, 2025
@danbev
Copy link
Member Author

danbev commented Sep 9, 2025

@danbev danbev marked this pull request as ready for review September 9, 2025 06:21
@CISC
Copy link
Collaborator

CISC commented Sep 9, 2025

The real reason for the sporadic failures is that the installer (for unknown reasons) randomly hangs. I added a 10 minute timeout in #15365 which bypasses the hang, but then the installation is incomplete and you get this build failure instead.

@danbev
Copy link
Member Author

danbev commented Sep 9, 2025

I added a 10 minute timeout in #15365 which bypasses the hang, but then the installation is incomplete and you get this build failure instead.

Thanks, I completely missed that PR. I'll see if we might be able to cache the installation and perhaps that might help with this situation.

I've tried this out on my fork, re-running the job to verify that the cache worked:
https://github.com/danbev/llama.cpp/actions/runs/17579193508/job/49934493336#step:4:18

@CISC
Copy link
Collaborator

CISC commented Sep 9, 2025

The $proc.WaitForExit call will return False if timeout occurs, so perhaps just invalidate the cache and bail out in this case. The symlink is no longer needed.

@CISC
Copy link
Collaborator

CISC commented Sep 9, 2025

Looks good in general, but my worry is that we will cache a broken install and have a broken CI until cache is evicted. Since it seems the missing DLL is a common denominator, perhaps add a check for it and force reinstall if missing?

Also, once verified working this needs to be added to release.yml too.

@danbev
Copy link
Member Author

danbev commented Sep 9, 2025

Looks good in general, but my worry is that we will cache a broken install and have a broken CI until cache is evicted.

We could manually delete the cache if turns out that this breaks CI, and at that point add something to work around it. To save some work in case it is not needed perhaps we can try this out for a few days and see.

@danbev danbev changed the title ci : add symbolic link for amdhip64.dll if needed ci : cache ROCm installation in windows-latest-cmake-hip Sep 10, 2025
@danbev danbev merged commit ff02caf into ggml-org:master Sep 10, 2025
44 checks passed
danbev added a commit to danbev/llama.cpp that referenced this pull request Sep 10, 2025
This commit applies the same caching to the release workflow which
currently exists for the main CI workflow that was introduced in Commit
ff02caf ("ci : cache ROCm installation
in windows-latest-cmake-hip (ggml-org#15887)").
danbev added a commit that referenced this pull request Sep 10, 2025
This commit applies the same caching to the release workflow which
currently exists for the main CI workflow that was introduced in Commit
ff02caf ("ci : cache ROCm installation
in windows-latest-cmake-hip (#15887)").
njsyw1997 pushed a commit to aizip/llama.cpp that referenced this pull request Sep 10, 2025
…15924)

This commit applies the same caching to the release workflow which
currently exists for the main CI workflow that was introduced in Commit
ff02caf ("ci : cache ROCm installation
in windows-latest-cmake-hip (ggml-org#15887)").
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants