Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Sep 19, 2025

  • With this change, there is no longer need to specify ggml-ci into commits to trigger the CI
  • The V100 runner now runs both CUDA and Vulkan CI
  • Add T4 runner for CUDA, Vulkan Coopmat1 and Vulkan Coopmat2 runs
  • The Mac Mini runner will run both Metal and Vulkan CI
  • Disable test-backend-ops from Debug builds due to slowness (alternatively, we can add a Release build config with asserts enabled and run just that)
  • Add instructions for adding self-hosted runners
  • Move the MUSA CI instructions to separate README
  • Add AMD V710 runner for Vulkan and ROCm workflows
  • Switch to more lightweight Qwen3 0.6B model

@github-actions github-actions bot added the devops improvements to build systems and github actions label Sep 19, 2025
@ggerganov ggerganov force-pushed the gg/ggml-ci-self-host branch 3 times, most recently from e05d6c2 to b50a96f Compare September 19, 2025 17:47
@ggerganov ggerganov marked this pull request as ready for review September 19, 2025 19:12
@ggerganov ggerganov changed the title ci : try to migrate ggml ci to a self-hosted runner ci : migrate ggml ci to a self-hosted runner Sep 19, 2025
@ggerganov ggerganov changed the title ci : migrate ggml ci to a self-hosted runner ci : migrate ggml ci to a self-hosted runners Sep 19, 2025
@ggerganov ggerganov changed the title ci : migrate ggml ci to a self-hosted runners ci : migrate ggml ci to self-hosted runners Sep 19, 2025
@netrunnereve
Copy link
Collaborator

It'll definitely be good to have a Vulkan CI. Is it not crashing now on Nvidia?

BTW the V100 is pretty old and I wonder if it's worth upgrading to something newer with coopmat and coopmat2 support. Budget wise it should be possible to move the existing arm and x86 runs to the free Github machines (I think 16GB and 4 cores are enough for the CI) or have the amx machine do double duty as a x86 CPU runner.

@ggerganov
Copy link
Member Author

It'll definitely be good to have a Vulkan CI. Is it not crashing now on Nvidia?

On both V100 and T4, the F16 EXP tests are failing because instead of inf the Vulkan F16 kernels return 65504.000000:

[EXP] inf mismatch: Vulkan0=65504.000000 CPU=inf   EXP(type=f16,ne_a=[128,2,2,2],v=0): FAIL
[EXP] inf mismatch: Vulkan0=65504.000000 CPU=inf   EXP(type=f16,ne_a=[5,7,11,13],v=0): FAIL

Is this a known issue and can we workaround?

@am17an
Copy link
Collaborator

am17an commented Sep 20, 2025

Maybe #15652?

@ggerganov
Copy link
Member Author

Maybe #15652?

Yes, probably a patch like this is needed. I'll leave it to be fixed on master.

@netrunnereve

For the Vulkan runs, we now have:

  • V100 (no coopmat)
  • T4 with coopmat1
  • T4 with coopmat2
  • Will add a Mac workflow on Tuesday

Let me know if there are any additional Vulkan configs that we can add. Or any machines from the Azure cloud that can be useful to exercise in addition to the current ones.

@ggerganov ggerganov requested a review from slaren September 20, 2025 08:40
@netrunnereve
Copy link
Collaborator

Let me know if there are any additional Vulkan configs that we can add. Or any machines from the Azure cloud that can be useful to exercise in addition to the current ones.

For Vulkan your runs already cover most cases including the FP16 path (Mac with MoltenVK), the int dot path (V100), and coopmat 1 and 2 (T4). We also already run the tests for the FP16 path with llvmpipe using the standard Github machines. The only remaining path is the FP32 one (run with GGML_VK_DISABLE_F16, GGML_VK_DISABLE_COOPMAT, GGML_VK_DISABLE_COOPMAT2, and GGML_VK_DISABLE_INTEGER_DOT_PRODUCT) which is only used on old gpus. If we have room for that then go for it, else I don't think it's that important.

If you're thinking of getting another machine then the NVads V710 is an AMD one which could be used for the ROCM ci as well as an additional Vulkan testcase for benchmarking and detecting driver bugs. Having an Intel GPU would be nice for SYCL and Vulkan as well but Azure doesn't have those.

FYI while I was looking into this I also discovered that the V100 Azure machines will be discontinued at the end of this month so that'll have to be dealt with eventually.

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good starting point. Some notes:

  • A lot of this could be consolidated into a single job using a matrix
  • It may be too heavy to run the full ggml-ci on every PR push, it may be necessary to leave the full version for master only, and use lighter tests for PRs

@ggerganov
Copy link
Member Author

ggerganov commented Sep 21, 2025

@netrunnereve Do you know which driver I need to install for the AMD V710 GPU?

https://github.com/ggml-org/llama.cpp/actions/runs/17892383263/job/50874519651?pr=16116

Currently, I have installed the Vulkan SDK and the project builds. But it does not detect the GPU.

Edit: I think I found it: https://learn.microsoft.com/en-us/azure/virtual-machines/linux/azure-n-series-amd-gpu-driver-linux-installation-guide

@CISC
Copy link
Collaborator

CISC commented Sep 21, 2025

* It may be too heavy to run the full ggml-ci on every PR push, it may be necessary to leave the full version for master only, and use lighter tests for PRs

Perhaps limit them to pushes to the related backend?

@ggerganov ggerganov merged commit 28baac9 into master Sep 21, 2025
1 check passed
@ggerganov ggerganov deleted the gg/ggml-ci-self-host branch September 21, 2025 13:50
struct pushed a commit to struct/llama.cpp that referenced this pull request Sep 26, 2025
* ci : migrate ggml ci to a self-hosted runners

* ci : add T4 runner

* ci : add instructions for adding self-hosted runners

* ci : disable test-backend-ops from debug builds due to slowness

* ci : add AMD V710 runner (vulkan)

* cont : add ROCM workflow

* ci : switch to qwen3 0.6b model

* cont : fix the context size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants