ci: add GPU backend test matrix and transform all builds into test jobs#65
Conversation
Add a new `build-features` job with a matrix covering all GPU backend features: cuda, flash-attn, vulkan, rocm, and metal. Each entry builds, runs clippy, and compile-checks tests (--no-run) since CI runners have no GPU hardware. CUDA entries install the CUDA toolkit via Jimver/cuda-toolkit. Metal runs on macOS; the rest on Ubuntu. Existing test, clippy, Android, and iOS jobs are unchanged. Co-authored-by: evilsocket <86922+evilsocket@users.noreply.github.com> Agent-Logs-Url: https://github.com/evilsocket/cake/sessions/93e5edd8-b2ee-40c9-8dbd-caaafd26cafa
- Upgrade Jimver/cuda-toolkit from v0.2.16 to v0.2.30 in both ci.yml and release.yml (v0.2.16 returns 404 for the CUDA keyring on Ubuntu 24.04) - Allow clippy::missing_transmute_annotations on RocmFfi::load() (the sym! macro infers transmute target type from struct field assignment context) Co-authored-by: evilsocket <86922+evilsocket@users.noreply.github.com> Agent-Logs-Url: https://github.com/evilsocket/cake/sessions/519aae09-3cfa-4d45-9cdd-0941e0f275de
Fixed in a95ebc3:
|
evilsocket
left a comment
There was a problem hiding this comment.
Fix the issues, test locally until it works before sending a commit:
/usr/bin/sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
/usr/bin/sudo add-apt-repository deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/ /
Get:1 file:/etc/apt/apt-mirrors.txt Mirrorlist [144 B]
Hit:2 http://azure.archive.ubuntu.com/ubuntu noble InRelease
Hit:7 https://packages.microsoft.com/repos/azure-cli noble InRelease
Hit:3 http://azure.archive.ubuntu.com/ubuntu noble-updates InRelease
Hit:4 http://azure.archive.ubuntu.com/ubuntu noble-backports InRelease
Hit:5 http://azure.archive.ubuntu.com/ubuntu noble-security InRelease
Hit:8 https://packages.microsoft.com/ubuntu/24.04/prod noble InRelease
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 InRelease [1581 B]
Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 Packages [1312 kB]
Fetched 1313 kB in 1s (2257 kB/s)
Reading package lists...
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list.d/archive_uri-http_developer_download_nvidia_com_compute_cuda_repos_ubuntu2404_x86_64_-noble.list:1 and /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list:1
W: Target Translations (en) is configured multiple times in /etc/apt/sources.list.d/archive_uri-http_developer_download_nvidia_com_compute_cuda_repos_ubuntu2404_x86_64_-noble.list:1 and /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list:1
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list.d/archive_uri-http_developer_download_nvidia_com_compute_cuda_repos_ubuntu2404_x86_64_-noble.list:1 and /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list:1
W: Target Translations (en) is configured multiple times in /etc/apt/sources.list.d/archive_uri-http_developer_download_nvidia_com_compute_cuda_repos_ubuntu2404_x86_64_-noble.list:1 and /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list:1
Repository: 'deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/ /'
Description:
Archive for codename: / components:
More info: http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/
Adding repository.
Adding deb entry to /etc/apt/sources.list.d/archive_uri-http_developer_download_nvidia_com_compute_cuda_repos_ubuntu2404_x86_64_-noble.list
Adding disabled deb-src entry to /etc/apt/sources.list.d/archive_uri-http_developer_download_nvidia_com_compute_cuda_repos_ubuntu2404_x86_64_-noble.list
/usr/bin/sudo apt-get update
Get:1 file:/etc/apt/apt-mirrors.txt Mirrorlist [144 B]
Hit:2 http://azure.archive.ubuntu.com/ubuntu noble InRelease
Hit:3 http://azure.archive.ubuntu.com/ubuntu noble-updates InRelease
Hit:4 http://azure.archive.ubuntu.com/ubuntu noble-backports InRelease
Hit:5 http://azure.archive.ubuntu.com/ubuntu noble-security InRelease
Hit:6 https://packages.microsoft.com/repos/azure-cli noble InRelease
Hit:7 https://packages.microsoft.com/ubuntu/24.04/prod noble InRelease
Hit:8 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 InRelease
Reading package lists...
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list.d/archive_uri-http_developer_download_nvidia_com_compute_cuda_repos_ubuntu2404_x86_64_-noble.list:1 and /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list:1
W: Target Translations (en) is configured multiple times in /etc/apt/sources.list.d/archive_uri-http_developer_download_nvidia_com_compute_cuda_repos_ubuntu2404_x86_64_-noble.list:1 and /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list:1
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list.d/archive_uri-http_developer_download_nvidia_com_compute_cuda_repos_ubuntu2404_x86_64_-noble.list:1 and /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list:1
W: Target Translations (en) is configured multiple times in /etc/apt/sources.list.d/archive_uri-http_developer_download_nvidia_com_compute_cuda_repos_ubuntu2404_x86_64_-noble.list:1 and /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list:1
/usr/bin/sudo apt-get -y install cuda-12-4
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package cuda-12-4
Error: Error: The process '/usr/bin/sudo' failed with exit code 100
The `method: network` apt-based install fails on Ubuntu 24.04 because the `cuda-12-4` package doesn't exist in NVIDIA's apt repo for that OS. Switch to `method: local` which downloads the .run installer directly, bypassing the broken apt repository. Also skip samples installation to save CI time. Co-authored-by: evilsocket <86922+evilsocket@users.noreply.github.com> Agent-Logs-Url: https://github.com/evilsocket/cake/sessions/890ff0a7-afaf-4c50-be99-050a61dcf68a
Fixed in 173bf4f. Switched CUDA toolkit install from |
|
/usr/bin/sudo cuda_installer-linux-6.14.0-1017-azure-x64-12.4.0/cuda_installer-linux-6.14.0-1017-azure-x64_12.4.0.run --silent --toolkit |
evilsocket
left a comment
There was a problem hiding this comment.
/usr/bin/sudo cuda_installer-linux-6.14.0-1017-azure-x64-12.4.0/cuda_installer-linux-6.14.0-1017-azure-x64_12.4.0.run --silent --toolkit
/usr/bin/sudo chmod 644 /var/log/cuda-installer.log
/usr/bin/sudo chown runner /var/log/cuda-installer.log
Artifact name is valid!
Root directory input is valid!
Error: Error: Failed to CreateArtifact: Received non-retryable error: Failed request: (409) Conflict: an artifact with this name already exists on the workflow run
…e collision When multiple matrix jobs use Jimver/cuda-toolkit in the same workflow run, they each try to upload a log artifact with the same default name, causing a 409 Conflict. The log-file-suffix parameter makes each artifact name unique per matrix entry. Co-authored-by: evilsocket <86922+evilsocket@users.noreply.github.com> Agent-Logs-Url: https://github.com/evilsocket/cake/sessions/cdaf66ca-e20b-4d76-86e4-ccdd54b95a5d
Fixed in d08888e. Added
Same fix as above in d08888e — |
candle-kernels build script (bindgen_cuda) calls `nvidia-smi` to detect GPU compute capability. CI runners have no GPU, so this panics. Setting CUDA_COMPUTE_CAP=89 (Ada Lovelace) skips the nvidia-smi probe. Co-authored-by: evilsocket <86922+evilsocket@users.noreply.github.com> Agent-Logs-Url: https://github.com/evilsocket/cake/sessions/359801b6-0cfe-4555-8736-6a8f781803dd
Fixed in 7cb6ce5. The actual build failure was |
evilsocket
left a comment
There was a problem hiding this comment.
Transform all the Build actions in Test actions for the respective platforms with the specific features enabled. also make sure that Clippy actions lints all features.
…tures - Renamed build-features job to test-features, changed job name from "Build" to "Test" - Replaced cargo build + cargo test --no-run with actual test runs (lib, unit, protocol) matching the CPU test job pattern - Removed separate Clippy step from matrix (tests now cover compilation) - Updated standalone Clippy job to lint vulkan+rocm features (cuda/metal covered by their respective platform matrix entries) Co-authored-by: evilsocket <86922+evilsocket@users.noreply.github.com> Agent-Logs-Url: https://github.com/evilsocket/cake/sessions/54368b4a-fd95-4169-a4b4-a9c7bc87c90d
Fixed in 0c4683e.
|
The CUDA toolkit installs development libraries but not libcuda.so.1 (the driver runtime). Creates a symlink from the toolkit's stub library and sets LD_LIBRARY_PATH so test binaries can load on GPU-less runners. Co-authored-by: evilsocket <86922+evilsocket@users.noreply.github.com> Agent-Logs-Url: https://github.com/evilsocket/cake/sessions/645784f1-b536-4f48-a2fa-e3350119b36b
Fixed in d9cbc7a. The |
evilsocket
left a comment
There was a problem hiding this comment.
also change the ios and android builds in test builds that run all the unit tests plus the ones for the specific features (metal on ios and just cpu on android)
Android: runs full test suite (lib, unit, protocol) on host with CPU features. iOS: runs full test suite on host with metal feature enabled. Both still cross-compile cake-mobile and run clippy for their respective mobile targets. Co-authored-by: evilsocket <86922+evilsocket@users.noreply.github.com> Agent-Logs-Url: https://github.com/evilsocket/cake/sessions/a46ad566-5c6b-426f-a72c-32f220812d7b
Fixed in 56d255b.
|
GPU backend features were not compile-checked or tested in CI. Only CPU builds were tested. Android and iOS jobs only compiled tests without running them.
Adds a
test-featuresmatrix job that tests each GPU backend:Jimver/cuda-toolkitEach entry runs the full test suite (lib, unit, protocol) with the specific GPU feature enabled, matching the existing CPU
testjob pattern. CUDA entries useCUDA_COMPUTE_CAP=89since CI runners have no physical GPU fornvidia-smiprobing.Transforms Android and iOS build jobs into test jobs:
--features metalThe standalone Clippy job now lints with
vulkan,rocmfeatures in addition to defaults (CUDA/flash-attn and Metal are covered by their respective platform matrix entries).✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.