CI build-cache rollout (sccache Phase 2 + GGUF models) + CUDA fast-build, dep bumps, NativeServer scaffold by bernardladenthin · Pull Request #245 · bernardladenthin/java-llama.cpp

bernardladenthin · 2026-06-20T12:59:58Z

Summary

Rolls the shared CI build cache out across (almost) the whole native build matrix and bundles several independent build/quality improvements. Net effect: warm CI builds drop from tens of minutes to a few minutes, test-model downloads stop hitting HuggingFace every run, and CUDA validation builds are no longer the ~70-minute long pole — while every distributed artifact stays bit-identical to a clean release build.

Headline results measured on this branch:

manylinux2014: warm sccache 99.64% hits (277/278), build step ~1m46s
macOS (3 jobs): ~40 min → ~6 min warm
GGUF models: pulled from HuggingFace only on a cold cache; served from GitHub's cache afterwards

1. sccache / Depot compiler cache — Phase 2 complete (all dockcross jobs)

Probe-compile health check (sccache_can_wrap_compiler in build.sh): compiles a trivial TU through sccache before enabling it as the launcher. A present-but-crashing sccache (the v0.8.2 in-container panic that stalled the first attempt) now falls back to a clean, uncached green -O3 build instead of redding CI, logging the panic backtrace + detached-server log for diagnosis.
sccache pinned to v0.16.0, overridable per-job via SCCACHE_DL_VERSION (the panic is gone on 0.16.0).
All 5 dockcross cross-compile jobs enabled (the 3 macOS jobs were Phase 1):
- ✅ crosscompile-linux-x86_64 (manylinux2014) — verified green, 99.64% warm hits
- 🔄 crosscompile-linux-x86_64-cuda — gcc/C++ TUs cache (nvcc kernels can't; see §3)
- ✅ crosscompile-linux-aarch64
- ✅ crosscompile-android-aarch64
- ✅ crosscompile-android-aarch64-opencl — build_opencl_android.sh now execs build.sh, inheriting the probe + launcher (same pattern as build_cuda_linux.sh)
Inert without DEPOT_TOKEN (fork PRs) and with use_cache=false; the probe makes enabling all jobs at once safe.

2. GGUF test-model cache (GitHub `actions/cache`)

Caches models/ (~5 GB) across the 4 Java-test jobs under one platform-independent key (gguf-models-v1), so CodeLlama/Qwen/SmolVLM/etc. GGUFs are downloaded only when the cache is cold. Every download step is guarded test -f … || curl ….
Deliberately GitHub's free cache, not Depot (GB-scale blobs are usage-priced there), and intentionally has no on/off flag (free + safe). Documented in publish.yml + CLAUDE.md.

3. CUDA fast-build knob (`CUDA_FAST_BUILD`)

nvcc recompiles each .cu kernel once per GPU arch — the dominant cost of the ~70-min CUDA job (sccache can't cache nvcc kernels). New opt-in CUDA_FAST_BUILD builds a single arch to cut that time.
Release-safe by policy: in CI it's auto-derived from publish_to_central — fast single-arch for PR/push validation, full arch set whenever publishing to Central. Every artifact that reaches Central is the full set; only validation runs are fast.
Validation arch pinned to sm_120 (newest CUDA 13.2, consumer Blackwell / RTX 50xx). Default off for local/manual builds (full = release-safe).

4. Dependency / image bumps

NullAway 0.13.6 → 0.13.7, pitest-maven 1.25.4 → 1.25.5 (same bump applied to the 3 sibling repos on matching branches).
googletest v1.15.2 → v1.17.0 (test-only; documented that it tracks nothing and should be bumped periodically).
All 5 dockcross images → 20260515-5fd14ac (latest).

5. NativeServer scaffold (server package)

New net.ladenthin.llama.server.NativeServer — the planned entry point for the native HTTP transport (server-http.cpp + cpp-httplib, already compiled into libjllama), the only path able to serve the embedded WebUI. Scaffold only: start() throws UnsupportedOperationException until the native routes are wired to JNI (a separate, detailed step). Fixes the package/API shape so the real wiring lands cleanly.
NativeServerSmokeTest — 3 model-free tests (construct, start() throws, close() no-op); no model / no libjllama required.
Makes explicit that two server classes now exist: the working Java OpenAiCompatServer (today's runnable server) and the NativeServer scaffold.

Docs

CLAUDE.md: Phase 2 rollout status, CUDA_FAST_BUILD policy, GGUF-cache rationale, googletest note, cross-repo scope pointer.
TODO.md: Windows sccache item (needs Ninja Multi-Config per upstream; evaluate shipping Ninja + MSVC artifacts in parallel) + the deferred NativeServer native-route wiring.

CI status / blockers

No code blockers. CI is re-running on the latest commit; the previously-enabled jobs (manylinux2014 + all 3 macOS) are proven green, and the remaining dockcross jobs use the same probe-guarded path. Windows jobs are unchanged (still VS generator, uncached — deferred as a documented TODO).
SonarCloud Quality Gate passes (100% coverage on new code, 0 security hotspots); the 5 new issues are java:S5786 public-test-visibility on the smoke test — cosmetic.
Everything stays bit-identical to a clean -O3 / full-arch release build; publishing remains gated behind publish_to_central.

Deferred to follow-up sessions (tracked in `TODO.md`)

Windows compiler cache — Ninja Multi-Config + dual-artifact (Ninja & MSVC) evaluation.
NativeServer full implementation — native startServer/stopServer JNI methods, route wiring, lifecycle/threading, WebUI serving.

🤖 Generated with Claude Code

https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

Add a 'Cross-repo scope' note to the CI build cache section explaining the sccache+Depot compiler cache benefits only this repo's native build, and link the workspace crossrepostatus.md non-parity entry. No build/CI behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

Keep only the one-line 'jllama-only, it's the sole repo with a native build' fact and defer the full rationale (Maven repos, GitHub-hosted runners, inert DEPOT_TOKEN, badge) to workspace/crossrepostatus.md instead of duplicating it here. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

…ncached build.sh uses sccache as the compiler launcher, so a present-but-crashing sccache (the static-musl panic seen inside the dockcross cross-compile containers) failed every compile and redded the whole build. The inert-safe guard only covered sccache being absent, not present-but-crashing. Add sccache_can_wrap_compiler(): probe-compile a trivial TU through sccache and only enable -DCMAKE_{C,CXX}_COMPILER_LAUNCHER=sccache when it succeeds. On any failure it logs the captured Rust panic backtrace (and the detached server's SCCACHE_ERROR_LOG when a job sets one) and builds WITHOUT the cache — a clean green -O3 build. Also make the fetched sccache version a SCCACHE_DL_VERSION knob (default bumped 0.8.2 -> 0.15.0, overridable per-job) and only run --show-stats when sccache was actually used. Verified locally with fake sccache/cmake across every variant: no token, use_cache=false, crashing sccache, and working sccache all produce a green build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

…b 1) First dockcross job re-enabled after the phase-2 revert, now safe behind the build.sh probe. Forwards the Depot cache env into the container via DOCKCROSS_ARGS and enables SCCACHE_LOG=debug + SCCACHE_ERROR_LOG + RUST_BACKTRACE=full so this run captures the in-container panic root cause if it recurs (the probe keeps the build green either way). The CUDA, aarch64, Android, OpenCL-Android and Windows jobs stay uncached until this one is verified green in CI — one job at a time. Document the staged rollout and the probe in CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

Bump the SCCACHE_DL_VERSION default 0.15.0 -> 0.16.0 (released 2026-06-19, the current latest). The x86_64-unknown-linux-musl asset is confirmed published; the fetch stays fail-safe (a missing version just falls back to an uncached build) and the value is overridable per-job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

Bump DEFAULT_DOCKCROSS_IMAGE in all five wrappers from 20260312/13-9b3357c to 20260515-5fd14ac — the newest dockcross release on Docker Hub (verified: a full tag scan shows nothing dated later than 2026-05-15 across the images, no 2026-06 build exists, and 'latest' points to the same digest). This is a tag-pin bump on line 3 (the operative pin), not a full update.sh docker regeneration (which needs Docker unavailable here); the wrapper body is version-stable. It changes the toolchain for every cross-compiled native artifact, so each platform should be confirmed green in CI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

…(phase 2, job 2) manylinux2014 (job 1) verified green in PR #245: sccache v0.16.0 probe passed inside the container (devtoolset-10 gcc), cache ON over Depot WebDAV, cold run stored 275 objects. The v0.8.2 in-container panic does not occur on v0.16.0. Dropped job 1's first-run diagnostics (SCCACHE_LOG/SCCACHE_ERROR_LOG/RUST_BACKTRACE) to its steady-state env. Enable job 2: crosscompile-linux-x86_64-cuda (manylinux_2_28 + CUDA via build_cuda_linux.sh, which execs build.sh, so the same probe guards it). Diagnostics on for its first run on the manylinux_2_28 image. Only the gcc C/C++ TUs cache; nvcc .cu kernels are not wrapped. aarch64/android/opencl-android/Windows stay uncached until each is verified — one job at a time. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

Both are the latest stable patch releases on Maven Central. NullAway runs at -Xep:NullAway:ERROR and was verified clean with 'mvn compile' in this repo; pitest-maven is a plugin-only patch bump. Part of the cross-repo dependency freshness sweep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

… dev knob googletest: bump the BUILD_TESTING-only FetchContent (used only by jllama_test's C++ unit tests, not the shipped library and not coupled to llama.cpp) from v1.15.2 to v1.17.0. There is no constraint behind the tag — it is just latest-stable; CLAUDE.md now says to bump it periodically. CUDA_FAST_BUILD: add an opt-in, default-OFF env knob to build_cuda_linux.sh that builds CUDA for a single architecture (default 'native', override CUDA_ARCH=<cc>) instead of the full release arch set, to speed up local iteration. Default + CI/release behaviour is unchanged (full arch set), so released jars keep full GPU coverage. nvcc .cu kernels are not sccache-cached (limited support), so fewer archs is the real CUDA build-time lever; rationale documented in CLAUDE.md and inline. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

… re-downloads) Each Java-test job re-downloaded ~5 GB of GGUF models from HuggingFace every run. Add an actions/cache@v5 step (path models/, shared key gguf-models-v1) to all four Java-test jobs and guard every model curl with 'test -f models/$NAME ||' so a cache hit skips the download. GGUF files are platform-independent, so ubuntu + macOS share one ~5 GB entry (well under GitHub's free 10 GB/repo cache). Deliberately GitHub's free cache, NOT Depot: Depot Cache is usage-priced (GB-scale model blobs would raise the bill, unlike the tiny content-addressed sccache objects) and its general file cache only works on Depot-hosted runners. Bonus: cache hits also dodge HuggingFace 429s (the reason for the curl --retry flags). Bump the key suffix when the model set/URLs change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

Expand the inline comment on the model-cache step: it exists to avoid re-downloading ~5 GB of GGUF test models from HuggingFace every run (and to dodge HF rate-limits). It is always ON by design — no on/off flag — unlike the sccache compiler cache, which the use_cache input / USE_CACHE env toggles. Notes it uses GitHub's free cache, not Depot. Comment-only; no behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

@tostring

…r + WebUI) Minimal structural wiring for the planned native server: NativeServer sits next to OpenAiCompatServer (the Java server) as the entry point for the upstream native HTTP transport (server-http.cpp + cpp-httplib) already compiled into libjllama — the only component that can serve the embedded WebUI. Scaffold only: start() throws UnsupportedOperationException until the upstream routes (server.cpp's registration) are wired to a JNI entry point; isRunning()/getHost()/getPort()/close() are model-free placeholders. The native methods + C++ implementation + lifecycle are a separate, detailed step. Adds a model-free smoke test (NativeServerSmokeTest, 3 tests). Verified locally: compile (Error Prone/NullAway/Checker), javadoc (failOnWarnings), SpotBugs Max/Low (0 bugs, @tostring clears IMC), ArchUnit (12/12). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

…nly on publish Invert the CUDA build-time/coverage trade-off in CI without risking the distributed jar. The crosscompile-linux-x86_64-cuda job now sets CUDA_FAST_BUILD=1 (single arch, CUDA_ARCH=90) for validation runs (PR/push/non-publish dispatch) to cut nvcc time, and CUDA_FAST_BUILD=0 (full arch set) only when publish_to_central is set. Because publish-snapshot/publish-release require publish_to_central, every artifact that reaches Maven Central is still built for every GPU generation — only non-distributed validation builds go fast. CI has no GPU so the fast path pins a fixed CUDA_ARCH (native would fail at configure); both vars are forwarded into the dockcross container via DOCKCROSS_ARGS -e. build_cuda_linux.sh's own default stays off, so local/manual builds remain release-safe unless you opt in. Docs updated in CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

…build Change the fast-path CUDA_ARCH from 90 to 120 (the newest CUDA 13.2 compute capability, consumer Blackwell / RTX 50xx) per request. Only affects the fast single-arch validation build (PR/push); publish runs still build the full arch set. Bump as newer GPU generations ship. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

Add the USE_CACHE / SCCACHE_WEBDAV_* / DOCKCROSS_ARGS env to crosscompile-linux-aarch64, crosscompile-android-aarch64, and crosscompile-android-aarch64-opencl (jobs 3-5). Jobs 1-2 were already enabled (manylinux2014 verified green, CUDA first run in progress). The build.sh probe-compile health-check makes it safe to enable all jobs simultaneously: any container where sccache crashes automatically falls back to an uncached green build, so there is no need to stage one job at a time anymore. build_opencl_android.sh previously called cmake directly; changed to exec build.sh (same pattern as build_cuda_linux.sh) so it inherits the sccache probe + Depot launcher + --show-stats without duplicating any download/probe logic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

…ifact Record the investigation outcome for caching the two Windows native build jobs (the only remaining uncached native builds): - Root cause: the Visual Studio generator ignores CMAKE_<LANG>_COMPILER_LAUNCHER (and ggml's GGML_CCACHE RULE_LAUNCH_COMPILE), so sccache can only cache under Ninja/Makefiles. - Upstream evidence: llama.cpp b9682 builds windows-cpu + windows-cuda with Ninja Multi-Config (+ ccache); the VS generator is only used by legacy jobs. - Chosen path: don't flip the working build blindly. Validate Ninja Multi-Config in a separate build, or ship two Windows artifacts (Ninja + MSVC) in parallel so end users can test both before committing — Windows build runs twice during the transition. - Implementation notes captured (sccache+Depot backend, build.bat generator wiring, files to touch, bounded risk via the publish gate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

sonarqubecloud · 2026-06-20T16:13:03Z

Quality Gate passed

Issues
5 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

claude added 6 commits June 20, 2026 12:02

bernardladenthin temporarily deployed to startgate June 20, 2026 13:00 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate June 20, 2026 13:19 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate June 20, 2026 13:55 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate June 20, 2026 14:07 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate June 20, 2026 14:35 — with GitHub Actions Inactive

bernardladenthin had a problem deploying to startgate June 20, 2026 14:43 — with GitHub Actions Error

bernardladenthin temporarily deployed to startgate June 20, 2026 15:06 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate June 20, 2026 15:24 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate June 20, 2026 15:27 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate June 20, 2026 15:40 — with GitHub Actions Inactive

bernardladenthin had a problem deploying to startgate June 20, 2026 16:11 — with GitHub Actions Error

bernardladenthin changed the title ~~Add sccache health-check probe and Phase 2 dockcross rollout~~ CI build-cache rollout (sccache Phase 2 + GGUF models) + CUDA fast-build, dep bumps, NativeServer scaffold Jun 20, 2026

bernardladenthin merged commit 162c5fc into main Jun 20, 2026
14 of 43 checks passed

bernardladenthin deleted the claude/gracious-thompson-640r4f branch June 20, 2026 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI build-cache rollout (sccache Phase 2 + GGUF models) + CUDA fast-build, dep bumps, NativeServer scaffold#245

CI build-cache rollout (sccache Phase 2 + GGUF models) + CUDA fast-build, dep bumps, NativeServer scaffold#245
bernardladenthin merged 16 commits into
mainfrom
claude/gracious-thompson-640r4f

bernardladenthin commented Jun 20, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bernardladenthin commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. sccache / Depot compiler cache — Phase 2 complete (all dockcross jobs)

2. GGUF test-model cache (GitHub actions/cache)

3. CUDA fast-build knob (CUDA_FAST_BUILD)

4. Dependency / image bumps

5. NativeServer scaffold (server package)

Docs

CI status / blockers

Deferred to follow-up sessions (tracked in TODO.md)

Uh oh!

sonarqubecloud Bot commented Jun 20, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bernardladenthin commented Jun 20, 2026 •

edited

Loading

2. GGUF test-model cache (GitHub `actions/cache`)

3. CUDA fast-build knob (`CUDA_FAST_BUILD`)

Deferred to follow-up sessions (tracked in `TODO.md`)