Skip to content

CI build-cache rollout (sccache Phase 2 + GGUF models) + CUDA fast-build, dep bumps, NativeServer scaffold#245

Merged
bernardladenthin merged 16 commits into
mainfrom
claude/gracious-thompson-640r4f
Jun 20, 2026
Merged

CI build-cache rollout (sccache Phase 2 + GGUF models) + CUDA fast-build, dep bumps, NativeServer scaffold#245
bernardladenthin merged 16 commits into
mainfrom
claude/gracious-thompson-640r4f

Conversation

@bernardladenthin

@bernardladenthin bernardladenthin commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

Rolls the shared CI build cache out across (almost) the whole native build matrix and bundles several independent build/quality improvements. Net effect: warm CI builds drop from tens of minutes to a few minutes, test-model downloads stop hitting HuggingFace every run, and CUDA validation builds are no longer the ~70-minute long pole — while every distributed artifact stays bit-identical to a clean release build.

Headline results measured on this branch:

  • manylinux2014: warm sccache 99.64% hits (277/278), build step ~1m46s
  • macOS (3 jobs): ~40 min → ~6 min warm
  • GGUF models: pulled from HuggingFace only on a cold cache; served from GitHub's cache afterwards

1. sccache / Depot compiler cache — Phase 2 complete (all dockcross jobs)

  • Probe-compile health check (sccache_can_wrap_compiler in build.sh): compiles a trivial TU through sccache before enabling it as the launcher. A present-but-crashing sccache (the v0.8.2 in-container panic that stalled the first attempt) now falls back to a clean, uncached green -O3 build instead of redding CI, logging the panic backtrace + detached-server log for diagnosis.
  • sccache pinned to v0.16.0, overridable per-job via SCCACHE_DL_VERSION (the panic is gone on 0.16.0).
  • All 5 dockcross cross-compile jobs enabled (the 3 macOS jobs were Phase 1):
    • crosscompile-linux-x86_64 (manylinux2014) — verified green, 99.64% warm hits
    • 🔄 crosscompile-linux-x86_64-cuda — gcc/C++ TUs cache (nvcc kernels can't; see §3)
    • crosscompile-linux-aarch64
    • crosscompile-android-aarch64
    • crosscompile-android-aarch64-openclbuild_opencl_android.sh now execs build.sh, inheriting the probe + launcher (same pattern as build_cuda_linux.sh)
  • Inert without DEPOT_TOKEN (fork PRs) and with use_cache=false; the probe makes enabling all jobs at once safe.

2. GGUF test-model cache (GitHub actions/cache)

  • Caches models/ (~5 GB) across the 4 Java-test jobs under one platform-independent key (gguf-models-v1), so CodeLlama/Qwen/SmolVLM/etc. GGUFs are downloaded only when the cache is cold. Every download step is guarded test -f … || curl ….
  • Deliberately GitHub's free cache, not Depot (GB-scale blobs are usage-priced there), and intentionally has no on/off flag (free + safe). Documented in publish.yml + CLAUDE.md.

3. CUDA fast-build knob (CUDA_FAST_BUILD)

  • nvcc recompiles each .cu kernel once per GPU arch — the dominant cost of the ~70-min CUDA job (sccache can't cache nvcc kernels). New opt-in CUDA_FAST_BUILD builds a single arch to cut that time.
  • Release-safe by policy: in CI it's auto-derived from publish_to_centralfast single-arch for PR/push validation, full arch set whenever publishing to Central. Every artifact that reaches Central is the full set; only validation runs are fast.
  • Validation arch pinned to sm_120 (newest CUDA 13.2, consumer Blackwell / RTX 50xx). Default off for local/manual builds (full = release-safe).

4. Dependency / image bumps

  • NullAway 0.13.6 → 0.13.7, pitest-maven 1.25.4 → 1.25.5 (same bump applied to the 3 sibling repos on matching branches).
  • googletest v1.15.2 → v1.17.0 (test-only; documented that it tracks nothing and should be bumped periodically).
  • All 5 dockcross images → 20260515-5fd14ac (latest).

5. NativeServer scaffold (server package)

  • New net.ladenthin.llama.server.NativeServer — the planned entry point for the native HTTP transport (server-http.cpp + cpp-httplib, already compiled into libjllama), the only path able to serve the embedded WebUI. Scaffold only: start() throws UnsupportedOperationException until the native routes are wired to JNI (a separate, detailed step). Fixes the package/API shape so the real wiring lands cleanly.
  • NativeServerSmokeTest — 3 model-free tests (construct, start() throws, close() no-op); no model / no libjllama required.
  • Makes explicit that two server classes now exist: the working Java OpenAiCompatServer (today's runnable server) and the NativeServer scaffold.

Docs

  • CLAUDE.md: Phase 2 rollout status, CUDA_FAST_BUILD policy, GGUF-cache rationale, googletest note, cross-repo scope pointer.
  • TODO.md: Windows sccache item (needs Ninja Multi-Config per upstream; evaluate shipping Ninja + MSVC artifacts in parallel) + the deferred NativeServer native-route wiring.

CI status / blockers

  • No code blockers. CI is re-running on the latest commit; the previously-enabled jobs (manylinux2014 + all 3 macOS) are proven green, and the remaining dockcross jobs use the same probe-guarded path. Windows jobs are unchanged (still VS generator, uncached — deferred as a documented TODO).
  • SonarCloud Quality Gate passes (100% coverage on new code, 0 security hotspots); the 5 new issues are java:S5786 public-test-visibility on the smoke test — cosmetic.
  • Everything stays bit-identical to a clean -O3 / full-arch release build; publishing remains gated behind publish_to_central.

Deferred to follow-up sessions (tracked in TODO.md)

  • Windows compiler cache — Ninja Multi-Config + dual-artifact (Ninja & MSVC) evaluation.
  • NativeServer full implementation — native startServer/stopServer JNI methods, route wiring, lifecycle/threading, WebUI serving.

🤖 Generated with Claude Code

https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5

claude added 6 commits June 20, 2026 12:02
Add a 'Cross-repo scope' note to the CI build cache section explaining the sccache+Depot compiler cache benefits only this repo's native build, and link the workspace crossrepostatus.md non-parity entry. No build/CI behaviour change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
Keep only the one-line 'jllama-only, it's the sole repo with a native build' fact and defer the full rationale (Maven repos, GitHub-hosted runners, inert DEPOT_TOKEN, badge) to workspace/crossrepostatus.md instead of duplicating it here.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
…ncached

build.sh uses sccache as the compiler launcher, so a present-but-crashing sccache (the static-musl panic seen inside the dockcross cross-compile containers) failed every compile and redded the whole build. The inert-safe guard only covered sccache being absent, not present-but-crashing.

Add sccache_can_wrap_compiler(): probe-compile a trivial TU through sccache and only enable -DCMAKE_{C,CXX}_COMPILER_LAUNCHER=sccache when it succeeds. On any failure it logs the captured Rust panic backtrace (and the detached server's SCCACHE_ERROR_LOG when a job sets one) and builds WITHOUT the cache — a clean green -O3 build. Also make the fetched sccache version a SCCACHE_DL_VERSION knob (default bumped 0.8.2 -> 0.15.0, overridable per-job) and only run --show-stats when sccache was actually used.

Verified locally with fake sccache/cmake across every variant: no token, use_cache=false, crashing sccache, and working sccache all produce a green build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
…b 1)

First dockcross job re-enabled after the phase-2 revert, now safe behind the build.sh probe. Forwards the Depot cache env into the container via DOCKCROSS_ARGS and enables SCCACHE_LOG=debug + SCCACHE_ERROR_LOG + RUST_BACKTRACE=full so this run captures the in-container panic root cause if it recurs (the probe keeps the build green either way). The CUDA, aarch64, Android, OpenCL-Android and Windows jobs stay uncached until this one is verified green in CI — one job at a time. Document the staged rollout and the probe in CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
Bump the SCCACHE_DL_VERSION default 0.15.0 -> 0.16.0 (released 2026-06-19, the current latest). The x86_64-unknown-linux-musl asset is confirmed published; the fetch stays fail-safe (a missing version just falls back to an uncached build) and the value is overridable per-job.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
Bump DEFAULT_DOCKCROSS_IMAGE in all five wrappers from 20260312/13-9b3357c to 20260515-5fd14ac — the newest dockcross release on Docker Hub (verified: a full tag scan shows nothing dated later than 2026-05-15 across the images, no 2026-06 build exists, and 'latest' points to the same digest). This is a tag-pin bump on line 3 (the operative pin), not a full update.sh docker regeneration (which needs Docker unavailable here); the wrapper body is version-stable. It changes the toolchain for every cross-compiled native artifact, so each platform should be confirmed green in CI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
…(phase 2, job 2)

manylinux2014 (job 1) verified green in PR #245: sccache v0.16.0 probe passed inside the container (devtoolset-10 gcc), cache ON over Depot WebDAV, cold run stored 275 objects. The v0.8.2 in-container panic does not occur on v0.16.0. Dropped job 1's first-run diagnostics (SCCACHE_LOG/SCCACHE_ERROR_LOG/RUST_BACKTRACE) to its steady-state env.

Enable job 2: crosscompile-linux-x86_64-cuda (manylinux_2_28 + CUDA via build_cuda_linux.sh, which execs build.sh, so the same probe guards it). Diagnostics on for its first run on the manylinux_2_28 image. Only the gcc C/C++ TUs cache; nvcc .cu kernels are not wrapped. aarch64/android/opencl-android/Windows stay uncached until each is verified — one job at a time.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
Both are the latest stable patch releases on Maven Central. NullAway runs at -Xep:NullAway:ERROR and was verified clean with 'mvn compile' in this repo; pitest-maven is a plugin-only patch bump. Part of the cross-repo dependency freshness sweep.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
… dev knob

googletest: bump the BUILD_TESTING-only FetchContent (used only by jllama_test's C++ unit tests, not the shipped library and not coupled to llama.cpp) from v1.15.2 to v1.17.0. There is no constraint behind the tag — it is just latest-stable; CLAUDE.md now says to bump it periodically.

CUDA_FAST_BUILD: add an opt-in, default-OFF env knob to build_cuda_linux.sh that builds CUDA for a single architecture (default 'native', override CUDA_ARCH=<cc>) instead of the full release arch set, to speed up local iteration. Default + CI/release behaviour is unchanged (full arch set), so released jars keep full GPU coverage. nvcc .cu kernels are not sccache-cached (limited support), so fewer archs is the real CUDA build-time lever; rationale documented in CLAUDE.md and inline.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
… re-downloads)

Each Java-test job re-downloaded ~5 GB of GGUF models from HuggingFace every run. Add an actions/cache@v5 step (path models/, shared key gguf-models-v1) to all four Java-test jobs and guard every model curl with 'test -f models/$NAME ||' so a cache hit skips the download. GGUF files are platform-independent, so ubuntu + macOS share one ~5 GB entry (well under GitHub's free 10 GB/repo cache).

Deliberately GitHub's free cache, NOT Depot: Depot Cache is usage-priced (GB-scale model blobs would raise the bill, unlike the tiny content-addressed sccache objects) and its general file cache only works on Depot-hosted runners. Bonus: cache hits also dodge HuggingFace 429s (the reason for the curl --retry flags). Bump the key suffix when the model set/URLs change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
Expand the inline comment on the model-cache step: it exists to avoid re-downloading ~5 GB of GGUF test models from HuggingFace every run (and to dodge HF rate-limits). It is always ON by design — no on/off flag — unlike the sccache compiler cache, which the use_cache input / USE_CACHE env toggles. Notes it uses GitHub's free cache, not Depot. Comment-only; no behaviour change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
…r + WebUI)

Minimal structural wiring for the planned native server: NativeServer sits next to OpenAiCompatServer (the Java server) as the entry point for the upstream native HTTP transport (server-http.cpp + cpp-httplib) already compiled into libjllama — the only component that can serve the embedded WebUI. Scaffold only: start() throws UnsupportedOperationException until the upstream routes (server.cpp's registration) are wired to a JNI entry point; isRunning()/getHost()/getPort()/close() are model-free placeholders. The native methods + C++ implementation + lifecycle are a separate, detailed step.

Adds a model-free smoke test (NativeServerSmokeTest, 3 tests). Verified locally: compile (Error Prone/NullAway/Checker), javadoc (failOnWarnings), SpotBugs Max/Low (0 bugs, @tostring clears IMC), ArchUnit (12/12).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
…nly on publish

Invert the CUDA build-time/coverage trade-off in CI without risking the distributed jar. The crosscompile-linux-x86_64-cuda job now sets CUDA_FAST_BUILD=1 (single arch, CUDA_ARCH=90) for validation runs (PR/push/non-publish dispatch) to cut nvcc time, and CUDA_FAST_BUILD=0 (full arch set) only when publish_to_central is set. Because publish-snapshot/publish-release require publish_to_central, every artifact that reaches Maven Central is still built for every GPU generation — only non-distributed validation builds go fast.

CI has no GPU so the fast path pins a fixed CUDA_ARCH (native would fail at configure); both vars are forwarded into the dockcross container via DOCKCROSS_ARGS -e. build_cuda_linux.sh's own default stays off, so local/manual builds remain release-safe unless you opt in. Docs updated in CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
…build

Change the fast-path CUDA_ARCH from 90 to 120 (the newest CUDA 13.2 compute capability, consumer Blackwell / RTX 50xx) per request. Only affects the fast single-arch validation build (PR/push); publish runs still build the full arch set. Bump as newer GPU generations ship.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
Add the USE_CACHE / SCCACHE_WEBDAV_* / DOCKCROSS_ARGS env to
crosscompile-linux-aarch64, crosscompile-android-aarch64, and
crosscompile-android-aarch64-opencl (jobs 3-5). Jobs 1-2 were already
enabled (manylinux2014 verified green, CUDA first run in progress).

The build.sh probe-compile health-check makes it safe to enable all jobs
simultaneously: any container where sccache crashes automatically falls
back to an uncached green build, so there is no need to stage one job at
a time anymore.

build_opencl_android.sh previously called cmake directly; changed to
exec build.sh (same pattern as build_cuda_linux.sh) so it inherits the
sccache probe + Depot launcher + --show-stats without duplicating any
download/probe logic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
…ifact

Record the investigation outcome for caching the two Windows native build
jobs (the only remaining uncached native builds):

- Root cause: the Visual Studio generator ignores CMAKE_<LANG>_COMPILER_LAUNCHER
  (and ggml's GGML_CCACHE RULE_LAUNCH_COMPILE), so sccache can only cache under
  Ninja/Makefiles.
- Upstream evidence: llama.cpp b9682 builds windows-cpu + windows-cuda with
  Ninja Multi-Config (+ ccache); the VS generator is only used by legacy jobs.
- Chosen path: don't flip the working build blindly. Validate Ninja Multi-Config
  in a separate build, or ship two Windows artifacts (Ninja + MSVC) in parallel
  so end users can test both before committing — Windows build runs twice during
  the transition.
- Implementation notes captured (sccache+Depot backend, build.bat generator
  wiring, files to touch, bounded risk via the publish gate).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
@bernardladenthin bernardladenthin changed the title Add sccache health-check probe and Phase 2 dockcross rollout CI build-cache rollout (sccache Phase 2 + GGUF models) + CUDA fast-build, dep bumps, NativeServer scaffold Jun 20, 2026
@sonarqubecloud

Copy link
Copy Markdown

@bernardladenthin bernardladenthin merged commit 162c5fc into main Jun 20, 2026
14 of 43 checks passed
@bernardladenthin bernardladenthin deleted the claude/gracious-thompson-640r4f branch June 20, 2026 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants