NativeServer mode, GPU/CPU classifier matrix (ROCm/SYCL/OpenVINO/Win-arm64/s390x), llama.cpp b9870 + bump automation#293
Merged
Conversation
…s (Granite-4) Recurrent/hybrid models (granitehybrid, Mamba, Jamba) can only roll a slot back to a saved context checkpoint. In upstream b9859 the near-prompt-end checkpoints are gated by checkpoint_min_step (default 8192 tokens) and new checkpoints are otherwise only created at user-message boundaries. An agentic tool-calling conversation appends only assistant/tool messages after turn 1, so no new checkpoint is ever created and every turn re-prefills the whole conversation tail. Measured on a synthetic granitehybrid model (llama-server, 6-turn tool loop, ~643 new tokens/turn): prefilled tokens per turn grew 901 -> 1544 -> 2187 -> 2830 -> 3473 unpatched, i.e. quadratic total prefill. patches/0005 (upstream-submittable, server-context.cpp): - exempt near-prompt-end checkpoints from the min-step spacing when the memory can only roll back via checkpoints (seq-rm type FULL or RS); SWA-only models are unaffected - never create a checkpoint at the same position as the newest one (the last-user-message checkpoint was re-created identically every turn, flooding the 32-entry checkpoint list) With the patch the same loop prefills a constant 647 tokens/turn (each turn restores the previous turn's near-end checkpoint): 5.4x less prefill at turn 6, growing with conversation length. Outputs verified byte-identical to unpatched at temperature=0. ModelParameters gains setCtxCheckpoints(int) / setCheckpointMinStep(int) (--ctx-checkpoints / --checkpoint-min-step, both LLAMA_EXAMPLE_SERVER scope, reach the embedded server through common_params_parse) so callers can tune checkpoint density/RAM from Java. +2 unit tests (144 pass), javadoc clean, spotless applied. Complements open upstream PRs #24035/#24899/#24891 (checkpoint invalidation/ retention); this fixes checkpoint starvation. Drop the patch once upstream lands role-boundary checkpoint placement. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Pure readability refactor of the checkpoint-starvation fix — no behavior
change. The two compound `do_checkpoint = do_checkpoint && (empty || ...)`
assignments are lifted into named locals so the final gate reads:
do_checkpoint = do_checkpoint && checkpoint_well_spaced && checkpoint_not_duplicate;
- checkpoint_well_spaced: the min-step spacing test with the last-user-message
and near-prompt-end (checkpoint-only-rollback) exemptions
- checkpoint_not_duplicate: the same-position dedup guard
Each named bool keeps the leading `checkpoints.empty() ||` so the
`checkpoints.back()` access stays short-circuit-guarded (identical semantics
to the previous inlined `&&`-chains). Compiles clean; patch re-verified to
apply and reverse-check (idempotence) against pristine b9859 via the same
`git apply` path the CMake applier uses.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…enAiServerCli
The fat-jar launcher (OpenAiCompatServer.main) parses args via OpenAiServerCli,
which only understood a subset of flags and threw on anything else. Extend it
with the seven tuning flags that a llama-server.exe user needs so the bundled
`java -jar …-jar-with-dependencies.jar` covers a full invocation without any
custom Java:
-b/--batch-size, -ub/--ubatch-size -> ModelParameters.setBatchSize/setUbatchSize
-tb/--threads-batch -> setThreadsBatch
-ctk/--cache-type-k, -ctv/--cache-type-v -> setCacheTypeK/V (case-insensitive
CacheType lookup; unknown -> error)
--jinja -> enableJinja
--chat-template-kwargs <json> -> setChatTemplateKwargs
--chat-template-kwargs is parsed here (Jackson, already a server-package dep)
into the raw-per-value map setChatTemplateKwargs expects, so a malformed object
fails fast with usage text instead of at native model load. All setters already
existed; the ints/CacheType/kwargs use 0/null "leave the default" sentinels
mirroring the existing ctx/threads/parallel handling.
+13 unit tests (30 pass total); usage() and README flag list updated; javadoc
and spotless clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…LL via JNI
Second server mode alongside the Java-transport OpenAiCompatServer: NativeServer
runs the *full* upstream llama.cpp HTTP server — embedded WebUI included — inside
libjllama over JNI, with no separate llama-server executable. It forwards the raw
llama-server argv verbatim, so every flag works exactly as for the standalone
binary (no per-flag Java mapping).
How: b9859 already exposes `int llama_server(int, char**)` (no main() in
server.cpp). patches/0006 makes it embeddable — skips installing process-wide
SIGINT/SIGTERM handlers when embedded (they would hijack the JVM), parses the
forwarded argv via common_params_parse instead of common_params_parse_main
(whose GetCommandLineW recovery would grab java.exe's command line — the Windows
bug class 0001 fixes), and adds llama_server_request_shutdown() for out-of-band
stop (ctx_server is loop-local). native_server.cpp's JNI bridge runs llama_server
on a worker thread; start/stop/isRunning map to the three native methods.
CMake: server.cpp + server-tools.cpp are now compiled in (non-Android — both pull
subprocess.h/posix_spawn_*, so they share server-models.cpp's guard), plus
native_server.cpp.
NativeServer is an independent lifecycle (loads its own model from the argv, like
llama-server.exe), single-instance per process (upstream keeps shutdown state in
file-scope globals), and unavailable on Android. Reusing an already-loaded
LlamaModel's context is a documented TODO. libjllama loads lazily in start(), so
construction/arg-parsing/close stay pure-Java unit-testable.
Verified end-to-end on Linux x86_64 with a synthetic granitehybrid model: server
starts, GET /health -> 200 {"status":"ok"}, /v1/models and /props served, / is
the native WebUI route (404 locally with the empty-asset stub; serves index.html
in released jars that bake in webui-generated assets), close() shuts down cleanly.
7 pure-Java NativeServer tests + javadoc + spotless + clang-format(22.1.5) clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…CompatServer) Two runnable server mains now exist. The fat jar's default Main-Class becomes NativeServer, so `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080` runs the full native llama.cpp server with its embedded WebUI, forwarding every argument. OpenAiCompatServer is unchanged and still runnable via `java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …`. - NativeServer.main(args): forwards argv, starts the server, registers a JVM shutdown hook (the embedded server installs no signal handlers of its own — see patches/0006 — so the hook is what stops it cleanly on Ctrl-C/SIGTERM), and blocks until the native worker exits. - llama/pom.xml assembly profile: Main-Class OpenAiCompatServer -> NativeServer. - README + CLAUDE.md: document the two modes and how to select each. Verified end-to-end (Linux x86_64, synthetic granitehybrid): `java -cp … NativeServer -m model --port 8972` serves /health=ok after load; SIGTERM to the JVM fires the shutdown hook -> clean "cleaning up before exit" -> port down. Javadoc + spotless clean; 7 pure-Java NativeServer tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…ai-compat Tiny dispatcher as the fat jar's Main-Class: with --open-ai-compat it runs OpenAiCompatServer (Java-transport OpenAI API), without it the default NativeServer (full native llama.cpp server + WebUI). The --open-ai-compat marker is stripped (it is not a llama.cpp flag); all other args are forwarded verbatim to the chosen server. Both underlying mains stay runnable directly by class name via java -cp. Note: the two servers accept different flag sets — NativeServer forwards every llama-server flag, OpenAiCompatServer's CLI accepts a curated subset and rejects unknown flags — so native-only flags can't be combined with --open-ai-compat. Dispatch logic split into pure static helpers (selectsOpenAiCompat / withoutFlag) with 7 unit tests. Verified at runtime: `ServerLauncher --open-ai-compat -m model --port 8973` starts the Java server (/ -> invalid_request_error, its handler), and without the flag starts NativeServer (/ -> native File Not Found); both shut down cleanly on SIGTERM. pom Main-Class NativeServer -> ServerLauncher; README + CLAUDE.md updated. spotless + javadoc clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…g to --openai-compat
Per review: collapse the dispatch to a single pure helper. withoutFlag(args, flag)
strips the selector and returns a (possibly shorter) array; main() selects the
mode purely by whether that shortened the list (present iff result is smaller),
so the separate selectsOpenAiCompat method + its baked-in constant are gone. The
helper takes the flag as a parameter, so it is general and testable independent
of the flag's meaning.
Also rename the selector --open-ai-compat -> --openai-compat ("OpenAI" is one
word, matching the brand and the codebase's oaicompat / OpenAiCompatServer);
constant OPEN_AI_COMPAT_FLAG -> OPENAI_COMPAT_FLAG.
Tests rewritten around withoutFlag: the length-change selection signal (shorter
iff present, same length iff absent, position-independent) plus stripping
behaviour (strips all occurrences, preserves order, no-op when absent, empty).
7 pass. Verified at runtime: `ServerLauncher --openai-compat -m model --port
8974` routes to OpenAiCompatServer (/ -> invalid_request_error) and shuts down
cleanly; no-flag routes to NativeServer. README + CLAUDE.md + pom updated;
spotless + javadoc clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Rename the fat-jar dispatch selector from --openai-compat to --jllama-openai-compat so it can never collide with a current or future llama.cpp / llama-server flag: upstream owns the --* space, this launcher owns --jllama-*. The jllama prefix is the project's native-library name, which upstream will never use, and it stays a lowercase-hyphen CLI token (not the verbose FQN, not the class name). ServerLauncher strips it before forwarding, so it never reaches llama_server (which rejects unknown flags). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Additive-only upgrade — no incompatibilities, no project source changes.
Bumps GIT_TAG (and the TTS provenance banner), the README badge/link, and
the CLAUDE.md pinned-version line + build examples.
The b9859..b9862 diff touches two patch-target files (server-context.cpp/.h,
for the new model_ftype field + get_meta/get_model_info additions). Patches
0002/0003/0005 were applied in sequence against the actual b9862
server-context.{cpp,h} and all apply cleanly (their regions are disjoint from
and far from the additions). Patches 0001/0004/0006 target files not in the
changed-file list (common/arg.*, server-common.cpp, test-chat.cpp, server.cpp,
the ~34 standalone mains) and the OuteTTS generator anchors (tts.cpp) are
unchanged, so they apply byte-identically.
New upstream features surfaced by the bump (documented in the breaking-changes
history):
- Additive C API llama_ftype_name / llama_model_ftype; the native server now
emits a model 'ftype' in get_model_info(), so NativeServer mode surfaces the
quant type automatically. Optional follow-up: bind it into LlamaModel and the
Java OpenAiCompatServer propsJson.
- CUDA gated-delta-net fused snapshot-copy path (decode win for gated-delta /
hybrid-recurrent models on the cuda13 classifiers).
- Vendored cpp-httplib bumped to v0.49.0 (locale-independent ASCII classifiers,
additive MultipartFormDataWriter, base64 UB fix) — internal to the compiled
server transport, no bound symbol.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Wires the new b9862 llama_ftype_name / llama_model_ftype quant-type info (exposed by server_context_meta::model_ftype) up through JNI to Java: - jllama.cpp getModelMetaJson now emits "ftype" from server_context_meta. - ModelMeta.getFtype() and the convenience LlamaModel.getModelFtype() expose the quant label (e.g. "Q4_K - Medium"; a guessed type is prefixed with "(guessed) "), empty when the native layer does not report it. - OpenAiCompatServer advertises it as data[].ftype in GET /v1/models, matching the upstream server's get_model_info() key. The value is threaded through OpenAiServerConfig.modelFtype (new field/builder/getter) from the loaded model, mirroring how supportsVision is threaded — keeping the "models built from config alone" invariant. The field is omitted when unknown/blank. Tests: +2 ModelMeta, +1 OpenAiServerConfig, +1 OpenAiSseFormatter (ftype present/omitted). Verified end-to-end: full native jllama build against b9862 with all six patches applied (100% link), model-free load smoke test green, and 63 model-free Java unit tests pass; clang-format 22.1.5 + spotless + javadoc all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Extends the artifact matrix toward upstream llama.cpp's release set with three
new native builds (all wired for CI; the user runs CI to validate on the GPU/
arm runners this environment lacks):
1. vulkan-linux-x86-64 — Linux x86_64 Vulkan classifier JAR
2. vulkan-linux-aarch64 — Linux aarch64 Vulkan classifier JAR
3. Windows arm64 CPU — folded into the DEFAULT JAR (no classifier)
Linux Vulkan (vendor-neutral GPU jar, no CUDA toolkit) — the intersection of
the existing Vulkan-Windows and CUDA-Linux wiring:
- CMakeLists: the elseif(GGML_VULKAN) branch is now OS-aware like GGML_CUDA
(Windows -> resources_windows_vulkan, else resources_linux_vulkan/.../Linux/
${OS_ARCH}); one tree holds both arches.
- pom.xml: profiles vulkan-linux / vulkan-linux-aarch64, both reading the shared
resources_linux_vulkan tree with an arch-scoped resource-copy <includes>
(Linux/x86_64 vs Linux/aarch64), so each classifier JAR carries only its arch.
Verified locally with staged dummy natives: each jar contains exactly one
libjllama.so for its arch.
- publish.yml: build-linux-x86_64-vulkan (native ubuntu-latest) +
build-linux-aarch64-vulkan (ubuntu-24.04-arm, GCC 14); both apt-install the
Vulkan SDK, build -DGGML_VULKAN=ON -DGGML_NATIVE=OFF, build-only (GPU-less
runners). Artifacts merge into one resources_linux_vulkan tree in package/
publish; profiles added to the three -P lists.
- .gitignore: ignore resources_linux_vulkan (also fixed the pre-existing
resources_cuda_linux -> resources_linux_cuda typo).
Windows arm64 CPU (default JAR):
- build-windows-arm64 on the free windows-11-arm runner (msvc-dev-cmd arch:arm64,
Ninja Multi-Config, -DOS_ARCH=aarch64, build + ctest), emitting to the canonical
resources/.../Windows/aarch64 and uploading Windows-aarch64-libraries, which the
*-libraries glob merges into the default tree. No Java change: OSInfo already
maps a Windows-on-ARM JVM (os.arch=aarch64) to Windows/aarch64. Matches the
existing Windows CPU jobs (committed jllama.h + bundled JNI headers, so no
mvn compile / setup-java needed).
All three added to package.needs. Runtime GPU libs are never bundled (driver
supplies libvulkan.so.1) — same policy as every GPU classifier.
Local verification (CI does the real GPU/arm builds): CMake configures clean and
the CPU branch still routes correctly; pom.xml is well-formed and both new
profiles are recognized and activate; the per-arch classifier split was proven by
packaging staged dummy natives; the workflow YAML parses (40 jobs) with all needs
resolving and all -P lists updated. README classifier table + snippets and CLAUDE.md
document the additions.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
KISS analogue of upstream llama.cpp's ccache-action "CCache Statistics" table, but for this repo's sccache-over-Depot cache (ccache-action can't be dropped in: it manages its own ccache + actions/cache backend, conflicting with the Depot WebDAV design). build.sh/build.bat already print `sccache --show-stats` to the log; now, when running in CI (GITHUB_STEP_SUMMARY set) and sccache was actually the launcher, they also parse those stats and append a small markdown table: ### sccache statistics | Cache hits | Requests | Hit rate | |------------|----------|----------| | 589 | 600 | 98.2% | Per-job (GitHub does not merge job summaries), covering every native build job uniformly — build.sh handles the dockcross/native-Linux/aarch64/vulkan-linux and macOS jobs; build.bat the Windows jobs. Parses the text stats (top-level "Compile requests" = total, top-level "Cache hits" = hits; the per-language "Cache hits (C/C++)" line is skipped by the digit-anchored regex). Best-effort: skipped silently if the numbers can't be parsed or there were no requests, and never emitted for local runs (no GITHUB_STEP_SUMMARY) — so local builds are untouched. Verified the build.sh parse end-to-end against a realistic sccache --show-stats sample (req=600, hits=589 -> 98.2%, with the "Compile requests executed" line correctly excluded); build.sh passes bash -n. The Windows batch path (integer math with rounding, escaped pipes/parens) is validated by CI on the Windows runners. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Additive-only upgrade — no incompatibilities, no project source changes.
Bumps GIT_TAG (and the TTS provenance banner), the README badge/link, and the
CLAUDE.md pinned-version line + build examples.
The b9862..b9864 diff is almost entirely the Svelte WebUI (tools/ui/**, which
auto-follows the pinned GIT_TAG via the build-webui CI job) plus one small server
change: a new per-request sse_ping_interval in the completion API (task_params
field + make_llama_cmpl_schema field + handle_completions_impl capture). It's
inside upstream-compiled server TUs the project already links; NativeServer mode
gets it for free, and the project binds no new symbol.
Patch verification: the diff touches exactly one patch-target file
(server-context.cpp, only in handle_completions_impl ~L4089, far below every
patched region). Patches 0002/0003/0005 were applied in sequence against the
actual b9864 server-context.{cpp,h} — all clean; server-context.h is unchanged,
and server-schema.cpp/server-task.h are not patch targets. Patches 0001/0004/0006
target files not in the changed-file list, so they apply unchanged. Confirmed
end-to-end by a clean cmake configure: b9864 fetched and all six patches applied
via the fail-loud PATCH_COMMAND (exit 0), OuteTTS generator anchors held.
Optional future work (documented in the breaking-changes history): expose
sse_ping_interval on the Java InferenceParameters — it would flow through the
OAI-compat completion path via eval_llama_cmpl_schema like any other field.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…meters Wires the b9864 per-request sse_ping_interval into the Java API and, from a completion-schema audit, adds the other already-parseable-but-unexposed plain scalars so callers of the OAI-compat completion path can set them. New withers (each emits a JSON key honored by eval_llama_cmpl_schema): - withSsePingInterval(int) -> sse_ping_interval (b9864; -1 disables pings) - withXtcProbability(float) / withXtcThreshold(float) -> XTC sampler - withNDiscard(int) -> n_discard (context-shift discard) - withNIndent(int) -> n_indent (infill indentation) - withTMaxPredictMs(int) -> t_max_predict_ms (generation time budget) - withPostSamplingProbs(boolean) -> post_sampling_probs - withTimingsPerToken(boolean) -> timings_per_token - withReturnTokens(boolean) -> return_tokens Audit method: extracted every field name from b9864's make_llama_cmpl_schema and diffed against the InferenceParameters keys. t_max_prompt_ms was deliberately skipped (commented out upstream, so not parseable). The remaining unexposed fields are OAI aliases already covered (max_tokens/ max_completion_tokens -> n_predict) or OAI/server-internal / array-shaped / advanced knobs (n, logprobs, echo, verbose, include_usage, return_progress, response_fields, lora, grammar_lazy/grammar_triggers/preserved_tokens, chat_format, parse_tool_calls, reasoning_control, backend_sampling, adaptive_*), left out on purpose and documented in the breaking-changes history. Tests: +2 Java withers tests (InferenceParametersTest -> 104 pass) and +3 C++ schema round-trip guards in test_server.cpp pinning that the native parser honors sse_ping_interval (round-trip, -1 disables, below-hard-limit throws, absent inherits the server default) -> full C++ suite 462 pass (was 459). javadoc (llama module) + spotless + clang-format all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…arm64 Fixes three PR #293 CI failures. 1. SpotBugs (spotbugs-exclude.xml): the branch's new server classes tripped 14 fb-contrib/findsecbugs findings not yet covered by the exclude file. All are false-positives or intentional tradeoffs mirroring suppressions the project already applies elsewhere. Added six method-scoped Match blocks with rationale: - NativeServer: IMC_IMMATURE_CLASS_NO_EQUALS, MDM_THREAD_YIELD (main() poll), UVA_USE_VAR_ARGS (native JNI method), WEM_WEAK_EXCEPTION_MESSAGING. - ServerLauncher.main: THROWS_METHOD_THROWS_CLAUSE_BASIC_EXCEPTION. - OpenAiServerCli parse/usage/parseChatTemplateKwargs: ENMI_NULL_ENUM_VALUE (@nullable CacheType unset sentinel), POTENTIAL_XML_INJECTION + PRMC (plain- text usage() help, no XML), EXS_EXCEPTION_SOFTENING (Jackson->CLI error, cause chained), PSC_PRESIZE_COLLECTIONS. - OpenAiServerCli$Options.getChatTemplateKwargs: EI_EXPOSE_REP (returns an already-unmodifiable map). Verified: mvn spotbugs:check -> BugInstance size 0, BUILD SUCCESS. 2. Linux Vulkan (publish.yml): both build-linux-*-vulkan jobs failed find_package(SPIRV-Headers CONFIG REQUIRED) because the apt install omitted it. Added spirv-headers to both apt lines (exact parity with upstream llama.cpp's build-vulkan.yml: 'glslc libvulkan-dev spirv-headers'). 3. Windows arm64 (publish.yml + CLAUDE.md): ggml aborts 'MSVC is not supported for ARM, use clang'. The generator (Ninja) was never the issue — the compiler was. Switched the job to clang-cl (-DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl): it satisfies ggml's guard (if (MSVC AND NOT CMAKE_C_COMPILER_ID STREQUAL "Clang")) while keeping CMake's MSVC=TRUE, so our static /MT CRT block still applies and the runner + Ninja + ctest all stay. msvc-dev-cmd (arm64) supplies the MSVC headers/libs. First-run watch item: clang-cl must be on PATH (VS Clang component / LLVM); if CI reports it missing, add an LLVM setup step. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Two LlamaArchitectureTest rules failed on PR #293 because of this branch's server additions: - layeredArchitecture (12 violations): the branch adds two legitimate new edges out of the Server layer -- Server -> Args (OpenAiServerCli maps -ctk/-ctv to the args.CacheType enum) and Server -> Loader (NativeServer.start() calls LlamaLoader.initialize() before launching the embedded native server). The rule documents itself as the EXACT set of accessors today, to be updated when a new dependency is intended, so Server is added to the Loader and Args mayOnlyBeAccessedByLayers lists (+ a doc note). Server remains the only layer allowed to reach the Api root and stays mayNotBeAccessedByAnyLayer. - noThreadSleep (1 violation): NativeServer.main() kept the JVM alive with a while(isRunning()) Thread.sleep(200) poll loop. The rule bans Thread.sleep and has no suppression seam (it prefers Condition.await/poll), so main() now blocks on a bounded CountDownLatch.await(200ms) signalled by the shutdown hook. This is also a behavioural improvement: Ctrl-C/SIGTERM wakes the wait immediately instead of after up to a 200 ms tick, while the timeout still re-checks isRunning() to catch a self-terminated native worker. Verified: LlamaArchitectureTest 12/12 pass; server-package tests 44/44 pass (ServerLauncher, OpenAiServerCli, NativeServerSmoke); javadoc + spotless clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
The clang-cl fix worked (jllama.dll linked for Windows/aarch64), but the next step failed at test discovery: gtest_discover_tests could not launch jllama_test.exe -> exit 0xc0000135 (STATUS_DLL_NOT_FOUND). Root cause: with clang-cl, ggml links LLVM's OpenMP runtime (libomp.lib -> needs libomp140.aarch64.dll at run time). Unlike MSVC's ambient vcomp140.dll on x64, that DLL is not on PATH, so neither the test exe nor a consumer could load the binary. (Upstream llama.cpp works around this by copying libomp140.aarch64.dll next to its arm64 output.) Fix: pass -DGGML_OPENMP=OFF for the arm64 job. ggml falls back to its own std::thread threadpool, so both jllama_test.exe and the shipped arm64 jllama.dll are self-contained with no libomp dependency to ship — cleaner than bundling an LLVM OpenMP DLL into the default JAR. The x86_64/x86 jobs keep OpenMP (MSVC vcomp, which is ambient and already proven). Also updated the job comment + CLAUDE.md to record that VC\Tools\Llvm\ARM64 supplies clang-cl/lld-link (no separate LLVM install needed) and the OpenMP rationale. The getenv/strdup/ctime deprecation messages in the same log are warnings only (clang-cl flagging POSIX names against the MSVC UCRT headers), not the failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Trivial additive upgrade — no incompatibilities, no project source changes. Bumps GIT_TAG (+ the TTS provenance banner), the README badge/link, and the CLAUDE.md pinned-version line + build examples. The b9864..b9866 diff is backend/WebUI-only: the CUDA topk-moe kernel gains a case 288 instantiation + accepts n_expert==288 (StepFun 3.7's non-power-of-2 expert count) — device-side, affecting only the cuda13 classifiers; a test-backend-ops.cpp case (not built here, LLAMA_BUILD_TESTS OFF); and WebUI changes (a config string-boolean normalization migration + a thinking-default flip) that auto-follow the pinned GIT_TAG via the build-webui job. The project binds no new symbol. Patch verification: the diff touches no patch-target file and no OuteTTS anchor, so all six patches are byte-identical to b9864. Confirmed end-to-end by a clean cmake configure: b9866 fetched (case 288 present) and all six patches applied via the fail-loud PATCH_COMMAND (exit 0; 0005 + 0006 markers present), OuteTTS anchors held. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Add a diff-size-driven bump workflow so a version upgrade never lands an unreviewably large diff in one step. - .github/scripts/llama-next-version.sh: read-only helper that computes the next reviewable step. Reads the current pin from llama/CMakeLists.txt and the target from an explicit b<nnnn> arg or the GitHub releases atom feed, against a cached blobless mirror clone. If git diff cur..target is under the threshold (LLAMA_BUMP_MAX_DIFF_KB, default 100 KiB) it bumps straight to the target; otherwise it binary-searches the intermediate tags for the largest one still under the threshold and prints that chunk plus its compare/.patch URLs. LLAMA_BUMP_EXCLUDE_WEBUI sizes the diff excluding the auto-followed tools/ui WebUI. - docs/upgrade/llama-cpp-version-bump.md: the runbook (documentation root) for target selection, byte-size chunking, the helper, and the edit/verify/commit loop. - CLAUDE.md: link the runbook from the Upgrading/Downgrading section. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
First bump driven by the new .github/scripts/llama-next-version.sh helper: b9866 -> b9867 is a 2 KiB single-commit final chunk (well under the 100 KiB threshold), so it bumps straight to the latest release. b9867 (spec: support spec-draft-p-min in DFlash) changes only common/speculative.cpp: the DFlash draft path now also clamps n_min to the block size, raises the draft sampler top_k 1 -> 10, stops drafting when the top candidate probability drops below p_min, and discards a step producing fewer than n_min tokens. All three use existing common_speculative_params fields; common/speculative.h is untouched. Entirely inside upstream-compiled common; the project binds no common_speculative_* symbol. No project source changes required. Re-verified all six patches (0001-0006) apply cleanly against b9867 via a fresh fail-loud cmake PATCH_COMMAND configure (0005/0006 markers present); OuteTTS generator anchors held. Appended the b9866->b9867 history rows. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
The new docs/upgrade/llama-cpp-version-bump.md lacked copyright/licensing info, failing the License Compliance (REUSE) check. Add the top-of-file HTML-comment SPDX block used by the sibling docs (docs/history/*.md, docs/feature-investigation-*.md). reuse lint now reports 310/310 files compliant with REUSE 3.3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Driven by .github/scripts/llama-next-version.sh: b9867 -> b9870 is a 21 KiB (11 KiB excl. WebUI) three-commit final chunk, under the 100 KiB threshold, so it bumps straight to the latest release. The only source edit in b9867..b9870 is common/chat.cpp: a StepFun message-content whitespace workaround (issue #24181) that trims leading and trailing whitespace from each common_chat_msg content, reasoning_content and text content-part before Jinja rendering, detected by the StepFun template signature. It uses existing common_chat_msg fields; common/chat.h is untouched. The removed stepfun-ai-Step-3.5-Flash.jinja template and the test-chat additions are not built here (LLAMA_BUILD_TESTS OFF); tools/ui is the auto-followed WebUI. No project source changes required. Re-verified all six patches (0001-0006) apply cleanly against b9870 via a fresh fail-loud cmake PATCH_COMMAND configure (0005/0006 markers and the b9870 trim_all_content change present); OuteTTS generator anchors held. Appended the b9867->b9870 history rows. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…VINO Wire eight new GPU-backend classifiers following the exact same 5-place pattern as the existing CUDA/Vulkan classifiers (fail-loud, in package.needs, no continue-on-error, no special cases): rocm-linux-x86-64 GGML_HIP Linux x86_64 (AMD ROCm/HIP) rocm-windows-x86-64 GGML_HIP Windows x86_64 (AMD HIP SDK) sycl-fp16-linux-x86-64 GGML_SYCL+F16 Linux x86_64 (Intel oneAPI, fp16) sycl-fp32-linux-x86-64 GGML_SYCL Linux x86_64 (Intel oneAPI, fp32) sycl-windows-x86-64 GGML_SYCL Windows x86_64 (Intel oneAPI) opencl-windows-aarch64 GGML_OPENCL Windows aarch64 (Snapdragon/Adreno) openvino-linux-x86-64 GGML_OPENVINO Linux x86_64 (Intel OpenVINO) openvino-windows-x86-64 GGML_OPENVINO Windows x86_64 (Intel OpenVINO) - llama/CMakeLists.txt: extend the OS-aware backend routing with GGML_HIP, GGML_SYCL (Linux fp16/fp32 split by GGML_SYCL_F16) and GGML_OPENVINO branches. - llama/pom.xml: eight classifier profiles; the existing opencl-windows include is now arch-scoped to Windows/x86_64 so the new aarch64 OpenCL build sharing the resources_windows_opencl tree does not leak into it (vulkan-linux split precedent). - .github/workflows/publish.yml: eight build jobs (build-only; GitHub runners have no matching GPU), all added to package.needs and to the download + profile-activation steps of package/publish-snapshot/publish-release. Vendor toolchain installs are first-pass and intentionally fail loud if a URL/version is stale. - README.md + CLAUDE.md: classifier table rows, dependency snippets, and a wiring/routing section. .gitignore: the seven new resources_* trees. All build-only, no vendor runtime bundled (consumer's driver/toolkit supplies it). Validated locally: CMake CPU reconfigure parses the extended routing, Maven recognizes all 8 profiles, publish.yml is valid YAML, pom.xml is well-formed, REUSE compliant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Mirror upstream llama.cpp's own release-job recipes for the Windows SYCL and Windows HIP builds, and fix the two OpenVINO installs: - Windows ROCm/HIP: the AMD HIP SDK URL 404'd and find_package(hip) could not locate the SDK. Use HIP SDK 26.Q1 (upstream's pin), resolve HIP_PATH from the installed ROCm dir, and pass -DCMAKE_PREFIX_PATH plus the SDK's own clang/clang++ so ggml-hip's find_package(hip) resolves (GPU_TARGETS, upstream spelling). - Windows SYCL: the oneAPI offline installer URL returned 403. Use upstream's intel-deep-learning-essentials-2025.3.3.18 offline installer with the extract + bootstrapper silent install (DPC++/MKL/oneDNN/TBB components), then setvars intel64 --force and build with cl (C) + icx (C++), matching upstream. - Linux OpenVINO: OpenVINOConfig.cmake's find_package(TBB) failed. Add libtbb-dev (supplies TBBConfig.cmake). - Windows OpenVINO: the archive extracts into a nested versioned folder, so the hard-coded C:\openvino\runtime\cmake did not exist. Resolve the nested dir and pass -DOpenVINO_DIR explicitly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…O OpenCL
Second round of fail-loud CI fixes for the new GPU classifiers, from the actual
build logs:
- Windows ROCm/HIP: device-code compile failed because ROCm 7.1's HIP clang
headers cannot overload the __host__ __device__ isgreater/isless/... that the
very new VS 2026 MSVC <cmath> declares via _CLANG_BUILTIN2. Move the job to
windows-2022 (MSVC 14.4x), which is what upstream llama.cpp uses for win-hip.
- Windows SYCL: icx rejected the project's static /MT CRT with '-fsycl'
("invalid argument 'MT' not allowed with '-fsycl'"). Exempt GGML_SYCL (and
GGML_OPENVINO, whose import libs are /MD) from the static-CRT force in
CMakeLists so they build with the dynamic /MD runtime. Those classifiers
already need the vendor runtime on the host, so the self-contained-DLL
rationale doesn't apply; CPU + CUDA/Vulkan/OpenCL keep /MT.
- Linux OpenVINO: past the TBB fix, ggml-openvino's find_package(OpenCL) failed.
Add ocl-icd-opencl-dev + opencl-headers to the apt install.
- Windows OpenVINO: same find_package(OpenCL) need — build it via
build_opencl_windows.bat (stages the Khronos headers + OpenCL.lib, then
delegates to build.bat) instead of build.bat directly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
OpenVINO backend now compiles but failed on both platforms: - ov::Allocator template error (allocate/is_equal on a const void*): version mismatch. llama.cpp's ggml-openvino targets OpenVINO 2026.2.1 (what upstream ships), not the 2025.0.0 I pinned. Bump Linux apt to openvino-2026.2.1 (repo /openvino/2026) and the Windows archive to 2026.2.1. - Windows 'CL/cl2.hpp' not found: the staged Khronos OpenCL-Headers dropped cl2.hpp. Install OpenCL via vcpkg (opencl:x64-windows ships cl2.hpp) and pass the vcpkg toolchain file, mirroring upstream's windows-openvino job; drop the build_opencl_windows.bat staging for this job. - Linux: add opencl-clhpp-headers + intel-opencl-icd to the apt set (upstream's full OpenCL package list for ubuntu-openvino). Also raise cmake_minimum_required 3.15 -> 3.22 to match what the build actually relies on (runners ship 3.31); no behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…2026 apt repo The apt repo https://apt.repos.intel.com/openvino/2026 returns 404 — Intel only publishes OpenVINO apt repos up to ~2025, and 2025.x has the older ov::Allocator API that breaks ggml-openvino's template compile. Switch Linux OpenVINO to the archive for 2026.2.1, exactly as upstream llama.cpp's linux-setup-openvino composite action does: storage.openvinotoolkit.org/repositories/openvino/packages/2026.2.1/linux/ openvino_toolkit_ubuntu24_2026.2.1.21919.ede283a88e3_x86_64.tgz extracted to /opt/intel/openvino, with OpenVINO_DIR set to its runtime/cmake. OpenCL headers (incl. the C++ CL/cl2.hpp via opencl-clhpp-headers) come from Ubuntu's own repos, so no Intel apt repo is needed at all. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Wire build-linux-s390x: cross-compile for IBM Z (s390x, big-endian) with the GCC cross toolchain (native x86 speed), then run the full 462-test C++ suite under qemu-user as a real big-endian correctness gate for our byte-order- sensitive code (the little-endian WAV writer, JSON/token/embedding transforms, JNI helpers). Model-backed Java tests are not run under emulation (slow/flaky); the Java<->JNI boundary uses host-native array copies, so the C++ gate covers the actual endian risk. - publish.yml: build-linux-s390x (g++-s390x-linux-gnu + qemu-user-static; CMAKE_CROSSCOMPILING_EMULATOR + QEMU_LD_PREFIX make ctest run the s390x exe; GGML_OPENMP=OFF avoids cross-libgomp). s390x is a default-jar CPU platform like aarch64, so the artifact merges via the *-libraries glob (no classifier / pom profile). Fail-loud and in package.needs. - OSInfo.java: map os.arch=s390x -> Linux/s390x (S390X constant + archMapping). - README/CLAUDE.md: document the platform + the big-endian gate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Address 'Resources should be closed' (java:S2095) on NativeServer.main. The server was already closed on every real path (shutdown hook on SIGTERM, explicit close on self-termination), but not in a structure Sonar recognizes. Wrap the body in try/finally so close() is guaranteed on normal or exceptional exit — S2095's 'close in a finally clause' option. try-with-resources is deliberately NOT used: the shutdown hook must also call close() explicitly, which javac flags under -Werror as 'explicit call to close() on an auto-closeable resource'. close() is idempotent (guards on a zero handle), so the finally and the hook both firing is safe. The now-redundant stoppedByHook flag is dropped. All 7 NativeServerSmokeTest cases still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Summary
Server
libjllamaover JNI, as the default server mode.ServerLauncherfat-jar entry point dispatches betweenNativeServer(default) andOpenAiCompatServer(via--jllama-openai-compat).OpenAiServerCliparses llama.cpp tuning flags (-b,-ub,-tb,-ctk,-ctv,--jinja,--chat-template-kwargs) with JSON validation for chat-template kwargs.Native classifier matrix expansion — new GPU backends, each wired with the same 5-place pattern (CMakeLists routing → build job in
package.needs, fail-loud → pom profile → README row → git-ignoredresources_*tree), all build-only (runners have no matching GPU), no vendor runtime bundled:vulkan-linux-x86-64,vulkan-linux-aarch64(vendor-neutral NVIDIA/AMD/Intel)rocm-linux-x86-64,rocm-windows-x86-64sycl-fp16-linux-x86-64,sycl-fp32-linux-x86-64,sycl-windows-x86-64openvino-linux-x86-64,openvino-windows-x86-64(2026.2.1 archive + OpenCL via apt/vcpkg)opencl-windows-aarch64New default-jar CPU platforms
GGML_OPENMP=OFF)OSInfomapsos.arch=s390x → Linux/s390x.llama.cpp upgrade b9864 → b9870
sse_ping_interval, modelftype(quantization) on/v1/models, plus 8 audited withers (xtcProbability/Threshold,nDiscard,nIndent,tMaxPredictMs,postSamplingProbs,timingsPerToken,returnTokens).Build / CI / tooling
.github/scripts/llama-next-version.sh(diff-size chunking against a cached mirror) +docs/upgrade/llama-cpp-version-bump.mdrunbook, linked from CLAUDE.md.Test plan
OpenAiServerCliTest,ServerLauncherTest,NativeServerSmokeTest,InferenceParametersTest,ModelParametersExtendedTest)PATCH_COMMANDRelated issues / PRs
Implements the native server mode + WebUI embedding and the GPU/CPU classifier matrix expansion described in CLAUDE.md and TODO.md.
Checklist
CONTRIBUTING.mdandCODE_OF_CONDUCT.mdhttps://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7