Claude/update b8913 compatibility eq8 n8#95
Merged
bernardladenthin merged 6 commits intoApr 24, 2026
Conversation
No breaking API changes for this project in this range. Changes applied: - CMakeLists.txt: bump GIT_TAG to b8913 - README.md: bump version badge to b8913 - CLAUDE.md: update pinned version; document b8887–b8913 changes - server.hpp: clamp n_discard to non-negative (mirrors server-task.cpp fix) Other upstream changes handled automatically (server-chat.cpp, server-common.cpp included directly from llama.cpp source): - convert_transcriptions_to_chatcmpl gained a tmpls parameter - parallel_tool_calls now defaults to model capability - normalize_anthropic_billing_header added for Claude Code prefix caching - chat_template_kwargs passed through in Anthropic→OAI conversion - common_chat_prompt_preset / common_chat_get_asr_prompt added (chat.h) - string_starts_with(char) overload added (common.h) - LLAMA_ROPE_TYPE_NONE added to mtmd rope-type switch https://claude.ai/code/session_01JkSSBJ1A1k9tXh54m4oP15
- logit_bias object format: add else-if branch handling {"tok": bias}
map format; previously silently dropped if client sent an object
instead of an array
- ignore_eos EOG injection: inject -INF logit bias for every EOG token
when ignore_eos=true, matching upstream behaviour (b8913 server-task.cpp)
- antiprompt fallback: apply server-configured stop sequences when the
request omits the "stop" field; previously defaults.antiprompt was
never initialised from params_base
https://claude.ai/code/session_01JkSSBJ1A1k9tXh54m4oP15
Port all remaining feature-parity items from upstream: defaults: - initialise defaults.n_predict and defaults.cache_prompt from params_base - use defaults.cache_prompt instead of hardcoded true basic params: - add max_completion_tokens as OAI alias for max_tokens / n_predict - add adaptive_target, adaptive_decay, backend_sampling sampling fields speculative: - seed speculative struct from defaults before per-field overrides - add speculative.type via common_speculative_type_from_name - add speculative.ngram_size_n/m and ngram_min_hits with clamping grammar: - inherit defaults.sampling.grammar before parsing request grammar - add grammar_type == "tool_calls" -> COMMON_GRAMMAR_TYPE_TOOL_CALLS - remove unnecessary inner block nesting chat format: - allow per-request reasoning_format override via common_reasoning_format_from_name - sync params.sampling.generation_prompt from oaicompat_chat_syntax - add chat_parser field via oaicompat_chat_syntax.parser.load reasoning budget: - add full reasoning_budget_tokens/start_tag/end_tag/message block https://claude.ai/code/session_01JkSSBJ1A1k9tXh54m4oP15
…upstream slot_params struct: - add include_usage, return_progress, n_cmpl, n_cache_reuse fields matching task_params in upstream server-task.h - fix typo 'mininum' -> 'minimum' in n_indent comment slot_params::to_json: - fix std::move on const ref in grammar_triggers loop (silently a copy; remove move) - add speculative.type, speculative.ngram_size_n/m, speculative.ngram_m_hits, backend_sampling to JSON output result_timings: - add cache_n field and emit it first in to_json(), matching upstream params_from_json_cmpl: - add defaults.n_cache_reuse = params_base.n_cache_reuse - add stream_options / include_usage parsing - add return_progress, n_cmpl (with "n" alias), n_cache_reuse parsing - align column spacing with upstream for readability - merge split TODO comment onto one line - add n_cmpl > n_parallel validation at end of function https://claude.ai/code/session_01JkSSBJ1A1k9tXh54m4oP15
- Add n_prompt_tokens_cache to server_slot (field, reset, capture at prompt processing, propagate to result structs and get_timings) - get_timings() now populates result_timings.cache_n from cache field - Add include_usage to server_task_result_cmpl_final; wire from slot params so streaming usage chunk is only sent when requested - Fix to_json_oaicompat_chat_stream(): usage now goes in a separate final chunk with empty choices per OpenAI spec, not in finish chunk - Add usage_json_oaicompat() helper to server_task_result_cmpl_final; includes prompt_tokens_details.cached_tokens in all OAI responses - n_prompt_tokens_cache also added to server_task_result_cmpl_partial https://claude.ai/code/session_01JkSSBJ1A1k9tXh54m4oP15
- Add result_prompt_progress struct (total, cache, processed, time_ms) with to_json() matching upstream server-task.h - Add is_progress and progress fields to server_task_result_cmpl_partial - Update to_json_non_oaicompat(), to_json_oaicompat(), and to_json_oaicompat_chat() in partial result to emit prompt_progress when is_progress is true - send_partial_response() gains is_progress parameter (default false); when true, populates progress fields and leaves content/tokens empty - Emit progress in the decode loop for SLOT_STATE_PROCESSING_PROMPT and SLOT_STATE_DONE_PROMPT slots when stream and return_progress are set https://claude.ai/code/session_01JkSSBJ1A1k9tXh54m4oP15
bernardladenthin
pushed a commit
that referenced
this pull request
May 22, 2026
Fetched verbatim text of the LIKELY FIXED / PARTIALLY FIXED issues from github.com/kherud/java-llama.cpp and append a Verification plan section with: (a) a table of new info extracted from each issue body, (b) four concrete JUnit test sketches that would close out #80, #95, #98, #102, (c) a non-unit-testable bucket for #34, #50, #86, #103, #121 with the corresponding action (feature, docs, CI matrix), (d) a recommended PR sequencing. Notable finding: #98's original repro did not call enableEmbedding() at all — the binding never forwarded --embedding to the upstream server-context, so the result_output assertion fired because the embedding pipeline was never initialised. enableEmbedding() now exists in ModelParameters (line 1040), so the fix is essentially code-confirmed; an integration test against nomic-embed-text is optional confirmation.
6 tasks
bernardladenthin
added a commit
that referenced
this pull request
May 22, 2026
) * Enrich open-issues baseline with current-fork status Appends a Status in fork subsection to each of the 37 upstream issues with a verdict, file:line evidence, and next steps; adds a Status overview table summarising verdicts across all issues. * Add deep-dive analysis for likely/partially fixed issues Appends a per-issue Deep-dive analysis block to each of the 9 LIKELY FIXED / PARTIALLY FIXED entries, and adds a top-level Deep-dive verdict guide categorising which issues are confirmable from code inspection, which need one targeted JUnit test, and which genuinely require platform-specific runtime reproduction. Updates the Status overview table for #121 (FIXED for 64-bit Android) and #86 (CUDA jar requires libcudart at runtime, not auto-fallback). * Add verification plan with original-issue research and test sketches Fetched verbatim text of the LIKELY FIXED / PARTIALLY FIXED issues from github.com/kherud/java-llama.cpp and append a Verification plan section with: (a) a table of new info extracted from each issue body, (b) four concrete JUnit test sketches that would close out #80, #95, #98, #102, (c) a non-unit-testable bucket for #34, #50, #86, #103, #121 with the corresponding action (feature, docs, CI matrix), (d) a recommended PR sequencing. Notable finding: #98's original repro did not call enableEmbedding() at all — the binding never forwarded --embedding to the upstream server-context, so the result_output assertion fired because the embedding pipeline was never initialised. enableEmbedding() now exists in ModelParameters (line 1040), so the fix is essentially code-confirmed; an integration test against nomic-embed-text is optional confirmation. --------- Co-authored-by: Claude <noreply@anthropic.com>
bernardladenthin
pushed a commit
that referenced
this pull request
May 22, 2026
Updates docs/history/49be664_open_issues.md to reflect that the four JUnit regression tests called for in the verification plan have been added on this branch: - Deep-dive verdict guide now lists each test name and self-skip behaviour next to its issue bullet - Per-issue Status blocks for #80, #95, #98, #102 annotated as "LIKELY FIXED -> FIXED on CI green" with the covering test - Status overview table rows for the same four issues updated - "What the original issues actually contain" feasibility table marks all four as DONE with the commit reference - "Concrete test plan" gains a status callout noting the as-shipped implementation matches the sketches - "Recommended sequencing" step 1 marked DONE and enumerates what shipped; remaining steps (#86 docs, #103/#34 typed image API, Android emulator CI) carried forward as the next deliverables No code or behaviour change, documentation only. https://claude.ai/code/session_01LR7Gw1pyKS7wvxXfZjnxNW
7 tasks
bernardladenthin
added a commit
that referenced
this pull request
May 22, 2026
* test: add JUnit regressions for kherud open issues #80, #95, #98, #102 Adds four small JUnit tests proposed in the verification plan section of docs/history/49be664_open_issues.md to upgrade the corresponding upstream issues from LIKELY FIXED to FIXED: - MemoryManagementTest#testOpenCloseLoopDoesNotLeak (#102) - 20-iteration open/close loop; on Linux asserts VmRSS delta < 200 MB. Degenerates to a no-crash smoke test on non-Linux hosts where /proc/self/status is absent. - MemoryManagementTest#testOpenCloseWithoutGeneration (#80) - 20 open + immediate close without any generation, exercises the half-initialised worker race closed by the double server.terminate() in jllama.cpp. - LlamaModelTest#testIteratorTerminatesOnRepetitivePrompt (#95) - asserts the iterator terminates within nPredict+1 steps on a deliberately repetitive prompt. - LlamaEmbeddingsTest#testNomicEmbedLoads (#98) - gated on system property net.ladenthin.llama.nomic.path; reproduces the reporter's batch/ubatch config plus the fix (enableEmbedding()), and asserts a 768-dim vector for nomic-embed-text-v1.5. Wires up the optional nomic GGUF download in the linux-x86_64 Java test job in .github/workflows/publish.yml. Other test jobs cleanly self-skip via Assume because the system property is unset. Documents the local native-build workflow in CLAUDE.md - per-host output paths, mvn-cmake handoff, optional model handling, and the restricted-network caveat for environments that block huggingface.co. https://claude.ai/code/session_01LR7Gw1pyKS7wvxXfZjnxNW * docs: record #80/#95/#98/#102 regression tests added in 713d426 Updates docs/history/49be664_open_issues.md to reflect that the four JUnit regression tests called for in the verification plan have been added on this branch: - Deep-dive verdict guide now lists each test name and self-skip behaviour next to its issue bullet - Per-issue Status blocks for #80, #95, #98, #102 annotated as "LIKELY FIXED -> FIXED on CI green" with the covering test - Status overview table rows for the same four issues updated - "What the original issues actually contain" feasibility table marks all four as DONE with the commit reference - "Concrete test plan" gains a status callout noting the as-shipped implementation matches the sketches - "Recommended sequencing" step 1 marked DONE and enumerates what shipped; remaining steps (#86 docs, #103/#34 typed image API, Android emulator CI) carried forward as the next deliverables No code or behaviour change, documentation only. https://claude.ai/code/session_01LR7Gw1pyKS7wvxXfZjnxNW --------- Co-authored-by: Claude <noreply@anthropic.com>
5 tasks
bernardladenthin
added a commit
that referenced
this pull request
May 22, 2026
* docs: mark #80/#95/#98/#102 as FIXED now that PR #185 is merged PR #185 (commit cba693c) merged the four regression tests sketched in the 49be664 open-issues verification plan. Update the per-issue blocks, the status overview table, the top-level deep-dive verdict guide, and the recommended-sequencing section to reflect that #80, #95, #98 and #102 are now FIXED (no longer "LIKELY FIXED → FIXED on CI green"). https://claude.ai/code/session_01R3jVWHsB3zymwAQtj8GT43 * docs: add README "Choosing the right classifier" section Closes the documentation gap for issue #86 (does the CUDA jar fall back to CPU?) and the 32-bit Android tail of #121 (armeabi-v7a not published). The new section enumerates the three published classifiers (default CPU, cuda13-linux-x86-64, opencl-android-aarch64), their backends, target platforms, and runtime requirements. It explicitly states that the CUDA JAR is CUDA-only at runtime — it dlopens libcudart.so.13/libcublas.so.13 and has no automatic CPU fallback — and that Android armeabi-v7a is not shipped as a released artifact. Updates docs/history/49be664_open_issues.md to mark #86 as FIXED-AS-DOCUMENTED and #121 as FIXED (64-bit) with the 32-bit limitation now documented. https://claude.ai/code/session_01R3jVWHsB3zymwAQtj8GT43 --------- Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.