feat(server): stream POST /v1/completions token-by-token (no new nati…#266
Merged
bernardladenthin merged 1 commit intoJun 21, 2026
Merged
Conversation
…ve code) The native streaming raw-completion path already exists (requestCompletion / receiveCompletionJson, exposed as LlamaModel.generate(InferenceParameters) -> LlamaIterable), so streaming /v1/completions is pure server wiring — no JNI/C++ change: - OpenAiRequestMapper.toCompletionParameters maps an OpenAI completion request (prompt + sampling) to InferenceParameters; the shared sampling/cache/output fields are factored into applyCommonFields (reused by the chat mapper, whose tests confirm no behaviour change). - OpenAiBackend.streamCompletions(request, sink) (default throws UnsupportedOperationException) + LlamaModelBackend impl drives generate() and emits one OpenAI text_completion chunk per token, mapping StopReason -> finish_reason (length / stop); a sink IOException cancels the native task via LlamaIterable.close(). - OpenAiSseFormatter.completionChunk builds the text_completion chunk; OpenAiCompatServer.handleCompletions branches on stream:true to a new streamCompletions SSE handler (mirrors streamChat: heartbeats, [DONE], graceful client-disconnect). Verified: new streaming HTTP test + 16 mapper + 36 server + 40 adjacent server tests green; Spotless + Javadoc clean. TODO.md updated: corrects the stale "new native method needed" note, marks /v1/completions done, and adds a grounded future-modality (audio/image OUTPUT) design note — llama.cpp generates text only, so that surface stays a documented extension point rather than speculative dead code. Remaining consumers (same pattern, follow-ups): token-streaming Ollama /api/generate and Continue's native POST /completion. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1302ff0
into
bernardladenthin:main
23 of 33 checks passed
vaiju1981
pushed a commit
to vaiju1981/java-llama.cpp
that referenced
this pull request
Jun 22, 2026
…adenthin#266 regression) LlamaArchitectureTest.layeredArchitecture was already failing on main (not introduced by the TTS work): the streaming-completions merge (bernardladenthin#266) added LlamaModelBackend (server layer) reads of StopReason / LlamaOutput (value layer), but the Value layer's mayOnlyBeAccessedByLayers list — documented as "the EXACT set of packages that reference it today" — was not updated. Add "Server" to it, the same maintenance the rule's own javadoc prescribes. Unrelated to TTS but folded in here because it blocks PR bernardladenthin#268's CI; kept as its own commit so it can be cherry-picked to main independently.
vaiju1981
pushed a commit
to vaiju1981/java-llama.cpp
that referenced
this pull request
Jun 22, 2026
…ladenthin#266/bernardladenthin#267 findings SpotBugs (effort=Max) flagged 5 Low/High findings; all are established false-positive categories already suppressed elsewhere with the same rationale: This PR (TextToSpeech, a native-handle wrapper like LlamaModel): - IMC_IMMATURE_CLASS_NO_TOSTRING — only field is the opaque native handle; a toString would emit just a pointer (mirrors the LlamaModelBackend suppression). - WEM_WEAK_EXCEPTION_MESSAGING on synthesize() — fixed "TextToSpeech is closed" precondition guard (mirrors the server request-parser guards). Pre-existing on main from the merged bernardladenthin#266/bernardladenthin#267 (this branch inherits them via the rebase; main is also red on them): - OpenAiRequestMapper.toCompletionParameters WEM — same input-validation guard as the already-suppressed toInferenceParameters; extended the existing Or-block. - ContentPart.inputAudio IMPROPER_UNICODE + LSC_LITERAL_STRING_COMPARISON — the canonical toLowerCase(Locale.ROOT)+equals format validation over a fixed ASCII pair; same false-positive class as the server.* IMPROPER_UNICODE block. Verified locally: mvn spotbugs:check -> BUILD SUCCESS (0 bugs).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stream
POST /v1/completionstoken-by-token — with no new native/JNI code. The streamingraw-completion path already exists in the JNI layer (
requestCompletion/receiveCompletionJson,exposed as
LlamaModel.generate(InferenceParameters) → LlamaIterable), so this is purely Java serverwiring; the earlier "a new
requestCompletionStreamnative method is needed" note was stale.OpenAiRequestMapper.toCompletionParametersmaps an OpenAI completion request (prompt+ sampling)to
InferenceParameters. The shared sampling/cache/output fields are factored into a newapplyCommonFieldsreused by the chat mapper (its tests confirm no behaviour change).OpenAiBackend.streamCompletions(request, sink)(default throwsUnsupportedOperationException) +a
LlamaModelBackendimpl that drivesgenerate()and emits one OpenAItext_completionchunk pertoken, mapping
StopReason → finish_reason(length/stop). A sinkIOException(clientdisconnect) cancels the native task via
LlamaIterable.close().OpenAiSseFormatter.completionChunkbuilds the chunk;handleCompletionsbranches onstream:trueto a new
streamCompletionsSSE handler that mirrorsstreamChat(heartbeats,[DONE], gracefuldisconnect).
Test plan
/v1/completionsHTTP test (FakeBackend), 16mapper tests + 36 server tests + 40 adjacent server tests; Spotless + Javadoc clean.
/v1/completionsstreaming done).Related issues / PRs
Server follow-up from the OpenAI-compatible endpoint work.
Note for reviewer
Remaining consumers are queued as follow-ups (same
generate()-driven pattern, no new infrastructure):token-streaming Ollama
/api/generateand Continue's nativePOST /completion. Verification gap: thestreaming path is proven by the
FakeBackendHTTP test (SSE framing,[DONE], the mapper) but not yetby a real-model streaming test — the gated
OpenAiServerCompletionIntegrationTestis the place to add one.Checklist
CONTRIBUTING.mdandCODE_OF_CONDUCT.mdSECURITY.md)