feat(server): stream POST /v1/completions token-by-token (no new nati… by vaiju1981 · Pull Request #266 · bernardladenthin/java-llama.cpp

vaiju1981 · 2026-06-21T18:59:08Z

Summary

Stream POST /v1/completions token-by-token — with no new native/JNI code. The streaming
raw-completion path already exists in the JNI layer (requestCompletion/receiveCompletionJson,
exposed as LlamaModel.generate(InferenceParameters) → LlamaIterable), so this is purely Java server
wiring; the earlier "a new requestCompletionStream native method is needed" note was stale.

OpenAiRequestMapper.toCompletionParameters maps an OpenAI completion request (prompt + sampling)
to InferenceParameters. The shared sampling/cache/output fields are factored into a new
applyCommonFields reused by the chat mapper (its tests confirm no behaviour change).
OpenAiBackend.streamCompletions(request, sink) (default throws UnsupportedOperationException) +
a LlamaModelBackend impl that drives generate() and emits one OpenAI text_completion chunk per
token, mapping StopReason → finish_reason (length / stop). A sink IOException (client
disconnect) cancels the native task via LlamaIterable.close().
OpenAiSseFormatter.completionChunk builds the chunk; handleCompletions branches on stream:true
to a new streamCompletions SSE handler that mirrors streamChat (heartbeats, [DONE], graceful
disconnect).

Test plan

Affected tests pass locally — new streaming /v1/completions HTTP test (FakeBackend), 16
mapper tests + 36 server tests + 40 adjacent server tests; Spotless + Javadoc clean.
CI is green on this branch
Docs — server class Javadoc updated; TODO.md corrected (stale "new native method" note; marks
/v1/completions streaming done).

Related issues / PRs

Server follow-up from the OpenAI-compatible endpoint work.

Note for reviewer

Remaining consumers are queued as follow-ups (same generate()-driven pattern, no new infrastructure):
token-streaming Ollama /api/generate and Continue's native POST /completion. Verification gap: the
streaming path is proven by the FakeBackend HTTP test (SSE framing, [DONE], the mapper) but not yet
by a real-model streaming test — the gated OpenAiServerCompletionIntegrationTest is the place to add one.

Checklist

I have read CONTRIBUTING.md and CODE_OF_CONDUCT.md
My commits follow Conventional Commits
No security-sensitive changes (if there are, I have notified the maintainer privately per SECURITY.md)

…ve code) The native streaming raw-completion path already exists (requestCompletion / receiveCompletionJson, exposed as LlamaModel.generate(InferenceParameters) -> LlamaIterable), so streaming /v1/completions is pure server wiring — no JNI/C++ change: - OpenAiRequestMapper.toCompletionParameters maps an OpenAI completion request (prompt + sampling) to InferenceParameters; the shared sampling/cache/output fields are factored into applyCommonFields (reused by the chat mapper, whose tests confirm no behaviour change). - OpenAiBackend.streamCompletions(request, sink) (default throws UnsupportedOperationException) + LlamaModelBackend impl drives generate() and emits one OpenAI text_completion chunk per token, mapping StopReason -> finish_reason (length / stop); a sink IOException cancels the native task via LlamaIterable.close(). - OpenAiSseFormatter.completionChunk builds the text_completion chunk; OpenAiCompatServer.handleCompletions branches on stream:true to a new streamCompletions SSE handler (mirrors streamChat: heartbeats, [DONE], graceful client-disconnect). Verified: new streaming HTTP test + 16 mapper + 36 server + 40 adjacent server tests green; Spotless + Javadoc clean. TODO.md updated: corrects the stale "new native method needed" note, marks /v1/completions done, and adds a grounded future-modality (audio/image OUTPUT) design note — llama.cpp generates text only, so that surface stays a documented extension point rather than speculative dead code. Remaining consumers (same pattern, follow-ups): token-streaming Ollama /api/generate and Continue's native POST /completion. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…adenthin#266 regression) LlamaArchitectureTest.layeredArchitecture was already failing on main (not introduced by the TTS work): the streaming-completions merge (bernardladenthin#266) added LlamaModelBackend (server layer) reads of StopReason / LlamaOutput (value layer), but the Value layer's mayOnlyBeAccessedByLayers list — documented as "the EXACT set of packages that reference it today" — was not updated. Add "Server" to it, the same maintenance the rule's own javadoc prescribes. Unrelated to TTS but folded in here because it blocks PR bernardladenthin#268's CI; kept as its own commit so it can be cherry-picked to main independently.

…ladenthin#266/bernardladenthin#267 findings SpotBugs (effort=Max) flagged 5 Low/High findings; all are established false-positive categories already suppressed elsewhere with the same rationale: This PR (TextToSpeech, a native-handle wrapper like LlamaModel): - IMC_IMMATURE_CLASS_NO_TOSTRING — only field is the opaque native handle; a toString would emit just a pointer (mirrors the LlamaModelBackend suppression). - WEM_WEAK_EXCEPTION_MESSAGING on synthesize() — fixed "TextToSpeech is closed" precondition guard (mirrors the server request-parser guards). Pre-existing on main from the merged bernardladenthin#266/bernardladenthin#267 (this branch inherits them via the rebase; main is also red on them): - OpenAiRequestMapper.toCompletionParameters WEM — same input-validation guard as the already-suppressed toInferenceParameters; extended the existing Or-block. - ContentPart.inputAudio IMPROPER_UNICODE + LSC_LITERAL_STRING_COMPARISON — the canonical toLowerCase(Locale.ROOT)+equals format validation over a fixed ASCII pair; same false-positive class as the server.* IMPROPER_UNICODE block. Verified locally: mvn spotbugs:check -> BUILD SUCCESS (0 bugs).

vaiju1981 requested a review from bernardladenthin as a code owner June 21, 2026 18:59

vaiju1981 temporarily deployed to startgate June 21, 2026 19:01 — with GitHub Actions Inactive

bernardladenthin merged commit 1302ff0 into bernardladenthin:main Jun 21, 2026
23 of 33 checks passed

vaiju1981 mentioned this pull request Jun 23, 2026

feat(tts): text-to-speech via the OuteTTS + WavTokenizer pipeline #268

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): stream POST /v1/completions token-by-token (no new nati…#266

feat(server): stream POST /v1/completions token-by-token (no new nati…#266
bernardladenthin merged 1 commit into
bernardladenthin:mainfrom
vaiju1981:feat/streaming-completions

vaiju1981 commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vaiju1981 commented Jun 21, 2026

Summary

Test plan

Related issues / PRs

Note for reviewer

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants