Skip to content

feat(server): stream POST /v1/completions token-by-token (no new nati…#266

Merged
bernardladenthin merged 1 commit into
bernardladenthin:mainfrom
vaiju1981:feat/streaming-completions
Jun 21, 2026
Merged

feat(server): stream POST /v1/completions token-by-token (no new nati…#266
bernardladenthin merged 1 commit into
bernardladenthin:mainfrom
vaiju1981:feat/streaming-completions

Conversation

@vaiju1981

Copy link
Copy Markdown

Summary

Stream POST /v1/completions token-by-token — with no new native/JNI code. The streaming
raw-completion path already exists in the JNI layer (requestCompletion/receiveCompletionJson,
exposed as LlamaModel.generate(InferenceParameters) → LlamaIterable), so this is purely Java server
wiring; the earlier "a new requestCompletionStream native method is needed" note was stale.

  • OpenAiRequestMapper.toCompletionParameters maps an OpenAI completion request (prompt + sampling)
    to InferenceParameters. The shared sampling/cache/output fields are factored into a new
    applyCommonFields reused by the chat mapper (its tests confirm no behaviour change).
  • OpenAiBackend.streamCompletions(request, sink) (default throws UnsupportedOperationException) +
    a LlamaModelBackend impl that drives generate() and emits one OpenAI text_completion chunk per
    token, mapping StopReason → finish_reason (length / stop). A sink IOException (client
    disconnect) cancels the native task via LlamaIterable.close().
  • OpenAiSseFormatter.completionChunk builds the chunk; handleCompletions branches on stream:true
    to a new streamCompletions SSE handler that mirrors streamChat (heartbeats, [DONE], graceful
    disconnect).

Test plan

  • Affected tests pass locally — new streaming /v1/completions HTTP test (FakeBackend), 16
    mapper tests + 36 server tests + 40 adjacent server tests; Spotless + Javadoc clean.
  • CI is green on this branch
  • Docs — server class Javadoc updated; TODO.md corrected (stale "new native method" note; marks
    /v1/completions streaming done).

Related issues / PRs

Server follow-up from the OpenAI-compatible endpoint work.

Note for reviewer

Remaining consumers are queued as follow-ups (same generate()-driven pattern, no new infrastructure):
token-streaming Ollama /api/generate and Continue's native POST /completion. Verification gap: the
streaming path is proven by the FakeBackend HTTP test (SSE framing, [DONE], the mapper) but not yet
by a real-model streaming test — the gated OpenAiServerCompletionIntegrationTest is the place to add one.

Checklist

  • I have read CONTRIBUTING.md and CODE_OF_CONDUCT.md
  • My commits follow Conventional Commits
  • No security-sensitive changes (if there are, I have notified the maintainer privately per SECURITY.md)

…ve code)

The native streaming raw-completion path already exists (requestCompletion / receiveCompletionJson,
exposed as LlamaModel.generate(InferenceParameters) -> LlamaIterable), so streaming /v1/completions is
pure server wiring — no JNI/C++ change:

- OpenAiRequestMapper.toCompletionParameters maps an OpenAI completion request (prompt + sampling) to
  InferenceParameters; the shared sampling/cache/output fields are factored into applyCommonFields
  (reused by the chat mapper, whose tests confirm no behaviour change).
- OpenAiBackend.streamCompletions(request, sink) (default throws UnsupportedOperationException) +
  LlamaModelBackend impl drives generate() and emits one OpenAI text_completion chunk per token,
  mapping StopReason -> finish_reason (length / stop); a sink IOException cancels the native task via
  LlamaIterable.close().
- OpenAiSseFormatter.completionChunk builds the text_completion chunk; OpenAiCompatServer.handleCompletions
  branches on stream:true to a new streamCompletions SSE handler (mirrors streamChat: heartbeats, [DONE],
  graceful client-disconnect).

Verified: new streaming HTTP test + 16 mapper + 36 server + 40 adjacent server tests green; Spotless +
Javadoc clean. TODO.md updated: corrects the stale "new native method needed" note, marks /v1/completions
done, and adds a grounded future-modality (audio/image OUTPUT) design note — llama.cpp generates text only,
so that surface stays a documented extension point rather than speculative dead code.

Remaining consumers (same pattern, follow-ups): token-streaming Ollama /api/generate and Continue's
native POST /completion.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bernardladenthin bernardladenthin merged commit 1302ff0 into bernardladenthin:main Jun 21, 2026
23 of 33 checks passed
vaiju1981 pushed a commit to vaiju1981/java-llama.cpp that referenced this pull request Jun 22, 2026
…adenthin#266 regression)

LlamaArchitectureTest.layeredArchitecture was already failing on main (not
introduced by the TTS work): the streaming-completions merge (bernardladenthin#266) added
LlamaModelBackend (server layer) reads of StopReason / LlamaOutput (value layer),
but the Value layer's mayOnlyBeAccessedByLayers list — documented as "the EXACT
set of packages that reference it today" — was not updated. Add "Server" to it,
the same maintenance the rule's own javadoc prescribes.

Unrelated to TTS but folded in here because it blocks PR bernardladenthin#268's CI; kept as its
own commit so it can be cherry-picked to main independently.
vaiju1981 pushed a commit to vaiju1981/java-llama.cpp that referenced this pull request Jun 22, 2026
…ladenthin#266/bernardladenthin#267 findings

SpotBugs (effort=Max) flagged 5 Low/High findings; all are established
false-positive categories already suppressed elsewhere with the same rationale:

This PR (TextToSpeech, a native-handle wrapper like LlamaModel):
- IMC_IMMATURE_CLASS_NO_TOSTRING — only field is the opaque native handle; a
  toString would emit just a pointer (mirrors the LlamaModelBackend suppression).
- WEM_WEAK_EXCEPTION_MESSAGING on synthesize() — fixed "TextToSpeech is closed"
  precondition guard (mirrors the server request-parser guards).

Pre-existing on main from the merged bernardladenthin#266/bernardladenthin#267 (this branch inherits them via the
rebase; main is also red on them):
- OpenAiRequestMapper.toCompletionParameters WEM — same input-validation guard as
  the already-suppressed toInferenceParameters; extended the existing Or-block.
- ContentPart.inputAudio IMPROPER_UNICODE + LSC_LITERAL_STRING_COMPARISON — the
  canonical toLowerCase(Locale.ROOT)+equals format validation over a fixed ASCII
  pair; same false-positive class as the server.* IMPROPER_UNICODE block.

Verified locally: mvn spotbugs:check -> BUILD SUCCESS (0 bugs).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants