Release xllama 0.2.0 · gianlucamazza/xllama

xllama 0.2.0 — 2026-05-22

Highlights

Persistent inference session (no more per-turn model reload)
xllama::Session keeps the model and tokenizer loaded across chat turns. After the first message (~1–2 s cold load), subsequent turns jump straight to generation — no "Loading model..." pause. The session is transparently rebuilt if the model changes via Settings.

Multi-turn chat + persistent history
Full ChatML prompt construction with context trimming, conversation persistence to LocalState/chats/ (JSON), history browser overlay, and system prompt editor.

Bundled SmolLM2-360M-Instruct INT4 CPU (~403 MB)
Ships inside the MSIX; model is copied to LocalState on first launch. No external download required for the base model.

USB and in-app HF download fallbacks
Three-step model bootstrap: LocalState → InstalledPath bundle → USB stick (E:\xllama\models\<name>) → Hugging Face HttpClient download. Enables larger models (SmolLM2-1.7B, ~1.4 GB) via USB without MSIX rebuild.

Bench diagnostics
Each bench run now logs prompt=N tok, max_length=M (new≤K) so n-token counts in CSV are self-explanatory. Bench cap raised to 512 new tokens (was 128).

Performance baseline (Xbox Series S, CPU EP)

Model	tok/s	RAM peak
SmolLM2-360M INT4	69–73	680 MB
SmolLM2-1.7B INT4	23.6	2195 MB

Fixed

weakly_canonical: Access is denied crash: merge ONNX external data into monolithic model.onnx to bypass ORT AppContainer path-walking.
OgaModelPtr typo in OrtSession UWP build (MSVC C2065).
ASCII-safe status strings (removed em-dash / ellipsis that caused MSVC C4566).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xllama 0.2.0

Choose a tag to compare

Sorry, something went wrong.