Anna

Anna is a local inference runtime and OpenAI-compatible API server focused on PyTorch/XPU execution for Intel Arc Alchemist . The current project can serve Qwen3.5 text-generation checkpoints, Gemma 4 multimodal checkpoints, and Qwen3-TTS speech synthesis checkpoints from local model directories. Qwen3.5 model directories can now be loaded either from Hugging Face-style configs / safetensors files or from GGUF checkpoints with an optional mmproj-*.gguf vision tower file.

Features

OpenAI-compatible routes:
- GET /healthz
- GET /v1/models
- POST /v1/chat/completions
- POST /v1/completions
- POST /v1/audio/speech
Non-streaming and streaming generation for chat and text completions
Optional reasoning separation through reasoning_content
Tool/function calling for chat models, including streamed tool-call deltas
Local multimodal chat input support
- Qwen3.5: text, image, video
- Gemma4: text, image, video, audio
Local speech synthesis for qwen3_tts through API and anna-speak
CLI entry points: anna-serve, anna-generate, anna-bench, anna-speak
XPU-oriented runtime controls:
- torch.compile
- prefill chunking
- exact prompt KV-cache reuse
- TurboQuant KV-cache compression
- runtime int4 weight quantization
- Qwen MoE expert offload and expert caching
- continuous batching
- optional SYCL fused custom ops
Health, memory, cache, and request metrics exposed through /healthz and terminal logs

Supported Model Families

Anna selects the runtime from the top-level model_type in config.json:

qwen3_tts -> Qwen3-TTS runtime
gemma4 -> Gemma4 runtime
everything else -> Qwen3.5 text runtime

Family	`config.json` top-level `model_type`	Other recognized model types/configs	Capabilities	Main commands
Qwen3.5 text runtime	Any non-`qwen3_tts` and non-`gemma4` value; known values include `qwen3_5`, `qwen3_5_text`, `qwen3_5_moe`, `qwen3_5_vl`	Optional `vision_config`; Qwen quantization configs such as AWQ and AutoRound are parsed from model files	Text completions, chat completions, streaming, reasoning output, tool calling, multimodal chat with image/video	`anna-serve`, `anna-generate`, `anna-bench`
Gemma4 runtime	`gemma4`	`text_config.model_type=gemma4_text`; optional `vision_config.model_type=gemma4_vision`; optional `audio_config.model_type=gemma4_audio`	Text completions, chat completions, streaming, reasoning output, tool calling, multimodal chat with image/video/audio	`anna-serve`, `anna-generate`, `anna-bench`
Qwen3-TTS runtime	`qwen3_tts`	Runtime `tts_model_type` can be `base`, `custom_voice`, or `voice_design`	Speech synthesis through API or CLI	`anna-serve`, `anna-speak`

Notes:

anna-generate and anna-bench are only for text-generation families (qwen3_5_text and gemma4).
anna-speak is only for qwen3_tts.
/v1/audio/speech only succeeds when the loaded model supports speech synthesis.
anna-bench supports --image and --video, but not audio benchmarking.
Multimodal media input is available through chat-style message content arrays and the benchmark CLI; /v1/completions remains text-only.

Requirements

Python 3.11+
A local model directory containing either:
- config.json, tokenizer files, and weights
- or a Qwen GGUF model file plus optional mmproj-*.gguf
PyTorch installed separately for your platform
- CPU works for debugging and tests
- Intel Arc + a PyTorch build with xpu support is the intended runtime path
imageio and imageio-ffmpeg for video input support
qwen-tts for qwen3_tts models
To build the optional fused XPU operator:
- Intel oneAPI DPC++/C++ Compiler
- Visual Studio Build Tools on Windows

The Python package dependencies above are installed by pip install -e .; PyTorch and oneAPI are environment-specific and should be installed separately.

Installation

1. Clone and enter the repository

git clone <your-repo-url> Anna
cd Anna

2. Create and activate a virtual environment

Windows PowerShell:

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -U pip

Linux:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip

3. Install Anna

python -m pip install -e .

Optional development dependencies:

python -m pip install -e ".[dev]"

4. Install and verify PyTorch

Install a PyTorch build that matches your target device. If you plan to use Intel Arc, install a build with xpu support and verify it:

python -c "import torch; print(torch.__version__); print(hasattr(torch, 'xpu')); print(torch.xpu.is_available() if hasattr(torch, 'xpu') else False)"

5. Build the fused XPU operator

This is recommended for XPU performance work, but not required for basic CPU use.

python tools/build_gated_delta_fused_op.py

The build output is written under .build/anna_gated_delta_fused.

Example Local Model Directories

If you already downloaded checkpoints into the repository-local models/ directory, examples currently present in this workspace include:

models/Intel/Qwen3___5-2B-int4-AutoRound
models/Intel/Qwen3___5-35B-A3B-int4-AutoRound
models/google/gemma-4-E4B-it
models/Qwen/Qwen3-TTS-12Hz-1___7B-Base

Any command below expects --model-dir to point at a directory like the ones above.

Quick Start

Serve an OpenAI-compatible API

Minimal example:

anna-serve --model-dir <path-to-model> --host 127.0.0.1 --port 8000

Typical XPU-oriented example:

anna-serve \
  --model-dir <path-to-model> \
  --device xpu \
  --dtype bfloat16 \
  --kv-cache-quantization turboquant \
  --kv-cache-quant-bits 4 \
  --weight-quant auto \
  --prompt-cache-size 4

For large Qwen MoE models, you can additionally tune options such as:

--offload-mode experts
--expert-quant int4
--cached-experts-per-layer <N> (64 recommended)
--resident-expert-layers <N>

Generate text locally

anna-generate is a prompt-to-text CLI for text-generation families:

anna-generate --model-dir <path-to-text-model> --prompt "Write a short summary of KV cache."

Benchmark a model

Text benchmark:

anna-bench --model-dir <path-to-text-model> --prompt "Explain grouped-query attention." --warmup 1 --runs 3

Image benchmark:

anna-bench --model-dir <path-to-multimodal-model> --prompt "Describe the image." --image ./demo.png

Video benchmark:

anna-bench --model-dir <path-to-multimodal-model> --prompt "Summarize the clip." --video ./demo.mp4

Synthesize speech with Qwen3-TTS

Base voice-clone model:

anna-speak \
  --model-dir <path-to-qwen3-tts-base> \
  --input "Hello from Anna." \
  --output out.wav \
  --ref-audio ref.wav \
  --ref-text "Reference transcript."

CustomVoice model:

anna-speak \
  --model-dir <path-to-qwen3-tts-custom> \
  --input "Hello from Anna." \
  --output out.wav \
  --speaker Vivian \
  --instruct "Speak with energy."

VoiceDesign model:

anna-speak \
  --model-dir <path-to-qwen3-tts-voice-design> \
  --input "Hello from Anna." \
  --output out.wav \
  --instruct "A calm, warm female voice."

The CLI supports wav and flac output through --response-format.

OpenAI-Compatible API

The server always exposes the same routes, but request success depends on the loaded model family.

Routes

GET /healthz
GET /v1/models
POST /v1/chat/completions
POST /v1/completions
POST /v1/audio/speech

Example requests

List models:

curl http://127.0.0.1:8000/v1/models

Chat completion:

curl -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "messages": [{"role": "user", "content": "Write a haiku about Intel Arc."}],
    "reasoning_format": "deepseek"
  }'

Text completion:

curl -X POST http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "prompt": "Anna is",
    "max_tokens": 64
  }'

Speech synthesis with a Base voice-clone TTS model:

curl -X POST http://127.0.0.1:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "input": "Hello from Anna.",
    "ref_audio": "ref.wav",
    "ref_text": "Reference transcript.",
    "response_format": "wav"
  }' \
  --output speech.wav

For custom_voice models, pass speaker or voice; for voice_design models, pass instruct.

Chat requests also accept:

multimodal content arrays with text, image_url, video_url, and audio_url parts
tools, tool_choice, and parallel_tool_calls
enable_thinking or chat_template_kwargs.enable_thinking

Useful Runtime Flags

Area	Options	Notes
Common	`--device`, `--dtype`, `--compile-mode`, `--compile-fullgraph`	`device` supports `auto`, `cpu`, and `xpu`; `dtype` supports `auto`, `float32`, `float16`, and `bfloat16`
Prefill and prompt cache	`--prefill-chunk-size`, `--prompt-cache-size`, `--prompt-cache-max-tokens`, `--profile-runtime`	Available on text-generation CLIs and server
KV cache	`--kv-cache-quantization`, `--kv-cache-quant-bits`, `--kv-cache-residual-len`	TurboQuant is supported by the Qwen3.5 and Gemma4 runtimes
Qwen large-model controls	`--offload-mode`, `--offload-vision`, `--expert-quant`, `--weight-quant`, `--resident-expert-layers`, `--resident-expert-layer-indices`, `--cached-experts-per-layer`	Mainly useful for large Qwen3.5 MoE or vision-capable checkpoints
Server defaults and safety	`--enable-thinking`, `--disable-thinking`, `--reasoning-format`, `--max-completion-tokens`, `--min-free-memory-mib`, `--reserve-memory-mib`, `--max-estimated-usage-ratio`, `--generation-memory-safety-factor`	Controls default chat behavior and request admission
Server batching and metrics	`--scheduler-max-batch-size`, `--scheduler-batch-wait-ms`, `--metrics-log-interval-seconds`	Continuous batching is enabled only when batch size is above `1`
Speech synthesis	`--language`, `--speaker`, `--instruct`, `--ref-audio`, `--ref-text`, `--x-vector-only-mode`, `--response-format`	`anna-speak` supports `wav` and `flac`; the API also supports `pcm`

Development

Run the test suite:

python -m pytest

If you are working on XPU fused kernels, build the custom operator first and then use the project CLIs or tests against the compiled library in .build/anna_gated_delta_fused.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
src/anna		src/anna
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anna

Features

Supported Model Families

Requirements

Installation

1. Clone and enter the repository

2. Create and activate a virtual environment

3. Install Anna

4. Install and verify PyTorch

5. Build the fused XPU operator

Example Local Model Directories

Quick Start

Serve an OpenAI-compatible API

Generate text locally

Benchmark a model

Synthesize speech with Qwen3-TTS

OpenAI-Compatible API

Routes

Example requests

Useful Runtime Flags

Development

About

Uh oh!

Releases 10

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Anna

Features

Supported Model Families

Requirements

Installation

1. Clone and enter the repository

2. Create and activate a virtual environment

3. Install Anna

4. Install and verify PyTorch

5. Build the fused XPU operator

Example Local Model Directories

Quick Start

Serve an OpenAI-compatible API

Generate text locally

Benchmark a model

Synthesize speech with Qwen3-TTS

OpenAI-Compatible API

Routes

Example requests

Useful Runtime Flags

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages