Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Yasha

Self-hosted, multi-model AI inference server. Runs LLMs alongside specialized models (TTS, speech-to-text, embeddings) on one or more GPUs, exposing an OpenAI-compatible API. Built on [vLLM](https://github.com/vllm-project/vllm) and [Ray](https://github.com/ray-project/ray).
Self-hosted, multi-model AI inference server. Runs LLMs alongside specialized models (TTS, speech-to-text, embeddings, image generation) on one or more GPUs, exposing an OpenAI-compatible API. Built on [vLLM](https://github.com/vllm-project/vllm) and [Ray](https://github.com/ray-project/ray).

## Architecture

Expand Down Expand Up @@ -36,7 +36,7 @@ Each model runs as an isolated Ray Serve deployment with its own lifecycle, heal

## Features

- **Multi-model on a single GPU** — run chat, embedding, STT, and TTS models simultaneously with tunable per-model GPU memory allocation
- **Multi-model on a single GPU** — run chat, embedding, STT, TTS, and image generation models simultaneously with tunable per-model GPU memory allocation
- **Per-model isolated deployments** — each model runs in its own Ray Serve deployment with independent lifecycle, health checks, and failure isolation
- **OpenAI-compatible API** — drop-in replacement for any OpenAI SDK client
- **Streaming** — SSE streaming for chat completions and TTS audio
Expand All @@ -55,6 +55,7 @@ Each model runs as an isolated Ray Serve deployment with its own lifecycle, heal
| `POST /v1/audio/transcriptions` | Speech-to-text |
| `POST /v1/audio/translations` | Audio translation |
| `POST /v1/audio/speech` | Text-to-speech (SSE streaming or single-response) |
| `POST /v1/images/generations` | Image generation |
| `GET /v1/models` | List available models |

## Quick Start
Expand Down
2 changes: 2 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Each deployment uses one of three loaders:
|--------|---------|-----------|
| `vllm` | vLLM engine | Chat/generation, embeddings, transcription, translation |
| `transformers` | PyTorch + HuggingFace | Custom model implementations |
| `diffusers` | HuggingFace Diffusers | Image generation (any `AutoPipelineForText2Image` model) |
| `custom` | Plugin system | TTS backends (Kokoro, Bark, Orpheus) |

## GPU Allocation
Expand Down Expand Up @@ -66,5 +67,6 @@ See [Plugin Development](plugins.md) for details.
| `yasha/infer/model_deployment.py` | Ray Serve deployment actor |
| `yasha/infer/infer_config.py` | Pydantic config models and protocols |
| `yasha/infer/vllm/vllm_infer.py` | vLLM engine wrapper |
| `yasha/infer/diffusers/diffusers_infer.py` | Diffusers pipeline wrapper |
| `yasha/plugins/base_plugin.py` | Plugin base classes |
| `config/models.yaml` | Model configuration |
31 changes: 29 additions & 2 deletions docs/model-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,14 @@ Models are configured in `config/models.yaml`. Each entry defines one deployment
|---|---|---|
| `name` | string | Model identifier used in API requests |
| `model` | string | HuggingFace model ID |
| `usecase` | string | `generate`, `embed`, `transcription`, `translation`, or `tts` |
| `loader` | string | `vllm`, `transformers`, or `custom` |
| `usecase` | string | `generate`, `embed`, `transcription`, `translation`, `tts`, or `image` |
| `loader` | string | `vllm`, `transformers`, `diffusers`, or `custom` |
| `plugin` | string | Plugin module name (required when `loader: custom`); must be installed via `uv sync --extra <plugin>` |
| `num_gpus` | float | Fraction of a GPU to allocate (0.0–1.0); also sets vLLM `gpu_memory_utilization` |
| `num_cpus` | float | CPU units to allocate (default `0.1`) |
| `use_gpu` | int \| string | Pin to a specific GPU (see below) |
| `vllm_engine_kwargs` | object | Passed directly to the vLLM engine — see [vLLM engine args](https://docs.vllm.ai/en/latest/configuration/engine_args.html) |
| `diffusers_config` | object | Diffusers pipeline options (see below) |
| `plugin_config` | object | Plugin-specific options passed through to the plugin |

## GPU Pinning
Expand All @@ -29,6 +30,32 @@ Models are configured in `config/models.yaml`. Each entry defines one deployment
The name is arbitrary — it must match the value in `use_gpu`. The `models.example.2x16GB.yaml` preset uses `"dual_16gb"` for a TP=2 LLM deployment.
- **omit** — Ray schedules the deployment freely across available GPUs

## Diffusers Config

Options for `loader: diffusers` models (image generation via HuggingFace Diffusers):

| Field | Type | Default | Description |
|---|---|---|---|
| `torch_dtype` | string | `float16` | Torch dtype (`float16`, `bfloat16`, `float32`) |
| `num_inference_steps` | int | `30` | Default denoising steps (can be overridden per request) |
| `guidance_scale` | float | `7.5` | Default classifier-free guidance scale (can be overridden per request) |

Any model supported by `AutoPipelineForText2Image` works out of the box — Stable Diffusion 1.5/2.x/XL/3.x, SDXL Turbo, Flux, PixArt, Kandinsky, etc.

Example:

```yaml
- name: "sdxl-turbo"
model: "stabilityai/sdxl-turbo"
usecase: "image"
loader: "diffusers"
num_gpus: 0.35
diffusers_config:
torch_dtype: "float16"
num_inference_steps: 4
guidance_scale: 0.0
```

## Environment Variables

| Variable | Description | Default |
Expand Down
6 changes: 4 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
[project]
name = "yasha"
version = "0.1.14"
description = "Self-hosted, multi-model AI inference server. Run LLMs, TTS, STT, and embeddings with an OpenAI-compatible API."
description = "Self-hosted, multi-model AI inference server. Run LLMs, TTS, STT, embeddings, and image generation with an OpenAI-compatible API."
authors = [
{ name = "Alex Margarit" }
]
readme = { file = "README.md", content-type = "text/markdown" }
license = { text = "MIT" }
keywords = ["ai", "inference", "vllm", "ray", "openai", "llm", "tts", "stt", "embeddings", "self-hosted"]
keywords = ["ai", "inference", "vllm", "ray", "openai", "llm", "tts", "stt", "embeddings", "image-generation", "diffusers", "self-hosted"]
classifiers = [
"Development Status :: 3 - Alpha",
"License :: OSI Approved :: MIT License",
Expand All @@ -34,6 +34,8 @@ dependencies = [
"vllm==0.18.0",
"vllm[audio]==0.18.0",
"requests>=2.32.5",
"accelerate>=1.6.0",
"diffusers>=0.31.0",
]

[project.optional-dependencies]
Expand Down
32 changes: 15 additions & 17 deletions tests/test_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ def test_minimal_vllm_model(self):
name="test-llm",
model="some-org/some-model",
usecase=ModelUsecase.generate,
loader=ModelLoader.vllm,
)
assert config.name == "test-llm"
assert config.loader == ModelLoader.vllm
Expand Down Expand Up @@ -46,35 +47,27 @@ def test_custom_loader_with_plugin(self):
def test_custom_loader_plugin_only(self):
config = YashaModelConfig(
name="test-tts",
model="some-model",
usecase=ModelUsecase.tts,
loader=ModelLoader.custom,
plugin="kokoro",
)
assert config.model is None
assert config.plugin == "kokoro"

def test_vllm_loader_requires_model(self):
with pytest.raises(ValidationError, match="cannot be both empty"):
def test_model_required(self):
with pytest.raises(ValidationError, match="Field required"):
YashaModelConfig(
name="test-llm",
usecase=ModelUsecase.generate,
loader=ModelLoader.vllm,
)

def test_transformers_loader_requires_model(self):
with pytest.raises(ValidationError, match="cannot be both empty"):
def test_loader_required(self):
with pytest.raises(ValidationError, match="Field required"):
YashaModelConfig(
name="test-llm",
model="some-model",
usecase=ModelUsecase.generate,
loader=ModelLoader.transformers,
)

def test_model_and_plugin_both_empty_fails(self):
with pytest.raises(ValidationError, match="cannot be both empty"):
YashaModelConfig(
name="test",
usecase=ModelUsecase.generate,
loader=ModelLoader.custom,
)

def test_gpu_index_with_tensor_parallelism_fails(self):
Expand All @@ -83,6 +76,7 @@ def test_gpu_index_with_tensor_parallelism_fails(self):
name="test-llm",
model="some-model",
usecase=ModelUsecase.generate,
loader=ModelLoader.vllm,
use_gpu=0,
vllm_engine_kwargs=VllmEngineConfig(tensor_parallel_size=2),
)
Expand All @@ -92,6 +86,7 @@ def test_gpu_index_with_tp1_ok(self):
name="test-llm",
model="some-model",
usecase=ModelUsecase.generate,
loader=ModelLoader.vllm,
use_gpu=0,
vllm_engine_kwargs=VllmEngineConfig(tensor_parallel_size=1),
)
Expand All @@ -102,6 +97,7 @@ def test_named_gpu_resource_with_tp(self):
name="test-llm",
model="some-model",
usecase=ModelUsecase.generate,
loader=ModelLoader.vllm,
use_gpu="dual_16gb",
vllm_engine_kwargs=VllmEngineConfig(tensor_parallel_size=2),
)
Expand All @@ -112,6 +108,7 @@ def test_gpu_allocation_fraction(self):
name="test-llm",
model="some-model",
usecase=ModelUsecase.generate,
loader=ModelLoader.vllm,
num_gpus=0.70,
)
assert config.num_gpus == 0.70
Expand All @@ -122,16 +119,15 @@ def test_all_usecases_valid(self):
name=f"test-{usecase.value}",
model="some-model",
usecase=usecase,
loader=ModelLoader.vllm,
)
assert config.usecase == usecase

def test_all_loaders_valid(self):
for loader in ModelLoader:
kwargs = {"name": "test", "usecase": ModelUsecase.generate}
kwargs = {"name": "test", "model": "some-model", "usecase": ModelUsecase.generate}
if loader == ModelLoader.custom:
kwargs["plugin"] = "test-plugin"
else:
kwargs["model"] = "some-model"
config = YashaModelConfig(loader=loader, **kwargs)
assert config.loader == loader

Expand Down Expand Up @@ -165,10 +161,12 @@ def test_multi_model_config(self):
name="llm",
model="some-org/some-llm",
usecase=ModelUsecase.generate,
loader=ModelLoader.vllm,
num_gpus=0.70,
),
YashaModelConfig(
name="tts",
model="some-model",
usecase=ModelUsecase.tts,
loader=ModelLoader.custom,
plugin="kokoro",
Expand Down
42 changes: 42 additions & 0 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions yasha/infer/custom/custom_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
EmbeddingRequest,
ErrorInfo,
ErrorResponse,
ImageGenerationRequest,
RawSpeechResponse,
SpeechRequest,
TranscriptionRequest,
Expand Down Expand Up @@ -79,3 +80,10 @@ async def create_speech(
error=ErrorInfo(message="model does not support this action", type="invalid_request_error", code=404)
)
return await self.serving_speech.create_speech(request, cast("Request", raw_request))

async def create_image_generation(
self, _request: ImageGenerationRequest, _raw_request: DisconnectProxy
) -> ErrorResponse:
return ErrorResponse(
error=ErrorInfo(message="model does not support this action", type="invalid_request_error", code=404)
)
Loading
Loading