LLMSim Library and Server by chaliy · Pull Request #1 · chaliy/llmsim

chaliy · 2025-12-26T04:43:13Z

What

Implement Phases 1-3 of the LLMSim project plan - a lightweight, high-performance LLM API simulator that replicates the traffic shape of real LLM APIs without running actual models.

Why

Load testing applications that integrate with LLMs is challenging due to:

Cost: Real API calls during load tests are expensive
Rate Limits: Production APIs have rate limits that prevent realistic load testing
Inconsistency: Real models produce variable responses, making test reproducibility difficult
Traffic Shape: LLM responses have unique characteristics (streaming, variable latency, token-based billing) that generic mock servers don't replicate

How

Phase 1: Project Foundation

Initialized Cargo workspace with llmsim (library) and llmsim-server (binary) crates
Defined OpenAI-compatible API types:
- ChatCompletionRequest/Response for chat completions
- ChatCompletionChunk for streaming responses
- Message, Role, Usage, ToolCall, Function
- ErrorResponse for OpenAI-style error handling
Set up GitHub Actions CI workflow (format, lint, test, build)

Phase 2: Core Library

Token Counter (tokens.rs): Accurate token counting using tiktoken-rs with model-specific encodings (o200k, cl100k, p50k, r50k)
Latency Profiles (latency.rs): Configurable TTFT and inter-token delays for GPT-4, GPT-4o, GPT-3.5, Claude Opus/Sonnet/Haiku, Gemini
Response Generators (generator.rs): Multiple strategies - lorem, echo, fixed, random, sequence
Streaming Engine (stream.rs): SSE streaming with realistic timing simulation
Error Injection (errors.rs): Configurable rate limits (429), server errors (500/503), timeouts

Phase 3: Server Binary

Axum HTTP server with graceful shutdown
OpenAI-compatible endpoints:
- GET /health - Health check
- POST /v1/chat/completions - Chat completions (streaming + non-streaming)
- GET /v1/models - List available models
- GET /v1/models/:id - Get specific model
YAML configuration with environment variable overrides
Docker and docker-compose support

Test Plan

All 54 unit tests pass
Code compiles without errors
cargo fmt --check passes
cargo clippy passes (only minor dead-code warnings for future use)
Manual testing: Start server and test with curl/OpenAI SDK
Docker build and run

Risk

Low
This is the initial implementation with no breaking changes to existing code
All new code with comprehensive test coverage

Checklist

Tests added (54 unit tests)
CI workflow configured
Docker support added
Configuration documented in code

This commit implements the foundation of the LLMSim project: Phase 1 - Project Foundation: - Initialize Cargo workspace with llmsim (library) and llmsim-server (binary) - Define OpenAI-compatible API types (ChatCompletionRequest/Response, streaming) - Set up CI workflow for format, lint, test, and build Phase 2 - Core Library: - Token counter using tiktoken-rs for accurate token counting - Latency profiles for GPT-4, GPT-3.5, Claude, and Gemini models - Response generators (lorem, echo, fixed, random, sequence) - Streaming engine with SSE support and realistic timing - Error injection for testing (rate limits, server errors, timeouts) Phase 3 - Server Binary: - Axum HTTP server with graceful shutdown - POST /v1/chat/completions endpoint (streaming and non-streaming) - GET /v1/models endpoint listing available models - YAML configuration with environment variable overrides - Docker and docker-compose support for containerization All 54 tests pass and the code compiles cleanly.

- Remove GPT-3.5-turbo support - Add GPT-5 variants: gpt-5, gpt-5-mini, gpt-5-nano - Add GPT-5.1/5.2 series support - Add O-series reasoning models (o1, o3) - Update latency profiles with GPT-5 characteristics: - gpt-5: 600ms TTFT, 40ms TBT (flagship) - gpt-5-mini: 300ms TTFT, 20ms TBT (lightweight) - gpt-5-nano: 150ms TTFT, 10ms TBT (speed optimized) - o-series: 2000ms TTFT (reasoning overhead) - Update token counting to use o200k_base for GPT-5 - Add Claude 4 models to default model list - Update default latency to GPT-5 profile Based on models.dev API and OpenAI documentation.

Remove models not in models.dev, add correct model IDs: GPT-5 family: - gpt-5, gpt-5-mini, gpt-5-codex - gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, gpt-5.1-codex-max - gpt-5.2 - Removed: gpt-5-nano, gpt-5.1-mini, gpt-5.1-nano (don't exist) O-series reasoning models: - o3, o3-mini, o4-mini - Removed: o1, o1-mini, o1-preview (not in models.dev) GPT-4 family: - Added: gpt-4.1 Claude family (updated to models.dev naming): - claude-3.5-sonnet, claude-3.7-sonnet - claude-sonnet-4, claude-sonnet-4.5 - claude-opus-4, claude-opus-4.5 - claude-haiku-4.5

Reorganize workspace to use standard crates/ folder structure: - crates/llmsim/ - core library - crates/llmsim-server/ - HTTP server binary Updated Cargo.toml to use `members = ["crates/*"]` glob pattern. Updated Dockerfile paths accordingly.

- Convert from workspace with two crates to single crate with lib + bin - Add `llmsim serve` subcommand using clap for CLI - Move server code to src/cli/ module - Update imports to use crate:: instead of llmsim:: - Update Dockerfile for new single binary structure - Update PLAN.md with new architecture and mark completed items - Add specs/architecture.md documenting new structure Usage: `llmsim serve --port 8080 --latency-profile gpt5`

Latency is now automatically derived from the model field in each chat completion request, making the CLI option redundant. The LatencyProfile::from_model() function maps models to appropriate latency profiles (e.g., gpt-5 → 600ms TTFT, o3 → 2000ms TTFT).

## What Implement Phases 1-3 of the LLMSim project plan - a lightweight, high-performance LLM API simulator that replicates the traffic shape of real LLM APIs without running actual models. ## Why Load testing applications that integrate with LLMs is challenging due to: - **Cost**: Real API calls during load tests are expensive - **Rate Limits**: Production APIs have rate limits that prevent realistic load testing - **Inconsistency**: Real models produce variable responses, making test reproducibility difficult - **Traffic Shape**: LLM responses have unique characteristics (streaming, variable latency, token-based billing) that generic mock servers don't replicate ## How ### Phase 1: Project Foundation - Initialized Cargo workspace with `llmsim` (library) and `llmsim-server` (binary) crates - Defined OpenAI-compatible API types: - `ChatCompletionRequest/Response` for chat completions - `ChatCompletionChunk` for streaming responses - `Message`, `Role`, `Usage`, `ToolCall`, `Function` - `ErrorResponse` for OpenAI-style error handling - Set up GitHub Actions CI workflow (format, lint, test, build) ### Phase 2: Core Library - **Token Counter** (`tokens.rs`): Accurate token counting using tiktoken-rs with model-specific encodings (o200k, cl100k, p50k, r50k) - **Latency Profiles** (`latency.rs`): Configurable TTFT and inter-token delays for GPT-4, GPT-4o, GPT-3.5, Claude Opus/Sonnet/Haiku, Gemini - **Response Generators** (`generator.rs`): Multiple strategies - lorem, echo, fixed, random, sequence - **Streaming Engine** (`stream.rs`): SSE streaming with realistic timing simulation - **Error Injection** (`errors.rs`): Configurable rate limits (429), server errors (500/503), timeouts ### Phase 3: Server Binary - Axum HTTP server with graceful shutdown - OpenAI-compatible endpoints: - `GET /health` - Health check - `POST /v1/chat/completions` - Chat completions (streaming + non-streaming) - `GET /v1/models` - List available models - `GET /v1/models/:id` - Get specific model - YAML configuration with environment variable overrides - Docker and docker-compose support ## Test Plan - [x] All 54 unit tests pass - [x] Code compiles without errors - [x] `cargo fmt --check` passes - [x] `cargo clippy` passes (only minor dead-code warnings for future use) - [ ] Manual testing: Start server and test with curl/OpenAI SDK - [ ] Docker build and run ## Risk - **Low** - This is the initial implementation with no breaking changes to existing code - All new code with comprehensive test coverage ## Checklist - [x] Tests added (54 unit tests) - [x] CI workflow configured - [x] Docker support added - [x] Configuration documented in code --------- Co-authored-by: Claude <noreply@anthropic.com>

chaliy changed the title ~~Implement plan sections one through three~~ LLMSim Library and Server Dec 26, 2025

claude added 9 commits December 26, 2025 04:48

refactor: move crates to crates/ folder

1e50790

Reorganize workspace to use standard crates/ folder structure: - crates/llmsim/ - core library - crates/llmsim-server/ - HTTP server binary Updated Cargo.toml to use `members = ["crates/*"]` glob pattern. Updated Dockerfile paths accordingly.

fix: correct rust-toolchain action name in CI workflow

2b3c2ef

fix: resolve clippy warnings in CI

c04fc01

style: fix import ordering in handlers.rs

9fbd67a

fix: ensure minimum 1ms latency to prevent flaky tests

3cd317f

chaliy merged commit 822dae0 into main Dec 26, 2025
7 checks passed

chaliy deleted the claude/implement-plan-sections-1-3-sayVs branch December 26, 2025 05:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLMSim Library and Server#1

LLMSim Library and Server#1
chaliy merged 10 commits intomainfrom
claude/implement-plan-sections-1-3-sayVs

chaliy commented Dec 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chaliy commented Dec 26, 2025

What

Why

How

Phase 1: Project Foundation

Phase 2: Core Library

Phase 3: Server Binary

Test Plan

Risk

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants