Merged
Conversation
This commit implements the foundation of the LLMSim project: Phase 1 - Project Foundation: - Initialize Cargo workspace with llmsim (library) and llmsim-server (binary) - Define OpenAI-compatible API types (ChatCompletionRequest/Response, streaming) - Set up CI workflow for format, lint, test, and build Phase 2 - Core Library: - Token counter using tiktoken-rs for accurate token counting - Latency profiles for GPT-4, GPT-3.5, Claude, and Gemini models - Response generators (lorem, echo, fixed, random, sequence) - Streaming engine with SSE support and realistic timing - Error injection for testing (rate limits, server errors, timeouts) Phase 3 - Server Binary: - Axum HTTP server with graceful shutdown - POST /v1/chat/completions endpoint (streaming and non-streaming) - GET /v1/models endpoint listing available models - YAML configuration with environment variable overrides - Docker and docker-compose support for containerization All 54 tests pass and the code compiles cleanly.
- Remove GPT-3.5-turbo support - Add GPT-5 variants: gpt-5, gpt-5-mini, gpt-5-nano - Add GPT-5.1/5.2 series support - Add O-series reasoning models (o1, o3) - Update latency profiles with GPT-5 characteristics: - gpt-5: 600ms TTFT, 40ms TBT (flagship) - gpt-5-mini: 300ms TTFT, 20ms TBT (lightweight) - gpt-5-nano: 150ms TTFT, 10ms TBT (speed optimized) - o-series: 2000ms TTFT (reasoning overhead) - Update token counting to use o200k_base for GPT-5 - Add Claude 4 models to default model list - Update default latency to GPT-5 profile Based on models.dev API and OpenAI documentation.
Remove models not in models.dev, add correct model IDs: GPT-5 family: - gpt-5, gpt-5-mini, gpt-5-codex - gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, gpt-5.1-codex-max - gpt-5.2 - Removed: gpt-5-nano, gpt-5.1-mini, gpt-5.1-nano (don't exist) O-series reasoning models: - o3, o3-mini, o4-mini - Removed: o1, o1-mini, o1-preview (not in models.dev) GPT-4 family: - Added: gpt-4.1 Claude family (updated to models.dev naming): - claude-3.5-sonnet, claude-3.7-sonnet - claude-sonnet-4, claude-sonnet-4.5 - claude-opus-4, claude-opus-4.5 - claude-haiku-4.5
Reorganize workspace to use standard crates/ folder structure: - crates/llmsim/ - core library - crates/llmsim-server/ - HTTP server binary Updated Cargo.toml to use `members = ["crates/*"]` glob pattern. Updated Dockerfile paths accordingly.
- Convert from workspace with two crates to single crate with lib + bin - Add `llmsim serve` subcommand using clap for CLI - Move server code to src/cli/ module - Update imports to use crate:: instead of llmsim:: - Update Dockerfile for new single binary structure - Update PLAN.md with new architecture and mark completed items - Add specs/architecture.md documenting new structure Usage: `llmsim serve --port 8080 --latency-profile gpt5`
Latency is now automatically derived from the model field in each chat completion request, making the CLI option redundant. The LatencyProfile::from_model() function maps models to appropriate latency profiles (e.g., gpt-5 → 600ms TTFT, o3 → 2000ms TTFT).
chaliy
added a commit
that referenced
this pull request
Dec 26, 2025
## What Implement Phases 1-3 of the LLMSim project plan - a lightweight, high-performance LLM API simulator that replicates the traffic shape of real LLM APIs without running actual models. ## Why Load testing applications that integrate with LLMs is challenging due to: - **Cost**: Real API calls during load tests are expensive - **Rate Limits**: Production APIs have rate limits that prevent realistic load testing - **Inconsistency**: Real models produce variable responses, making test reproducibility difficult - **Traffic Shape**: LLM responses have unique characteristics (streaming, variable latency, token-based billing) that generic mock servers don't replicate ## How ### Phase 1: Project Foundation - Initialized Cargo workspace with `llmsim` (library) and `llmsim-server` (binary) crates - Defined OpenAI-compatible API types: - `ChatCompletionRequest/Response` for chat completions - `ChatCompletionChunk` for streaming responses - `Message`, `Role`, `Usage`, `ToolCall`, `Function` - `ErrorResponse` for OpenAI-style error handling - Set up GitHub Actions CI workflow (format, lint, test, build) ### Phase 2: Core Library - **Token Counter** (`tokens.rs`): Accurate token counting using tiktoken-rs with model-specific encodings (o200k, cl100k, p50k, r50k) - **Latency Profiles** (`latency.rs`): Configurable TTFT and inter-token delays for GPT-4, GPT-4o, GPT-3.5, Claude Opus/Sonnet/Haiku, Gemini - **Response Generators** (`generator.rs`): Multiple strategies - lorem, echo, fixed, random, sequence - **Streaming Engine** (`stream.rs`): SSE streaming with realistic timing simulation - **Error Injection** (`errors.rs`): Configurable rate limits (429), server errors (500/503), timeouts ### Phase 3: Server Binary - Axum HTTP server with graceful shutdown - OpenAI-compatible endpoints: - `GET /health` - Health check - `POST /v1/chat/completions` - Chat completions (streaming + non-streaming) - `GET /v1/models` - List available models - `GET /v1/models/:id` - Get specific model - YAML configuration with environment variable overrides - Docker and docker-compose support ## Test Plan - [x] All 54 unit tests pass - [x] Code compiles without errors - [x] `cargo fmt --check` passes - [x] `cargo clippy` passes (only minor dead-code warnings for future use) - [ ] Manual testing: Start server and test with curl/OpenAI SDK - [ ] Docker build and run ## Risk - **Low** - This is the initial implementation with no breaking changes to existing code - All new code with comprehensive test coverage ## Checklist - [x] Tests added (54 unit tests) - [x] CI workflow configured - [x] Docker support added - [x] Configuration documented in code --------- Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Implement Phases 1-3 of the LLMSim project plan - a lightweight, high-performance LLM API simulator that replicates the traffic shape of real LLM APIs without running actual models.
Why
Load testing applications that integrate with LLMs is challenging due to:
How
Phase 1: Project Foundation
llmsim(library) andllmsim-server(binary) cratesChatCompletionRequest/Responsefor chat completionsChatCompletionChunkfor streaming responsesMessage,Role,Usage,ToolCall,FunctionErrorResponsefor OpenAI-style error handlingPhase 2: Core Library
tokens.rs): Accurate token counting using tiktoken-rs with model-specific encodings (o200k, cl100k, p50k, r50k)latency.rs): Configurable TTFT and inter-token delays for GPT-4, GPT-4o, GPT-3.5, Claude Opus/Sonnet/Haiku, Geminigenerator.rs): Multiple strategies - lorem, echo, fixed, random, sequencestream.rs): SSE streaming with realistic timing simulationerrors.rs): Configurable rate limits (429), server errors (500/503), timeoutsPhase 3: Server Binary
GET /health- Health checkPOST /v1/chat/completions- Chat completions (streaming + non-streaming)GET /v1/models- List available modelsGET /v1/models/:id- Get specific modelTest Plan
cargo fmt --checkpassescargo clippypasses (only minor dead-code warnings for future use)Risk
Checklist