Skip to content

LLMSim Library and Server#1

Merged
chaliy merged 10 commits intomainfrom
claude/implement-plan-sections-1-3-sayVs
Dec 26, 2025
Merged

LLMSim Library and Server#1
chaliy merged 10 commits intomainfrom
claude/implement-plan-sections-1-3-sayVs

Conversation

@chaliy
Copy link
Copy Markdown
Owner

@chaliy chaliy commented Dec 26, 2025

What

Implement Phases 1-3 of the LLMSim project plan - a lightweight, high-performance LLM API simulator that replicates the traffic shape of real LLM APIs without running actual models.

Why

Load testing applications that integrate with LLMs is challenging due to:

  • Cost: Real API calls during load tests are expensive
  • Rate Limits: Production APIs have rate limits that prevent realistic load testing
  • Inconsistency: Real models produce variable responses, making test reproducibility difficult
  • Traffic Shape: LLM responses have unique characteristics (streaming, variable latency, token-based billing) that generic mock servers don't replicate

How

Phase 1: Project Foundation

  • Initialized Cargo workspace with llmsim (library) and llmsim-server (binary) crates
  • Defined OpenAI-compatible API types:
    • ChatCompletionRequest/Response for chat completions
    • ChatCompletionChunk for streaming responses
    • Message, Role, Usage, ToolCall, Function
    • ErrorResponse for OpenAI-style error handling
  • Set up GitHub Actions CI workflow (format, lint, test, build)

Phase 2: Core Library

  • Token Counter (tokens.rs): Accurate token counting using tiktoken-rs with model-specific encodings (o200k, cl100k, p50k, r50k)
  • Latency Profiles (latency.rs): Configurable TTFT and inter-token delays for GPT-4, GPT-4o, GPT-3.5, Claude Opus/Sonnet/Haiku, Gemini
  • Response Generators (generator.rs): Multiple strategies - lorem, echo, fixed, random, sequence
  • Streaming Engine (stream.rs): SSE streaming with realistic timing simulation
  • Error Injection (errors.rs): Configurable rate limits (429), server errors (500/503), timeouts

Phase 3: Server Binary

  • Axum HTTP server with graceful shutdown
  • OpenAI-compatible endpoints:
    • GET /health - Health check
    • POST /v1/chat/completions - Chat completions (streaming + non-streaming)
    • GET /v1/models - List available models
    • GET /v1/models/:id - Get specific model
  • YAML configuration with environment variable overrides
  • Docker and docker-compose support

Test Plan

  • All 54 unit tests pass
  • Code compiles without errors
  • cargo fmt --check passes
  • cargo clippy passes (only minor dead-code warnings for future use)
  • Manual testing: Start server and test with curl/OpenAI SDK
  • Docker build and run

Risk

  • Low
  • This is the initial implementation with no breaking changes to existing code
  • All new code with comprehensive test coverage

Checklist

  • Tests added (54 unit tests)
  • CI workflow configured
  • Docker support added
  • Configuration documented in code

This commit implements the foundation of the LLMSim project:

Phase 1 - Project Foundation:
- Initialize Cargo workspace with llmsim (library) and llmsim-server (binary)
- Define OpenAI-compatible API types (ChatCompletionRequest/Response, streaming)
- Set up CI workflow for format, lint, test, and build

Phase 2 - Core Library:
- Token counter using tiktoken-rs for accurate token counting
- Latency profiles for GPT-4, GPT-3.5, Claude, and Gemini models
- Response generators (lorem, echo, fixed, random, sequence)
- Streaming engine with SSE support and realistic timing
- Error injection for testing (rate limits, server errors, timeouts)

Phase 3 - Server Binary:
- Axum HTTP server with graceful shutdown
- POST /v1/chat/completions endpoint (streaming and non-streaming)
- GET /v1/models endpoint listing available models
- YAML configuration with environment variable overrides
- Docker and docker-compose support for containerization

All 54 tests pass and the code compiles cleanly.
@chaliy chaliy changed the title Implement plan sections one through three LLMSim Library and Server Dec 26, 2025
- Remove GPT-3.5-turbo support
- Add GPT-5 variants: gpt-5, gpt-5-mini, gpt-5-nano
- Add GPT-5.1/5.2 series support
- Add O-series reasoning models (o1, o3)
- Update latency profiles with GPT-5 characteristics:
  - gpt-5: 600ms TTFT, 40ms TBT (flagship)
  - gpt-5-mini: 300ms TTFT, 20ms TBT (lightweight)
  - gpt-5-nano: 150ms TTFT, 10ms TBT (speed optimized)
  - o-series: 2000ms TTFT (reasoning overhead)
- Update token counting to use o200k_base for GPT-5
- Add Claude 4 models to default model list
- Update default latency to GPT-5 profile

Based on models.dev API and OpenAI documentation.
Remove models not in models.dev, add correct model IDs:

GPT-5 family:
- gpt-5, gpt-5-mini, gpt-5-codex
- gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, gpt-5.1-codex-max
- gpt-5.2
- Removed: gpt-5-nano, gpt-5.1-mini, gpt-5.1-nano (don't exist)

O-series reasoning models:
- o3, o3-mini, o4-mini
- Removed: o1, o1-mini, o1-preview (not in models.dev)

GPT-4 family:
- Added: gpt-4.1

Claude family (updated to models.dev naming):
- claude-3.5-sonnet, claude-3.7-sonnet
- claude-sonnet-4, claude-sonnet-4.5
- claude-opus-4, claude-opus-4.5
- claude-haiku-4.5
Reorganize workspace to use standard crates/ folder structure:
- crates/llmsim/ - core library
- crates/llmsim-server/ - HTTP server binary

Updated Cargo.toml to use `members = ["crates/*"]` glob pattern.
Updated Dockerfile paths accordingly.
- Convert from workspace with two crates to single crate with lib + bin
- Add `llmsim serve` subcommand using clap for CLI
- Move server code to src/cli/ module
- Update imports to use crate:: instead of llmsim::
- Update Dockerfile for new single binary structure
- Update PLAN.md with new architecture and mark completed items
- Add specs/architecture.md documenting new structure

Usage: `llmsim serve --port 8080 --latency-profile gpt5`
Latency is now automatically derived from the model field in each
chat completion request, making the CLI option redundant.

The LatencyProfile::from_model() function maps models to appropriate
latency profiles (e.g., gpt-5 → 600ms TTFT, o3 → 2000ms TTFT).
@chaliy chaliy merged commit 822dae0 into main Dec 26, 2025
7 checks passed
@chaliy chaliy deleted the claude/implement-plan-sections-1-3-sayVs branch December 26, 2025 05:52
chaliy added a commit that referenced this pull request Dec 26, 2025
## What
Implement Phases 1-3 of the LLMSim project plan - a lightweight,
high-performance LLM API simulator that replicates the traffic shape of
real LLM APIs without running actual models.

## Why
Load testing applications that integrate with LLMs is challenging due
to:
- **Cost**: Real API calls during load tests are expensive
- **Rate Limits**: Production APIs have rate limits that prevent
realistic load testing
- **Inconsistency**: Real models produce variable responses, making test
reproducibility difficult
- **Traffic Shape**: LLM responses have unique characteristics
(streaming, variable latency, token-based billing) that generic mock
servers don't replicate

## How

### Phase 1: Project Foundation
- Initialized Cargo workspace with `llmsim` (library) and
`llmsim-server` (binary) crates
- Defined OpenAI-compatible API types:
  - `ChatCompletionRequest/Response` for chat completions
  - `ChatCompletionChunk` for streaming responses
  - `Message`, `Role`, `Usage`, `ToolCall`, `Function`
  - `ErrorResponse` for OpenAI-style error handling
- Set up GitHub Actions CI workflow (format, lint, test, build)

### Phase 2: Core Library
- **Token Counter** (`tokens.rs`): Accurate token counting using
tiktoken-rs with model-specific encodings (o200k, cl100k, p50k, r50k)
- **Latency Profiles** (`latency.rs`): Configurable TTFT and inter-token
delays for GPT-4, GPT-4o, GPT-3.5, Claude Opus/Sonnet/Haiku, Gemini
- **Response Generators** (`generator.rs`): Multiple strategies - lorem,
echo, fixed, random, sequence
- **Streaming Engine** (`stream.rs`): SSE streaming with realistic
timing simulation
- **Error Injection** (`errors.rs`): Configurable rate limits (429),
server errors (500/503), timeouts

### Phase 3: Server Binary
- Axum HTTP server with graceful shutdown
- OpenAI-compatible endpoints:
  - `GET /health` - Health check
- `POST /v1/chat/completions` - Chat completions (streaming +
non-streaming)
  - `GET /v1/models` - List available models
  - `GET /v1/models/:id` - Get specific model
- YAML configuration with environment variable overrides
- Docker and docker-compose support

## Test Plan
- [x] All 54 unit tests pass
- [x] Code compiles without errors
- [x] `cargo fmt --check` passes
- [x] `cargo clippy` passes (only minor dead-code warnings for future
use)
- [ ] Manual testing: Start server and test with curl/OpenAI SDK
- [ ] Docker build and run

## Risk
- **Low**
- This is the initial implementation with no breaking changes to
existing code
- All new code with comprehensive test coverage

## Checklist
- [x] Tests added (54 unit tests)
- [x] CI workflow configured
- [x] Docker support added
- [x] Configuration documented in code

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants