Add model-cli gateway#812
Conversation
5369ad7 to
cb7f667
Compare
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- The retry and fallback logic in
Router::chat_completion,chat_completion_stream, andembeddingis nearly identical; consider extracting a shared helper (e.g. a genericrun_with_retries_and_fallbacks) to reduce duplication and keep behavior consistent across request types. - In
build_app, the master key is inserted into request extensions only to be immediately re-read inauth::auth_middleware; you can simplify the auth middleware by capturing themaster_keydirectly in the closure and passing it to the auth function instead of going through extensions. - The streaming branch in
chat_completion_handlerre-wraps the providerByteStreaminto an unbounded MPSC channel andUnboundedReceiverStream; you can simplify and reduce overhead by mapping theByteStreamdirectly into anaxum::body::Body::from_streamwithout spawning an extra task and channel.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The retry and fallback logic in `Router::chat_completion`, `chat_completion_stream`, and `embedding` is nearly identical; consider extracting a shared helper (e.g. a generic `run_with_retries_and_fallbacks`) to reduce duplication and keep behavior consistent across request types.
- In `build_app`, the master key is inserted into request extensions only to be immediately re-read in `auth::auth_middleware`; you can simplify the auth middleware by capturing the `master_key` directly in the closure and passing it to the auth function instead of going through extensions.
- The streaming branch in `chat_completion_handler` re-wraps the provider `ByteStream` into an unbounded MPSC channel and `UnboundedReceiverStream`; you can simplify and reduce overhead by mapping the `ByteStream` directly into an `axum::body::Body::from_stream` without spawning an extra task and channel.
## Individual Comments
### Comment 1
<location path="model-cli/src/providers/anthropic.rs" line_range="362-363" />
<code_context>
+ buffer = buffer[event_end + 2..].to_string();
+
+ let mut event_data = String::new();
+ for line in event_text.lines() {
+ if let Some(data) = line.strip_prefix("data: ") {
+ event_data = data.to_string();
+ }
</code_context>
<issue_to_address>
**issue:** SSE parsing only keeps the last `data:` line per event, which may drop data for multi-line events.
Per the SSE spec, multiple `data:` lines in one event should be concatenated with `\n`. Here, `event_data` is overwritten on each `data:` line, so any earlier lines in a multi-line event would be lost. Please accumulate all `data:` lines (e.g., into a `Vec<String>` and `join('\n')`) so multi-line events are handled correctly.
</issue_to_address>
### Comment 2
<location path="model-cli/src/router.rs" line_range="130" />
<code_context>
+
+ let dep_count = self.deployments.get(model).map_or(0, |v| v.len());
+
+ for attempt in 0..max_attempts.max(dep_count) {
+ if let Some(idx) = self.next_index(model) {
+ let dep = &self.deployments[model][idx];
</code_context>
<issue_to_address>
**question (bug_risk):** The attempt loop bounds may exceed the configured retry count when multiple deployments exist.
Because the loop uses `max_attempts.max(dep_count)` with `max_attempts = num_retries + 1`, multiple deployments can cause more total attempts than `num_retries` suggests (e.g., 3 deployments, `num_retries = 0` → 3 attempts). If `num_retries` is intended as a hard cap on attempts, consider `min(max_attempts, dep_count)` instead, or clarify in docs that each deployment is tried at least once even if that exceeds `num_retries`.
</issue_to_address>
### Comment 3
<location path="model-cli/src/handlers.rs" line_range="99-103" />
<code_context>
+
+// ── Embeddings ──
+
+pub async fn embeddings_handler(
+ State(state): State<Arc<AppState>>,
+ Json(request): Json<EmbeddingRequest>,
+) -> Result<impl IntoResponse, AppError> {
+ let response = state.router.embedding(&request).await?;
+ Ok(Json(response))
+}
</code_context>
<issue_to_address>
**nitpick:** Embeddings responses don’t include the `x-model-cli-version` header like chat completions do.
Here the JSON is returned directly, so the version header isn’t added. For consistency with non-streaming chat completions and to support client-side introspection, consider wrapping the embeddings result in a `Response` and setting the same `x-model-cli-version` header.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
doringeman
left a comment
There was a problem hiding this comment.
LGTM! WDYT of adding some Makefile rules for it?
|
Warning Gemini is experiencing higher than usual traffic and was unable to create the review. Please try again in a few hours by commenting |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces model-cli, a Rust-based LLM gateway featuring provider routing, load balancing, and fallback mechanisms. Feedback highlights critical security concerns regarding timing attacks in API key validation and permissive CORS configurations. Other recommendations focus on improving performance and maintainability through optimized dependency features, simplified stream processing, and more idiomatic Axum middleware patterns.
| fn build_app(state: Arc<AppState>) -> AxumRouter { | ||
| let master_key = state.master_key.clone(); | ||
|
|
||
| // Middleware that injects the master key into request extensions for auth checking. | ||
| let auth_layer = axum::middleware::from_fn( | ||
| move |mut request: axum::extract::Request, next: axum::middleware::Next| { | ||
| let key = master_key.clone(); | ||
| async move { | ||
| request.extensions_mut().insert(key); | ||
| auth::auth_middleware(request, next).await | ||
| } | ||
| }, | ||
| ); |
There was a problem hiding this comment.
The method used to pass the master_key to the authentication middleware is a bit of a workaround. A more idiomatic axum approach is to use middleware::from_fn_with_state, which passes state directly to the middleware function. This would make the auth_middleware function cleaner as it wouldn't need to extract the key from request extensions.
This would require changing the signature of auth_middleware in src/auth.rs to:
pub async fn auth_middleware(
axum::extract::State(master_key): axum::extract::State<Option<String>>,
request: axum::extract::Request,
next: axum::middleware::Next,
) -> Result<Response, AppError>And then you can use master_key directly instead of extracting it from extensions.
| fn build_app(state: Arc<AppState>) -> AxumRouter { | |
| let master_key = state.master_key.clone(); | |
| // Middleware that injects the master key into request extensions for auth checking. | |
| let auth_layer = axum::middleware::from_fn( | |
| move |mut request: axum::extract::Request, next: axum::middleware::Next| { | |
| let key = master_key.clone(); | |
| async move { | |
| request.extensions_mut().insert(key); | |
| auth::auth_middleware(request, next).await | |
| } | |
| }, | |
| ); | |
| fn build_app(state: Arc<AppState>) -> AxumRouter { | |
| let auth_layer = | |
| axum::middleware::from_fn_with_state(state.master_key.clone(), auth::auth_middleware); |
cb7f667 to
03779c5
Compare
ab207ad to
92736d4
Compare
ff616b3 to
3f8328d
Compare
3f8328d to
bd0003b
Compare
OpenAI-compatible LLM gateway, invokable via 'model-cli gateway --config <file>'. Supports Docker Model Runner as a first-class provider alongside Ollama, OpenAI, Anthropic, and 15+ others. Signed-off-by: Eric Curtin <eric.curtin@docker.com>
bd0003b to
0d9fc40
Compare
OpenAI-compatible LLM gateway, invokable via
'model-cli gateway --config '. Supports Docker Model Runner as a first-class provider alongside Ollama, OpenAI, Anthropic, and 15+ others.