fix(catalog): default phi-4-mini context to 8K (closes #386)#387
Merged
fix(catalog): default phi-4-mini context to 8K (closes #386)#387
Conversation
…#386) The catalog shipped phi-4-mini with context_size: 128000 paired with a 4Gi memory request. The KV cache for 128K context on a 3.8B model exceeds 4Gi during model load, so a `llmkube deploy phi-4-mini` (the catalog default for our smallest model) lands the pod in CrashLoopBackOff with SystemOOM events. This is the first model the quickstart docs recommend, so the failure mode hits on minute one of a new user's session. Drop the catalog default to context_size: 8192. Phi-4 still supports 128K when users explicitly set spec.contextSize on InferenceService or pass --context 128000 on the CLI. Other catalog small-models (3B-7B class) already default to 8192-32768 context; phi-4-mini was the outlier. Verified locally on a fresh kind cluster: with the fix, `llmkube deploy phi-4-mini` reaches Ready in 13 seconds and serves inference at the OpenAI-compatible endpoint as expected. Note: this repo currently maintains two copies of catalog.yaml, one at the repo root (catalog/catalog.yaml) and one embedded into the CLI binary (pkg/cli/catalog.yaml). Both are updated here so the next binary build picks up the fix. A follow-up issue should consolidate these into a single source of truth. Signed-off-by: Christopher Maher <chris@mahercode.io>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #386.
The bug
`catalog/catalog.yaml` shipped `phi-4-mini` with `context_size: 128000` paired with a `memory: 4Gi` request. KV cache for 128K context on a 3.8B model exceeds 4Gi during model load. Pod lands in `CrashLoopBackOff` with repeated `SystemOOM` events.
This is the first model the quickstart docs recommend deploying. A copy-paste `llmkube deploy phi-4-mini` from the catalog defaults breaks on minute one of a new user's session. HN-killer for the launch window.
The fix
Drop the catalog default to `context_size: 8192`. Phi-4 still supports 128K when users explicitly set `spec.contextSize` on InferenceService or pass `--context 128000` on the CLI. Other catalog small-models (3B–7B class) already default to 8192–32768; phi-4-mini was the outlier.
Test plan
Note on the duplicate catalog files
This repo currently maintains two copies of `catalog.yaml`: one at the repo root (`catalog/catalog.yaml`) and one embedded into the CLI binary (`pkg/cli/catalog.yaml` via `go:embed`). Both are updated in this PR so the next binary build picks up the fix.
The duplication is asking for drift. Worth a follow-up issue to consolidate to a single source of truth (probably `catalog/catalog.yaml` as canonical with a build-time copy / symlink to `pkg/cli/`). Out of scope for this PR.