Merged
Conversation
added 2 commits
April 30, 2026 20:47
Pre-flight every built-in-loader model on the driver before any Ray actor spins up: download with a universal `*.safetensors`-preferred filter, validate local paths, and surface auth / missing-repo / missing-file errors at startup instead of inside an UNHEALTHY replica. GGUF variants are picked via a `model: repo:filename` syntax (glob supported); the resolver detects multi-variant GGUF repos with no selector and raises with the variant list. `hf_filename` is gone — selectors live on the `model:` field for both HF and local sources. Built-in loaders (vllm, transformers, diffusers, llama_cpp) now consume `config._resolved_path` and fail fast when it isn't populated. Plugins are unchanged: they keep managing their own downloads, and `model:` is optional for `loader=custom`. The resolver runs after the gateway comes up so /health and /readyz answer during long downloads.
There was a problem hiding this comment.
Code Review
This pull request introduces a centralized model resolution system that supports HuggingFace repositories, local paths, and a new repo:filename syntax for specific file selection. By moving resolution to the driver during startup, the system can catch configuration errors early. The llama_cpp, vllm, transformers, and diffusers loaders have been updated to utilize these resolved paths, and the hf_filename field in the llama_cpp configuration has been removed. Review feedback highlights critical issues where the resolver returns directory paths instead of specific file paths for GGUF models—both in single-file and sharded scenarios—which would cause the llama_cpp loader to fail.
Three follow-up fixes after the initial integration:
- Set HF_HOME / VLLM_CACHE_ROOT / FLASHINFER_CACHE_DIR at the very top of
mship_deploy.py, before any import that pulls in huggingface_hub. Its
HF_HOME constant is latched at import time, so driver-side downloads
were landing in ~/.cache/huggingface instead of MSHIP_CACHE_DIR.
- Always return a file path (never a directory) when resolving GGUF-style
models. Three branches updated to match what llama.cpp expects:
* Local dir + selector matching multiple shards: sort and return the
first shard's path (was raising RuntimeError).
* HF repo + selector matching multiple shards: snapshot_download all
shards, then return the first shard's full path inside the snapshot
(was returning the snapshot dir).
* HF repo with exactly one .gguf file and no selector: hf_hub_download
it directly and return the file path (was returning the snapshot dir
via the universal-filter snapshot_download fallback).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.