Skip to content

Commit ec18edf

Browse files
ngxsonallozaurggerganov
authored
server: introduce API for serving / loading / unloading multiple models (#17470)
* server: add model management and proxy * fix compile error * does this fix windows? * fix windows build * use subprocess.h, better logging * add test * fix windows * feat: Model/Router server architecture WIP * more stable * fix unsafe pointer * also allow terminate loading model * add is_active() * refactor: Architecture improvements * tmp apply upstream fix * address most problems * address thread safety issue * address review comment * add docs (first version) * address review comment * feat: Improved UX for model information, modality interactions etc * chore: update webui build output * refactor: Use only the message data `model` property for displaying model used info * chore: update webui build output * add --models-dir param * feat: New Model Selection UX WIP * chore: update webui build output * feat: Add auto-mic setting * feat: Attachments UX improvements * implement LRU * remove default model path * better --models-dir * add env for args * address review comments * fix compile * refactor: Chat Form Submit component * ad endpoint docs * Merge remote-tracking branch 'webui/allozaur/server_model_management_v1_2' into xsn/server_model_maagement_v1_2 Co-authored-by: Aleksander <aleksander.grygier@gmail.com> * feat: Add copy to clipboard to model name in model info dialog * feat: Model unavailable UI state for model selector * feat: Chat Form Actions UI logic improvements * feat: Auto-select model from last assistant response * chore: update webui build output * expose args and exit_code in API * add note * support extra_args on loading model * allow reusing args if auto_load * typo docs * oai-compat /models endpoint * cleaner * address review comments * feat: Use `model` property for displaying the `repo/model-name` naming format * refactor: Attachments data * chore: update webui build output * refactor: Enum imports * feat: Improve Model Selector responsiveness * chore: update webui build output * refactor: Cleanup * refactor: Cleanup * refactor: Formatters * chore: update webui build output * refactor: Copy To Clipboard Icon component * chore: update webui build output * refactor: Cleanup * chore: update webui build output * refactor: UI badges * chore: update webui build output * refactor: Cleanup * refactor: Cleanup * chore: update webui build output * add --models-allow-extra-args for security * nits * add stdin_file * fix merge * fix: Retrieve lost setting after resolving merge conflict * refactor: DatabaseStore -> DatabaseService * refactor: Database, Conversations & Chat services + stores architecture improvements (WIP) * refactor: Remove redundant settings * refactor: Multi-model business logic WIP * chore: update webui build output * feat: Switching models logic for ChatForm or when regenerating messges + modality detection logic * chore: update webui build output * fix: Add `untrack` inside chat processing info data logic to prevent infinite effect * fix: Regenerate * feat: Remove redundant settigns + rearrange * fix: Audio attachments * refactor: Icons * chore: update webui build output * feat: Model management and selection features WIP * chore: update webui build output * refactor: Improve server properties management * refactor: Icons * chore: update webui build output * feat: Improve model loading/unloading status updates * chore: update webui build output * refactor: Improve API header management via utility functions * remove support for extra args * set hf_repo/docker_repo as model alias when posible * refactor: Remove ConversationsService * refactor: Chat requests abort handling * refactor: Server store * tmp webui build * refactor: Model modality handling * chore: update webui build output * refactor: Processing state reactivity * fix: UI * refactor: Services/Stores syntax + logic improvements Refactors components to access stores directly instead of using exported getter functions. This change centralizes store access and logic, simplifying component code and improving maintainability by reducing the number of exported functions and promoting direct store interaction. Removes exported getter functions from `chat.svelte.ts`, `conversations.svelte.ts`, `models.svelte.ts` and `settings.svelte.ts`. * refactor: Architecture cleanup * feat: Improve statistic badges * feat: Condition available models based on modality + better model loading strategy & UX * docs: Architecture documentation * feat: Update logic for PDF as Image * add TODO for http client * refactor: Enhance model info and attachment handling * chore: update webui build output * refactor: Components naming * chore: update webui build output * refactor: Cleanup * refactor: DRY `getAttachmentDisplayItems` function + fix UI * chore: update webui build output * fix: Modality detection improvement for text-based PDF attachments * refactor: Cleanup * docs: Add info comment * refactor: Cleanup * re * refactor: Cleanup * refactor: Cleanup * feat: Attachment logic & UI improvements * refactor: Constants * feat: Improve UI sidebar background color * chore: update webui build output * refactor: Utils imports + move types to `app.d.ts` * test: Fix Storybook mocks * chore: update webui build output * test: Update Chat Form UI tests * refactor: Tooltip Provider from core layout * refactor: Tests to separate location * decouple server_models from server_routes * test: Move demo test to tests/server * refactor: Remove redundant method * chore: update webui build output * also route anthropic endpoints * fix duplicated arg * fix invalid ptr to shutdown_handler * server : minor * rm unused fn * add ?autoload=true|false query param * refactor: Remove redundant code * docs: Update README documentations + architecture & data flow diagrams * fix: Disable autoload on calling server props for the model * chore: update webui build output * fix ubuntu build * fix: Model status reactivity * fix: Modality detection for MODEL mode * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
1 parent 7733409 commit ec18edf

File tree

178 files changed

+11511
-4224
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

178 files changed

+11511
-4224
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -613,3 +613,4 @@ $ echo "source ~/.llama-completion.bash" >> ~/.bashrc
613613
- [linenoise.cpp](./tools/run/linenoise.cpp/linenoise.cpp) - C++ library that provides readline-like line editing capabilities, used by `llama-run` - BSD 2-Clause License
614614
- [curl](https://curl.se/) - Client-side URL transfer library, used by various tools/examples - [CURL License](https://curl.se/docs/copyright.html)
615615
- [miniaudio.h](https://github.com/mackron/miniaudio) - Single-header audio format decoder, used by multimodal subsystem - Public domain
616+
- [subprocess.h](https://github.com/sheredom/subprocess.h) - Single-header process launching solution for C and C++ - Public domain

common/arg.cpp

Lines changed: 36 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -212,13 +212,13 @@ struct handle_model_result {
212212
static handle_model_result common_params_handle_model(
213213
struct common_params_model & model,
214214
const std::string & bearer_token,
215-
const std::string & model_path_default,
216215
bool offline) {
217216
handle_model_result result;
218217
// handle pre-fill default model path and url based on hf_repo and hf_file
219218
{
220219
if (!model.docker_repo.empty()) { // Handle Docker URLs by resolving them to local paths
221220
model.path = common_docker_resolve_model(model.docker_repo);
221+
model.name = model.docker_repo; // set name for consistency
222222
} else if (!model.hf_repo.empty()) {
223223
// short-hand to avoid specifying --hf-file -> default it to --model
224224
if (model.hf_file.empty()) {
@@ -227,7 +227,8 @@ static handle_model_result common_params_handle_model(
227227
if (auto_detected.repo.empty() || auto_detected.ggufFile.empty()) {
228228
exit(1); // built without CURL, error message already printed
229229
}
230-
model.hf_repo = auto_detected.repo;
230+
model.name = model.hf_repo; // repo name with tag
231+
model.hf_repo = auto_detected.repo; // repo name without tag
231232
model.hf_file = auto_detected.ggufFile;
232233
if (!auto_detected.mmprojFile.empty()) {
233234
result.found_mmproj = true;
@@ -257,8 +258,6 @@ static handle_model_result common_params_handle_model(
257258
model.path = fs_get_cache_file(string_split<std::string>(f, '/').back());
258259
}
259260

260-
} else if (model.path.empty()) {
261-
model.path = model_path_default;
262261
}
263262
}
264263

@@ -405,7 +404,7 @@ static bool common_params_parse_ex(int argc, char ** argv, common_params_context
405404

406405
// handle model and download
407406
{
408-
auto res = common_params_handle_model(params.model, params.hf_token, DEFAULT_MODEL_PATH, params.offline);
407+
auto res = common_params_handle_model(params.model, params.hf_token, params.offline);
409408
if (params.no_mmproj) {
410409
params.mmproj = {};
411410
} else if (res.found_mmproj && params.mmproj.path.empty() && params.mmproj.url.empty()) {
@@ -415,12 +414,18 @@ static bool common_params_parse_ex(int argc, char ** argv, common_params_context
415414
// only download mmproj if the current example is using it
416415
for (auto & ex : mmproj_examples) {
417416
if (ctx_arg.ex == ex) {
418-
common_params_handle_model(params.mmproj, params.hf_token, "", params.offline);
417+
common_params_handle_model(params.mmproj, params.hf_token, params.offline);
419418
break;
420419
}
421420
}
422-
common_params_handle_model(params.speculative.model, params.hf_token, "", params.offline);
423-
common_params_handle_model(params.vocoder.model, params.hf_token, "", params.offline);
421+
common_params_handle_model(params.speculative.model, params.hf_token, params.offline);
422+
common_params_handle_model(params.vocoder.model, params.hf_token, params.offline);
423+
}
424+
425+
// model is required (except for server)
426+
// TODO @ngxson : maybe show a list of available models in CLI in this case
427+
if (params.model.path.empty() && ctx_arg.ex != LLAMA_EXAMPLE_SERVER) {
428+
throw std::invalid_argument("error: --model is required\n");
424429
}
425430

426431
if (params.escape) {
@@ -2090,11 +2095,8 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
20902095
add_opt(common_arg(
20912096
{"-m", "--model"}, "FNAME",
20922097
ex == LLAMA_EXAMPLE_EXPORT_LORA
2093-
? std::string("model path from which to load base model")
2094-
: string_format(
2095-
"model path (default: `models/$filename` with filename from `--hf-file` "
2096-
"or `--model-url` if set, otherwise %s)", DEFAULT_MODEL_PATH
2097-
),
2098+
? "model path from which to load base model"
2099+
: "model path to load",
20982100
[](common_params & params, const std::string & value) {
20992101
params.model.path = value;
21002102
}
@@ -2492,6 +2494,27 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
24922494
}
24932495
}
24942496
).set_examples({LLAMA_EXAMPLE_SERVER}));
2497+
add_opt(common_arg(
2498+
{"--models-dir"}, "PATH",
2499+
"directory containing models for the router server (default: disabled)",
2500+
[](common_params & params, const std::string & value) {
2501+
params.models_dir = value;
2502+
}
2503+
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_MODELS_DIR"));
2504+
add_opt(common_arg(
2505+
{"--models-max"}, "N",
2506+
string_format("for router server, maximum number of models to load simultaneously (default: %d, 0 = unlimited)", params.models_max),
2507+
[](common_params & params, int value) {
2508+
params.models_max = value;
2509+
}
2510+
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_MODELS_MAX"));
2511+
add_opt(common_arg(
2512+
{"--no-models-autoload"},
2513+
"disables automatic loading of models (default: enabled)",
2514+
[](common_params & params) {
2515+
params.models_autoload = false;
2516+
}
2517+
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_NO_MODELS_AUTOLOAD"));
24952518
add_opt(common_arg(
24962519
{"--jinja"},
24972520
string_format("use jinja template for chat (default: %s)\n", params.use_jinja ? "enabled" : "disabled"),

common/common.cpp

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -912,7 +912,7 @@ std::string fs_get_cache_file(const std::string & filename) {
912912
return cache_directory + filename;
913913
}
914914

915-
std::vector<common_file_info> fs_list_files(const std::string & path) {
915+
std::vector<common_file_info> fs_list(const std::string & path, bool include_directories) {
916916
std::vector<common_file_info> files;
917917
if (path.empty()) return files;
918918

@@ -927,14 +927,22 @@ std::vector<common_file_info> fs_list_files(const std::string & path) {
927927
const auto & p = entry.path();
928928
if (std::filesystem::is_regular_file(p)) {
929929
common_file_info info;
930-
info.path = p.string();
931-
info.name = p.filename().string();
930+
info.path = p.string();
931+
info.name = p.filename().string();
932+
info.is_dir = false;
932933
try {
933934
info.size = static_cast<size_t>(std::filesystem::file_size(p));
934935
} catch (const std::filesystem::filesystem_error &) {
935936
info.size = 0;
936937
}
937938
files.push_back(std::move(info));
939+
} else if (include_directories && std::filesystem::is_directory(p)) {
940+
common_file_info info;
941+
info.path = p.string();
942+
info.name = p.filename().string();
943+
info.size = 0; // Directories have no size
944+
info.is_dir = true;
945+
files.push_back(std::move(info));
938946
}
939947
} catch (const std::filesystem::filesystem_error &) {
940948
// skip entries we cannot inspect

common/common.h

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,6 @@
2626
fprintf(stderr, "%s: built with %s for %s\n", __func__, LLAMA_COMPILER, LLAMA_BUILD_TARGET); \
2727
} while(0)
2828

29-
#define DEFAULT_MODEL_PATH "models/7B/ggml-model-f16.gguf"
30-
3129
struct common_time_meas {
3230
common_time_meas(int64_t & t_acc, bool disable = false);
3331
~common_time_meas();
@@ -223,6 +221,7 @@ struct common_params_model {
223221
std::string hf_repo = ""; // HF repo // NOLINT
224222
std::string hf_file = ""; // HF file // NOLINT
225223
std::string docker_repo = ""; // Docker repo // NOLINT
224+
std::string name = ""; // in format <user>/<model>[:<tag>] (tag is optional) // NOLINT
226225
};
227226

228227
struct common_params_speculative {
@@ -478,6 +477,11 @@ struct common_params {
478477
bool endpoint_props = false; // only control POST requests, not GET
479478
bool endpoint_metrics = false;
480479

480+
// router server configs
481+
std::string models_dir = ""; // directory containing models for the router server
482+
int models_max = 4; // maximum number of models to load simultaneously
483+
bool models_autoload = true; // automatically load models when requested via the router server
484+
481485
bool log_json = false;
482486

483487
std::string slot_save_path;
@@ -641,8 +645,9 @@ struct common_file_info {
641645
std::string path;
642646
std::string name;
643647
size_t size = 0; // in bytes
648+
bool is_dir = false;
644649
};
645-
std::vector<common_file_info> fs_list_files(const std::string & path);
650+
std::vector<common_file_info> fs_list(const std::string & path, bool include_directories);
646651

647652
//
648653
// Model utils

common/download.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1054,7 +1054,7 @@ std::string common_docker_resolve_model(const std::string &) {
10541054
std::vector<common_cached_model_info> common_list_cached_models() {
10551055
std::vector<common_cached_model_info> models;
10561056
const std::string cache_dir = fs_get_cache_directory();
1057-
const std::vector<common_file_info> files = fs_list_files(cache_dir);
1057+
const std::vector<common_file_info> files = fs_list(cache_dir, false);
10581058
for (const auto & file : files) {
10591059
if (string_starts_with(file.name, "manifest=") && string_ends_with(file.name, ".json")) {
10601060
common_cached_model_info model_info;

common/download.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,10 @@ struct common_cached_model_info {
1414
std::string model;
1515
std::string tag;
1616
size_t size = 0; // GGUF size in bytes
17+
// return string representation like "user/model:tag"
18+
// if tag is "latest", it will be omitted
1719
std::string to_string() const {
18-
return user + "/" + model + ":" + tag;
20+
return user + "/" + model + (tag == "latest" ? "" : ":" + tag);
1921
}
2022
};
2123

scripts/sync_vendor.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@
1717
"https://github.com/mackron/miniaudio/raw/669ed3e844524fcd883231b13095baee9f6de304/miniaudio.h": "vendor/miniaudio/miniaudio.h",
1818

1919
"https://raw.githubusercontent.com/yhirose/cpp-httplib/refs/tags/v0.28.0/httplib.h": "vendor/cpp-httplib/httplib.h",
20+
21+
"https://raw.githubusercontent.com/sheredom/subprocess.h/b49c56e9fe214488493021017bf3954b91c7c1f5/subprocess.h": "vendor/sheredom/subprocess.h",
2022
}
2123

2224
for url, filename in vendor.items():

tests/test-quantize-stats.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
#endif
2424

2525
struct quantize_stats_params {
26-
std::string model = DEFAULT_MODEL_PATH;
26+
std::string model = "models/7B/ggml-model-f16.gguf";
2727
bool verbose = false;
2828
bool per_layer_stats = false;
2929
bool print_histogram = false;

tools/server/CMakeLists.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ set(TARGET_SRCS
1515
server.cpp
1616
server-http.cpp
1717
server-http.h
18+
server-models.cpp
19+
server-models.h
1820
server-task.cpp
1921
server-task.h
2022
server-queue.cpp

0 commit comments

Comments
 (0)