[API] feat: add model management endpoints for loading and unloading models#1553
Conversation
Summary of ChangesHello @AlpinDale, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Aphrodite API server's capabilities by introducing dynamic model management. It allows users to unload the currently active model to free up GPU resources and then load a different model or reload the existing one with new configurations without requiring a full server restart. This provides greater operational flexibility and efficiency for managing large language models in a serving environment. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Summary of ChangesHello @AlpinDale, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Aphrodite API server's flexibility by introducing robust model management capabilities. It allows users to dynamically unload the currently active large language model, thereby releasing GPU resources, and subsequently load a different model or reconfigure the existing one without requiring a full server restart. This is achieved through new "/v1/unload_model" and "/v1/load_model" endpoints, which support various methods for specifying model configurations, including inline JSON, YAML file uploads, and "aphrodite_config.yaml" files located in model directories. The changes also ensure that the server's internal monitoring systems correctly adapt to these dynamic model changes. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces valuable model management endpoints for loading and unloading models, which is a significant enhancement for development and resource management. The implementation is generally robust, handling various configuration sources and correctly updating the server state. My feedback focuses on a few areas to improve the implementation further: enhancing error handling for malformed API requests, refactoring duplicated code to improve maintainability, and standardizing the location of imports to align with common Python conventions.
| except json.JSONDecodeError: | ||
| pass |
There was a problem hiding this comment.
Silently ignoring a JSONDecodeError can lead to confusing behavior for the user. If a user provides a malformed JSON body, the request will proceed as if no body was sent, loading the default model instead of returning an error. It's better to raise an HTTPException to inform the user about the malformed request.
| except json.JSONDecodeError: | |
| pass | |
| except json.JSONDecodeError as e: | |
| raise HTTPException( | |
| status_code=HTTPStatus.BAD_REQUEST.value, | |
| detail=f"Invalid JSON in request body: {e}" | |
| ) from e |
| state.engine_args = args | ||
| if not hasattr(state, 'original_engine_args'): | ||
| # Only store original args on first init, not on subsequent loads | ||
| from copy import deepcopy |
| shutdown_start = time.time() | ||
| old_engine_client.shutdown() | ||
|
|
||
| import torch.cuda |
| if config_data is None: | ||
| model_config_yaml = get_model_config_yaml( | ||
| args.model, getattr(args, 'download_dir', None) | ||
| ) | ||
|
|
||
| if model_config_yaml: | ||
| logger.info( | ||
| f"Found aphrodite_config in model directory with " | ||
| f"{len(model_config_yaml)} settings" | ||
| ) | ||
| for key, value in model_config_yaml.items(): | ||
| attr_name = key.replace('-', '_') | ||
| # Don't override the model path if it was explicitly provided in request | ||
| if attr_name == 'model' and model is not None: | ||
| continue | ||
| if hasattr(args, attr_name): | ||
| old_value = getattr(args, attr_name) | ||
| setattr(args, attr_name, value) | ||
| config_applied[key] = { | ||
| "old": old_value, | ||
| "new": value, | ||
| "source": "model_dir" | ||
| } | ||
| logger.info( | ||
| f"Config from model dir: {key} = {value} (was: {old_value})" | ||
| ) | ||
| else: | ||
| logger.warning( | ||
| f"Unknown config key in model directory: {key} - ignoring" | ||
| ) |
There was a problem hiding this comment.
This block of logic for applying configuration from aphrodite_config.yaml is very similar to the logic in api_server.py (lines 185-200). To improve maintainability and reduce code duplication, consider refactoring this into a shared helper function that can be called from both places. The helper could take the args object, the model_config_yaml dict, and optionally a dictionary to populate with applied changes for the API response.
|
|
||
| logger.info("Model load requested - initializing engine...") | ||
|
|
||
| from aphrodite.v1.engine.async_llm import AsyncLLM |
| return None | ||
|
|
||
| try: | ||
| import yaml |
There was a problem hiding this comment.
Code Review
This pull request introduces a powerful new feature for dynamically loading and unloading models via API endpoints. The implementation is well-structured, covering various configuration methods and ensuring proper state management, especially with the updated watchdog logic. My review focuses on enhancing error handling and logging to improve the robustness and debuggability of this new functionality. Overall, these are great additions to the project.
| try: | ||
| body = await raw_request.json() | ||
| model_name = body.get("model") | ||
| config_data = body.get("config") | ||
| except json.JSONDecodeError: | ||
| pass |
There was a problem hiding this comment.
When handling a JSON request in the load_model endpoint, a json.JSONDecodeError is silently ignored. This means if a user sends a malformed JSON body, the request will proceed as if no body was sent, likely reloading the original model without any changes. This can be confusing for the user, who would expect an error message indicating that their JSON is invalid. It's better to catch this exception and return a 400 Bad Request response with a descriptive error message.
except json.JSONDecodeError as e:
raise HTTPException(
status_code=HTTPStatus.BAD_REQUEST.value,
detail=f"Invalid JSON body: {e}",
) from e| except Exception: | ||
| return None |
There was a problem hiding this comment.
The broad except Exception block when trying to download aphrodite_config.yaml from the Hugging Face Hub currently swallows any exception silently and returns None. While the config file is optional, this behavior can make it difficult to debug issues related to network connectivity, permissions, or unexpected errors from the huggingface_hub library. I suggest adding a warning log that includes the exception details to provide better visibility when a download fails.
except Exception as e:
logger.warning(
"Failed to download aphrodite_config.yaml from hub for %s: %s",
model_name_or_path,
e,
)
return None
Needs the env var
APHRODITE_SERVER_DEV_MODE=1enabled.You can call the
/v1/unload_modelendpoint to unload the model by killing the engine and worker processes (which frees up all memory), then call the/v1/load_modelendpoint to reload your model.Unloading a model
Loading a model
Running the
/v1/load_modelendpoint by itself will reload your original model with no changes. If you wish to swap out the model, or change the launch args, there are a few options:Pass a config YAML file
You can call the
/v1/load_modelendpoint with a file upload that contains your YAML config with the desired args, e.g.:And use the endpoint like this:
Pass the launch arguments directly
You don't need to provide a YAML config:
Aphrodite config in model directory
You may also put a
aphrodite_config.yamlin your model directory (local or on HF), and Aphrodite will use the args there when loading the model via this endpoint.Then load your model:
Inline model loading
If you launch the Aphrodite API server with
--enable-inline-model-loading, you can use a different model name in your request, and it'll swap the models out and complete your request with the provided one instead.For example, we have the model
Qwen/Qwen3-0.6Bloaded, but we want to run inference on a local model at/root/models/Meta-Llama-3.1-8B-Instruct(which contains anaphrodite_config.yamlin its directory), we can simply send our completions request as normal, but with the desired model: