Skip to content

[API] feat: add model management endpoints for loading and unloading models#1553

Merged
AlpinDale merged 3 commits into
mainfrom
model_load_unload
Nov 3, 2025
Merged

[API] feat: add model management endpoints for loading and unloading models#1553
AlpinDale merged 3 commits into
mainfrom
model_load_unload

Conversation

@AlpinDale

@AlpinDale AlpinDale commented Nov 3, 2025

Copy link
Copy Markdown
Member

Needs the env var APHRODITE_SERVER_DEV_MODE=1 enabled.

You can call the /v1/unload_model endpoint to unload the model by killing the engine and worker processes (which frees up all memory), then call the /v1/load_model endpoint to reload your model.

Unloading a model

# curl -s -X POST http://localhost:2242/v1/unload_model | jq .
{
  "status": "success",
  "message": "Model unloaded successfully in 1.28s. All GPU memory has been freed.",
  "drain_time_s": 0.0,
  "shutdown_time_s": 1.28,
  "total_time_s": 1.28
}

Loading a model

Running the /v1/load_model endpoint by itself will reload your original model with no changes. If you wish to swap out the model, or change the launch args, there are a few options:

Pass a config YAML file

You can call the /v1/load_model endpoint with a file upload that contains your YAML config with the desired args, e.g.:

model: Qwen/Qwen3-32B-FP8
tensor_parallel_size: 2
max_model_len: 8192

And use the endpoint like this:

# curl -s -X POST http://localhost:2242/v1/load_model -F "config=@config.yaml" | jq .
{
  "status": "success",
  "message": "Model loaded successfully in 87.01s.",
  "load_time_s": 87.01,
  "model": "Qwen/Qwen3-32B-FP8",
  "config_applied": {
    "model": {
      "value": "Qwen/Qwen3-32B-FP8",
      "source": "uploaded"
    },
    "tensor_parallel_size": {
      "value": 2,
      "source": "uploaded"
    },
    "max_model_len": {
      "value": 8192,
      "source": "uploaded"
    }
  }
}

Pass the launch arguments directly

You don't need to provide a YAML config:

# curl -s -X POST http://localhost:2242/v1/load_model \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-32B-FP8",
    "config": {
      "tensor_parallel_size": 2,
      "max_model_len": 32768
    }
  }' | jq .
{
  "status": "success",
  "message": "Model loaded successfully in 97.10s.",
  "load_time_s": 97.1,
  "model": "Qwen/Qwen3-32B-FP8",
  "config_applied": {
    "model": {
      "value": "Qwen/Qwen3-32B-FP8",
      "source": "request"
    },
    "tensor_parallel_size": {
      "value": 2,
      "source": "uploaded"
    },
    "max_model_len": {
      "value": 32768,
      "source": "uploaded"
    }
  }
}

Aphrodite config in model directory

You may also put a aphrodite_config.yaml in your model directory (local or on HF), and Aphrodite will use the args there when loading the model via this endpoint.

# cat /root/models/Meta-Llama-3.1-8B-Instruct/aphrodite_config.yaml 
max_model_len: 4096
tensor_parallel_size: 2
gpu_memory_utilization: 0.9

Then load your model:

# curl -s -X POST http://localhost:2242/v1/load_model \
  -H "Content-Type: application/json" \
  -d '{"model": "/root/models/Meta-Llama-3.1-8B-Instruct"}' | jq .
{
  "status": "success",
  "message": "Model loaded successfully in 44.03s.",
  "load_time_s": 44.03,
  "model": "/root/models/Meta-Llama-3.1-8B-Instruct",
  "config_applied": {
    "model": {
      "value": "/root/models/Meta-Llama-3.1-8B-Instruct",
      "source": "request"
    },
    "max_model_len": {
      "value": 4096,
      "source": "model_dir"
    },
    "tensor_parallel_size": {
      "value": 2,
      "source": "model_dir"
    },
    "gpu_memory_utilization": {
      "value": 0.9,
      "source": "model_dir"
    }
  }
}

Inline model loading

If you launch the Aphrodite API server with --enable-inline-model-loading, you can use a different model name in your request, and it'll swap the models out and complete your request with the provided one instead.

For example, we have the model Qwen/Qwen3-0.6B loaded, but we want to run inference on a local model at /root/models/Meta-Llama-3.1-8B-Instruct (which contains an aphrodite_config.yaml in its directory), we can simply send our completions request as normal, but with the desired model:

# curl -s -X POST http://localhost:2242/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/root/models/Meta-Llama-3.1-8B-Instruct", "prompt": "Once upon a time", "max_tokens": 128}' | jq .
{
  "id": "cmpl-140069eb7f1a44d49c0b09f49967f3a9",
  "object": "text_completion",
  "created": 1762199086,
  "model": "/root/models/Meta-Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "text": " there was a man who devoted his entire life to the service of God. He lived a simple and austere life on the banks of the River Ganga, surrounded by the vast and serene landscape of the Indian Himalayan mountains. This saint, known as Ganga Baba, was revered and respected everywhere for his deep devotion, compassion, and holiness.\nAs he traveled from place to place, spreading the message of love, equality, and non-violence, he attracted thousands of people belonging to all walks of life. The curious and the indifferent, the rich and the poor, the young and the old—they all gathered ‘round to listen",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 133,
    "completion_tokens": 128,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @AlpinDale, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Aphrodite API server's capabilities by introducing dynamic model management. It allows users to unload the currently active model to free up GPU resources and then load a different model or reload the existing one with new configurations without requiring a full server restart. This provides greater operational flexibility and efficiency for managing large language models in a serving environment.

Highlights

  • Model Unloading Endpoint: Introduced a new API endpoint, /v1/unload_model, which allows for the graceful shutdown of the current model engine, draining all in-flight requests, and freeing up GPU memory and CUDA context.
  • Model Loading Endpoint: Added a new API endpoint, /v1/load_model, enabling dynamic loading of models. This endpoint supports loading the original model, a new model, or reloading with updated configurations provided via JSON body, YAML file upload, or an aphrodite_config.yaml located in the model directory.
  • Flexible Configuration: The /v1/load_model endpoint offers a flexible configuration priority system: request-provided configurations (JSON or uploaded YAML) take precedence, followed by original server startup arguments, and then aphrodite_config.yaml from the model directory (if no explicit config is provided in the request).
  • Dynamic Watchdog Updates: The server's watchdog loop has been updated to dynamically check the current engine client from the application state, ensuring it monitors the correct engine instance even after models are unloaded and reloaded.
  • Model Directory Configuration: Implemented functionality to automatically detect and apply configuration settings from an aphrodite_config.yaml file within the model's directory, simplifying model deployment and configuration management.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @AlpinDale, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Aphrodite API server's flexibility by introducing robust model management capabilities. It allows users to dynamically unload the currently active large language model, thereby releasing GPU resources, and subsequently load a different model or reconfigure the existing one without requiring a full server restart. This is achieved through new "/v1/unload_model" and "/v1/load_model" endpoints, which support various methods for specifying model configurations, including inline JSON, YAML file uploads, and "aphrodite_config.yaml" files located in model directories. The changes also ensure that the server's internal monitoring systems correctly adapt to these dynamic model changes.

Highlights

  • Dynamic Model Management Endpoints: Introduced "/v1/unload_model" and "/v1/load_model" API endpoints for runtime model control.
  • Model Unloading: Allows graceful shutdown of the current model engine, freeing up all associated GPU memory and resources.
  • Flexible Model Loading: Enables loading new models or reloading the current one with configurable parameters, supporting JSON body, YAML file uploads, and "aphrodite_config.yaml" files within model directories.
  • Configuration Prioritization: Establishes a clear hierarchy for applying model configurations: request-provided settings take precedence over model directory configurations, which in turn override original server startup arguments.
  • Watchdog Adaptability: The server's watchdog mechanism has been updated to dynamically monitor the active engine client, ensuring proper error handling even after models are reloaded.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable model management endpoints for loading and unloading models, which is a significant enhancement for development and resource management. The implementation is generally robust, handling various configuration sources and correctly updating the server state. My feedback focuses on a few areas to improve the implementation further: enhancing error handling for malformed API requests, refactoring duplicated code to improve maintainability, and standardizing the location of imports to align with common Python conventions.

Comment on lines +1205 to +1206
except json.JSONDecodeError:
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Silently ignoring a JSONDecodeError can lead to confusing behavior for the user. If a user provides a malformed JSON body, the request will proceed as if no body was sent, loading the default model instead of returning an error. It's better to raise an HTTPException to inform the user about the malformed request.

Suggested change
except json.JSONDecodeError:
pass
except json.JSONDecodeError as e:
raise HTTPException(
status_code=HTTPStatus.BAD_REQUEST.value,
detail=f"Invalid JSON in request body: {e}"
) from e

state.engine_args = args
if not hasattr(state, 'original_engine_args'):
# Only store original args on first init, not on subsequent loads
from copy import deepcopy

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and adherence to PEP 8, imports should be at the top of the file. Please remove this import from here and add from copy import deepcopy to the file-level imports.

shutdown_start = time.time()
old_engine_client.shutdown()

import torch.cuda

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and adherence to PEP 8, imports should be at the top of the file. Please remove this import from here and add import torch.cuda to the file-level imports. This improves readability by declaring all dependencies in one place.

Comment on lines +380 to +409
if config_data is None:
model_config_yaml = get_model_config_yaml(
args.model, getattr(args, 'download_dir', None)
)

if model_config_yaml:
logger.info(
f"Found aphrodite_config in model directory with "
f"{len(model_config_yaml)} settings"
)
for key, value in model_config_yaml.items():
attr_name = key.replace('-', '_')
# Don't override the model path if it was explicitly provided in request
if attr_name == 'model' and model is not None:
continue
if hasattr(args, attr_name):
old_value = getattr(args, attr_name)
setattr(args, attr_name, value)
config_applied[key] = {
"old": old_value,
"new": value,
"source": "model_dir"
}
logger.info(
f"Config from model dir: {key} = {value} (was: {old_value})"
)
else:
logger.warning(
f"Unknown config key in model directory: {key} - ignoring"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of logic for applying configuration from aphrodite_config.yaml is very similar to the logic in api_server.py (lines 185-200). To improve maintainability and reduce code duplication, consider refactoring this into a shared helper function that can be called from both places. The helper could take the args object, the model_config_yaml dict, and optionally a dictionary to populate with applied changes for the API response.


logger.info("Model load requested - initializing engine...")

from aphrodite.v1.engine.async_llm import AsyncLLM

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and adherence to PEP 8, imports should be at the top of the file. While lazy importing can help with startup times, it can also make dependencies harder to track. Please move this import to the top of the file.

return None

try:
import yaml

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and adherence to PEP 8, imports should be at the top of the file. Please move import yaml to the file-level imports.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a powerful new feature for dynamically loading and unloading models via API endpoints. The implementation is well-structured, covering various configuration methods and ensuring proper state management, especially with the updated watchdog logic. My review focuses on enhancing error handling and logging to improve the robustness and debuggability of this new functionality. Overall, these are great additions to the project.

Comment on lines +1201 to +1206
try:
body = await raw_request.json()
model_name = body.get("model")
config_data = body.get("config")
except json.JSONDecodeError:
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When handling a JSON request in the load_model endpoint, a json.JSONDecodeError is silently ignored. This means if a user sends a malformed JSON body, the request will proceed as if no body was sent, likely reloading the original model without any changes. This can be confusing for the user, who would expect an error message indicating that their JSON is invalid. It's better to catch this exception and return a 400 Bad Request response with a descriptive error message.

                except json.JSONDecodeError as e:
                    raise HTTPException(
                        status_code=HTTPStatus.BAD_REQUEST.value,
                        detail=f"Invalid JSON body: {e}",
                    ) from e

Comment on lines +919 to +920
except Exception:
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The broad except Exception block when trying to download aphrodite_config.yaml from the Hugging Face Hub currently swallows any exception silently and returns None. While the config file is optional, this behavior can make it difficult to debug issues related to network connectivity, permissions, or unexpected errors from the huggingface_hub library. I suggest adding a warning log that includes the exception details to provide better visibility when a download fails.

        except Exception as e:
            logger.warning(
                "Failed to download aphrodite_config.yaml from hub for %s: %s",
                model_name_or_path,
                e,
            )
            return None

@AlpinDale AlpinDale merged commit 5785dbd into main Nov 3, 2025
0 of 4 checks passed
@AlpinDale AlpinDale deleted the model_load_unload branch November 3, 2025 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant