[API] feat: add model management endpoints for loading and unloading models by AlpinDale · Pull Request #1553 · dphnAI/sonar

AlpinDale · 2025-11-03T19:22:08Z

Needs the env var APHRODITE_SERVER_DEV_MODE=1 enabled.

You can call the /v1/unload_model endpoint to unload the model by killing the engine and worker processes (which frees up all memory), then call the /v1/load_model endpoint to reload your model.

Unloading a model

# curl -s -X POST http://localhost:2242/v1/unload_model | jq .
{
  "status": "success",
  "message": "Model unloaded successfully in 1.28s. All GPU memory has been freed.",
  "drain_time_s": 0.0,
  "shutdown_time_s": 1.28,
  "total_time_s": 1.28
}

Loading a model

Running the /v1/load_model endpoint by itself will reload your original model with no changes. If you wish to swap out the model, or change the launch args, there are a few options:

Pass a config YAML file

You can call the /v1/load_model endpoint with a file upload that contains your YAML config with the desired args, e.g.:

model: Qwen/Qwen3-32B-FP8
tensor_parallel_size: 2
max_model_len: 8192

And use the endpoint like this:

# curl -s -X POST http://localhost:2242/v1/load_model -F "config=@config.yaml" | jq .
{
  "status": "success",
  "message": "Model loaded successfully in 87.01s.",
  "load_time_s": 87.01,
  "model": "Qwen/Qwen3-32B-FP8",
  "config_applied": {
    "model": {
      "value": "Qwen/Qwen3-32B-FP8",
      "source": "uploaded"
    },
    "tensor_parallel_size": {
      "value": 2,
      "source": "uploaded"
    },
    "max_model_len": {
      "value": 8192,
      "source": "uploaded"
    }
  }
}

Pass the launch arguments directly

You don't need to provide a YAML config:

# curl -s -X POST http://localhost:2242/v1/load_model \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-32B-FP8",
    "config": {
      "tensor_parallel_size": 2,
      "max_model_len": 32768
    }
  }' | jq .
{
  "status": "success",
  "message": "Model loaded successfully in 97.10s.",
  "load_time_s": 97.1,
  "model": "Qwen/Qwen3-32B-FP8",
  "config_applied": {
    "model": {
      "value": "Qwen/Qwen3-32B-FP8",
      "source": "request"
    },
    "tensor_parallel_size": {
      "value": 2,
      "source": "uploaded"
    },
    "max_model_len": {
      "value": 32768,
      "source": "uploaded"
    }
  }
}

Aphrodite config in model directory

You may also put a aphrodite_config.yaml in your model directory (local or on HF), and Aphrodite will use the args there when loading the model via this endpoint.

# cat /root/models/Meta-Llama-3.1-8B-Instruct/aphrodite_config.yaml 
max_model_len: 4096
tensor_parallel_size: 2
gpu_memory_utilization: 0.9

Then load your model:

# curl -s -X POST http://localhost:2242/v1/load_model \
  -H "Content-Type: application/json" \
  -d '{"model": "/root/models/Meta-Llama-3.1-8B-Instruct"}' | jq .
{
  "status": "success",
  "message": "Model loaded successfully in 44.03s.",
  "load_time_s": 44.03,
  "model": "/root/models/Meta-Llama-3.1-8B-Instruct",
  "config_applied": {
    "model": {
      "value": "/root/models/Meta-Llama-3.1-8B-Instruct",
      "source": "request"
    },
    "max_model_len": {
      "value": 4096,
      "source": "model_dir"
    },
    "tensor_parallel_size": {
      "value": 2,
      "source": "model_dir"
    },
    "gpu_memory_utilization": {
      "value": 0.9,
      "source": "model_dir"
    }
  }
}

Inline model loading

If you launch the Aphrodite API server with --enable-inline-model-loading, you can use a different model name in your request, and it'll swap the models out and complete your request with the provided one instead.

For example, we have the model Qwen/Qwen3-0.6B loaded, but we want to run inference on a local model at /root/models/Meta-Llama-3.1-8B-Instruct (which contains an aphrodite_config.yaml in its directory), we can simply send our completions request as normal, but with the desired model:

# curl -s -X POST http://localhost:2242/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/root/models/Meta-Llama-3.1-8B-Instruct", "prompt": "Once upon a time", "max_tokens": 128}' | jq .
{
  "id": "cmpl-140069eb7f1a44d49c0b09f49967f3a9",
  "object": "text_completion",
  "created": 1762199086,
  "model": "/root/models/Meta-Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "text": " there was a man who devoted his entire life to the service of God. He lived a simple and austere life on the banks of the River Ganga, surrounded by the vast and serene landscape of the Indian Himalayan mountains. This saint, known as Ganga Baba, was revered and respected everywhere for his deep devotion, compassion, and holiness.\nAs he traveled from place to place, spreading the message of love, equality, and non-violence, he attracted thousands of people belonging to all walks of life. The curious and the indifferent, the rich and the poor, the young and the old—they all gathered ‘round to listen",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 133,
    "completion_tokens": 128,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

…models

gemini-code-assist · 2025-11-03T19:22:24Z

Summary of Changes

Hello @AlpinDale, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Aphrodite API server's capabilities by introducing dynamic model management. It allows users to unload the currently active model to free up GPU resources and then load a different model or reload the existing one with new configurations without requiring a full server restart. This provides greater operational flexibility and efficiency for managing large language models in a serving environment.

Highlights

Model Unloading Endpoint: Introduced a new API endpoint, /v1/unload_model, which allows for the graceful shutdown of the current model engine, draining all in-flight requests, and freeing up GPU memory and CUDA context.
Model Loading Endpoint: Added a new API endpoint, /v1/load_model, enabling dynamic loading of models. This endpoint supports loading the original model, a new model, or reloading with updated configurations provided via JSON body, YAML file upload, or an aphrodite_config.yaml located in the model directory.
Flexible Configuration: The /v1/load_model endpoint offers a flexible configuration priority system: request-provided configurations (JSON or uploaded YAML) take precedence, followed by original server startup arguments, and then aphrodite_config.yaml from the model directory (if no explicit config is provided in the request).
Dynamic Watchdog Updates: The server's watchdog loop has been updated to dynamically check the current engine client from the application state, ensuring it monitors the correct engine instance even after models are unloaded and reloaded.
Model Directory Configuration: Implemented functionality to automatically detect and apply configuration settings from an aphrodite_config.yaml file within the model's directory, simplifying model deployment and configuration management.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist · 2025-11-03T19:23:07Z

Summary of Changes

Hello @AlpinDale, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Aphrodite API server's flexibility by introducing robust model management capabilities. It allows users to dynamically unload the currently active large language model, thereby releasing GPU resources, and subsequently load a different model or reconfigure the existing one without requiring a full server restart. This is achieved through new "/v1/unload_model" and "/v1/load_model" endpoints, which support various methods for specifying model configurations, including inline JSON, YAML file uploads, and "aphrodite_config.yaml" files located in model directories. The changes also ensure that the server's internal monitoring systems correctly adapt to these dynamic model changes.

Highlights

Dynamic Model Management Endpoints: Introduced "/v1/unload_model" and "/v1/load_model" API endpoints for runtime model control.
Model Unloading: Allows graceful shutdown of the current model engine, freeing up all associated GPU memory and resources.
Flexible Model Loading: Enables loading new models or reloading the current one with configurable parameters, supporting JSON body, YAML file uploads, and "aphrodite_config.yaml" files within model directories.
Configuration Prioritization: Establishes a clear hierarchy for applying model configurations: request-provided settings take precedence over model directory configurations, which in turn override original server startup arguments.
Watchdog Adaptability: The server's watchdog mechanism has been updated to dynamically monitor the active engine client, ensuring proper error handling even after models are reloaded.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces valuable model management endpoints for loading and unloading models, which is a significant enhancement for development and resource management. The implementation is generally robust, handling various configuration sources and correctly updating the server state. My feedback focuses on a few areas to improve the implementation further: enhancing error handling for malformed API requests, refactoring duplicated code to improve maintainability, and standardizing the location of imports to align with common Python conventions.

gemini-code-assist · 2025-11-03T19:24:19Z

+                except json.JSONDecodeError:
+                    pass


Silently ignoring a JSONDecodeError can lead to confusing behavior for the user. If a user provides a malformed JSON body, the request will proceed as if no body was sent, loading the default model instead of returning an error. It's better to raise an HTTPException to inform the user about the malformed request.

Suggested change

except json.JSONDecodeError:

pass

except json.JSONDecodeError as e:

raise HTTPException(

status_code=HTTPStatus.BAD_REQUEST.value,

detail=f"Invalid JSON in request body: {e}"

) from e

gemini-code-assist · 2025-11-03T19:24:19Z

+    state.engine_args = args
+    if not hasattr(state, 'original_engine_args'):
+        # Only store original args on first init, not on subsequent loads
+        from copy import deepcopy


For consistency and adherence to PEP 8, imports should be at the top of the file. Please remove this import from here and add from copy import deepcopy to the file-level imports.

gemini-code-assist · 2025-11-03T19:24:20Z

+        shutdown_start = time.time()
+        old_engine_client.shutdown()
+
+        import torch.cuda


For consistency and adherence to PEP 8, imports should be at the top of the file. Please remove this import from here and add import torch.cuda to the file-level imports. This improves readability by declaring all dependencies in one place.

gemini-code-assist · 2025-11-03T19:24:20Z

+        if config_data is None:
+            model_config_yaml = get_model_config_yaml(
+                args.model, getattr(args, 'download_dir', None)
+            )
+
+            if model_config_yaml:
+                logger.info(
+                    f"Found aphrodite_config in model directory with "
+                    f"{len(model_config_yaml)} settings"
+                )
+                for key, value in model_config_yaml.items():
+                    attr_name = key.replace('-', '_')
+                    # Don't override the model path if it was explicitly provided in request
+                    if attr_name == 'model' and model is not None:
+                        continue
+                    if hasattr(args, attr_name):
+                        old_value = getattr(args, attr_name)
+                        setattr(args, attr_name, value)
+                        config_applied[key] = {
+                            "old": old_value,
+                            "new": value,
+                            "source": "model_dir"
+                        }
+                        logger.info(
+                            f"Config from model dir: {key} = {value} (was: {old_value})"
+                        )
+                    else:
+                        logger.warning(
+                            f"Unknown config key in model directory: {key} - ignoring"
+                        )


This block of logic for applying configuration from aphrodite_config.yaml is very similar to the logic in api_server.py (lines 185-200). To improve maintainability and reduce code duplication, consider refactoring this into a shared helper function that can be called from both places. The helper could take the args object, the model_config_yaml dict, and optionally a dictionary to populate with applied changes for the API response.

gemini-code-assist · 2025-11-03T19:24:20Z

+
+        logger.info("Model load requested - initializing engine...")
+
+        from aphrodite.v1.engine.async_llm import AsyncLLM


For consistency and adherence to PEP 8, imports should be at the top of the file. While lazy importing can help with startup times, it can also make dependencies harder to track. Please move this import to the top of the file.

gemini-code-assist · 2025-11-03T19:24:20Z

+        return None
+
+    try:
+        import yaml


For consistency and adherence to PEP 8, imports should be at the top of the file. Please move import yaml to the file-level imports.

gemini-code-assist

Code Review

This pull request introduces a powerful new feature for dynamically loading and unloading models via API endpoints. The implementation is well-structured, covering various configuration methods and ensuring proper state management, especially with the updated watchdog logic. My review focuses on enhancing error handling and logging to improve the robustness and debuggability of this new functionality. Overall, these are great additions to the project.

gemini-code-assist · 2025-11-03T19:24:36Z

+                try:
+                    body = await raw_request.json()
+                    model_name = body.get("model")
+                    config_data = body.get("config")
+                except json.JSONDecodeError:
+                    pass


When handling a JSON request in the load_model endpoint, a json.JSONDecodeError is silently ignored. This means if a user sends a malformed JSON body, the request will proceed as if no body was sent, likely reloading the original model without any changes. This can be confusing for the user, who would expect an error message indicating that their JSON is invalid. It's better to catch this exception and return a 400 Bad Request response with a descriptive error message.

except json.JSONDecodeError as e: raise HTTPException( status_code=HTTPStatus.BAD_REQUEST.value, detail=f"Invalid JSON body: {e}", ) from e

gemini-code-assist · 2025-11-03T19:24:37Z

+        except Exception:
+            return None


The broad except Exception block when trying to download aphrodite_config.yaml from the Hugging Face Hub currently swallows any exception silently and returns None. While the config file is optional, this behavior can make it difficult to debug issues related to network connectivity, permissions, or unexpected errors from the huggingface_hub library. I suggest adding a warning log that includes the exception details to provide better visibility when a download fails.

except Exception as e: logger.warning( "Failed to download aphrodite_config.yaml from hub for %s: %s", model_name_or_path, e, ) return None

…nfig formats

[API] feat: add model management endpoints for loading and unloading …

82e7efc

…models

gemini-code-assist Bot reviewed Nov 3, 2025

View reviewed changes

AlpinDale added 2 commits November 3, 2025 19:28

[API] feat: enhance load_model endpoint to support flat and nested co…

c88b6bf

…nfig formats

add inline loading

229ce4a

AlpinDale merged commit 5785dbd into main Nov 3, 2025
0 of 4 checks passed

AlpinDale deleted the model_load_unload branch November 3, 2025 19:51

AlpinDale mentioned this pull request Nov 3, 2025

[API] feat: add multi-model support #1554

Merged

-                except json.JSONDecodeError:
-                    pass
+                except json.JSONDecodeError as e:
+                    raise HTTPException(
+                        status_code=HTTPStatus.BAD_REQUEST.value,
+                        detail=f"Invalid JSON in request body: {e}"
+                    ) from e


		logger.info("Model load requested - initializing engine...")

		from aphrodite.v1.engine.async_llm import AsyncLLM

Uh oh!

Uh oh!

Conversation

AlpinDale commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unloading a model

Loading a model

Pass a config YAML file

Pass the launch arguments directly

Aphrodite config in model directory

Inline model loading

Uh oh!

gemini-code-assist Bot commented Nov 3, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot commented Nov 3, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AlpinDale commented Nov 3, 2025 •

edited

Loading