diff --git a/README.md b/README.md index ee58de0d..51720641 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ [![codecov](https://codecov.io/github/VectorInstitute/vector-inference/branch/develop/graph/badge.svg?token=NI88QSIGAC)](https://app.codecov.io/github/VectorInstitute/vector-inference/tree/develop) ![GitHub License](https://img.shields.io/github/license/VectorInstitute/vector-inference) -This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](vec_inf/launch_server.sh), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.csv`](vec_inf/config/models.yaml) accordingly. +This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](vec_inf/cli/_helper.py), [`cli/_config.py`](vec_inf/cli/_config.py), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.yaml`](vec_inf/config/models.yaml) accordingly. ## Installation If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package: @@ -22,8 +22,7 @@ Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up ### `launch` command -The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for -the user to send requests for inference. +The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference. We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run: @@ -32,11 +31,12 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct ``` You should see an output like the following: -launch_img +launch_img + #### Overrides -Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be +Models that are already supported by `vec-inf` would be launched using the cached configuration or [default configuration](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be overriden. For example, if `qos` is to be overriden: ```bash @@ -46,11 +46,11 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct --qos #### Custom models You can also launch your own custom model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below: -* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT`. +* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL). * Your model weights directory should contain HuggingFace format weights. -* You should create a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG` -Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model -should be specified in that config file. +* You should specify your model configuration by: + * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file. + * Using launch command options to specify your model setup. * For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command). Here is an example to deploy a custom [Qwen2.5-7B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M) model which is not @@ -64,7 +64,7 @@ models: model_family: Qwen2.5 model_variant: 7B-Instruct-1M model_type: LLM - num_gpus: 2 + gpus_per_node: 1 num_nodes: 1 vocab_size: 152064 max_model_len: 1010000 @@ -74,9 +74,6 @@ models: qos: m2 time: 08:00:00 partition: a40 - data_type: auto - venv: singularity - log_dir: default model_weights_parent_dir: /h//model-weights ``` @@ -86,17 +83,21 @@ You would then set the `VEC_INF_CONFIG` path using: export VEC_INF_CONFIG=/h//my-model-config.yaml ``` -Alternatively, you can also use launch parameters to set these values instead of using a user-defined config. +Note that there are other parameters that can also be added to the config but not shown in this example, such as `data_type` and `log_dir`. ### `status` command You can check the inference server status by providing the Slurm job ID to the `status` command: ```bash -vec-inf status 13014393 +vec-inf status 15373800 ``` -You should see an output like the following: +If the server is pending for resources, you should see an output like this: + +status_pending_img -status_img +When the server is ready, you should see an output like this: + +status_ready_img There are 5 possible states: @@ -111,19 +112,19 @@ Note that the base URL is only available when model is in `READY` state, and if ### `metrics` command Once your server is ready, you can check performance metrics by providing the Slurm job ID to the `metrics` command: ```bash -vec-inf metrics 13014393 +vec-inf metrics 15373800 ``` -And you will see the performance metrics streamed to your console, note that the metrics are updated with a 10-second interval. +And you will see the performance metrics streamed to your console, note that the metrics are updated with a 2-second interval. -metrics_img +metrics_img ### `shutdown` command Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID: ```bash -vec-inf shutdown 13014393 +vec-inf shutdown 15373800 -> Shutting down model with Slurm Job ID: 13014393 +> Shutting down model with Slurm Job ID: 15373800 ``` ### `list` command @@ -133,19 +134,50 @@ vec-inf list ``` list_img +NOTE: The above screenshot does not represent the full list of models supported. You can also view the default setup for a specific supported model by providing the model name, for example `Meta-Llama-3.1-70B-Instruct`: ```bash vec-inf list Meta-Llama-3.1-70B-Instruct ``` -list_model_img +list_model_img `launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string. ## Send inference requests -Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/completions.py`, and you should expect to see an output like the following: -> {"id":"cmpl-c08d8946224747af9cce9f4d9f36ceb3","object":"text_completion","created":1725394970,"model":"Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"text":" is a question that many people may wonder. The answer is, of course, Ottawa. But if","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":28,"completion_tokens":20}} - +Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/chat_completions.py`, and you should expect to see an output like the following: + +```json +{ + "id":"chatcmpl-387c2579231948ffaf66cdda5439d3dc", + "choices": [ + { + "finish_reason":"stop", + "index":0, + "logprobs":null, + "message": { + "content":"Arrr, I be Captain Chatbeard, the scurviest chatbot on the seven seas! Ye be wantin' to know me identity, eh? Well, matey, I be a swashbucklin' AI, here to provide ye with answers and swappin' tales, savvy?", + "role":"assistant", + "function_call":null, + "tool_calls":[], + "reasoning_content":null + }, + "stop_reason":null + } + ], + "created":1742496683, + "model":"Meta-Llama-3.1-8B-Instruct", + "object":"chat.completion", + "system_fingerprint":null, + "usage": { + "completion_tokens":66, + "prompt_tokens":32, + "total_tokens":98, + "prompt_tokens_details":null + }, + "prompt_logprobs":null +} +``` **NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt. ## SSH tunnel from your local device diff --git a/docs/source/index.md b/docs/source/index.md index cfb94c9e..aadb09bf 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -11,7 +11,7 @@ user_guide ``` -This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/launch_server.sh), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm) and [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly. +This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/cli/_helper.py), [`cli/_config.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/cli/_config_.py), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm), and model configurations in [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly. ## Installation diff --git a/docs/source/user_guide.md b/docs/source/user_guide.md index 6a69936b..dced9db2 100644 --- a/docs/source/user_guide.md +++ b/docs/source/user_guide.md @@ -4,8 +4,7 @@ ### `launch` command -The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for -the user to send requests for inference. +The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference. We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run: @@ -14,12 +13,11 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct ``` You should see an output like the following: -launch_img +launch_img #### Overrides -Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be -overriden. For example, if `qos` is to be overriden: +Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be overriden. For example, if `qos` is to be overriden: ```bash vec-inf launch Meta-Llama-3.1-8B-Instruct --qos @@ -28,11 +26,11 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct --qos #### Custom models You can also launch your own custom model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below: -* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT`. +* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL). * Your model weights directory should contain HuggingFace format weights. -* You should create a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG` -Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model -should be specified in that config file. +* You should specify your model configuration by: + * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file. + * Using launch command options to specify your model setup. * For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command). Here is an example to deploy a custom [Qwen2.5-7B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M) model which is not @@ -46,7 +44,7 @@ models: model_family: Qwen2.5 model_variant: 7B-Instruct-1M model_type: LLM - num_gpus: 2 + num_gpus: 1 num_nodes: 1 vocab_size: 152064 max_model_len: 1010000 @@ -56,9 +54,6 @@ models: qos: m2 time: 08:00:00 partition: a40 - data_type: auto - venv: singularity - log_dir: default model_weights_parent_dir: /h//model-weights ``` @@ -68,19 +63,23 @@ You would then set the `VEC_INF_CONFIG` path using: export VEC_INF_CONFIG=/h//my-model-config.yaml ``` -Alternatively, you can also use launch parameters to set these values instead of using a user-defined config. +Note that there are other parameters that can also be added to the config but not shown in this example, such as `data_type` and `log_dir`. ### `status` command You can check the inference server status by providing the Slurm job ID to the `status` command: ```bash -vec-inf status 13014393 +vec-inf status 15373800 ``` -You should see an output like the following: +If the server is pending for resources, you should see an output like this: + +status_pending_img + +When the server is ready, you should see an output like this: -status_img +status_ready_img There are 5 possible states: @@ -96,20 +95,20 @@ Note that the base URL is only available when model is in `READY` state, and if Once your server is ready, you can check performance metrics by providing the Slurm job ID to the `metrics` command: ```bash -vec-inf metrics 13014393 +vec-inf metrics 15373800 ``` -And you will see the performance metrics streamed to your console, note that the metrics are updated with a 10-second interval. +And you will see the performance metrics streamed to your console, note that the metrics are updated with a 2-second interval. -metrics_img +metrics_img ### `shutdown` command Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID: ```bash -vec-inf shutdown 13014393 +vec-inf shutdown 15373800 -> Shutting down model with Slurm Job ID: 13014393 +> Shutting down model with Slurm Job ID: 15373800 ``` (list-command)= @@ -130,34 +129,43 @@ You can also view the default setup for a specific supported model by providing vec-inf list Meta-Llama-3.1-70B-Instruct ``` -list_model_img +list_model_img `launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string. ## Send inference requests -Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](https://github.com/VectorInstitute/vector-inference/blob/main/examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/completions.py`, and you should expect to see an output like the following: +Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](https://github.com/VectorInstitute/vector-inference/blob/main/examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/chat_completions.py`, and you should expect to see an output like the following: ```json { - "id": "cmpl-c08d8946224747af9cce9f4d9f36ceb3", - "object": "text_completion", - "created": 1725394970, - "model": "Meta-Llama-3.1-8B-Instruct", + "id":"chatcmpl-387c2579231948ffaf66cdda5439d3dc", "choices": [ { - "index": 0, - "text": " is a question that many people may wonder. The answer is, of course, Ottawa. But if", - "logprobs": null, - "finish_reason": "length", - "stop_reason": null + "finish_reason":"stop", + "index":0, + "logprobs":null, + "message": { + "content":"Arrr, I be Captain Chatbeard, the scurviest chatbot on the seven seas! Ye be wantin' to know me identity, eh? Well, matey, I be a swashbucklin' AI, here to provide ye with answers and swappin' tales, savvy?", + "role":"assistant", + "function_call":null, + "tool_calls":[], + "reasoning_content":null + }, + "stop_reason":null } ], + "created":1742496683, + "model":"Meta-Llama-3.1-8B-Instruct", + "object":"chat.completion", + "system_fingerprint":null, "usage": { - "prompt_tokens": 8, - "total_tokens": 28, - "completion_tokens": 20 - } + "completion_tokens":66, + "prompt_tokens":32, + "total_tokens":98, + "prompt_tokens_details":null + }, + "prompt_logprobs":null } ``` diff --git a/examples/inference/llm/chat_completions.py b/examples/inference/llm/chat_completions.py index ffe303eb..3fc796d8 100644 --- a/examples/inference/llm/chat_completions.py +++ b/examples/inference/llm/chat_completions.py @@ -18,4 +18,4 @@ ], ) -print(completion) +print(completion.model_dump_json())