diff --git a/README.md b/README.md
index ee58de0d..51720641 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@
[](https://app.codecov.io/github/VectorInstitute/vector-inference/tree/develop)

-This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](vec_inf/launch_server.sh), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.csv`](vec_inf/config/models.yaml) accordingly.
+This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](vec_inf/cli/_helper.py), [`cli/_config.py`](vec_inf/cli/_config.py), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.yaml`](vec_inf/config/models.yaml) accordingly.
## Installation
If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package:
@@ -22,8 +22,7 @@ Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up
### `launch` command
-The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for
-the user to send requests for inference.
+The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run:
@@ -32,11 +31,12 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct
```
You should see an output like the following:
-
+
+
#### Overrides
-Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
+Models that are already supported by `vec-inf` would be launched using the cached configuration or [default configuration](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
overriden. For example, if `qos` is to be overriden:
```bash
@@ -46,11 +46,11 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct --qos
#### Custom models
You can also launch your own custom model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below:
-* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT`.
+* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL).
* Your model weights directory should contain HuggingFace format weights.
-* You should create a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`
-Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model
-should be specified in that config file.
+* You should specify your model configuration by:
+ * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
+ * Using launch command options to specify your model setup.
* For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
Here is an example to deploy a custom [Qwen2.5-7B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M) model which is not
@@ -64,7 +64,7 @@ models:
model_family: Qwen2.5
model_variant: 7B-Instruct-1M
model_type: LLM
- num_gpus: 2
+ gpus_per_node: 1
num_nodes: 1
vocab_size: 152064
max_model_len: 1010000
@@ -74,9 +74,6 @@ models:
qos: m2
time: 08:00:00
partition: a40
- data_type: auto
- venv: singularity
- log_dir: default
model_weights_parent_dir: /h//model-weights
```
@@ -86,17 +83,21 @@ You would then set the `VEC_INF_CONFIG` path using:
export VEC_INF_CONFIG=/h//my-model-config.yaml
```
-Alternatively, you can also use launch parameters to set these values instead of using a user-defined config.
+Note that there are other parameters that can also be added to the config but not shown in this example, such as `data_type` and `log_dir`.
### `status` command
You can check the inference server status by providing the Slurm job ID to the `status` command:
```bash
-vec-inf status 13014393
+vec-inf status 15373800
```
-You should see an output like the following:
+If the server is pending for resources, you should see an output like this:
+
+
-
+When the server is ready, you should see an output like this:
+
+
There are 5 possible states:
@@ -111,19 +112,19 @@ Note that the base URL is only available when model is in `READY` state, and if
### `metrics` command
Once your server is ready, you can check performance metrics by providing the Slurm job ID to the `metrics` command:
```bash
-vec-inf metrics 13014393
+vec-inf metrics 15373800
```
-And you will see the performance metrics streamed to your console, note that the metrics are updated with a 10-second interval.
+And you will see the performance metrics streamed to your console, note that the metrics are updated with a 2-second interval.
-
+
### `shutdown` command
Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID:
```bash
-vec-inf shutdown 13014393
+vec-inf shutdown 15373800
-> Shutting down model with Slurm Job ID: 13014393
+> Shutting down model with Slurm Job ID: 15373800
```
### `list` command
@@ -133,19 +134,50 @@ vec-inf list
```
+NOTE: The above screenshot does not represent the full list of models supported.
You can also view the default setup for a specific supported model by providing the model name, for example `Meta-Llama-3.1-70B-Instruct`:
```bash
vec-inf list Meta-Llama-3.1-70B-Instruct
```
-
+
`launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string.
## Send inference requests
-Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/completions.py`, and you should expect to see an output like the following:
-> {"id":"cmpl-c08d8946224747af9cce9f4d9f36ceb3","object":"text_completion","created":1725394970,"model":"Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"text":" is a question that many people may wonder. The answer is, of course, Ottawa. But if","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":28,"completion_tokens":20}}
-
+Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/chat_completions.py`, and you should expect to see an output like the following:
+
+```json
+{
+ "id":"chatcmpl-387c2579231948ffaf66cdda5439d3dc",
+ "choices": [
+ {
+ "finish_reason":"stop",
+ "index":0,
+ "logprobs":null,
+ "message": {
+ "content":"Arrr, I be Captain Chatbeard, the scurviest chatbot on the seven seas! Ye be wantin' to know me identity, eh? Well, matey, I be a swashbucklin' AI, here to provide ye with answers and swappin' tales, savvy?",
+ "role":"assistant",
+ "function_call":null,
+ "tool_calls":[],
+ "reasoning_content":null
+ },
+ "stop_reason":null
+ }
+ ],
+ "created":1742496683,
+ "model":"Meta-Llama-3.1-8B-Instruct",
+ "object":"chat.completion",
+ "system_fingerprint":null,
+ "usage": {
+ "completion_tokens":66,
+ "prompt_tokens":32,
+ "total_tokens":98,
+ "prompt_tokens_details":null
+ },
+ "prompt_logprobs":null
+}
+```
**NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.
## SSH tunnel from your local device
diff --git a/docs/source/index.md b/docs/source/index.md
index cfb94c9e..aadb09bf 100644
--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -11,7 +11,7 @@ user_guide
```
-This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/launch_server.sh), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm) and [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
+This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/cli/_helper.py), [`cli/_config.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/cli/_config_.py), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm), and model configurations in [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
## Installation
diff --git a/docs/source/user_guide.md b/docs/source/user_guide.md
index 6a69936b..dced9db2 100644
--- a/docs/source/user_guide.md
+++ b/docs/source/user_guide.md
@@ -4,8 +4,7 @@
### `launch` command
-The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for
-the user to send requests for inference.
+The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run:
@@ -14,12 +13,11 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct
```
You should see an output like the following:
-
+
#### Overrides
-Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
-overriden. For example, if `qos` is to be overriden:
+Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be overriden. For example, if `qos` is to be overriden:
```bash
vec-inf launch Meta-Llama-3.1-8B-Instruct --qos
@@ -28,11 +26,11 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct --qos
#### Custom models
You can also launch your own custom model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below:
-* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT`.
+* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL).
* Your model weights directory should contain HuggingFace format weights.
-* You should create a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`
-Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model
-should be specified in that config file.
+* You should specify your model configuration by:
+ * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
+ * Using launch command options to specify your model setup.
* For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
Here is an example to deploy a custom [Qwen2.5-7B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M) model which is not
@@ -46,7 +44,7 @@ models:
model_family: Qwen2.5
model_variant: 7B-Instruct-1M
model_type: LLM
- num_gpus: 2
+ num_gpus: 1
num_nodes: 1
vocab_size: 152064
max_model_len: 1010000
@@ -56,9 +54,6 @@ models:
qos: m2
time: 08:00:00
partition: a40
- data_type: auto
- venv: singularity
- log_dir: default
model_weights_parent_dir: /h//model-weights
```
@@ -68,19 +63,23 @@ You would then set the `VEC_INF_CONFIG` path using:
export VEC_INF_CONFIG=/h//my-model-config.yaml
```
-Alternatively, you can also use launch parameters to set these values instead of using a user-defined config.
+Note that there are other parameters that can also be added to the config but not shown in this example, such as `data_type` and `log_dir`.
### `status` command
You can check the inference server status by providing the Slurm job ID to the `status` command:
```bash
-vec-inf status 13014393
+vec-inf status 15373800
```
-You should see an output like the following:
+If the server is pending for resources, you should see an output like this:
+
+
+
+When the server is ready, you should see an output like this:
-
+
There are 5 possible states:
@@ -96,20 +95,20 @@ Note that the base URL is only available when model is in `READY` state, and if
Once your server is ready, you can check performance metrics by providing the Slurm job ID to the `metrics` command:
```bash
-vec-inf metrics 13014393
+vec-inf metrics 15373800
```
-And you will see the performance metrics streamed to your console, note that the metrics are updated with a 10-second interval.
+And you will see the performance metrics streamed to your console, note that the metrics are updated with a 2-second interval.
-
+
### `shutdown` command
Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID:
```bash
-vec-inf shutdown 13014393
+vec-inf shutdown 15373800
-> Shutting down model with Slurm Job ID: 13014393
+> Shutting down model with Slurm Job ID: 15373800
```
(list-command)=
@@ -130,34 +129,43 @@ You can also view the default setup for a specific supported model by providing
vec-inf list Meta-Llama-3.1-70B-Instruct
```
-
+
`launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string.
## Send inference requests
-Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](https://github.com/VectorInstitute/vector-inference/blob/main/examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/completions.py`, and you should expect to see an output like the following:
+Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](https://github.com/VectorInstitute/vector-inference/blob/main/examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/chat_completions.py`, and you should expect to see an output like the following:
```json
{
- "id": "cmpl-c08d8946224747af9cce9f4d9f36ceb3",
- "object": "text_completion",
- "created": 1725394970,
- "model": "Meta-Llama-3.1-8B-Instruct",
+ "id":"chatcmpl-387c2579231948ffaf66cdda5439d3dc",
"choices": [
{
- "index": 0,
- "text": " is a question that many people may wonder. The answer is, of course, Ottawa. But if",
- "logprobs": null,
- "finish_reason": "length",
- "stop_reason": null
+ "finish_reason":"stop",
+ "index":0,
+ "logprobs":null,
+ "message": {
+ "content":"Arrr, I be Captain Chatbeard, the scurviest chatbot on the seven seas! Ye be wantin' to know me identity, eh? Well, matey, I be a swashbucklin' AI, here to provide ye with answers and swappin' tales, savvy?",
+ "role":"assistant",
+ "function_call":null,
+ "tool_calls":[],
+ "reasoning_content":null
+ },
+ "stop_reason":null
}
],
+ "created":1742496683,
+ "model":"Meta-Llama-3.1-8B-Instruct",
+ "object":"chat.completion",
+ "system_fingerprint":null,
"usage": {
- "prompt_tokens": 8,
- "total_tokens": 28,
- "completion_tokens": 20
- }
+ "completion_tokens":66,
+ "prompt_tokens":32,
+ "total_tokens":98,
+ "prompt_tokens_details":null
+ },
+ "prompt_logprobs":null
}
```
diff --git a/examples/inference/llm/chat_completions.py b/examples/inference/llm/chat_completions.py
index ffe303eb..3fc796d8 100644
--- a/examples/inference/llm/chat_completions.py
+++ b/examples/inference/llm/chat_completions.py
@@ -18,4 +18,4 @@
],
)
-print(completion)
+print(completion.model_dump_json())