VectorInstitute · XkunW · Mar 20, 2025 · Mar 20, 2025 · Mar 20, 2025 · Mar 20, 2025
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@
 [![codecov](https://codecov.io/github/VectorInstitute/vector-inference/branch/develop/graph/badge.svg?token=NI88QSIGAC)](https://app.codecov.io/github/VectorInstitute/vector-inference/tree/develop)
 ![GitHub License](https://img.shields.io/github/license/VectorInstitute/vector-inference)
 
-This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](vec_inf/launch_server.sh), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.csv`](vec_inf/config/models.yaml) accordingly.
+This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](vec_inf/cli/_helper.py), [`cli/_config.py`](vec_inf/cli/_config.py), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.yaml`](vec_inf/config/models.yaml) accordingly.
 
 ## Installation
 If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package:
@@ -22,8 +22,7 @@ Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up
 
 ### `launch` command
 
-The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for
-the user to send requests for inference.
+The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
 
 We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run:
 
@@ -32,11 +31,12 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct
 ```
 You should see an output like the following:
 
-<img width="600" alt="launch_img" src="https://github.com/user-attachments/assets/ab658552-18b2-47e0-bf70-e539c3b898d5">
+<img width="600" alt="launch_img" src="https://github.com/user-attachments/assets/883e6a5b-8016-4837-8fdf-39097dfb18bf">
+
 
 #### Overrides
 
-Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
+Models that are already supported by `vec-inf` would be launched using the cached configuration or [default configuration](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
 overriden. For example, if `qos` is to be overriden:
 
 ```bash
@@ -46,11 +46,11 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct --qos <new_qos>
 #### Custom models
 
 You can also launch your own custom model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below:
-* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT`.
+* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL).
 * Your model weights directory should contain HuggingFace format weights.
-* You should create a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`
-Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model
-should be specified in that config file.
+* You should specify your model configuration by:
+  * Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
+  * Using launch command options to specify your model setup.
 * For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
 
 Here is an example to deploy a custom [Qwen2.5-7B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M) model which is not
@@ -64,7 +64,7 @@ models:
     model_family: Qwen2.5
     model_variant: 7B-Instruct-1M
     model_type: LLM
-    num_gpus: 2
+    gpus_per_node: 1
     num_nodes: 1
     vocab_size: 152064
     max_model_len: 1010000
@@ -74,9 +74,6 @@ models:
     qos: m2
     time: 08:00:00
     partition: a40
-    data_type: auto
-    venv: singularity
-    log_dir: default
     model_weights_parent_dir: /h/<username>/model-weights
 ```
 
@@ -86,17 +83,21 @@ You would then set the `VEC_INF_CONFIG` path using:
 export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
 ```
 
-Alternatively, you can also use launch parameters to set these values instead of using a user-defined config.
+Note that there are other parameters that can also be added to the config but not shown in this example, such as `data_type` and `log_dir`.
 
 ### `status` command
 You can check the inference server status by providing the Slurm job ID to the `status` command:
 ```bash
-vec-inf status 13014393
+vec-inf status 15373800
 ```
 
-You should see an output like the following:
+If the server is pending for resources, you should see an output like this:
+
+<img width="400" alt="status_pending_img" src="https://github.com/user-attachments/assets/b659c302-eae1-4560-b7a9-14eb3a822a2f">
 
-<img width="400" alt="status_img" src="https://github.com/user-attachments/assets/7385b9ca-9159-4ca9-bae2-7e26d80d9747">
+When the server is ready, you should see an output like this:
+
+<img width="400" alt="status_ready_img" src="https://github.com/user-attachments/assets/672986c2-736c-41ce-ac7c-1fb585cdcb0d">
 
 There are 5 possible states:
 
@@ -111,19 +112,19 @@ Note that the base URL is only available when model is in `READY` state, and if
 ### `metrics` command
 Once your server is ready, you can check performance metrics by providing the Slurm job ID to the `metrics` command:
 ```bash
-vec-inf metrics 13014393
+vec-inf metrics 15373800
 ```
 
-And you will see the performance metrics streamed to your console, note that the metrics are updated with a 10-second interval.
+And you will see the performance metrics streamed to your console, note that the metrics are updated with a 2-second interval.
 
-<img width="400" alt="metrics_img" src="https://github.com/user-attachments/assets/e5ff2cd5-659b-4c88-8ebc-d8f3fdc023a4">
+<img width="400" alt="metrics_img" src="https://github.com/user-attachments/assets/3ee143d0-1a71-4944-bbd7-4c3299bf0339">
 
 ### `shutdown` command
 Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID:
 ```bash
-vec-inf shutdown 13014393
+vec-inf shutdown 15373800
 
-> Shutting down model with Slurm Job ID: 13014393
+> Shutting down model with Slurm Job ID: 15373800
 ```
 
 ### `list` command
@@ -133,19 +134,50 @@ vec-inf list
 ```
 <img width="940" alt="list_img" src="https://github.com/user-attachments/assets/8cf901c4-404c-4398-a52f-0486f00747a3">
 
+NOTE: The above screenshot does not represent the full list of models supported.
 
 You can also view the default setup for a specific supported model by providing the model name, for example `Meta-Llama-3.1-70B-Instruct`:
 ```bash
 vec-inf list Meta-Llama-3.1-70B-Instruct
 ```
-<img width="400" alt="list_model_img" src="https://github.com/user-attachments/assets/30e42ab7-dde2-4d20-85f0-187adffefc3d">
+<img width="500" alt="list_model_img" src="https://github.com/user-attachments/assets/34e53937-2d86-443e-85f6-34e408653ddb">
 
 `launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string.
 
 ## Send inference requests
-Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/completions.py`, and you should expect to see an output like the following:
-> {"id":"cmpl-c08d8946224747af9cce9f4d9f36ceb3","object":"text_completion","created":1725394970,"model":"Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"text":" is a question that many people may wonder. The answer is, of course, Ottawa. But if","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":28,"completion_tokens":20}}
-
+Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/chat_completions.py`, and you should expect to see an output like the following:
+
+```json
+{
+    "id":"chatcmpl-387c2579231948ffaf66cdda5439d3dc",
+    "choices": [
+        {
+            "finish_reason":"stop",
+            "index":0,
+            "logprobs":null,
+            "message": {
+                "content":"Arrr, I be Captain Chatbeard, the scurviest chatbot on the seven seas! Ye be wantin' to know me identity, eh? Well, matey, I be a swashbucklin' AI, here to provide ye with answers and swappin' tales, savvy?",
+                "role":"assistant",
+                "function_call":null,
+                "tool_calls":[],
+                "reasoning_content":null
+            },
+            "stop_reason":null
+        }
+    ],
+    "created":1742496683,
+    "model":"Meta-Llama-3.1-8B-Instruct",
+    "object":"chat.completion",
+    "system_fingerprint":null,
+    "usage": {
+        "completion_tokens":66,
+        "prompt_tokens":32,
+        "total_tokens":98,
+        "prompt_tokens_details":null
+    },
+    "prompt_logprobs":null
+}
+```
 **NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.
 
 ## SSH tunnel from your local device

diff --git a/docs/source/index.md b/docs/source/index.md
@@ -11,7 +11,7 @@ user_guide
 
 ```
 
-This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/launch_server.sh), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm) and [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
+This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/cli/_helper.py), [`cli/_config.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/cli/_config_.py), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm), and model configurations in [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
 
 ## Installation
-Original file line number
+Diff line change
@@ Expand Up / @@ -11,7 +11,7 @@ user_guide @@
     ```
-    This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/launch_server.sh), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm) and [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
+    This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/cli/_helper.py), [`cli/_config.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/cli/_config_.py), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm), and model configurations in [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
     ## Installation
@@ Expand Down @@