# Deploying DeepSeek-LLM-7B-Chat using vLLM

vLLM is an open-source library designed to deliver high throughput and low latency for large language model (LLM) inference. It optimizes text generation workloads by efficiently batching requests and making full use of GPU resources, empowering developers to manage complex tasks like code generation and large-scale conversational AI.

This tutorial guides you through setting up and running vLLM on AMD Instinct™ GPUs using the ROCm software stack. Learn how to configure your environment, containerize your workflow, and send test queries to the vLLM-supported inference server.

## Deploying the LLM using vLLM

Start deploying the LLM (meta-llama/Llama-3.1-8B-Instruct) using vLLM in the Jupyter notebook:

### Start the vLLM server 

Open a new tab in this Jypyter server, click on the terminal icon to open a new terminal, then copy the following command to launch the vLLM server:

```bash
HIP_VISIBLE_DEVICES=0 vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
        --gpu-memory-utilization 0.9 \
        --swap-space 16 \
        --disable-log-requests \
        --dtype float16 \
        --max-model-len 131072 \
        --tensor-parallel-size 1 \
        --host 0.0.0.0 \
        --port 3000 \
        --num-scheduler-steps 10 \
        --max-num-seqs 128 \
        --max-num-batched-tokens 131072 \
        --max-model-len 131072 \
        --distributed-executor-backend "mp"
```

After successfully connecting, it displays `INFO:     Application startup complete.`.

**Note**: In a multi-GPU environment, the setting `HIP_VISIBLE_DEVICES=x` is recommended to deploy the LLM on your preferred GPU.

### Start the client

After successfully running the server, as described above, run the following code to start your client:

In [4]:
import requests

url = "http://localhost:3000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
        {
            "role": "system",
            "content": "You are an expert in the field of AI. Make sure to provide an explanation in few sentences."
        },
        {
            "role": "user",
            "content": "Explain the concept of AI."
        }
    ],
    "stream": False,
    "max_tokens": 128
}

response = requests.post(url, headers=headers, json=data)
print(response.json())


{'id': 'chatcmpl-70f5e43b166347388234b0226355b030', 'object': 'chat.completion', 'created': 1749483992, 'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'reasoning_content': None, 'content': 'Artificial Intelligence (AI) is a field of computer science that focuses on creating intelligent machines that can perform tasks that typically require human intelligence, such as learning, problem-solving, decision-making, and perception. These machines are designed to simulate human-like intelligence, enabling them to interact with their environment, understand and interpret data, and make decisions based on that information. AI systems use algorithms, data, and computational power to analyze and process information, allowing them to learn from experience and improve their performance over time.', 'tool_calls': []}, 'logprobs': None, 'finish_reason': 'stop', 'stop_reason': None}], 'usage': {'prompt_tokens': 62, 'total_tokens': 164, 'comp

**Note**: Remember to match the Docker `--port` **3000** and the port indicated in the URL, for instance, http://localhost:**3000**. If the port is already used by another application, you can modify the number. 

If the connection is successful, the output will be:

``` bash
{"id":"chat-xx","object":"chat.completion","created":1736494622,"model":"meta-llama/Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Artificial Intelligence (AI) is a field of computer science ...}
```