# Deploying with vLLM

In this demo, we assume deploying on a simple AWS EC2 instance for simplicity. Beforehand, make sure you have access to enough vCPU on G and VT machine:
- [Service Quota](https://us-west-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas): 
Running On-Demand G and VT instances

In the EC2 dashboard, launch instance (e.g. `g4dn.xlarge`) with an AMI with NVIDIA drivers (e.g `Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.7 (Amazon Linux 2023)`). Select or create a key-pair when launching the instance. You can ssh into the machine:
```bash
ssh -i vllm-keypair.pem ec2-user@ec2-54-153-9-44.us-west-1.compute.amazonaws.com
```

If you selected the `Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.7 (Amazon Linux 2023)` AMI, you can activate the virtual environment:
```bash
source /opt/pytorch/bin/activate
```
And you can then install vLLM:
```bash
pip install vllm
```
Some Hugging Face model will require a read access token:
```bash
export HF_TOKEN=[YOUR TOKEN]
```

Now, you can serve a model:
```bash
vllm serve meta-llama/Llama-3.2-1B-Instruct --gpu-memory-utilization 0.9 --max-model-len 200
```

## The OpenAI Client

vLLM provides an HTTP server that implements OpenAI's Completions API, so on the client side, we need to install OpenAI:
```bash
pip install openai
```
Make sure to open the port 8000 on the EC2 instance. We can now instantiate a client


In [None]:
from openai import OpenAI

client = OpenAI(
    base_url="http://ec2-54-153-9-44.us-west-1.compute.amazonaws.com:8000/v1",
    api_key="none",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[
    {"role": "user", "content": "How are you?"}
  ]
)

print(response.choices[0].message)

We can pass arguments like the temperature, the number output sequence penalty or beam search. 
- A few of the chat completion API are supported: https://platform.openai.com/docs/api-reference/chat/create
- And additional parameters are also supported: https://docs.vllm.ai/en/v0.4.0.post1/serving/openai_compatible_server.html

In [None]:
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[
        {"role": "user", "content": "Which one is greater, 9.11 or 9.8?"},
    ],
    extra_body={'use_beam_search': True},
    temperature=0.7
)

print(response.choices[0].message.content)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[
    {"role": "user", "content": "Which one is greater, 9.11 or 9.8?"},
  ],
  n=1,
  frequency_penalty=2
)

print(response.choices[0].message)

You can also enable streaming output

In [None]:
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
        {"role": "user", "content": "2 + 2=?. Give me your reasoning!"}
    ],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

## Tool

We can enable the tool parsing argument:
```bash
vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --guided-decoding-backend outlines \
    --gpu-memory-utilization 0.9 \
    --max-model-len 4000 \
    --enable-auto-tool-choice \
    --tool-call-parser llama3_json
```

For example, let's use Tavily as a tool

In [None]:
from tavily import TavilyClient
from pydantic import BaseModel, Field
import json


tavily = TavilyClient(api_key='[YOUR TAVILY KEY]')


# We define the tool as a function
def run_tavily(query: str, max_results: int = 5):
    """
    Executes a web search and returns a JSON string the model can read.
    """
    results = tavily.search(
        query=query,
        max_results=max_results,
        include_answer=False,             # raw results are usually best for an LLM
    )
    # The open-source models like compact JSON without Python objects
    return json.dumps(results)


# We can use Pydantic to correctly define the arguments to the tools
class SearchFormat(BaseModel):
    query: str = Field(
        ..., description="The search query (what the user wants to know)."
    )
    max_results: int = Field(
        ..., description="How many URLs to return (1-10)."
    )

SearchFormat.model_json_schema()['properties']


tools = [
    {
        "type": "function",
        "function": {
            "name": "tavily_search",
            "description": "Search the web with Tavily and return the top results.",
            "parameters": {
                "type": "object",
                "properties": SearchFormat.model_json_schema()['properties'],
                "required": ["query"]
            },
        },
    }
]


messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the 2024 Turing Award and why?"},
]

while True:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.2-3B-Instruct",
        messages=messages,
        tools=tools,
        tool_choice="auto",               # let the model decide if it needs search
    )

    msg = response.choices[0].message
    tool_calls = msg.tool_calls

    # If the model wants to use Tavily, satisfy the request and continue
    if tool_calls:
        messages.append(
            {
                "role": "assistant",
                "tool_calls": tool_calls,
            }
        )
        for call in tool_calls:
            if call.function.name == "tavily_search":
                args = json.loads(call.function.arguments)
                tool_output = run_tavily(args)

                messages.append(
                    {
                        "role": "tool",
                        "name": call.function.name,
                        "tool_call_id": call.id,
                        "content": tool_output,
                    }
                )
                break
        continue               # call the LLM again with new context

    # otherwise, the model is done – print its answer
    print(msg.content)
    break

## Structured Output

We can impose a specific output format 

In [None]:
from pydantic import BaseModel


class Reasoning(BaseModel):
    steps: list[str] = Field(..., description="The reasoning steps to answer to the question")
    answer: str
    confidence: float = Field(..., gte=0, lte=1)

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "reasoning",
        "schema": Reasoning.model_json_schema(),
        # 'strict': True
    },
}

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
        {"role": "user", "content": "what is 235 + 526"},
    ],
    response_format=response_format
)

print(response.choices[0])


Or by using the `parse` API:

In [None]:
response = client.chat.completions.parse(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
         {"role": "user", "content": "what is 235 + 526"},
    ],
    response_format=Reasoning
)

print(response.choices[0])

## Reasoning Models

With reasoning models, we can isolate the reasoning:
```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --reasoning-parser deepseek_r1 --gpu-memory-utilization 0.9 --max-model-len 4000
````


In [None]:
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    messages=[
        {"role": "user", "content": "Which one is greater, 9.11 or 9.8?"},
    ],
)

print(response.choices[0].message.reasoning_content)
print(response.choices[0].message.content)

## Multimodal Models

VLLM supports multimodal models:
```bash
vllm serve microsoft/Phi-3.5-vision-instruct --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}' --gpu-memory-utilization 0.9
```

In [None]:
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {"url": image_url},
            },
        ],
    }
]

response = client.chat.completions.create(
    model="microsoft/Phi-3.5-vision-instruct",
    messages=messages,
)

print(response.choices[0].message.content)

Or from a local image:

In [None]:
from IPython.display import Image
import base64

img = Image(filename="Transformers-One.png")
data_uri = f"data:image/png;base64,{base64.b64encode(img.data).decode('utf-8')}"

response = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": data_uri}},
        ],
    }],
    model="microsoft/Phi-3.5-vision-instruct",
    # max_completion_tokens=64,
)

print(response.choices[0].message.content)