This notebook demonstrates how to interact with a local vLLM OpenAI-compatible server for chat and completion tasks, including advanced features such as tool/function calling. It covers sending chat prompts, handling tool calls for weather information, and integrating external APIs (like wttr.in) to provide dynamic responses. The workflow showcases both direct HTTP requests and usage of the OpenAI Python client with custom endpoints.

Additionally, the notebook includes examples using various models, such as Mistral-7B-Instruct, and Llama-3.1B-Instruct. For Llama-3.1B-Instruct, you can start the vLLM server with the appropriate model and chat template, then interact with it using the same OpenAI-compatible API for chat completions and tool calls. This allows you to leverage the capabilities of Llama-3.1B-Instruct for both standard and advanced conversational AI tasks.

In [1]:
import requests

VLLM_URL = "http://localhost:8000/v1/chat/completions"
API_KEY = "token"  # Use "Bearer token" in header — vLLM does not enforce auth by default

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}

data = {
    "model": "TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
    "messages": [
        {"role": "user", "content": "Who is the best French painter? Answer in one short sentence."}
    ],
    "max_tokens": 100,
    "temperature": 0.7
}

response = requests.post(VLLM_URL, headers=headers, json=data)

if response.status_code == 200:
    reply = response.json()["choices"][0]["message"]["content"]
    print("Assistant:", reply)
else:
    print("Error:", response.status_code, response.text)


Error: 404 {"object":"error","message":"The model `TheBloke/Mistral-7B-Instruct-v0.2-AWQ` does not exist.","type":"NotFoundError","param":null,"code":404}


In [2]:
# SPDX-License-Identifier: Apache-2.0

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Completion API
stream = False
completion = client.completions.create(
    model=model,
    prompt="A robot may not injure a human being",
    echo=False,
    n=2,
    stream=stream,
    logprobs=3)

print("Completion results:")
if stream:
    for c in completion:
        print(c)
else:
    print(completion)

Completion results:
Completion(id='cmpl-3ac37427f2d5489fa63f7a9218dafffa', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=Logprobs(text_offset=[0, 3, 4, 12, 15, 21, 22, 28, 30, 36, 42, 45, 50, 53, 58, 60], token_logprobs=[-0.3707573413848877, -0.07804243266582489, -0.0024917051196098328, -0.0005330810672603548, -0.0017513189231976867, -0.0010155049385502934, -0.00037079135654494166, -0.0016156489728018641, -7.807903602952138e-05, -0.0005930095794610679, -0.004116039723157883, -0.0003779412363655865, -5.0424259825376794e-05, -0.0006279165390878916, -4.012005805969238, -10.061521530151367], tokens=['▁or', ',', '▁through', '▁in', 'action', ',', '▁allow', '▁a', '▁human', '▁being', '▁to', '▁come', '▁to', '▁harm', '▁(', 'Zero'], top_logprobs=[{'▁or': -0.3707573413848877, ',': -1.3082573413848877, '.': -3.9645073413848877}, {',': -0.07804243266582489, '▁through': -2.640542507171631, '▁allow': -5.757730007171631}, {'▁through': -0.0024917051196098328, '<0x0A>': -7.354053974

In [None]:
# SPDX-License-Identifier: Apache-2.0
"""
An example shows how to generate chat completions from reasoning models
like DeepSeekR1.

To run this example, you need to start the vLLM server with the reasoning 
parser:

```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
     --enable-reasoning --reasoning-parser deepseek_r1
```

This example demonstrates how to generate chat completions from reasoning models
using the OpenAI Python client library.
"""

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Round 1
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(model=model, messages=messages)

reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

print("reasoning_content for Round 1:", reasoning_content)
print("content for Round 1:", content)

# Round 2
messages.append({"role": "assistant", "content": content})
messages.append({
    "role": "user",
    "content": "How many Rs are there in the word 'strawberry'?",
})
response = client.chat.completions.create(model=model, messages=messages)

reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

print("reasoning_content for Round 2:", reasoning_content)
print("content for Round 2:", content)

reasoning_content for Round 1: None
content for Round 1:  The number 9.11 is greater than the number 9.8. However, it's important to note that 9.11 does not represent a common mathematical expression or operation. Usually, when comparing numbers, we don't have a decimal point in the same place for both numbers. In this case, since 9.11 has a higher number before the decimal point, it is a greater number than 9.8.
reasoning_content for Round 2: None
content for Round 2:  There are no Rs in the word "strawberry." The letter R appears neither in the word "strawberry" nor in its plural form "strawberries."


In [7]:
# SPDX-License-Identifier: Apache-2.0
"""
Set up this example by starting a vLLM OpenAI-compatible server with tool call
options enabled. For example:

IMPORTANT: for mistral, you must use one of the provided mistral tool call
templates, or your own - the model default doesn't work for tool calls with vLLM
See the vLLM docs on OpenAI server & tool calling for more details.

vllm serve --model mistralai/Mistral-7B-Instruct-v0.3 \
            --chat-template examples/tool_chat_template_mistral.jinja \
            --enable-auto-tool-choice --tool-call-parser mistral

OR
vllm serve --model NousResearch/Hermes-2-Pro-Llama-3-8B \
            --chat-template examples/tool_chat_template_hermes.jinja \
            --enable-auto-tool-choice --tool-call-parser hermes
"""
import json

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

tools = [{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {
                    "type":
                    "string",
                    "description":
                    "The city to find the weather for, e.g. 'San Francisco'"
                },
                "state": {
                    "type":
                    "string",
                    "description":
                    "the two-letter abbreviation for the state that the city is"
                    " in, e.g. 'CA' which would mean 'California'"
                },
                "unit": {
                    "type": "string",
                    "description": "The unit to fetch the temperature in",
                    "enum": ["celsius", "fahrenheit"]
                }
            },
            "required": ["city", "state", "unit"]
        }
    }
}]

messages = [{
    "role": "user",
    "content": "Hi! How are you doing today?"
}, {
    "role": "assistant",
    "content": "I'm doing well! How can I help you?"
}, {
    "role":
    "user",
    "content":
    "Can you tell me what the temperate will be in San Jose in degree?"
}]

chat_completion = client.chat.completions.create(messages=messages,
                                                 model=model,
                                                 tools=tools)

print("Chat completion results:")
print(chat_completion)
print("\n\n")

tool_calls_stream = client.chat.completions.create(messages=messages,
                                                   model=model,
                                                   tools=tools,
                                                   stream=True,
                                                   tool_choice="auto")

chunks = []
for chunk in tool_calls_stream:
    chunks.append(chunk)
    if chunk.choices[0].delta.tool_calls:
        print(chunk.choices[0].delta.tool_calls[0])
    else:
        print(chunk.choices[0].delta)

arguments = []
tool_call_idx = -1
for chunk in chunks:

    if chunk.choices[0].delta.tool_calls:
        tool_call = chunk.choices[0].delta.tool_calls[0]

        if tool_call.index != tool_call_idx:
            if tool_call_idx >= 0:
                print(
                    f"streamed tool call arguments: {arguments[tool_call_idx]}"
                )
            tool_call_idx = chunk.choices[0].delta.tool_calls[0].index
            arguments.append("")
        if tool_call.id:
            print(f"streamed tool call id: {tool_call.id} ")

        if tool_call.function:
            if tool_call.function.name:
                print(f"streamed tool call name: {tool_call.function.name}")

            if tool_call.function.arguments:
                arguments[tool_call_idx] += tool_call.function.arguments

if len(arguments):
    print(f"streamed tool call arguments: {arguments[-1]}")

print("\n\n")

messages.append({
    "role": "assistant",
    "tool_calls": chat_completion.choices[0].message.tool_calls
})



import requests

def get_current_weather(city: str, state: str, unit: str = "fahrenheit") -> str:
    try:
        location = f"{city},{state}"
        # wttr.in JSON API
        url = f"https://wttr.in/{location}"
        params = {
            "format": "j1"  # JSON output
        }
        headers = {
            "User-Agent": "curl"  # ensures plain-text/JSON output :contentReference[oaicite:1]{index=1}
        }
        resp = requests.get(url, params=params, headers=headers)
        data = resp.json()

        current = data["current_condition"][0]
        temp_c = float(current["temp_C"])
        temp_f = float(current["temp_F"])
        condition = current["weatherDesc"][0]["value"]

        if unit.lower().startswith("c"):
            return f"The weather in {city}, {state} is {temp_c:.1f}°C with {condition.lower()}."
        else:
            return f"The weather in {city}, {state} is {temp_f:.1f}°F with {condition.lower()}."

    except Exception as e:
        return f"Failed to fetch weather via wttr.in: {e}"


available_tools = {"get_current_weather": get_current_weather}

completion_tool_calls = chat_completion.choices[0].message.tool_calls
for call in completion_tool_calls:
    tool_to_call = available_tools[call.function.name]
    args = json.loads(call.function.arguments)
    result = tool_to_call(**args)
    print(result)
    messages.append({
        "role": "tool",
        "content": result,
        "tool_call_id": call.id,
        "name": call.function.name
    })

chat_completion_2 = client.chat.completions.create(messages=messages,
                                                   model=model,
                                                   tools=tools,
                                                   stream=False)
print("\n\n")
print(chat_completion_2)

Chat completion results:
ChatCompletion(id='chatcmpl-37031a4e0a6e4d7bb8b0264b25e8752e', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='chatcmpl-tool-5f85a5be66904099b5b7ec2a7f074c6d', function=Function(arguments='{"city": "San Jose", "state": "CA", "unit": "celsius"}', name='get_current_weather'), type='function')], reasoning_content=None), stop_reason=128008)], created=1749624120, model='hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=32, prompt_tokens=399, total_tokens=431, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None, kv_transfer_params=None)



ChoiceDelta(content='', function_call=None, refusal=None, role='assistant', tool_calls=None)
ChoiceD

In [6]:
from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

import requests

def get_weather(city: str, state: str, unit: str = "fahrenheit") -> str:
    try:
        location = f"{city},{state}"
        # wttr.in JSON API
        url = f"https://wttr.in/{location}"
        params = {
            "format": "j1"  # JSON output
        }
        headers = {
            "User-Agent": "curl"  # ensures plain-text/JSON output :contentReference[oaicite:1]{index=1}
        }
        resp = requests.get(url, params=params, headers=headers)
        data = resp.json()

        current = data["current_condition"][0]
        temp_c = float(current["temp_C"])
        temp_f = float(current["temp_F"])
        condition = current["weatherDesc"][0]["value"]

        if unit.lower().startswith("c"):
            return f"The weather in {city}, {state} is {temp_c:.1f}°C with {condition.lower()}."
        else:
            return f"The weather in {city}, {state} is {temp_f:.1f}°F with {condition.lower()}."

    except Exception as e:
        return f"Failed to fetch weather via wttr.in: {e}"
tool_functions = {"get_weather": get_weather}

tools = [{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {
                    "type":
                    "string",
                    "description":
                    "The city to find the weather for, e.g. 'San Francisco'"
                },
                "state": {
                    "type":
                    "string",
                    "description":
                    "the two-letter abbreviation for the state that the city is"
                    " in, e.g. 'CA' which would mean 'California'"
                },
                "unit": {
                    "type": "string",
                    "description": "The unit to fetch the temperature in",
                    "enum": ["celsius", "fahrenheit"]
                }
            },
            "required": ["city", "state", "unit"]
        }
    }
}]

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
    tools=tools,
    tool_choice="auto"
)

tool_call = response.choices[0].message.tool_calls[0].function
print(f"Function called: {tool_call.name}")
print(f"Arguments: {tool_call.arguments}")
print(f"Result: {get_weather(**json.loads(tool_call.arguments))}")

Function called: get_current_weather
Arguments: {"city": "San Francisco", "state": "CA", "unit": "fahrenheit"}
Result: The weather in San Francisco, CA is 56.0°F with partly cloudy.
