This tutorial can be downloaded as part of the [Wallaroo Tutorials repository](https://github.com/WallarooLabs/Wallaroo_Tutorials/blob/wallaroo2025.2_tutorials/wallaroo-llms/llm-deploy/llm-openapi-endpoint-api-spec).

## Inference Endpoint API Spec for OpenAI Compatibility Enabled Models Tutorial

Models deployed to Wallaroo are available for inference requests either via the Wallaroo SDK or through API calls through the **inference endpoint**.  The **inference endpoint API spec retrieval command** creates an `yaml` file in OpenAPI format that details the inference URL, its endpoints, input and output parameters, and other useful information for developers.

This tutorial demonstrates how to:

* Deploy a vLLM model with OpenAI Compatibility in Wallaroo.
* Perform sample inferences via the Wallaroo SDK.
* Generate the inference endpoint API spec retrieval command and display some of the data.
* Perform a sample inference using the inference endpoint.

### Tutorial Prerequisites

* Wallaroo Version 2025.2 or above instance.

### References

* [Deploy LLMs with OpenAI Compatibility](https://docs.wallaroo.ai/wallaroo-llm/wallaroo-llm-package-deployment/wallaroo-llm-optimizations-openai-compatibility/)
* [Retrieve Inference Endpoint API Spec](https://docs.wallaroo.ai/wallaroo-model-operations/wallaroo-model-operations-serve/#retrieve-inference-endpoint-api-spec)

## Tutorial Steps

### Import Libraries

The first step is to import the libraries we'll be using.  These are included by default in the Wallaroo instance's JupyterHub service.

In [1]:
import wallaroo
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.framework import Framework
from wallaroo.openai_config import OpenaiConfig
import pyarrow as pa
from wallaroo.framework import VLLMConfig

### Open a Connection to Wallaroo

The next step is connect to Wallaroo through the Wallaroo client.  The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the `wallaroo.Client()` command, which provides a URL to grant the SDK permission to your specific Wallaroo environment.  When displayed, enter the URL into a browser and confirm permissions.  Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use `wl = wallaroo.Client()`.  For more details on logging in through Wallaroo, see the [Wallaroo SDK Essentials Guide: Client Connection](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-client/).

In [None]:
wl = wallaroo.Client(request_timeout=600)

### Create Workspace

We will now create the Wallaroo workspace to store our model and set it as the current workspace.  Future commands will default to this workspace for pipeline creation, model uploads, etc.

In [None]:
workspace = wl.get_workspace(name='vllm-openai-test', create_if_not_exist=True)
wl.set_current_workspace(workspace)

### Upload the vLLM with OpenAI API Compatibility Enabled

The model is uploaded with vLLM configuration options enabled, and continuous batch configuration. 

In [None]:
# Uploading the model

model_step = wl.upload_model(
    "tinyllamaopenaiyns4",
    "vllm-openai_tinyllama.zip",
    framework=Framework.VLLM,
    framework_config=VLLMConfig(
        gpu_memory_utilization=0.9, 
        max_model_len=128000,
        kv_cache_dtype='auto'
    ),
    input_schema=pa.schema([]),
    output_schema=pa.schema([]),
    convert_wait=True,
    accel=Acceleration.CUDA
)

In [4]:
model_step=wl.get_model("tinyllamaopenaiyns4")

In [5]:
from wallaroo.continuous_batching_config import ContinuousBatchingConfig
cbc = ContinuousBatchingConfig(max_concurrent_batch_size = 256)

### Set OpenAI Configuration Options

The OpenAI configuration options are set.  Note that these are compliant with the OpenAI API settings.

In [6]:
openai_config = OpenaiConfig(
    enabled=True,
    completion_config={
        "temperature": .3,
        "max_tokens": 200
    },
    chat_completion_config={
        "temperature": .3,
        "max_tokens": 200,
        "chat_template": """
        {% for message in messages %}
            {% if message['role'] == 'user' %}
                {{ '<|user|>\n' + message['content'] + eos_token }}
            {% elif message['role'] == 'system' %}
                {{ '<|system|>\n' + message['content'] + eos_token }}
            {% elif message['role'] == 'assistant' %}
                {{ '<|assistant|>\n'  + message['content'] + eos_token }}
            {% endif %}
            
            {% if loop.last and add_generation_prompt %}
                {{ '<|assistant|>' }}
            {% endif %}
        {% endfor %}"""
    })

model_step = model_step.configure(openai_config=openai_config,continuous_batching_config = cbc)

### Deploy the OpenAI Compatibility Enabled Model

With the model uploaded and settings configured, the model is deployed.  The following deployment configuration set the hardware allocated to the model's use, including 1 gpu.

In [7]:
# Deploying

deployment_config = wallaroo.DeploymentConfigBuilder() \
    .replica_count(1) \
    .cpus(.5) \
    .memory("1Gi") \
    .sidekick_cpus(model_step, 1) \
    .sidekick_memory(model_step, '8Gi') \
    .sidekick_gpus(model_step, 1) \
    .deployment_label('wallaroo.ai/accelerator:l4') \
    .build()

In [None]:
# pipeline.undeploy()
pipeline = wl.build_pipeline('tinyllama-openai')
pipeline.undeploy()
pipeline.clear()
pipeline.add_model_step(model_step)
pipeline.deploy(deployment_config = deployment_config)

In [9]:
pipeline.status()

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.4.3.3',
   'name': 'engine-cfc8445-5wdql',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'tinyllama-openai',
      'status': 'Running',
      'version': '3454c2e8-fe0f-4b9b-bfc3-8fa80b79881b'}]},
   'model_statuses': {'models': [{'model_version_id': 868,
      'name': 'tinyllamaopenaiyns4',
      'sha': 'db68af9c290cdc8d047b7ac70f5acbd446435d2767ac4dfd51509b750a78bdd0',
      'status': 'Running',
      'version': 'e86f14e6-e46c-4a79-a4ae-266b7008bc44'}]}}],
 'engine_lbs': [{'ip': '10.4.4.10',
   'name': 'engine-lb-5bc46b68cf-ltwx9',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.4.3.7',
   'name': 'engine-sidekick-tinyllamaopenaiyns4-868-5f8647bcfb-4dxx8',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

### Generate the Inference Endpoint API Spec

The method `wallaroo.pipeline.Pipeline.generate_api_spec()` returns the pipeline inference endpoint specification `yaml` format under the OpenAPI 3.1.1 format.  This provides developers the ability to import this `yaml` file into their development environments and have:

* Inference API endpoint(s).
  * Inference endpoints require authentication bearer tokens.  For more details, see [Retrieve Pipeline Inference URL Token]({{<ref "wallaroo-mlops-connection-guide#retrieve-pipeline-inference-url-token">}}).
* Input fields and data schemas.
* Output fields and data schemas.

#### Retrieve Inference Endpoint API Spec Parameters

| Field | Type | Description |
|---|---|---|
| **path** | *String* (*Optional*) | The file path where the `yaml` file is downloaded.  If not specified, the default location is in the current directory of the SDK session with the pipeline name.  For example, for the pipeline `sample-pipeline`, the endpoint specification inference endpoint file is downloaded to `./sample-pipeline.yaml`. |

#### Retrieve Inference Endpoint API Spec Returns

A `yaml` file in OPenAPI 3.1.1 format for the specific pipeline that contains:

* **URL**:  The deployed pipeline URL, for example, for the pipeline `sample-pipeline` this URL could be:  `https://example.wallaroo.ai/v1/api/pipelines/infer/sample-pipeline-414/sample-pipeline`
* **PATHS**:  The paths for each endpoint enabled.  Endpoints differ depending on whether pipelines include models with OpenAI API compatibility enabled.

#### Retrieve Inference Endpoint API Spec Example

The following example creates the inference endpoint API spec as the file `token_streaming.yaml`.  For this example, we extract the URL and the OpenAI API endpoints.

In [13]:
pipeline.generate_api_spec(path="./token_streaming.yaml")

In [None]:
import yaml
spec_file = "./token_streaming.yaml"

# open the yaml file
with open(spec_file, 'r') as f:
    data = yaml.load(f, Loader=yaml.SafeLoader)

# display the deploy_url
deploy_url = data['servers'][0]['url']
display(deploy_url)

# show the endpoints
paths = data['paths']
for item in paths:
    display(item)

'https://autoscale-uat-gcp.wallaroo.dev/v1/api/pipelines/infer/tinyllama-openai-414/tinyllama-openai'

'/openai/v1/completions'

'/openai/v1/chat/completions'

In [17]:
pipeline = wl.get_pipeline('tinyllama-openai')

In [18]:
pipeline.status()

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.4.4.12',
   'name': 'engine-7c55fff9f8-txpp2',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'tinyllama-openai',
      'status': 'Running',
      'version': 'c4335986-6816-47fc-b0d9-bba7ac68849e'}]},
   'model_statuses': {'models': [{'model_version_id': 868,
      'name': 'tinyllamaopenaiyns4',
      'sha': 'db68af9c290cdc8d047b7ac70f5acbd446435d2767ac4dfd51509b750a78bdd0',
      'status': 'Running',
      'version': 'e86f14e6-e46c-4a79-a4ae-266b7008bc44'}]}}],
 'engine_lbs': [{'ip': '10.4.8.10',
   'name': 'engine-lb-b5474885d-6xrvz',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.4.10.6',
   'name': 'engine-sidekick-tinyllamaopenaiyns4-868-c66cf7d49-4rwtr',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

### Sample Inferences

The following inferences show using the OpenAI client with the inference endpoint URL and endpoints as derived from the spec file.

In [20]:
# Now using the OpenAI client

token = wl.auth.auth_header()['Authorization'].split()[1]

from openai import OpenAI
client = OpenAI(
    base_url='https://autoscale-uat-gcp.wallaroo.dev/v1/api/pipelines/infer/tinyllama-openai-414/tinyllama-openai/openai/v1',
    api_key=token
)

In [21]:
for chunk in client.chat.completions.create(model="dummy", messages=[{"role": "user", "content": "this is a short story about love"}], max_tokens=1000, stream=True):
    print(chunk.choices[0].delta.content, end="", flush=True)

Title: The Love Story of Max and Lily

Max had always been a solitary soul. He had grown up in a small town, where everyone knew everyone else's business. He had never felt the need to connect with others, preferring to spend his days alone in his room, reading books and writing poetry.

But one day, everything changed. Max met Lily, a beautiful and enigmatic girl, at a local coffee shop. At first, Max was hesitant to approach her, but something about her drew him in. He couldn't shake the feeling that she was different from everyone else he had ever met.

As they spent more time together, Max began to realize that Lily was more than just a beautiful girl. She was intelligent, kind, and passionate about life. Max was fascinated by her, and he couldn't help but feel a sense of attraction towards her.

However, Max's past had taught him to be cautious. He knew that he had to be careful with Lily, as she was a sensitive and emotional person. Max tried to keep his feelings hidden, but he c

In [48]:
for chunk in client.completions.create(model="dummy", prompt="tell me about wallaroo.AI", max_tokens=1000, stream=True):
    print(chunk.choices[0].text, end="", flush=True)

's AI-powered chatbot platform?

In [None]:
token = wl.auth.auth_header()['Authorization'].split()[1]

'eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJoVUcyQ1puTTZpa0EtQlNRVFNsVkJnaEd0dk45QXItN0g2R3NLcHlrY0ZjIn0.eyJleHAiOjE3NTIxNzYwOTYsImlhdCI6MTc1MjE2ODg5NiwiYXV0aF90aW1lIjoxNzUyMTU0MDk5LCJqdGkiOiJlNmE4Y2MzZi00ZjRmLTQzZGMtYWFmYi0zYzZiNDg5ODlkNmYiLCJpc3MiOiJodHRwczovL2F1dG9zY2FsZS11YXQtZ2NwLndhbGxhcm9vLmRldi9hdXRoL3JlYWxtcy9tYXN0ZXIiLCJhdWQiOlsibWFzdGVyLXJlYWxtIiwiYWNjb3VudCJdLCJzdWIiOiJmYmRkZmY5ZC00OTE2LTRmNDYtYTkwNi0wYmUxYjM3MmE5ZjIiLCJ0eXAiOiJCZWFyZXIiLCJhenAiOiJzZGstY2xpZW50Iiwic2Vzc2lvbl9zdGF0ZSI6IjFmNDM0MzNiLWY1M2UtNDFjNC04OTc2LTg2Zjg2NzdiZjc4NyIsImFjciI6IjAiLCJyZWFsbV9hY2Nlc3MiOnsicm9sZXMiOlsiY3JlYXRlLXJlYWxtIiwiZGVmYXVsdC1yb2xlcy1tYXN0ZXIiLCJvZmZsaW5lX2FjY2VzcyIsImFkbWluIiwidW1hX2F1dGhvcml6YXRpb24iXX0sInJlc291cmNlX2FjY2VzcyI6eyJtYXN0ZXItcmVhbG0iOnsicm9sZXMiOlsidmlldy1pZGVudGl0eS1wcm92aWRlcnMiLCJ2aWV3LXJlYWxtIiwibWFuYWdlLWlkZW50aXR5LXByb3ZpZGVycyIsImltcGVyc29uYXRpb24iLCJjcmVhdGUtY2xpZW50IiwibWFuYWdlLXVzZXJzIiwicXVlcnktcmVhbG1zIiwidmlldy1hdXRob3JpemF0aW9uIiwicXVlcnktY2xpZW50cyI

In [None]:
# Streaming: Completion
!curl -X POST \
  -H "Authorization: Bearer abc123" \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100, "stream": true, "stream_options": {"include_usage": true}}' \
  https://autoscale-uat-gcp.wallaroo.dev/v1/api/pipelines/infer/tinyllama-openai-414/tinyllama-openai/openai/v1/completions

data: {"id":"cmpl-3e1d3dbf65ef48bfabd8da910b2338a2","created":1752169114,"model":"tinyllama.zip","choices":[{"text":" about","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

data: {"id":"cmpl-3e1d3dbf65ef48bfabd8da910b2338a2","created":1752169114,"model":"tinyllama.zip","choices":[{"text":" a","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

data: {"id":"cmpl-3e1d3dbf65ef48bfabd8da910b2338a2","created":1752169114,"model":"tinyllama.zip","choices":[{"text":" person","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

data: {"id":"cmpl-3e1d3dbf65ef48bfabd8da910b2338a2","created":1752169114,"model":"tinyllama.zip","choices":[{"text":" who","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

data: {"id":"cmpl-3e1d3dbf65ef48bfabd8da910b2338a2","created":1752169114,"model":"tinyllama.zip","choices":[{"text":" went","index":0,"logprobs":null,"finish_reason":null,"s

In [53]:
# Streaming: Chat completion
!curl -X POST \
  -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJoVUcyQ1puTTZpa0EtQlNRVFNsVkJnaEd0dk45QXItN0g2R3NLcHlrY0ZjIn0.eyJleHAiOjE3NTAxMzExMzcsImlhdCI6MTc1MDEyMzkzNywiYXV0aF90aW1lIjoxNzUwMDk4NTQ3LCJqdGkiOiJmZjlkMWEyNy0wZmY2LTQ2YzItODExMC03NDZkODFmNTU3YzkiLCJpc3MiOiJodHRwczovL2F1dG9zY2FsZS11YXQtZ2NwLndhbGxhcm9vLmRldi9hdXRoL3JlYWxtcy9tYXN0ZXIiLCJhdWQiOlsibWFzdGVyLXJlYWxtIiwiYWNjb3VudCJdLCJzdWIiOiJmYmRkZmY5ZC00OTE2LTRmNDYtYTkwNi0wYmUxYjM3MmE5ZjIiLCJ0eXAiOiJCZWFyZXIiLCJhenAiOiJzZGstY2xpZW50Iiwic2Vzc2lvbl9zdGF0ZSI6ImJmN2YzNDk3LTMwZmItNDllYy05NmY2LTkwYjhjZTc5NWQ0ZiIsImFjciI6IjAiLCJyZWFsbV9hY2Nlc3MiOnsicm9sZXMiOlsiY3JlYXRlLXJlYWxtIiwiZGVmYXVsdC1yb2xlcy1tYXN0ZXIiLCJvZmZsaW5lX2FjY2VzcyIsImFkbWluIiwidW1hX2F1dGhvcml6YXRpb24iXX0sInJlc291cmNlX2FjY2VzcyI6eyJtYXN0ZXItcmVhbG0iOnsicm9sZXMiOlsidmlldy1pZGVudGl0eS1wcm92aWRlcnMiLCJ2aWV3LXJlYWxtIiwibWFuYWdlLWlkZW50aXR5LXByb3ZpZGVycyIsImltcGVyc29uYXRpb24iLCJjcmVhdGUtY2xpZW50IiwibWFuYWdlLXVzZXJzIiwicXVlcnktcmVhbG1zIiwidmlldy1hdXRob3JpemF0aW9uIiwicXVlcnktY2xpZW50cyIsInF1ZXJ5LXVzZXJzIiwibWFuYWdlLWV2ZW50cyIsIm1hbmFnZS1yZWFsbSIsInZpZXctZXZlbnRzIiwidmlldy11c2VycyIsInZpZXctY2xpZW50cyIsIm1hbmFnZS1hdXRob3JpemF0aW9uIiwibWFuYWdlLWNsaWVudHMiLCJxdWVyeS1ncm91cHMiXX0sImFjY291bnQiOnsicm9sZXMiOlsibWFuYWdlLWFjY291bnQiLCJtYW5hZ2UtYWNjb3VudC1saW5rcyIsInZpZXctcHJvZmlsZSJdfX0sInNjb3BlIjoiZW1haWwgb3BlbmlkIHByb2ZpbGUiLCJzaWQiOiJiZjdmMzQ5Ny0zMGZiLTQ5ZWMtOTZmNi05MGI4Y2U3OTVkNGYiLCJlbWFpbF92ZXJpZmllZCI6ZmFsc2UsImh0dHBzOi8vaGFzdXJhLmlvL2p3dC9jbGFpbXMiOnsieC1oYXN1cmEtdXNlci1pZCI6ImZiZGRmZjlkLTQ5MTYtNGY0Ni1hOTA2LTBiZTFiMzcyYTlmMiIsIngtaGFzdXJhLXVzZXItZW1haWwiOiJ5b3VuZXMuYW1hckB3YWxsYXJvby5haSIsIngtaGFzdXJhLWRlZmF1bHQtcm9sZSI6ImFkbWluX3VzZXIiLCJ4LWhhc3VyYS1hbGxvd2VkLXJvbGVzIjpbInVzZXIiLCJhZG1pbl91c2VyIl0sIngtaGFzdXJhLXVzZXItZ3JvdXBzIjoie30ifSwibmFtZSI6IllvdW5lcyBBbWFyIiwicHJlZmVycmVkX3VzZXJuYW1lIjoieW91bmVzLmFtYXJAd2FsbGFyb28uYWkiLCJnaXZlbl9uYW1lIjoiWW91bmVzIiwiZmFtaWx5X25hbWUiOiJBbWFyIiwiZW1haWwiOiJ5b3VuZXMuYW1hckB3YWxsYXJvby5haSJ9.PiSbiPmvZYf9gXC12QFdHKwlHRQUN8T3Qkb7deK2RCxJOH_iyDnZZl6zkvHqZX3aH7DraMUgaHJK5b9QEvInb_5YtezOSSRIxZiZWK9S-osTTAV4x6N8sg6W8LLpv1h1JIodiZobB0Hrn3K6qru6jMB4xh7AxpUZ2J-gqQivxVmGg46RDHy5X_cGwYE6OYN315o7gxSADdNxKDlry0M-fOjyRyz1yUKou1zkUM7ti3h9wVb3sDBOudqie6IJuQVNrd_P6JR62cugIjs17Jc5XRyYK_QdsHdqiw7K_E6EXv7mtckoWI1Fr4m4JXXWZ6t1PSEmkS-iaL65Zp9DLQoRCA" \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "messages": [{"role": "user", "content": "you are a story teller"}], "max_tokens": 100, "stream": true, "stream_options": {"include_usage": true}}' \
  https://autoscale-uat-gcp.wallaroo.dev/v1/api/pipelines/infer/tinyllama-openai-414/tinyllama-openai/openai/v1/chat/completions


data: {"id":"chatcmpl-2a31009c58c14e0cb935b13cf960ba69","object":"chat.completion.chunk","created":1750125078,"model":"tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":"assistant","content":""}}],"usage":null}

data: {"id":"chatcmpl-2a31009c58c14e0cb935b13cf960ba69","object":"chat.completion.chunk","created":1750125078,"model":"tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"Of"}}],"usage":null}

data: {"id":"chatcmpl-2a31009c58c14e0cb935b13cf960ba69","object":"chat.completion.chunk","created":1750125078,"model":"tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":" course"}}],"usage":null}

data: {"id":"chatcmpl-2a31009c58c14e0cb935b13cf960ba69","object":"chat.completion.chunk","created":1750125078,"model":"tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"!"}}],"usage":null}

data: {