This tutorial and the assets can be downloaded as part of the [Wallaroo Tutorials repository](https://github.com/WallarooLabs/Wallaroo_Tutorials/blob/wallaroo2025.1_tutorials/wallaroo-llms/llm-performance-optimizations/continuous-batching-standard-llama).

## Continuous Batching with Llama 3.1 8B Instruct Tutorial

Wallaroo **continuous batching** for vLLMs provides:

* Standards based vLLM deployment options.
* Increased performance for vLLM deployments that leverage GPUs.

**Continuous Batching** improves throughput by dynamically grouping incoming inference requests in real time to optimize processing. It’s useful for real concurrent inference requests when LLM-based or agentic AI applications run at scale, balancing latency, throughput, and resource use.

Performance is fine tuned through **Framework Configurations** which tailor how the vLLM is deployed and processes requests.

Wallaroo continuous batching is available for the following frameworks:

* `wallaroo.framework.Framework.VLLM` aka "Standard Framework":  Hugging Face vLLM models compatible with NVIDIA CUDA.
* `wallaroo.framework.Framework.CUSTOM` aka "Custom Framework":  Wallaroo Custom Models aka BYOP (Bring Your Own Predict) provide greater flexibility through Python scripts included with the model artifacts.  For more details, see [Custom Model Upload](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-model-uploads/wallaroo-sdk-model-arbitrary-python/).

This tutorial demonstrates deploying the Llama V3 Instruct LLM with continuous batching in Wallaroo with CUDA AI Acceleration with the Standard Framework.  For access to these sample models and for a demonstration of how to use Continuous Batching to improve LLM performance:

* Contact your Wallaroo Support Representative **OR**
* [Schedule Your Wallaroo.AI Demo Today](https://wallaroo.ai/request-a-demo/)

## Tutorial Overview

This tutorial demonstrates using Wallaroo to:

* Upload a LLM with the following options:
  * Framework:  `vLLM`
  * Framework Configuration to specify LLM options.
* Define a Continuous Batching Configuration and apply it to the LLM model configuration.
* Deploy a the LLM with a Deployment Configuration that allocates resources to the LLM; the Framework Configuration is applied at the LLM level, so it inherited during deployment.
* Demonstrate how to perform a sample inference.
* Demonstrate publishing an Wallaroo pipeline to an Open Container Initiative (OCI) registry for deployment in multi-cloud or edge environments.

## Requirements

The following tutorial requires the following:

* Llama V3 Instruct vLLM.  This is available through a Wallaroo representative.
* Wallaroo version 2025.1 and above.

## Tutorial Steps

### Library Imports

We start by importing the libraries used for this tutorial, including the [Wallaroo SDK](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/).  This is provided by default when executing this Jupyter Notebook in the Wallaroo JupyterHub service.

In [None]:
import wallaroo
import pyarrow as pa
import pandas as pd
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.continuous_batching_config import ContinuousBatchingConfig
from wallaroo.object import EntityNotFoundError

### Connect to the Wallaroo Instance

The next step to connect to Wallaroo through the Wallaroo client.  The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the `wallaroo.Client()` command, which provides a URL to grant the SDK permission to your specific Wallaroo environment.  When displayed, enter the URL into a browser and confirm permissions.  Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use `wl = wallaroo.Client()`.  For more information on Wallaroo Client settings, see the [Client Connection guide](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-client/).

In [2]:
wl = wallaroo.Client()

### Define Schemas and Upload Model

The model is uploaded via the Wallaroo SDK method `wallaroo.client.Client.upload_model` which takes the following parameters.

| Parameter | Type | Description |
|---|---|---|
| `name` | `string` (*Required*) | The name of the model.  Model names are unique **per workspace**.  Models that are uploaded with the same name are assigned as a new **version** of the model. |
| `path` | `string` (*Required*) | The path to the model file being uploaded. |
| `framework` |`string` (*Required*) | The framework of the model from `wallaroo.framework.Framework`.  For vLLMs, this framework is `wallaroo.framework.Framework.VLLM`.|
| `input_schema` | `pyarrow.lib.Schema` <ul><li>Native Wallaroo Runtimes: (*Optional*)</li><li>Non-Native Wallaroo Runtimes: (*Required*)</li></ul> | The input schema in Apache Arrow schema format. |
| `output_schema` | `pyarrow.lib.Schema` <ul><li>Native Wallaroo Runtimes: (*Optional*)</li><li>Non-Native Wallaroo Runtimes: (*Required*)</li></ul> | The output schema in Apache Arrow schema format. |
| `framework_config` | `wallaroo.framework.VLLMConfig` (*Optional*) | Sets the vLLM configuration options based on the [Framework Configuration Parameters](#framework-configuration-parameters).  If no options are specified, the default values are applied. |
| `convert_wait` | `bool` (*Optional*) | <ul><li>**True**: Waits in the script for the model conversion completion.</li><li>**False**:  Proceeds with the script without waiting for the model conversion process to display complete.</li></ul> |

`wallaroo.framework.VLLMConfig` contains the following parameters.

| Parameters | Type |
|---|---|
| **max_num_seqs** | *Integer* (*Default: 256*) |
| **max_model_len** | *Integer* (*Default: None*) |
| **max_seq_len_to_capture** | *Integer* (*Default: 8192*) |
| **quantization** | (*Default: None*)  |
| **kv_cache_dtype** (*Default: 'auto'*) |
| **gpu_memory_utilization** | **Float** (*Default: 0.9*) |
| **block_size** | (*Default: None*)  |
| **device_group** |  (*Default: None*) This setting is ignored for CUDA acceleration. |

#### Define Input and Output Schemas

The input and output schemas are defined in Apache pyarrow format.


In [None]:
input_schema = pa.schema([
    pa.field('prompt', pa.string()),
    pa.field('max_tokens', pa.int64())
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('num_output_tokens', pa.int64())
])

#### Define VLLMConfig

We define the `wallaroo.framework.VLLMConfig` object and 

`wallaroo.framework.VLLMConfig` contains the following parameters.

| Parameters | Type |
|---|---|
| **max_num_seqs** | *Integer* (*Default: 256*) |
| **max_model_len** | *Integer* (*Default: None*) |
| **max_seq_len_to_capture** | *Integer* (*Default: 8192*) |
| **quantization** | (*Default: None*)  |
| **kv_cache_dtype** (*Default: 'auto'*) |
| **gpu_memory_utilization** | **Float** (*Default: 0.9*) |
| **block_size** | (*Default: None*) |
| **device_group** |  (*Default: None*) This setting is ignored for for CUDA acceleration. |

For this example, the `VLLMConfig` parameters are set with the following:

* `gpu_memory_utilization=0.9` 
* `max_model_len=128`

Other parameters will use the default values.

In [None]:
vllm_config = VLLMConfig(
        gpu_memory_utilization=0.9, 
        max_model_len=128
    )

#### Upload model via the Wallaroo SDK

With our values set, we upload the model with the `wallaroo.client.Client.upload_model` method with the following parameters:

* Model name and path to the Llama V3 Instruct LLM.
* `framework_config` set to our defined `VLLMConfig`.
* Input and output schemas.
* `accel` set to `from wallaroo.engine_config.Acceleration.CUDA`.

In [None]:
model = wl.upload_model(
    "vllm-llama31-8b-async-demo", 
    "./models/vLLM_llama-31-8b.zip",
    framework=Framework.VLLM,
    framework_config=vllm_config,
    input_schema=input_schema, 
    output_schema=output_schema,
    accel=Acceleration.CUDA
)
model

Waiting for model loading - this will take up to 10min.
.odel is pending loading to a container runtime.
.............................................successful

Ready


0,1
Name,vllm-llama31-8b-async-demo
Version,422d3ad9-1bc7-40c1-99af-0ba109964bfd
File Name,vLLM_llama-31-8b.zip
SHA,62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838
Status,ready
Image Path,proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-main-6132
Architecture,x86
Acceleration,cuda
Updated At,2025-08-May 19:24:36
Workspace id,60


In [7]:
model

0,1
Name,vllm-llama31-8b-async-demo
Version,422d3ad9-1bc7-40c1-99af-0ba109964bfd
File Name,vLLM_llama-31-8b.zip
SHA,62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838
Status,ready
Image Path,proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-main-6132
Architecture,x86
Acceleration,cuda
Updated At,2025-08-May 19:24:36
Workspace id,60


### Set Continuous Batching Configuration

The model configuration is set either during model upload or post model upload.  We define the continuous batching configuration with the max current batch size set to `100`, then apply it to the model configuration.

If the `max_concurrent_batch_size` is **not** specified it is set to the default to the value of `256`.

When applying a continuous batch configuration to a model configuration, the input and output schemas **must** be included.

In [None]:
# Define continuous batching for Async vLLM (you can choose the number of connections you want)
cbc = ContinuousBatchingConfig(max_concurrent_batch_size = 100)

In [None]:
vllm_with_continuous_batching = model.configure(
    input_schema = input_schema,
    output_schema = output_schema,
    continuous_batching_config = cbc
)

In [None]:
vllm_with_continuous_batching

0,1
Name,vllm-llama31-8b-async-demo
Version,422d3ad9-1bc7-40c1-99af-0ba109964bfd
File Name,vLLM_llama-31-8b.zip
SHA,62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838
Status,ready
Image Path,proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-main-6132
Architecture,x86
Acceleration,cuda
Updated At,2025-08-May 19:24:36
Workspace id,60


### Deploy Model

Models are deployed in Wallaroo via **Wallaroo Pipelines** through the following process.

* (Optional): Create a **deployment configuration**.  If no deployment configuration is specified, then the default values are used.  For our deployment, we specify the LLM is assigned the following resources:
  * 1 cpu
  * 10 Gi RAM
  * 1 gpu from the nodepool `"wallaroo.ai/accelerator:a100"`.  Wallaroo deployments and pipelines inherit the acceleration settings from the model, so this will be `CUDA`.
* Create the Wallaroo pipeline.
* Assign the model as a **pipeline step** to processing incoming data and return the inference results.
* Deploy the pipeline with the pipeline configuration.

#### Define the Deployment Configuration

The deployment configuration allocates resources for the LLM's exclusive use.  These resources are used by the LLM until the pipeline is **undeployed** and the resources returned.

In [None]:
deployment_config = DeploymentConfigBuilder() \
    .cpus(1.).memory('1Gi') \
    .sidekick_cpus(vllm_with_continuous_batching, 1.) \
    .sidekick_memory(vllm_with_continuous_batching, '10Gi') \
    .sidekick_gpus(vllm_with_continuous_batching, 1) \
    .deployment_label("wallaroo.ai/accelerator:a100") \
    .build()

#### Deploy vLLM

The next steps we deploy the model by creating the pipeline, adding the vLLM as the pipeline step, and deploying the pipeline with the deployment configuration.

Once complete, the model is ready to accept inference requests.

In [None]:
pipeline = wl.build_pipeline("llama-31-8b-vllm-demo")
pipeline.clear()

pipeline.add_model_step(vllm_with_continuous_batching)
pipeline.deploy(deployment_config=deployment_config)

In [13]:
pipeline.status()

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.4.8.2',
   'name': 'engine-8558f6576d-8h7pc',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'llama-31-8b-vllm-demo',
      'status': 'Running',
      'version': '62806288-5f42-44b8-9345-bb4dfb613801'}]},
   'model_statuses': {'models': [{'model_version_id': 443,
      'name': 'vllm-llama31-8b-async-demo',
      'sha': '62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838',
      'status': 'Running',
      'version': '422d3ad9-1bc7-40c1-99af-0ba109964bfd'}]}}],
 'engine_lbs': [{'ip': '10.4.1.17',
   'name': 'engine-lb-5cf49f9d5f-sqr4f',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.4.8.7',
   'name': 'engine-sidekick-vllm-llama31-8b-async-demo-443-75d58845c-svvll',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

### Inference

Inference requests are submitted to deployed models as either pandas DataFrames or Apache Arrow tables.  The inference data must match the input schemas defined earlier.

Our sample inference request submits a pandas DataFrame with a simple prompt and the `max_tokens` field set to 200.  We receive a pandas DataFrame in return with the outputs labeled as `out.{variable_name}`, with `variable_name` matching the output schemas defined at model upload.

In [14]:
data = pd.DataFrame({"prompt": ["What is Wallaroo.AI?"], "max_tokens": [200]})

In [15]:
pipeline.infer(data)

Unnamed: 0,time,in.max_tokens,in.prompt,out.generated_text,out.num_output_tokens,anomaly.count
0,2025-05-08 19:42:06.259,200,What is Wallaroo.AI?,Cloud and AutoML with Python\nWallaroo.AI is a...,122,0


### Publish Pipeline

Wallaroo pipelines are published to OCI Registries via the `wallaroo.pipeline.Pipeline.publish` method.  This stores the following in the OCI registry:

* The LLM set as the pipeline step.
* The Wallaroo engine used to deploy the LLM.  The engine used is targeted based on settings inherited from the LLM set during the **model upload** stage.  These settings include:
  * Architecture
  * AI accelerations
  * Framework Configuration
* The deployment configuration included with as a parameter to the publish command.

For more details on publishing, deploying, and inferencing in multi-cloud and edge with Wallaroo, see [Edge and Multi-cloud Model Publish and Deploy](https://docs.wallaroo.ai/wallaroo-model-operations-run-anywhere/wallaroo-model-operations-run-anywhere-inference/wallaroo-model-operations-run-anywhere-publish/).

Note that when published to an OCI registry, the `publish` command returns the `docker run` and `helm install` commands used to deploy the specified LLM.

In [16]:
pipeline.publish(deployment_config=deployment_config)

Waiting for pipeline publish... It may take up to 600 sec.
............................................... Published.


0,1
ID,36
Pipeline Name,llama-31-8b-vllm-demo
Pipeline Version,a5b7a202-9923-4d8d-ba4c-31e22a83cddc
Status,Published
Workspace Id,60
Workspace Name,younes.amar@wallaroo.ai - Default Workspace
Edges,
Engine URL,us-central1-docker.pkg.dev/wallaroo-dev-253816/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-cuda:v2025.1.0-main-6132
Pipeline URL,us-central1-docker.pkg.dev/wallaroo-dev-253816/uat/pipelines/llama-31-8b-vllm-demo:a5b7a202-9923-4d8d-ba4c-31e22a83cddc
Helm Chart URL,oci://us-central1-docker.pkg.dev/wallaroo-dev-253816/uat/charts/llama-31-8b-vllm-demo

0
docker run \  -p $EDGE_PORT:8080 \  -e OCI_USERNAME=$OCI_USERNAME \  -e OCI_PASSWORD=$OCI_PASSWORD \  -e PIPELINE_URL=us-central1-docker.pkg.dev/wallaroo-dev-253816/uat/pipelines/llama-31-8b-vllm-demo:a5b7a202-9923-4d8d-ba4c-31e22a83cddc \  -e CONFIG_CPUS=1.0 --gpus all --cpus=2.0 --memory=11g \  us-central1-docker.pkg.dev/wallaroo-dev-253816/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-cuda:v2025.1.0-main-6132

0
helm install --atomic $HELM_INSTALL_NAME \  oci://us-central1-docker.pkg.dev/wallaroo-dev-253816/uat/charts/llama-31-8b-vllm-demo \  --namespace $HELM_INSTALL_NAMESPACE \  --version 0.0.1-a5b7a202-9923-4d8d-ba4c-31e22a83cddc \  --set ociRegistry.username=$OCI_USERNAME \  --set ociRegistry.password=$OCI_PASSWORD


### Undeploy

With the tutorial complete, the pipeline is undeployed to return the resources back to the Wallaroo environment.

In [91]:
pipeline.undeploy()

 ok


0,1
name,llama-31-8b-vllm-ynsv5
created,2025-05-06 12:31:40.360907+00:00
last_updated,2025-05-06 19:51:47.490400+00:00
deployed,False
workspace_id,60
workspace_name,younes.amar@wallaroo.ai - Default Workspace
arch,x86
accel,cuda
tags,
versions,"b82ed30f-e937-4b49-94d5-63e6e798cc4b, b0a4ab4d-28ee-4470-9391-888a486375d2, 47760536-b263-428d-a9eb-f763c84f8920, 632917ff-0ffd-49be-abca-5a69a6432f93, 18cc0cad-cf6c-4abf-9083-ee90c2e704e2"


This tutorial demonstrates deploying the Llama V3 Instruct LLM with continuous batching in Wallaroo with CUDA AI Acceleration.  For access to these sample models and for a demonstration of how to use Continuous Batching to improve LLM performance:

* Contact your Wallaroo Support Representative **OR**
* [Schedule Your Wallaroo.AI Demo Today](https://wallaroo.ai/request-a-demo/)