This tutorial and the assets can be downloaded as part of the [Wallaroo Tutorials repository](https://github.com/WallarooLabs/Wallaroo_Tutorials/blob/wallaroo2025.1_tutorials/wallaroo-llms/llm-performance-optimizations/continuous-batching-custom-llama).

## Continuous Batching with Llama 3 8B Instruct Custom Config Tutorial

Wallaroo **continuous batching** for vLLMs provides:

* Standards based vLLM deployment options.
* Increased performance for vLLM deployments that leverage GPUs.

**Continuous Batching** improves throughput by dynamically grouping incoming inference requests in real time to optimize processing. It’s useful for real concurrent inference requests when LLM-based or agentic AI applications run at scale, balancing latency, throughput, and resource use.

Wallaroo continuous batching is available for the following frameworks:

* `wallaroo.framework.Framework.VLLM` aka "Standard Framework":  Hugging Face vLLM models compatible with NVIDIA CUDA.
* `wallaroo.framework.Framework.CUSTOM` aka "Custom Framework":  Wallaroo Custom Models aka BYOP (Bring Your Own Predict) provide greater flexibility through Python scripts included with the model artifacts.  For more details, see [Custom Model Upload](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-model-uploads/wallaroo-sdk-model-arbitrary-python/).

This tutorial demonstrates deploying the Llama V3 Instruct LLM with continuous batching in Wallaroo with CUDA AI Acceleration with the Custom configuration.  For access to these sample models and for a demonstration of how to use Continuous Batching to improve LLM performance:

* Contact your Wallaroo Support Representative **OR**
* [Schedule Your Wallaroo.AI Demo Today](https://wallaroo.ai/request-a-demo/)

## Tutorial Overview

This tutorial demonstrates using Wallaroo to:

* Upload a LLM with the following options:
  * Framework:  `Custom`.  The Wallaroo Custom Model for this tutorial includes extensions to enable continuous batching with its deployment.
  * Framework Configuration to specify LLM options.
* Define a Continuous Batching Configuration and apply it to the LLM model configuration.
* Deploy a the LLM with a Deployment Configuration that allocates resources to the LLM; the Framework Configuration is applied at the LLM level, so it inherited during deployment.
* Demonstrate how to perform a sample inference.

## Requirements

The following tutorial requires the following:

* Llama V3 Instruct vLLM encapsulated in the Wallaroo Custom Model aka BYOP Framework.  This is available through a Wallaroo representative.
* Wallaroo version 2025.1 and above.

### Custom vLLM Requirements

To leverage continuous batching for Custom Models, the following requirements must be met.  For full Custom model details, see [Custom Model aka BYOP](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-model-uploads/wallaroo-sdk-model-arbitrary-python/).

Wallaroo Custom Model include the following artifacts.

| Artifact | Type | Description |
|---|---|---|
| Python scripts aka `.py` files with classes that extend `mac.inference.AsyncInference` and `mac.inference.creation.InferenceBuilder` | Python Script | Extend the classes `mac.inference.Inference` and `mac.inference.creation.InferenceBuilder`.  These are included with the [Wallaroo SDK]({{<ref "wallaroo-sdk-install-guides">}}).  Further details are in [Custom Model Script Requirements](#arbitrary-python-script-requirements).  Note that there is no specified naming requirements for the classes that extend `mac.inference.AsyncInference` and `mac.inference.creation.InferenceBuilder` - any qualified class name is sufficient as long as these two classes are extended as defined below. |
| `requirements.txt` | Python requirements file | This sets the Python libraries used for the Custom Model.  These libraries should be targeted for Python 3.10 compliance.  **These requirements and the versions of libraries should be exactly the same between creating the model and deploying it in Wallaroo**.  This insures that the script and methods will function exactly the same as during the model creation process. |
| Other artifacts | Files | Other models, files, and other artifacts used in support of this model. |

For Custom Models with continuous batching, the following additions to the standard Wallaroo Custom Models are required.

* In the `requirements.txt` file, the `vllm` library **must** be included.    For optional performance, use the version specified below.

    ```python
    {{<param python_library_version_vllm>}}
    ```

* Import the following libraries

    ```python
    from vllm import AsyncLLMEngine, SamplingParams
    from vllm.engine.arg_utils import AsyncEngineArgs
    ```

* The class that accepts `InferenceBuilder` extends:
  * `def inference(self) -> AsyncVLLMInference`: Specifies the Inference instance used by `create`.
  * `def create(self, config: CustomInferenceConfig) -> AsyncVLLMInference:`  Creates the inference subclass and adds the vLLM for use with the inference requests.

The following shows an example of extending the `inference` and `create` to for `AsyncVLLMInference`.

```python
# vllm import libraries 
from vllm import AsyncLLMEngine, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs

class AsyncVLLMInferenceBuilder(InferenceBuilder):
    """Inference builder class for AsyncVLLMInference."""

    def inference(self) -> AsyncVLLMInference: # extend mac.inference.AsyncInference
        """Returns an Inference subclass instance.
        This specifies the Inference instance to be used
        by create() to build additionally needed components."""
        return AsyncVLLMInference()

    def create(self, config: CustomInferenceConfig) -> AsyncVLLMInference:
        """Creates an Inference subclass and assigns a model to it.
        :param config: Inference configuration
        :return: Inference subclass
        """
        inference = self.inference
        inference.model = AsyncLLMEngine.from_engine_args(
            AsyncEngineArgs(
                model=(config.model_path / "model").as_posix(),
            ),
        )
        return inference
```


## Tutorial Steps

### Library Imports

We start by importing the libraries used for this tutorial, including the [Wallaroo SDK](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/).  This is provided by default when executing this Jupyter Notebook in the Wallaroo JupyterHub service.

In [None]:
import wallaroo
import pyarrow as pa
import pandas as pd
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.continuous_batching_config import ContinuousBatchingConfig
from wallaroo.object import EntityNotFoundError

### Connect to the Wallaroo Instance

The next step to connect to Wallaroo through the Wallaroo client.  The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the `wallaroo.Client()` command, which provides a URL to grant the SDK permission to your specific Wallaroo environment.  When displayed, enter the URL into a browser and confirm permissions.  Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use `wl = wallaroo.Client()`.  For more information on Wallaroo Client settings, see the [Client Connection guide](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-client/).

In [None]:
wl = wallaroo.Client()

### Define Schemas and Upload Model

The model is uploaded via the Wallaroo SDK method `wallaroo.client.Client.upload_model` which takes the following parameters.

| Parameter | Type | Description |
|---|---|---|
| `name` | `string` (*Required*) | The name of the model.  Model names are unique **per workspace**.  Models that are uploaded with the same name are assigned as a new **version** of the model. |
| `path` | `string` (*Required*) | The path to the model file being uploaded. |
| `framework` |`string` (*Required*) | The framework of the model from `wallaroo.framework.Framework`.  For vLLMs, this framework is `wallaroo.framework.Framework.VLLM`.|
| `input_schema` | `pyarrow.lib.Schema` <ul><li>Native Wallaroo Runtimes: (*Optional*)</li><li>Non-Native Wallaroo Runtimes: (*Required*)</li></ul> | The input schema in Apache Arrow schema format. |
| `output_schema` | `pyarrow.lib.Schema` <ul><li>Native Wallaroo Runtimes: (*Optional*)</li><li>Non-Native Wallaroo Runtimes: (*Required*)</li></ul> | The output schema in Apache Arrow schema format. |
| `framework_config` | `wallaroo.framework.CustomConfig` (*Optional*) | Sets the vLLM configuration options based on the [Framework Configuration Parameters](#framework-configuration-parameters).  If no options are specified, the default values are applied. |
| `convert_wait` | `bool` (*Optional*) | <ul><li>**True**: Waits in the script for the model conversion completion.</li><li>**False**:  Proceeds with the script without waiting for the model conversion process to display complete.</li></ul> |

`wallaroo.framework.CustomConfig` contains the following parameters.

| Parameters | Type |
|---|---|
| **max_num_seqs** | *Integer* (*Default: 256*) |
| **max_model_len** | *Integer* (*Default: None*) |
| **max_seq_len_to_capture** | *Integer* (*Default: 8192*) |
| **quantization** | (*Default: None*)  |
| **kv_cache_dtype** (*Default: 'auto'*)  |
| **gpu_memory_utilization** | **Float** (*Default: 0.9*) |
| **block_size** | (*Default: None*) Get data type |
| **device_group** | (Default: `None`)  This setting is ignored for CUDA acceleration. |

#### Define Input and Output Schemas

The input and output schemas are defined in Apache pyarrow format.


In [4]:
input_schema = pa.schema([
    pa.field('prompt', pa.string()),
    pa.field('max_tokens', pa.int64()),
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('num_output_tokens', pa.int64())
])

#### Define VLLMConfig

We define the `wallaroo.framework.CustomConfig` object and `wallaroo.framework.CustomConfig` contains the following parameters.

| Parameters | Type |
|---|---|
| **max_num_seqs** | *Integer* (*Default: 256*) |
| **max_model_len** | *Integer* (*Default: None*) |
| **max_seq_len_to_capture** | *Integer* (*Default: 8192*) |
| **quantization** | (*Default: None*) |
| **kv_cache_dtype** (*Default: 'auto'*) |
| **gpu_memory_utilization** | **Float** (*Default: 0.9*) |
| **block_size** | (*Default: None*)  |
| **device_group** |  (*Default: None*) This setting is ignored for for CUDA acceleration. |

For this example, the `VLLMConfig` parameters are set with the following:

* `gpu_memory_utilization=0.9` 
* `max_model_len=128`

Other parameters will use the default values.

In [None]:
from wallaroo.framework import CustomConfig

custom_vllm_config = CustomConfig(
        gpu_memory_utilization=0.9, 
        max_model_len=128
    )

#### Upload model via the Wallaroo SDK

With our values set, we upload the model with the `wallaroo.client.Client.upload_model` method with the following parameters:

* Model name and path to the Llama V3 Instruct LLM.
* `framework_config` set to our defined `VLLMConfig`.
* Input and output schemas.
* `accel` set to `from wallaroo.engine_config.Acceleration.CUDA`.

In [None]:
model = wl.upload_model(
    "byop-vllm-tinyllama-ynsv5", 
    "./byop_tinyllama_vllm_v4.zip",
    framework=Framework.CUSTOM,
    framework_config=custom_vllm_config,
    input_schema=input_schema, 
    output_schema=output_schema,
    accel=Acceleration.CUDA
)
model

Waiting for model loading - this will take up to 10min.
.odel is pending loading to a container runtime.
.............................successfulner runtime.

Ready


0,1
Name,byop-vllm-tinyllama-ynsv5
Version,4b40ba86-8af1-4945-bde6-137245d5e618
File Name,byop_tinyllama_vllm_v4.zip
SHA,5e244d5ab73cf718256d1d08b7c0553102215f69c3d70936b2d4b89043499a2e
Status,ready
Image Path,proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-main-6132
Architecture,x86
Acceleration,cuda
Updated At,2025-08-May 18:22:35
Workspace id,60


### Set Continuous Batching Configuration

The model configuration is set either during model upload or post model upload.  We define the continuous batching configuration with the max current batch size set to 256, then apply it to the model configuration.

When applying a continuous batch configuration to a model configuration, the input and output schemas **must** be included.

In [None]:
# Define continuous batching for Async vLLM (you can choose the number of connections you want)
cbc = ContinuousBatchingConfig(max_concurrent_batch_size = 256)

In [8]:
batch = model.configure(
    input_schema = input_schema,
    output_schema = output_schema,
    continuous_batching_config = cbc
)
batch

0,1
Name,byop-vllm-tinyllama-ynsv5
Version,4b40ba86-8af1-4945-bde6-137245d5e618
File Name,byop_tinyllama_vllm_v4.zip
SHA,5e244d5ab73cf718256d1d08b7c0553102215f69c3d70936b2d4b89043499a2e
Status,ready
Image Path,proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-main-6132
Architecture,x86
Acceleration,cuda
Updated At,2025-08-May 18:22:35
Workspace id,60


### Deploy Model

Models are deployed in Wallaroo via **Wallaroo Pipelines** through the following process.

* (Optional): Create a **deployment configuration**.  If no deployment configuration is specified, then the default values are used.  For our deployment, we specify the LLM is assigned the following resources:
  * 1 cpu
  * 10 Gi RAM
  * 1 gpu from the nodepool `"wallaroo.ai/accelerator:a100"`.  Wallaroo deployments and pipelines inherit the acceleration settings from the model, so this will be `CUDA`.
* Create the Wallaroo pipeline.
* Assign the model as a **pipeline step** to processing incoming data and return the inference results.
* Deploy the pipeline with the pipeline configuration.

#### Define the Deployment Configuration

The deployment configuration allocates resources for the LLM's exclusive use.  These resources are used by the LLM until the pipeline is **undeployed** and the resources returned.

In [15]:
deployment_config = DeploymentConfigBuilder() \
    .cpus(1.).memory('1Gi') \
    .sidekick_cpus(batch, 1.) \
    .sidekick_memory(batch, '10Gi') \
    .sidekick_gpus(batch, 1) \
    .deployment_label("wallaroo.ai/accelerator:t4-shared") \
    .build()

#### Deploy vLLM

The next steps we deploy the model by creating the pipeline, adding the vLLM as the pipeline step, and deploying the pipeline with the deployment configuration.

Once complete, the model is ready to accept inference requests.

In [None]:
pipeline = wl.build_pipeline("byop-tinyllama-cutom-vllm")
pipeline.undeploy()
pipeline.clear()

pipeline.add_model_step(batch)
pipeline.deploy(deployment_config=deployment_config)

In [17]:
pipeline.status()

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.4.7.8',
   'name': 'engine-65bc55d64f-mdrnh',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'byop-tinyllama-cutom-vllm',
      'status': 'Running',
      'version': '95a07681-e434-4108-8e9c-01c052b7b5ec'}]},
   'model_statuses': {'models': [{'model_version_id': 434,
      'name': 'byop-vllm-tinyllama-ynsv5',
      'sha': '5e244d5ab73cf718256d1d08b7c0553102215f69c3d70936b2d4b89043499a2e',
      'status': 'Running',
      'version': '4b40ba86-8af1-4945-bde6-137245d5e618'}]}}],
 'engine_lbs': [{'ip': '10.4.1.15',
   'name': 'engine-lb-5cf49f9d5f-dkvsz',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.4.7.9',
   'name': 'engine-sidekick-byop-vllm-tinyllama-ynsv5-434-5cc6f466fc-zqzbk',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

### Inference

Inference requests are submitted to deployed models as either pandas DataFrames or Apache Arrow tables.  The inference data must match the input schemas defined earlier.

Our sample inference request submits a pandas DataFrame with a simple prompt and the `max_tokens` field set to 200.  We receive a pandas DataFrame in return with the outputs labeled as `out.{variable_name}`, with `variable_name` matching the output schemas defined at model upload.

In [18]:
data = pd.DataFrame({"prompt": ["What is Wallaroo.AI?"], "max_tokens": [200]})

In [19]:
pipeline.infer(data, timeout=600)

Unnamed: 0,time,in.max_tokens,in.prompt,out.generated_text,out.num_output_tokens,anomaly.count
0,2025-05-08 18:41:35.436,200,What is Wallaroo.AI?,\n2.2 How does Wallaroo.AI's Asset Composition...,200,0


### Undeploy

With the tutorial complete, the pipeline is undeployed to return the resources back to the Wallaroo environment.

In [14]:
pipeline.undeploy()

Waiting for undeployment - this will take up to 45s ..................................... ok


0,1
name,byop-tinyllama-demo-yns-cudafix
created,2025-05-08 18:23:23.012161+00:00
last_updated,2025-05-08 18:23:23.094326+00:00
deployed,False
workspace_id,60
workspace_name,younes.amar@wallaroo.ai - Default Workspace
arch,x86
accel,cuda
tags,
versions,"2ae66497-d235-44b5-8be5-52a6b83cf945, 2c8d7c28-1702-4e6a-9805-c8f5b918ab36"


This tutorial demonstrates deploying the Llama V3 Instruct LLM with continuous batching in Wallaroo with CUDA AI Acceleration.  For access to these sample models and for a demonstration of how to use Continuous Batching to improve LLM performance:

* Contact your Wallaroo Support Representative **OR**
* [Schedule Your Wallaroo.AI Demo Today](https://wallaroo.ai/request-a-demo/)