This tutorial is available on the [Wallaroo Tutorials repository](https://github.com/WallarooLabs/Wallaroo_Tutorials/blob/wallaroo2025.1_tutorials/wallaroo-run-anywhere/inference-on-any-architecture/cuda/wallaroo-gpu-llm-summarization).

## LLM Summarization GPU Edge Deployment on Cuda

This tutorial demonstrates how to use the Wallaroo combined with GPU processors to perform inferences with pre-trained computer vision ML models.  This demonstration assumes that:

* Wallaroo Version 2023.3 or above instance is installed.
* A nodepools with GPUs part of the Kubernetes cluster.  See [Create GPU Nodepools for Kubernetes Clusters](https://docs.wallaroo.ai/wallaroo-operations-guide/wallaroo-install-guides/wallaroo-install-configurations/wallaroo-gpu-nodepools/) for more detials.
* The model [`hf-summarization-bart-large-samsun.zip` (1.4 G)](https://storage.googleapis.com/wallaroo-public-data/llm-models/model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip) has been downloaded to the `./models` folder.

### Tutorial Goals

For our example, we will perform the following:

* Create a workspace for our work.
* Upload the model.
* Create a pipeline and specify the gpus in the pipeline deployment.
* Perform a sample inference.


## Steps

### Import Libraries

The first step will be to import our libraries.

In [4]:
import json
import os

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework

import pyarrow as pa
import numpy as np
import pandas as pd

### Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client.  The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the `wallaroo.Client()` command, which provides a URL to grant the SDK permission to your specific Wallaroo environment.  When displayed, enter the URL into a browser and confirm permissions.  Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use `wl = wallaroo.Client()`.  For more information on Wallaroo Client settings, see the [Client Connection guide](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-client/).

In [11]:
wl = wallaroo.Client()

### Configure PyArrow Schema

You can find more info on the available inputs under [TextSummarizationInputs](https://github.com/WallarooLabs/platform/blob/main/conductor/model-auto-conversion/flavors/hugging-face/src/io/pipeline_inputs/text_summarization_inputs.py#L14) or under the [official source code](https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/pipelines/text2text_generation.py#L241) from `🤗 Hugging Face`.

In [12]:
input_schema = pa.schema([
    pa.field('inputs', pa.string()),
    pa.field('return_text', pa.bool_()),
    pa.field('return_tensors', pa.bool_()),
    pa.field('clean_up_tokenization_spaces', pa.bool_()),
    # pa.field('generate_kwargs', pa.map_(pa.string(), pa.null())), # dictionaries are not currently supported by the engine
])

output_schema = pa.schema([
    pa.field('summary_text', pa.string()),
])

### Upload Model

We will now create or connect to our pipeline and upload the model.

The model's AI Accelerator is set during the model upload process.  For this model, that is set to `CUDA`.

In [13]:
model = wl.upload_model('hf-summarization', 
                        './models/hf_summarization.zip', 
                        framework=Framework.HUGGING_FACE_SUMMARIZATION, 
                        input_schema=input_schema, 
                        output_schema=output_schema,
                        accel=wallaroo.engine_config.Acceleration.CUDA,
                        convert_wait=False)
model

0,1
Name,hf-summarization
Version,d206bb27-50cf-4beb-8bab-4e65747bc423
File Name,hf_summarization.zip
SHA,ee71d066a83708e7ca4a3c07caf33fdc528bb000039b6ca2ef77fa2428dc6268
Status,pending_load_container
Image Path,
Architecture,x86
Acceleration,cuda
Updated At,2025-14-Jul 17:06:37
Workspace id,108


In [15]:
import time
while model.status() != "ready" and model.status() != "error":
    print(model.status())
    time.sleep(10)
print(model.status())

attempting_load_container
attempting_load_container
attempting_load_container
attempting_load_container
attempting_load_container
attempting_load_container
attempting_load_container
attempting_load_container
attempting_load_container
ready


### Deploy Pipeline

With the model uploaded, we can add it is as a step in the pipeline, then deploy it.  

For GPU deployment, the pipeline deployment configuration allocated the cpus, ram, gpus, and other settings for the pipeline.  For gpus,  both the number of GPUs and the nodepool containing the gpus must be specified.

For Wallaroo Native Runtime models (`onnx`, `tensorflow`), the method is `wallaroo.deployment_config.gpus(int)` to allocate the number of gpus to the pipeline.  This applies to all Wallaroo Native Runtime models in the pipeline.

For Wallaroo Containerized models (`hugging-face`, etc), the method is `wallaroo.deployment_config.sidekick_gpus(int)` to allocate the number of gpus to the model.

The deployment label is set with the `wallaroo.deployment_config.deployment_label(string)` method.

For more information on allocating resources to a Wallaroo pipeline for deployment, see [Wallaroo SDK Essentials Guide: Pipeline Deployment Configuration](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-pipelines/wallaroo-sdk-essentials-pipeline-deployment-config/).

For this example, 1 gpu will be allocated to the pipeline from the nodepool with the deployment label `wallaroo.ai/accelerator: a100`.

Note that the accelerator setting for the deployment configuration is not specified; this is inherited from the model's `accel` parameter.

In [16]:
deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('1Gi') \
    .sidekick_gpus(model, 1) \
    .sidekick_cpus(model,4) \
    .sidekick_memory(model, '8Gi') \
    .deployment_label('wallaroo.ai/accelerator: a100') \
    .build()

In [17]:
deployment_config

{'engine': {'cpu': 1,
  'resources': {'limits': {'cpu': 1, 'memory': '1Gi'},
   'requests': {'cpu': 1, 'memory': '1Gi'}},
  'node_selector': 'wallaroo.ai/accelerator: a100'},
 'enginelb': {},
 'engineAux': {'images': {'hf-summarization-828': {'resources': {'limits': {'nvidia.com/gpu': 1,
      'cpu': 4,
      'memory': '8Gi'},
     'requests': {'nvidia.com/gpu': 1, 'cpu': 4, 'memory': '8Gi'}},
    'node_selector': 'wallaroo.ai/accelerator: a100'}}}}

In [18]:
pipeline_name = "hf-summarization-pipeline"
pipeline = wl.build_pipeline(pipeline_name)
pipeline.add_model_step(model)

0,1
name,hf-summarization-pipeline
created,2025-07-14 17:14:28.980544+00:00
last_updated,2025-07-14 17:14:28.980544+00:00
deployed,(none)
workspace_id,108
workspace_name,john.hummel@wallaroo.ai - Default Workspace
arch,
accel,
tags,
versions,07f4e9c6-e1e4-4a68-8c5e-82841f45d9f9


In [19]:
pub = pipeline.publish(deployment_config)

Waiting for pipeline publish... It may take up to 600 sec.
.................................. Published.


In [20]:
display(pub)

0,1
ID,99
Pipeline Name,hf-summarization-pipeline
Pipeline Version,782e26c5-a88c-49d0-b2ec-8d35ab363a3e
Status,Published
Workspace Id,108
Workspace Name,john.hummel@wallaroo.ai - Default Workspace
Edges,
Engine URL,us-central1-docker.pkg.dev/wallaroo-dev-253816/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-cuda:v2025.1.0-6245
Pipeline URL,us-central1-docker.pkg.dev/wallaroo-dev-253816/uat/pipelines/hf-summarization-pipeline:782e26c5-a88c-49d0-b2ec-8d35ab363a3e
Helm Chart URL,oci://us-central1-docker.pkg.dev/wallaroo-dev-253816/uat/charts/hf-summarization-pipeline

0
docker run \  -p $EDGE_PORT:8080 \  -e OCI_USERNAME=$OCI_USERNAME \  -e OCI_PASSWORD=$OCI_PASSWORD \  -e PIPELINE_URL=us-central1-docker.pkg.dev/wallaroo-dev-253816/uat/pipelines/hf-summarization-pipeline:782e26c5-a88c-49d0-b2ec-8d35ab363a3e \  -e CONFIG_CPUS=1.0 --gpus all --cpus=5.0 --memory=9g \  us-central1-docker.pkg.dev/wallaroo-dev-253816/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-cuda:v2025.1.0-6245

0
podman run \  -p $EDGE_PORT:8080 \  -e OCI_USERNAME=$OCI_USERNAME \  -e OCI_PASSWORD=$OCI_PASSWORD \  -e PIPELINE_URL=us-central1-docker.pkg.dev/wallaroo-dev-253816/uat/pipelines/hf-summarization-pipeline:782e26c5-a88c-49d0-b2ec-8d35ab363a3e \  -e CONFIG_CPUS=1.0 --device nvidia.com/gpu=all --cpus=5.0 --memory=9g \  us-central1-docker.pkg.dev/wallaroo-dev-253816/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-cuda:v2025.1.0-6245

0
helm install --atomic $HELM_INSTALL_NAME \  oci://us-central1-docker.pkg.dev/wallaroo-dev-253816/uat/charts/hf-summarization-pipeline \  --namespace $HELM_INSTALL_NAMESPACE \  --version 0.0.1-782e26c5-a88c-49d0-b2ec-8d35ab363a3e \  --set ociRegistry.username=$OCI_USERNAME \  --set ociRegistry.password=$OCI_PASSWORD


In [21]:
pipeline.deploy(deployment_config=deployment_config, 
               wait_for_status=False)

Deployment initiated for hf-summarization-pipeline. Please check pipeline status.


0,1
name,hf-summarization-pipeline
created,2025-07-14 17:14:28.980544+00:00
last_updated,2025-07-14 17:22:07.234072+00:00
deployed,True
workspace_id,108
workspace_name,john.hummel@wallaroo.ai - Default Workspace
arch,x86
accel,cuda
tags,
versions,"2234dc3f-014f-4608-b3b4-b9e8c787c477, 782e26c5-a88c-49d0-b2ec-8d35ab363a3e, 07f4e9c6-e1e4-4a68-8c5e-82841f45d9f9"


In [27]:
time.sleep(30)

while pipeline.status()['status'] != 'Running':
    time.sleep(30)
    print("Waiting for deployment.")
    pipeline.status()['status']
pipeline.status()['status']


Waiting for deployment.
Waiting for deployment.
Waiting for deployment.
Waiting for deployment.
Waiting for deployment.
Waiting for deployment.


'Running'

In [28]:
pipeline.status()

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.4.2.3',
   'name': 'engine-5467cf7b9f-qrbfk',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'hf-summarization-pipeline',
      'status': 'Running',
      'version': '2234dc3f-014f-4608-b3b4-b9e8c787c477'}]},
   'model_statuses': {'models': [{'model_version_id': 828,
      'name': 'hf-summarization',
      'sha': 'ee71d066a83708e7ca4a3c07caf33fdc528bb000039b6ca2ef77fa2428dc6268',
      'status': 'Running',
      'version': 'd206bb27-50cf-4beb-8bab-4e65747bc423'}]}}],
 'engine_lbs': [{'ip': '10.4.11.14',
   'name': 'engine-lb-648945b8b4-mdnxx',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.4.2.7',
   'name': 'engine-sidekick-hf-summarization-828-5d64cbbfdd-5nks7',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

### Run inference

We can now run a sample inference using the `wallaroo.pipeline.infer` method and display the results.

In [29]:
input_data = {
        "inputs": ["LinkedIn (/lɪŋktˈɪn/) is a business and employment-focused social media platform that works through websites and mobile apps. It launched on May 5, 2003. It is now owned by Microsoft. The platform is primarily used for professional networking and career development, and allows jobseekers to post their CVs and employers to post jobs. From 2015 most of the company's revenue came from selling access to information about its members to recruiters and sales professionals. Since December 2016, it has been a wholly owned subsidiary of Microsoft. As of March 2023, LinkedIn has more than 900 million registered members from over 200 countries and territories. LinkedIn allows members (both workers and employers) to create profiles and connect with each other in an online social network which may represent real-world professional relationships. Members can invite anyone (whether an existing member or not) to become a connection. LinkedIn can also be used to organize offline events, join groups, write articles, publish job postings, post photos and videos, and more"], # required
        "return_text": [True], # optional: using the defaults, similar to not passing this parameter
        "return_tensors": [False], # optional: using the defaults, similar to not passing this parameter
        "clean_up_tokenization_spaces": [False], # optional: using the defaults, similar to not passing this parameter
}
dataframe = pd.DataFrame(input_data)
dataframe

Unnamed: 0,inputs,return_text,return_tensors,clean_up_tokenization_spaces
0,LinkedIn (/lɪŋktˈɪn/) is a business and employ...,True,False,False


In [30]:
out = pipeline.infer(dataframe)
out

Unnamed: 0,time,in.clean_up_tokenization_spaces,in.inputs,in.return_tensors,in.return_text,out.summary_text,anomaly.count
0,2025-07-14 17:29:14.893,False,LinkedIn (/lɪŋktˈɪn/) is a business and employ...,False,True,LinkedIn is a business and employment-focused ...,0


In [31]:
out["out.summary_text"][0]

'LinkedIn is a business and employment-focused social media platform that works through websites and mobile apps. It launched on May 5, 2003. LinkedIn allows members (both workers and employers) to create profiles and connect with each other in an online social network which may represent real-world professional relationships.'

### Model Inferencing with Pipeline Deployment Endpoint

The other option is to use the pipeline's inference endpoint.

In [36]:
dataframe.to_json("./data/sample_request.json", orient="records")

In [37]:
!curl -X POST {pipeline.url()} \
    -H "Content-Type: application/json; format=pandas-records" \
    -d @./data/sample_request.json

[{"time":1752514324749,"in":{"clean_up_tokenization_spaces":false,"inputs":"LinkedIn (/lɪŋktˈɪn/) is a business and employment-focused social media platform that works through websites and mobile apps. It launched on May 5, 2003. It is now owned by Microsoft. The platform is primarily used for professional networking and career development, and allows jobseekers to post their CVs and employers to post jobs. From 2015 most of the company's revenue came from selling access to information about its members to recruiters and sales professionals. Since December 2016, it has been a wholly owned subsidiary of Microsoft. As of March 2023, LinkedIn has more than 900 million registered members from over 200 countries and territories. LinkedIn allows members (both workers and employers) to create profiles and connect with each other in an online social network which may represent real-world professional relationships. Members can invite anyone (whether an existing member or not) to become a conne

### Undeploy the Pipeline

With the demonstration complete, we can undeploy the pipeline and return the resources back to the Wallaroo instance.

In [38]:
pipeline.undeploy()

Waiting for undeployment - this will take up to 45s .................................... ok


0,1
name,hf-summarization-pipeline
created,2025-07-14 17:14:28.980544+00:00
last_updated,2025-07-14 17:22:07.234072+00:00
deployed,False
workspace_id,108
workspace_name,john.hummel@wallaroo.ai - Default Workspace
arch,x86
accel,cuda
tags,
versions,"2234dc3f-014f-4608-b3b4-b9e8c787c477, 782e26c5-a88c-49d0-b2ec-8d35ab363a3e, 07f4e9c6-e1e4-4a68-8c5e-82841f45d9f9"
