This tutorial and the assets can be downloaded as part of the [Wallaroo Tutorials repository](https://github.com/WallarooLabs/Wallaroo_Tutorials/blob/wallaroo2024.4_tutorials/wallaroo-llms/llamacpp-with-safeguards).

The following tutorial demonstrates the Llama 3 70b Instruct Q5 Large Language Model (LLM) with a Harmful Language Listener (HLL).  This provides validation monitoring to detect language that could be considered harmful:  obscene, racist, insulting, or other benchmarks.

This tutorial demonstrates how to:

* Upload the LLM and HLL.
* Create a Wallaroo pipeline and set the LLM and then the HLL as pipeline steps.
* Deploy the models and perform sample inferences.

For access to these sample models and for a demonstration of how to use a LLM Validation Listener.

* Contact your Wallaroo Support Representative **OR**
* [Schedule Your Wallaroo.AI Demo Today](https://wallaroo.ai/request-a-demo/)

## Model Overview

The LLM used in this demonstrates has the following attributes.

* Framework: `vllm` for more optimized model deployment, uploaded to Wallaroo in the [Wallaroo Arbitrary Python aka Bring Your Own Predict (BYOP) Framework](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-model-uploads/wallaroo-sdk-model-arbitrary-python/).
* Artifacts:  The original model is here the Llama 3 8B Instruct Hugging Face model:[Llama 3 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
* Input/Output Types:  Both the input and outputs are text.

The HLL used in this demonstration has the following attributes:

* Framework: `vllm` for more optimized model deployment, uploaded to Wallaroo in the [Wallaroo Arbitrary Python aka Bring Your Own Predict (BYOP) Framework](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-model-uploads/wallaroo-sdk-model-arbitrary-python/).
* Artifacts:  The HLL model is encapsulated as part of the BYOP framework.
* Input/Output Types:  The HLL takes the following inputs and outputs.
  * HLL Input:
    * `text` (*String*): The original input text to the LLM.
    * `generated_text` (*String*): The text created by the LLM.  This will be evaluated by the HLL for any harmful language.
  * HLL Output:
    * `harmful` (*Boolean*): Determines if the `generated_text` is harmful.
    * `reasoning` (*String*): The reasons why the `generated_text` is considered harmful or not.
    * `confidence` (*Float*): The confidence the model has of whether the `generated_text` is harmful or now.
    * `generated_text` (*String*): The text generated by the LLM.  This is passed on as part of the HLL's output.

## Tutorial Steps

### Import Libraries

We start by importing the required libraries.  This includes the following:

* [Wallaroo SDK](https://pypi.org/project/wallaroo/):  Used to upload and deploy the model in Wallaroo.
* [pyarrow](https://pypi.org/project/pyarrow/):  Models uploaded to Wallaroo are defined in the input/output format.
* [pandas](https://pypi.org/project/pandas/):  Data is submitted to models deployed in Wallaroo as either Apache Arrow Table format or pandas Record Format as a DataFrame.

In [1]:
import json
import os

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture
from wallaroo.dynamic_batching_config import DynamicBatchingConfig

import pyarrow as pa
import numpy as np
import pandas as pd

### Connect to the Wallaroo Instance

A connection to Wallaroo is set through the Wallaroo client.  The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the `wallaroo.Client()` command, which provides a URL to grant the SDK permission to your specific Wallaroo environment.  When displayed, enter the URL into a browser and confirm permissions.  Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use `wl = wallaroo.Client()`.  For more information on Wallaroo Client settings, see the [Client Connection guide](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-client/).

In [17]:
wl = wallaroo.Client()

### Upload the LLM

To upload the LLM and HLL, we use the `wallaroo.client.Client.upload_model` method which takes the following parameters.

* The name to assign to the LLM.
* The file path to upload the LLM.
* The Framework set to `wallaroo.framework.Framework.CUSTOM` for our Hugging Face model encapsulated in the BYOP framework.
* The input and output schemas.

For more information, see the Wallaroo [Model Upload](https://docs.wallaroo.ai/wallaroo-model-operations/wallaroo-model-operations-deploy/wallaroo-model-operations-upload-register/) guide.

First we'll set the input and output schemas for our LLM in Apache PyArrow Schema format.

In [19]:
input_schema = pa.schema([
    pa.field("text", pa.string())
])

output_schema = pa.schema([
    pa.field("text", pa.string()),
    pa.field("generated_text", pa.string())
])

Then issue the upload command.  For this example, we'll add a **model configuration** to specify [Dynamic Batching for LLMs](https://docs.wallaroo.ai/wallaroo-llm/wallaroo-llm-optimizations/wallaroo-llm-optimizations-dynamic-batching/) which improves the performance of LLMs.

In [20]:
llm = wl.upload_model('llama-cpp-sdk-safeguards', 
    'byop_llamacpp_safeguards.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema
).configure(input_schema=input_schema,
            output_schema=output_schema,
            dynamic_batching_config=DynamicBatchingConfig(max_batch_delay_ms=1000, 
                                                          batch_size_target=8)
            )
llm

Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime............successful

Ready


0,1
Name,llama-cpp-sdk-safeguards
Version,c28e8fee-a1e0-48eb-a906-430fe1eba7ac
File Name,byop_llamacpp_safeguards.zip
SHA,45752b3566691a641787abd9b1b9d94809f8a74d545283d599e8a2cdc492d110
Status,ready
Image Path,proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.4.0-5825
Architecture,x86
Acceleration,none
Updated At,2024-12-Dec 15:34:44
Workspace id,60


Next we upload the HLL in the same process:  define the input and output schemas, and then upload the model.

Note that for the HLL, the inputs are the LLM's **outputs**.  The HLL includes with its outputs the LLM's `generated_text` field so it is passed back to the original receiver.

In [5]:
#Safeguards Harmful Language Listener
#Define schemas
input_schema = pa.schema([
    pa.field("text", pa.string()),
    pa.field("generated_text", pa.string())
])

output_schema = pa.schema([
    pa.field("harmful", pa.bool_()),
    pa.field("reasoning", pa.string()),
    pa.field("confidence", pa.float32()),
    pa.field("generated_text", pa.string())
])

In [6]:
#upload harmful language listener
listener = wl.upload_model('byop-safeguards-harmful-5', 
    'byop-safeguards-harmful.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema,
)
listener

Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime................................successful

Ready


0,1
Name,byop-safeguards-harmful-5
Version,98893de8-6c13-44cf-b098-b4f1f44ff483
File Name,byop-safeguards-harmful.zip
SHA,c41ff30b7032262e6ceffed2da658a44d16e698c1e826c3526b6a2379c8d2b1b
Status,ready
Image Path,proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.4.0-5825
Architecture,x86
Acceleration,none
Updated At,2024-12-Dec 14:50:39
Workspace id,60


### Deployment

For our deployment, we deploy both the LLM and HLL in the same pipeline as **pipeline steps**.  Input provided to the pipeline is submitted first to the LLM.  The output from the LLM is then the input to the HLL, and the HLL's output is then provided back to the requester.

The deployment configuration sets the resources allocated for the LLM and the HLL with the following options:

* LLM
  * CPUs: 6
  * Memory:  10 Gi
* HLL Listener
  * CPUs: 2
  * Memory:  10 Gi

In [21]:
deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(llm, 6) \
    .sidekick_memory(llm, '10Gi') \
    .sidekick_cpus(listener, 2) \
    .sidekick_memory(listener, '10Gi') \
    .sidekick_env(listener, json.load(open("credentials.json", 'r'))) \
    .build()

The Wallaroo pipeline is created with the `build_pipeline` method.  The LLM and HLL are set as the **pipeline steps**, then deployed with the previously defined deployment configuration.

In [None]:
pipeline = wl.build_pipeline("safeguards-llamacpp-2")
pipeline.add_model_step(llm)
pipeline.add_model_step(listener)
pipeline.deploy(deployment_config=deployment_config, wait_for_status=False)

Once deployed, we'll check on the `status`.  When the `status` is `Running`, we continue to the inference steps.

In [27]:
pipeline.status()

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.4.4.27',
   'name': 'engine-6c578848c9-bhs29',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'safeguards-llamacpp-2',
      'status': 'Running',
      'version': '2b61e016-1e92-4f7a-8efb-f09b29cd126a'}]},
   'model_statuses': {'models': [{'model_version_id': 151,
      'name': 'byop-safeguards-harmful-5',
      'sha': 'c41ff30b7032262e6ceffed2da658a44d16e698c1e826c3526b6a2379c8d2b1b',
      'status': 'Running',
      'version': '98893de8-6c13-44cf-b098-b4f1f44ff483'},
     {'model_version_id': 152,
      'name': 'llama-cpp-sdk-safeguards',
      'sha': '45752b3566691a641787abd9b1b9d94809f8a74d545283d599e8a2cdc492d110',
      'status': 'Running',
      'version': 'c28e8fee-a1e0-48eb-a906-430fe1eba7ac'}]}}],
 'engine_lbs': [{'ip': '10.4.4.26',
   'name': 'engine-lb-6676794678-bbpfm',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip':

### Inference

For our inference, we submit either a pandas DataFrame or Apache Arrow table with our text query.  In this case:  `Describe what Wallaroo.AI is'.

Once submitted, we display the `harmful`, `confidence`, and `reason`.

In [28]:
data = pd.DataFrame({'text': ['Describe what Wallaroo.AI is']})

In [None]:
result=pipeline.infer(data, timeout=10000)

In [36]:
result["out.confidence"][0]

0.95

In [35]:
result["out.harmful"][0]

False

In [34]:
result["out.reasoning"][0]

'This response provides a neutral and informative description of Wallaroo.AI, highlighting its capabilities without perpetuating any biases or stereotypes.'

### Undeploy the Models

With the tutorial complete, we undeploy the model and return the resources back to the cluster.

In [None]:
pipeline.undeploy()