# BYOC LLM Monitoring: Bring Your Own Container Llama2 Multiple Evaluations Monitoring with SageMaker Model Monitor

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

---

---
In this demo notebook, we demonstrate how to use the SageMaker Python SDK to deploy and monitor a JumpStart Llama 2 fine-tuned model for Toxicity, Answer Relevance and Accuracy, and Readability. The container associated with this notebook employs [FMEval](https://github.com/aws/fmeval) for LLM Toxicity evaluation, [LangChain](https://python.langchain.com/v0.1/docs/guides/productionization/evaluation/) for Answer Relevance and Accuracy, and [WhyLabs LangKit](https://whylabs.ai/langkit) for Readability.

To perform inference on these models, you need to pass custom_attributes='accept_eula=true' as part of header. This means you have read and accept the end-user-license-agreement (EULA) of the model. EULA can be found in model card description or from https://ai.meta.com/resources/models-and-libraries/llama-downloads/. By default, this notebook sets custom_attributes='accept_eula=false', so all inference requests will fail until you explicitly change this custom attribute.

Note: Custom_attributes used to pass EULA are key/value pairs. The key and value are separated by '=' and pairs are separated by ';'. If the user passes the same key more than once, the last value is kept and passed to the script handler (i.e., in this case, used for conditional logic). For example, if 'accept_eula=false; accept_eula=true' is passed to the server, then 'accept_eula=true' is kept and passed to the script handler.



# Background

SageMaker Model Monitor allows users to provide images of their own custom-built containers to be run at each monitoring job. This notebook leverages the [BYOC](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-byoc-containers.html) feature to monitor the Llama2-7b model for 7 different Toxicity levels.

# Prerequisites
- **IF RUNNING LOCALLY (not SageMaker Studio/Classic)**: An IAM role that gives SageMakerFullAccess. This role must also include the AmazonEC2ContainerRegistryFullAccess permission in order to push container image to ECR and the CloudWatchFullAccess permission to create CloudWatch Dashboards. By default, the SageMaker Execution Role associated with Sagemaker Studio instances do not have these permissions; **you must manually attach them**. For information on how to complete this, see this [documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html)

- **IF RUNNING ON SAGEMAKER STUDIO/STUDIO CLASSIC (not locally)**: An IAM role that gives SageMakerFullAccess. This role must also include the AmazonEC2ContainerRegistryFullAccess permission in order to push container image to ECR and the CloudWatchFullAccess permission to create CloudWatch Dashboards. By default, the SageMaker Execution Role associated with Sagemaker Studio instances do not have these permissions; **you must manually attach them**. Please also ensure that Docker access is enabled in your domain and that you have downloaded Docker for this notebook instance. Please follow the [guide](#sagemaker-studio-docker-guide) at the end of this notebook to complete Docker setup.

## Setup

***

**This notebook is best suited for a kernel of python verion >= 3.11**

In [None]:
%pip install -r requirements.txt

## Retreive your SageMaker Session and Configure Execution Role

In [None]:
import sagemaker
import boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

# Here, we create a role for SageMaker. The role ARN must be specified when calling the predict() method. If this fails, you can manually specify the role ARN in the except block.
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    # Manually specify the role ARN. Ensure that this role has the 'AmazonSageMakerFullAccess' role. See the linked documentation for help.
    role = iam.get_role(RoleName="<CustomRoleName>")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

***
You can continue with the default model or choose a different model: this notebook will run with the following model IDs :
- `meta-textgeneration-llama-2-7b-f`
- `meta-textgeneration-llama-2-13b-f`
- `meta-textgeneration-llama-2-70b-f`
***

In [None]:
model_id, model_version = "meta-textgeneration-llama-2-7b-f", "2.*"

## Deploy model

***
You can now deploy the model using SageMaker JumpStart.
***

### Set up DataCapture

In [None]:
bucket = sess.default_bucket()
print("Demo Bucket:", bucket)

In [None]:
from sagemaker.model_monitor import DataCaptureConfig

s3_root_dir = "byoc-multiple-eval-monitor-llm"

s3_capture_upload_path = f"s3://{bucket}/{s3_root_dir}/datacapture"

data_capture_config = DataCaptureConfig(
    enable_capture=True, sampling_percentage=100, destination_s3_uri=s3_capture_upload_path
)

In [None]:
print(s3_capture_upload_path)

### Note: This next cell will take ~10 minutes

In [None]:
from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id=model_id, model_version=model_version, role=role)
predictor = model.deploy(data_capture_config=data_capture_config)
print(model.endpoint_name)

## Invoke the endpoint

***
### Supported Parameters
This model supports the following inference payload parameters:

* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.

You may specify any subset of the parameters mentioned above while invoking an endpoint. 

***
### Notes
- If `max_new_tokens` is not defined, the model may generate up to the maximum total tokens allowed, which is 4K for these models. This may result in endpoint query timeout errors, so it is recommended to set `max_new_tokens` when possible. For 7B, 13B, and 70B models, we recommend to set `max_new_tokens` no greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.
- In order to support a 4k context length, this model has restricted query payloads to only utilize a batch size of 1. Payloads with larger batch sizes will receive an endpoint error prior to inference.
- This model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and alternating (u/a/u/a/u...).


In [None]:
def print_dialog(payload, response):
    dialog = payload["inputs"][0]
    for msg in dialog:
        print(f"{msg['role'].capitalize()}: {msg['content']}\n")
    print(
        f">>>> {response[0]['generation']['role'].capitalize()}: {response[0]['generation']['content']}"
    )
    print("\n==================================\n")

### Example of a single invocation

**NOTE**: Read the end-user-license-agreement here https://ai.meta.com/resources/models-and-libraries/llama-downloads/ and accept by setting `accept_eula` to `true`

In [None]:
payload = {
    "inputs": [
        [
            {"role": "user", "content": "what is the recipe of mayonnaise?"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)

### Send artificial traffic to the endpoint.

The following cell will send questions to the endpoint until stopped. Feel free to stop the cell whenever you feel you have captured enough data.

**NOTE**: Read the end-user-license-agreement here https://ai.meta.com/resources/models-and-libraries/llama-downloads/ and accept by setting `accept_eula` to `true`

In [None]:
import json

line_count = 0
with open("./data/questions.jsonl", "r") as datafile:
    for line in datafile:
        if line_count == 10:
            break
        line_count += 1
        data = json.loads(line)
        payload = {
            "inputs": [
                [
                    data,
                ]
            ],
            "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
        }
        try:
            response = predictor.predict(payload, custom_attributes="accept_eula=false")
            print_dialog(payload, response)
        except Exception as e:
            print(e)

# Build and Push the Container to ECR

In [None]:
ecr_repo_name = "byoc-llm-multiple-eval"
aws_region = sess.boto_region_name
aws_account_id = sess.account_id()

#### **IMPORTANT:** If running locally (not on SageMaker Studio), delete ' --network sagemaker'
Build the image. This will take ~5 mins.

In [None]:
!set -Eeuxo pipefail
!docker build -t "{ecr_repo_name}" . --network sagemaker

Create the repository. Ensure the role you have assumed has the AmazonEC2ContainerRegistryFullAccess permission attached.

In [None]:
ecr = boto3.client("ecr")

try:
    response = ecr.create_repository(
        repositoryName=ecr_repo_name,
        imageTagMutability="MUTABLE",
        imageScanningConfiguration={"scanOnPush": False},
    )
except ecr.exceptions.RepositoryAlreadyExistsException:
    print(f"Repository {ecr_repo_name} already exists. Skipping creation.")

Push the image to ECR. This will take some time, as the image is ~9GB. Ensure that your AWS credentials are fresh.

In [None]:
!LATEST_IMAGE_ID=$(docker images --filter=reference='{ecr_repo_name}:latest' --format "{{.ID}}" | head -n 1)
!echo $LATEST_IMAGE_ID

!aws ecr get-login-password --region '{aws_region}' | docker login --username AWS --password-stdin '{aws_account_id}'.dkr.ecr.'{aws_region}'.amazonaws.com

!docker tag '{ecr_repo_name}':latest '{aws_account_id}'.dkr.ecr.'{aws_region}'.amazonaws.com/'{ecr_repo_name}':latest

!echo 'Pushing to ECR Repo: ''{aws_account_id}'.dkr.ecr.'{aws_region}'.amazonaws.com/'{ecr_repo_name}':latest
!docker push '{aws_account_id}'.dkr.ecr.'{aws_region}'.amazonaws.com/'{ecr_repo_name}':latest

# Set a Monitoring Schedule

In [None]:
from sagemaker.model_monitor import ModelMonitor

image_uri = f"{aws_account_id}.dkr.ecr.{aws_region}.amazonaws.com/{ecr_repo_name}:latest"
bucket = sess.default_bucket()

monitor = ModelMonitor(
    base_job_name="byoc-llm-multiple-eval-monitor",
    role=role,
    image_uri=image_uri,
    instance_count=1,
    instance_type="ml.c5.9xlarge",
    env={
        "bucket": bucket,
        "TOXICITY": "Enabled",
        "READABILITY": "Enabled",
        "RELEVANCE_AND_ACCURACY": "Enabled",
    },  # Change one to DISABLED if metrics not desired.
)

**Note**: The following cell sets a **one-time** monitoring schedule for demonstration purposes. A one-time monioring schedule will execute immediately. If you would like to set an hourly schedule, swap out the commented line. It is important to know that hourly schedules will only begin at the start of the next full hour, so you will not see immediate results.

In [None]:
from sagemaker.model_monitor import CronExpressionGenerator, MonitoringOutput, EndpointInput

# Do not change
container_data_destination = "/opt/ml/processing/input_data"
container_evaluation_source = "/opt/ml/processing/output"
s3_report_upload_path = f"s3://{bucket}/{s3_root_dir}/results"


endpoint_input = EndpointInput(
    endpoint_name=predictor.endpoint_name,
    destination=container_data_destination,
)

monitor.create_monitoring_schedule(
    endpoint_input=endpoint_input,
    output=MonitoringOutput(source=container_evaluation_source, destination=s3_report_upload_path),
    schedule_cron_expression=CronExpressionGenerator.now(),  # CronExpressionGenerator.hourly()
    # data sampling is from 3hrs prior to execution to time of execution
    data_analysis_start_time="-PT3H",
    data_analysis_end_time="-PT0H",
)

# View Results

The following cell prints the output report stored in Amazon S3. It includes evaluations for at most 100 samples of the  captured data.

**NOTE:** The report will show up once the job is finished. Please try again in a few minutes.

In [None]:
from sagemaker import s3

try:
    execution_output = monitor.list_executions()[-1].output
    s3_path_to_toxicity_report = f"{execution_output.destination}/toxicity_custom_dataset.jsonl"
    s3_path_to_readability_report = f"{execution_output.destination}/readability_eval_results.jsonl"
    s3_path_to_relevance_and_accuracy_report = (
        f"{execution_output.destination}/relevance_and_accuracy_eval_results.jsonl"
    )
    print("Toxicity report: \n")
    print(s3.S3Downloader.read_file(s3_path_to_toxicity_report), "\n")
    print("Readability report: \n")
    print(s3.S3Downloader.read_file(s3_path_to_readability_report), "\n")
    print("Relevance and Accuracy report: \n")
    print(s3.S3Downloader.read_file(s3_path_to_relevance_and_accuracy_report))
except:
    print("Report not found. Please wait and try again.")

### View Cloudwatch Dashboard Graph
The following cell will generate a CloudWatch Dashboard for the monitoring schedule you created. For more information on dashboard formatting, see [here](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/CloudWatch-Dashboard-Body-Structure.html#Dashboard-Body-Overall-Structure)

In [None]:
cwClient = boto3.client("cloudwatch")
monitoring_schedule_name = monitor.describe_schedule()["MonitoringScheduleName"]
endpoint_name = monitor.describe_schedule()["EndpointName"]

# Get the metrics for this monitoring schedule
metric_list = cwClient.list_metrics(
    Dimensions=[
        {"Name": "Endpoint", "Value": endpoint_name},
        {"Name": "MonitoringSchedule", "Value": monitoring_schedule_name},
    ],
)
metric_names = [metric["MetricName"] for metric in metric_list["Metrics"]]
print(metric_names)

In [None]:
linear_interpolate_metric = [
    {
        "expression": "FILL(METRICS(), LINEAR)",
        "label": "Linear Interpolated",
        "id": "e1",
        "region": sess.boto_region_name,
    }
]
metrics = [linear_interpolate_metric]
for i, metric_name in enumerate(metric_names):
    metrics.append(
        [
            "aws/sagemaker/Endpoints/data-metrics",
            metric_name,
            "Endpoint",
            endpoint_name,
            "MonitoringSchedule",
            monitoring_schedule_name,
            {"id": f"m{i+1}", "region": sess.boto_region_name, "visible": False},
        ]
    )

widget_title = "LLM Multiple Evaluations Graph"

dash_data = json.dumps(
    {
        "start": "-PT6H",
        "periodOverride": "inherit",
        "widgets": [
            {
                "type": "metric",
                "x": 0,
                "y": 0,
                "width": 13,
                "height": 10,
                "properties": {
                    "metrics": metrics,
                    "view": "timeSeries",
                    "stacked": False,
                    "region": sess.boto_region_name,
                    "stat": "Average",
                    "period": 300,
                    "title": widget_title,
                },
            },
            {
                "type": "text",
                "x": 13,
                "y": 0,
                "width": 11,
                "height": 11,
                "properties": {
                    "markdown": "# LLM Evaluation Descriptions\n## Toxicity\nToxicity is measured in 7 different categories:\n- `toxicity`\n- `severe_toxicity`\n- `obscene`\n- `threat`\n- `insult`\n- `identity_attack`\n- `sexual_explicit`\n\nEach score is a number between 0 and 1, with 1 denoting extreme toxicity. To obtain the toxicity scores, the FMEval library uses the open-source [Detoxify](https://github.com/unitaryai/detoxify) model to grade each LLM output.\n \n\n\n## Readability\nReadability is measured in 11 different categories. These measurements are created and aggregating by the WhyLabs LangKit `textstat` module. For information on scoring for each metric, read their documentation [here](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data).\n\n## Relevance and Accuracy\nRelevance and accuracy is graded on a single score from 1-10. The prompt and response from the monitored LLM are provided to an evaluator LLM with intructions as follows:\n\n> Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. For this evaluation, you should primarily consider the following criteria:\n> - helpfulness: Is the submission helpful, insightful, and appropriate?\n> - relevance: Is the submission referring to a real quote from the text?\n> - correctness: Is the submission correct, accurate, and factual?\n> - depth: Does the submission demonstrate depth of thought?\n\n> Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: '[[rating]]', for example: 'Rating: [[5]]'.",
                },
            },
        ],
    }
)

dashboard_name = "byoc-llm-multiple-monitoring"
cwClient.put_dashboard(DashboardName=dashboard_name, DashboardBody=dash_data)

Click the link from the following cell output to view the created CloudWatch Dashboard

In [None]:
from IPython.display import display, Markdown

display(
    Markdown(
        f"[CloudWatch Dashboard](https://{aws_region}.console.aws.amazon.com/cloudwatch/home?region={aws_region}#dashboards/dashboard/{dashboard_name})"
    )
)

### Clean up resources

In [None]:
import time

# Delete monitoring job

name = monitor.monitoring_schedule_name
monitor.delete_monitoring_schedule()

# Waits until monitoring schedule has been deleted to delete endpoint
while True:
    monitoring_schedules = sess.list_monitoring_schedules()
    if any(
        schedule["MonitoringScheduleName"] == name
        for schedule in monitoring_schedules["MonitoringScheduleSummaries"]
    ):
        time.sleep(5)
    else:
        print("Monitoring schedule deleted")
        break

sess.delete_endpoint(endpoint_name=predictor.endpoint_name)  # delete model endpoint

# SageMaker Studio Docker Guide

To set up docker in your SageMaker studio environment, follow these steps:
1. Run the following command in the AWS CLI, inputting your region and SageMaker domain ID:
```bash
aws --region <region> \
    sagemaker update-domain --domain-id <domain-id> \
    --domain-settings-for-update '{"DockerSettings": {"EnableDockerAccess": "ENABLED"}}'
```
2. Open a new notebook instance. Only instances created after running this command will have Docker access.
3. Open the terminal in this new instance and follow the [installation directions](https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/sagemaker_studio_docker_cli_install/README.md)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/deploy_and_monitor|sm-model_monitor_byoc_llm_monitor|sm-model_monitor_byoc_llm_monitor.ipynb)
