# Testing LLM latency, throughput, and price-performance

> *This notebook has been tested in the Python 3 kernel of SageMaker Studio JupyterLab (Distribution v2.0)*

Response quality and robustness of LLM-powered applications must also be contextualized by their **speed** and **cost**: Because slow latencies or high prices could prevent a solution from being viable, and a wide *range of models are available* with different trade-offs between these parameters.

This notebook briefly demonstrates two open-source tools you can use to test and compare LLM (application)s' response speeds and consider the implications for cost-of-ownership: [LLMeter](https://pypi.org/project/llmeter/) and [FMBench](https://github.com/aws-samples/foundation-model-benchmarking-tool).

The two halves are independent, but both assume you've deployed example SageMaker JumpStart endpoints as detailed in the [accompanying guided workshop](https://catalog.workshops.aws/workshops/ab6c96d3-53cf-4730-b0fe-f4762dbbb6eb/en-US/20-model-shortlisting/21-model-setup). We also recommend installing the dependencies up-front so you can move between halves without having to restart your notebook kernel:

In [None]:
# (list botocore to help pip find the existing installation and speed things up)
%pip install botocore "llmeter[plotting]>=0.1.2,<0.2" "transformers>=4.30,<5"

## Lightweight latency and throughput testing with LLMeter

LLMeter (on [PyPI](https://pypi.org/project/llmeter/) and [GitHub](https://github.com/awslabs/llmeter)) offers simple but extensible tools for performance testing LLMs hosted on a wide range of platforms.

It offers out-of-the-box analyses of:

- **Latency by input & output token count**: Helping you understand the impact of your generation lengths and input prompt lengths on response speed
- **Latency and throughput by concurrent requests**: Exploring how container/server-based deployment models behave under load

...and the ability to create other custom "experiments" on top of the core instrumentation framework, if you need.

In [None]:
from llmeter.endpoints import BedrockConverseStream, SageMakerStreamEndpoint
from llmeter.experiments import LatencyHeatmap, LoadTest

# We'll also use some utilities fetching example payloads for SM JumpStart endpoints:
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.jumpstart.session_utils import get_model_info_from_endpoint

### Latency heatmapping by I/O token counts

First, we'll need to set up an LLMeter "endpoint" object to connect our chosen model/API with instrumentation. In this example, we'll target Anthropic Claude 3 Haiku on Amazon Bedrock - with **response streaming** enabled:

In [None]:
bedrock_endpoint_stream = BedrockConverseStream(
    model_id="anthropic.claude-3-haiku-20240307-v1:0",
)

This endpoint provides a mechanism to invoke the model, but also track latency metrics of the invocation:

In [None]:
sample_response = bedrock_endpoint_stream.invoke(
    payload=BedrockConverseStream.create_payload(
        "Create a list of 3 pop songs",
        max_tokens=512,
        system=[{"text": "you're an expert in pop and indie music"}],
    ),
)
print(sample_response)

> ℹ️ **Note:** We're using the *streaming* variant of the endpoint here, which means the "Time-to-First Token" metric is also available. If your use-case is not able to consume streaming responses, you may see more accurate/representative overall (Time-to-Last Token) latency metrics using the non-streaming `BedrockConverse` endpoint.

For teams with an extensive, use-case representative dataset on hand - you could use LLMeter's [Runner](https://github.com/awslabs/llmeter/blob/main/llmeter/runner.py) class to run batch requests through the LLM and analyze the statistics by input and output token count. However, projects may need to perform initial investigations before such a dataset is available.

LLMeter's "Latency Heatmap" experiment explores latency as a function of prompt length and completion length by *automatically* generating prompts of various lengths from a source text (with the aim of producing long outputs), and using the `max_tokens` inference parameter to limit generation lengths.

Here we'll use the same source text as LLMeter's own examples: The text of short story "Frankenstein" by Mary Shelley:

In [None]:
!curl -o datasets/MaryShelleyFrankenstein.txt \
    https://raw.githubusercontent.com/awslabs/llmeter/main/examples/MaryShelleyFrankenstein.txt

With a source text and a function (below) to format example requests from fragments of that text, we're ready to run our experiment to measure latency across various input and output lengths:

In [None]:
def prompt_fn(prompt, **kwargs):
    formatted_prompt = f"Create a story based on the following prompt: {prompt}"
    return BedrockConverseStream.create_payload(
        formatted_prompt, inferenceConfig={"temperature": 1.0}, **kwargs
    )

latency_heatmap = LatencyHeatmap(
    endpoint=bedrock_endpoint_stream,
    clients=4,
    requests_per_combination=20,
    output_path=f"data/llmeter/{bedrock_endpoint_stream.model_id}/heatmap",
    source_file="datasets/MaryShelleyFrankenstein.txt",
    input_lengths=[50, 500, 1000],
    output_lengths=[128, 512, 1024],
    create_payload_fn=prompt_fn,
)

heatmap_results = await latency_heatmap.run()

fig, axs = latency_heatmap.plot_heatmap()

In our test, we found that the *Time-to-Last-Token* (the overall response time) correlated most strongly with the output token count - while the *Time-to-First-Token* (until the streaming response returned the first part of the response) correlated most strongly with the input prompt length.

Note that we kept the parallel `clients` and total `requests_per_combination` small for this example to avoid running in to quota issues for large workshops. With only 20 requests per combination, **p99 results are not statistically significant**. You'll also see that the graphed input and output token counts don't correspond exactly to the test inputs. The lengths of generated input prompts are approximate based on a local tokenizer (you can pass in a `tokenizer` argument to provide something that corresponds more accurately with your model under test), and outputs may be truncated in cases where the model stops generating earlier than `max_tokens` - so the graphed bins are calculated automatically from the actual data obtained by the test.

In addition to the visual output here in the notebook, you'll find that the plot and additional detail files have been saved to disk in the provided `output_path` - which could instead have been an `s3://...` URI to write directly to Amazon S3.

The summary and full request-level details are also available here in Python, for more custom analyses:

In [None]:
print(heatmap_results.output_path)

print("\nResults summary:")
print(heatmap_results)

print("\nIndividual invocation example:")
print(heatmap_results.responses[0])

### Load testing with concurrent requests

For large-scale API-based services like Amazon Bedrock, the total request volume for a specific use-case is likely to be insignificant relative to the overall size of the service: So it's important to check and request appropriate [quotas](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html) for your proposed usage, but there may be limited value in testing how latency varies with load.

For deployment like Amazon SageMaker JumpStart, however, where you choose the **number and type of instances** on which to deploy your model - it's important to understand how your solution will behave under many concurrent requests, to configure your infrastructure and auto-scaling appropriately.

LLMeter's **`LoadTest` experiment** provides a tool to test the latency and overall throughput your model endpoint can deliver as the number of concurrent requests increase.

We'll run this test against one of the SageMaker JumpStart endpoints you deployed earlier (make sure the `sm_endpoint_name` below matches to one of your deployed endpoints!):

In [None]:
sm_endpoint_name = "demo-llama-3-8b-instruct"  # <- Check this matches one of your endpoints!

# Look up the JumpStart model ID from the SageMaker endpoint:
model_id, model_version, _, _, _ = get_model_info_from_endpoint(endpoint_name=sm_endpoint_name)
print(f"{sm_endpoint_name} is a deployment of model ID:\n    {model_id}")

# Look up example payloads from SageMaker JumpStart for this model:
smjs_model = JumpStartModel(model_id=model_id, model_version=model_version)
sample_payloads = [k.body for k in (smjs_model.retrieve_all_examples() or []) if k.body]
print(f"Got {len(sample_payloads)} example request payloads for this model from SageMaker")

# Create the LLMeter 'endpoint':
sagemaker_endpoint = SageMakerStreamEndpoint(
    endpoint_name=sm_endpoint_name, model_id=model_id
)

# Test the endpoint with one of the example payloads:
print(f"\nTesting endpoint with payload:\n{sample_payloads[0]}")
sample_response = sagemaker_endpoint.invoke(payload=sample_payloads[0])

print(f"Got result:")
print(sample_response)

Once the target endpoint is set up, we can run a `LoadTest` by defining what concurrency levels we'd like to test and how many requests should be sent for each one:

In [None]:
sweep_test = LoadTest(
    endpoint=sagemaker_endpoint,
    payload=sample_payloads,
    sequence_of_clients=[1, 5, 20, 100, 200],
    min_requests_per_client=3,
    min_requests_per_run=20,
    output_path=f"data/llmeter/{sagemaker_endpoint.model_id}/sweep",
)
sweep_results = await sweep_test.run()

In [None]:
loadfig, loadaxs = sweep_test.plot_sweep_results()

LLMeter does not attempt to integrate **pricing** information directly from cloud services, but in general you can relate the results to pricing in terms of:
- The observed total input and output lengths, for per-token pricing deployments like [base models on Amazon Bedrock](https://aws.amazon.com/bedrock/pricing/)
- The total test duration & achievable throughput in concurrent requests per minute, for per-infrastructure pricing deployments like [SageMaker JumpStart Endpoints](https://aws.amazon.com/sagemaker/pricing/)

For more details on the low-level components in LLMeter, refer to their published [example notebooks](https://github.com/awslabs/llmeter/tree/main/examples).

## A more involved, cost-aware setup with FMBench

Another option for latency and load testing LLMs is [FMBench from AWS Samples](https://github.com/aws-samples/foundation-model-benchmarking-tool).

FMBench offers extra functionality, including:
- Built-in awareness of AWS Pricing to link metrics to actual dollar costs
- Built-in options to deploy (and tear down) model endpoints as part of the test run
- Extra analyses on response quality, as well as speed/cost

However, its larger scope makes it more complex to install and configure. In particular, FMBench's dependency on Python>=3.11 and range of tightly-pinned other dependencies make it challenging to install on a SageMaker Studio notebook. Generally, the tool works well when installed on its own independent environment away from other projects.

If you deployed [this workshop's CloudFormation stack](https://github.com/aws-samples/llm-evaluation-methodology/tree/main#readme) with the performance testing stack enabled, a [SageMaker Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-overview.html) has already been set up encapsulating an FMBench workflow for you.

> ℹ️ If the below SageMaker Pipeline isn't already set up in your environment, refer to the [deployment instructions](https://github.com/aws-samples/llm-evaluation-methodology/tree/main#readme) for extra guidance.

Run the code below to look up your deployed FMBench pipeline and its default input parameters:

In [None]:
import json
from sagemaker.workflow.pipeline import Pipeline

fmbench_pipeline = Pipeline("sm-fmbench-pipeline")

# Look up the pre-created pipeline and its configured input parameters:
pipeline_desc = fmbench_pipeline.describe()
print(f"Found existing FMBench pipeline '{fmbench_pipeline.name}'")

pipeline_defn = json.loads(pipeline_desc["PipelineDefinition"])
pipeline_params = pipeline_defn["Parameters"]

print("\nPipeline parameters:")
print(json.dumps(pipeline_params, indent=2))

try:
    default_config_s3uri = next(
        p for p in pipeline_params if p["Name"] == "ConfigS3Uri"
    )["DefaultValue"]
except StopIteration as e:
    raise ValueError(
        f"Couldn't find 'ConfigS3Uri' parameter in pipeline parameters: {pipeline_params}"
    )

To get started, you should be able to directly invoke this pipeline with the default parameters either through the ["Pipelines" interface in the SageMaker Studio sidebar menu](https://docs.aws.amazon.com/sagemaker/latest/dg/run-pipeline.html), or by running the code cell below:

In [None]:
fmbench_execution_1 = fmbench_pipeline.start()
fmbench_execution_1

While the pipeline runs (which may take several minutes), let's explore the configuration files that were pre-loaded to Amazon S3:

In [None]:
fmbench_config_s3root = "/".join(default_config_s3uri.split("/")[:-2])

print(fmbench_config_s3root + "\n")

!aws s3 sync {fmbench_config_s3root} data/fmbench_configs

The downloaded folder should contain (at least) **two configurations** for FMBench experiments: One for Llama, and one for Mistral. Check out the [data/fmbench_configs folder](data/fmbench_configs) for these pre-prepared `config.yaml` examples.

Note that in particular, the `ep_name` in the config YAML must exactly match the name of your deployed SageMaker endpoint. If you deployed your endpoint with a different name, you may need to edit this file and then re-upload it to S3 for your pipeline to work correctly.

You can check the status of your running pipeline execution in the SageMaker Studio Pipelines UI, or from code here in the notebook:

In [None]:
exec_1_desc = fmbench_execution_1.describe()
print(f"Pipeline run status: {exec_1_desc['PipelineExecutionStatus']}\n")

exec_1_steps = fmbench_execution_1.list_steps()
exec_1_steps

▶️ **TODO:** Can you edit the `ConfigS3Uri` parameter in the below cell to trigger an additional run of the pipeline to test the **other** model that wasn't in the default run?

Use the default pipeline parameters above as a guide, and check out which other config(s) were downloaded from this bucket that you could use.

In [None]:
fmbench_execution_2 = fmbench_pipeline.start(
    parameters={
        "ConfigS3Uri": "TODO: Replace with your alternative s3://... URI"
    },
    # You can also optionally name each pipeline execution for tracking purposes:
    # execution_display_name="llama3-run-1",
)

### Exploring FMBench results

When (one or both of) your pipeline runs have finished, you can explore the detailed results and reports saved by FMBench to Amazon S3.

The below code will wait for both pipeline runs, and then copy the contents of the output bucket (assuming you didn't override it in the pipeline parameters) here in the notebook:

In [None]:
# Wait for both pipeline runs to finish:
fmbench_execution_1.wait()
fmbench_execution_2.wait()

# Look up the (default) S3 output location for the runs:
try:
    default_output_bucket = next(
        p for p in pipeline_params if p["Name"] == "OutputS3BucketName"
    )["DefaultValue"]
except StopIteration as e:
    raise ValueError(
        f"Couldn't find 'OutputS3BucketName' parameter in pipeline parameters: {pipeline_params}"
    )

# Download the S3 output folder locally:
print("Downloading FMBench results...\n")
!aws s3 sync s3://{default_output_bucket}/ data/fmbench_outputs

You'll see separate subfolders created in the outputs for each model/experiment you ran, and the `metrics` subfolder will contain detailed reports including a `business_summary.html`, `tokens_vs_latency.png` chart, and a range of other data and charts.

## Summary

Particularly for real-time or high-volume applications, it's important to assess and evaluate potential trade-offs between the achievable quality of responses, versus the speed and cost-to-serve.

In this notebook we briefly explored two different tools for latency and throughput testing:

- [**LLMeter**](https://github.com/awslabs/llmeter#readme), from AWS Labs:
    - is straightforward to install
    - connects to OpenAI and a range of other 3rd-party model providers, as well as AWS-hosted models
    - includes pre-built "experiments" for mapping latency by prompt+completion length, and performance by number of concurrent clients/requests
    - keeps a simple API for defining more customized test runs and drilling further into the metrics
- [**FMBench**](https://aws-samples.github.io/foundation-model-benchmarking-tool/), from AWS Samples:
    - brings in AWS pricing information to directly link latency and throughput metrics to cost-of-ownership
    - offers some response quality evaluations as well as performance (based on LongBench and LLM-judged evaluations by default)
    - requires some more complex setup and configuration due to its feature set, but this complexity can be contained by packaging the tool in a SageMaker Pipeline as we showed here
 
You can find more detailed information and feature updates on the tools' own websites.

Whatever tooling you use, accounting for response speed and cost considerations can help you:
- Make informed trade-offs to select the best model for your use-case
- Understand potential latency & cost rewards for optimizing the length of your prompt templates, typical output generations, and hard `max_tokens` limits on generation length