# Modelling Costs with LLMeter

This notebook introduces how to use LLMeter's `CostModel` callback to estimate costs and factor these in to your comparisons between different LLMs and solution configurations.

If you're new to LLMeter, you may find it helpful to follow through one of the introductory "LLMeter with..." notebooks first.

## Overview

At a high level, to model costs with LLMeter you'll:
1. Start by building up a cost model in line with whatever "dimensions" of charges are applicable for your chosen FM provider (such as charges per token, infrastructure charges for endpoint uptime, and so on)
2. Include this model as a callback when running an LLMeter Run or Experiment, **or** calculate pricing against a previous run for which the callback wasn't enabled.
3. Explore the request-level and run-level cost estimates calculated by your model, to help evaluate and compare FMs.

> ⚠️ **In important warning: Pricing is complicated!**
>
> In general, many factors can affect the final charges you might incur when using Foundation Models or other Cloud services - and make comparing different types of hosting services more complicated.
>
> For example: services may offer volume discounts, reserved-capacity discounts, private pricing agreements, tiered pricing, or even free tiers, in which case the marginal costs of deploying a new use-case may depend on other workloads you're already committed to. Additional factors like networking and data transfer charges, gateways or other solution components may also contribute to pricing in complex ways.
>
> **You are ultimately responsible for understanding your cost structure.** Even in cases where LLMeter provides examples or utilities that attempt to simplify modelling the costs of endpoint types we natively support, we cannot guarantee these will be authoritative estimates *or* that they capture all the nuances of costing in your environment.

## Setting up

First, ensure your environment has LLMeter installed and import the key components:

In [None]:
%pip install llmeter

In [None]:
from dataclasses import dataclass

from llmeter.callbacks import CostModel
from llmeter.callbacks.cost import CalculatedCostWithDimensions, dimensions
from llmeter.endpoints.bedrock import BedrockConverseStream
from llmeter.endpoints.sagemaker import SageMakerEndpoint
from llmeter.experiments import LoadTest  # Example of a higher-level "experiment"
from llmeter.results import InvocationResponse, Result
from llmeter.runner import Runner  # Low-level test runner

## Request-based pricing example (with Amazon Bedrock)

For many popular "as-a-service" Foundation Model APIs (like OpenAI, Anthropic, or Amazon Bedrock Converse), the major components of pricing center around the **number of tokens** sent through input prompts and returned in output completions.

Often these are charged at different rates (usually with higher prices per output token), and we can model these with LLMeter's built-in `InputTokens` and `OutputTokens` dimensions.

Let's assume for example you'd like to call the Anthropic Claude 3.5 Haiku model on Amazon Bedrock On-Demand Throughput. Referring to the [Amazon Bedrock pricing page](https://aws.amazon.com/bedrock/pricing/), we can set up a basic cost model as shown below:

In [None]:
endpoint = BedrockConverseStream(
    # Model IDs from the table here:
    # https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html
    model_id="anthropic.claude-3-haiku-20240307-v1:0"
)

cost_model = CostModel(
    # Request-level costs:
    request_dims=[
        # At the time of writing, Bedrock pricing page for Claude 3 Haiku *in us-east-1* lists:
        # $0.00025 per thousand input tokens, $0.00125 per thousand output tokens
        dimensions.InputTokens(price_per_million=0.25),
        dimensions.OutputTokens(price_per_million=1.25),
    ]
)

To check things are working, we can invoke our endpoint with an example payload and estimate the costs for that specific request:

In [None]:
sample_payloads = [
    BedrockConverseStream.create_payload(
        "Tell me a short story about a caterpillar that learns to forgive",
        max_tokens=2000,
    ),
    BedrockConverseStream.create_payload(
        "In what year did Singapore Changi airport open?",
        max_tokens=200,
    ),
    BedrockConverseStream.create_payload(
        "When should I use lists vs tuples in Python?",
        max_tokens=1000,
    ),
]

response = endpoint.invoke(payload=sample_payloads[1])
print("--- Response ---")
print(response)

print("\n--- Cost Estimate ---")
req_est = await cost_model.calculate_request_cost(response)
print(f"Dimensions: {req_est}")
print(f"Total: ${req_est.total:f}")

More interestingly, we can analyze overall and average-per-request costs by passing the model in as a **callback** to a test run:

In [None]:
endpoint_test = Runner(
    endpoint,
    output_path=f"outputs/{endpoint.model_id}",
    callbacks=[cost_model],  # <- Specify our cost model
)
results = await endpoint_test.run(
    payload=sample_payloads,
    n_requests=9,
    clients=3,
)

original_total_cost = results.cost_total
original_per_req_avg = results.stats.get("cost_per_request-average")
print(results)

On these results:

- `cost_InputTokens` is the total estimated charges *for input tokens* in the test Run
- `cost_OutputTokens` is the total estimated charges *for output tokens* in the test Run
- `cost_total` is the total of all estimated costs for the test Run (should equal `cost_InputTokens + cost_outputTokens` in this case)
- You'll also see **statistics** for the total and per-dimension costs at the request level. For example:
    - `cost_OutputTokens_per_request-p50` is the *median* output tokens charge per request in the test run
    - `cost_per_request_average` is the *mean average* overall cost per request in the test run

In this example we deliberately set up some payloads with different expected output lengths, so you should see some variation between the average, `p50`, and `p90` request costs.

It's also possible to re-analyze your run Result with a different Cost Model:

In [None]:
cost_model_2 = CostModel(
    request_dims=[
        dimensions.InputTokens(price_per_million=5),
        dimensions.OutputTokens(price_per_million=10),
    ],
)

cost_estimate_2 = await cost_model_2.calculate_run_cost(
    results,
    # Optionally, add `save=True` here to overwrite the estimates and stats on the `results` object
)

print("--- Original estimates ---")
print(f"Total run cost: ${original_total_cost:f}")
print(f"Average cost per request: ${original_per_req_avg:f}")

print("\n--- Alternative model ---")
print(f"Total run cost: ${cost_estimate_2.total:f}")
# Without `save`-ing the estimates to the results, `results.stats` won't be updated. You can still
# generate the stats separately as shown below - it's just more complex:
new_req_costs = [
    await cost_model_2.calculate_request_cost(r) for r in results.responses
]
new_cost_stats = CalculatedCostWithDimensions.summary_statistics(new_req_costs)
print(f"Average cost per request: ${new_cost_stats.get('total-average')}")

By associating a cost model with your runs (or analyzing runs with one retrospectively), we've seen how you can explore detailed cost breakdowns; totals; and summary statistics over all the requests in the run.

In the next sections, we'll explore more complex scenarios including cost-drivers independent of individual requests - or bringing custom cost dimensions.

> ⚠️ **Warning:** The `cost_per_request-average` statistic shown above is the *average of request-level costs*. If you use a Cost Model with *both* request-level and run-level cost dimensions (as described below), you probably want `results.cost_total / results.total_requests` instead!

## Infrastructure-based pricing example (with Amazon SageMaker)

For *deployment-based* services like [Amazon Bedrock Provisioned Throughput](https://docs.aws.amazon.com/bedrock/latest/userguide/prov-throughput.html), self-managed model servers, or [Amazon SageMaker Real-Time Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html), pricing is mainly driven by the amount of time your provisioned endpoint is available - rather than charges per-request.

In these cases, we can use LLMeter's **run-level cost dimensions** to model costs that can't be broken down to the individual request level - but may still be useful to analyze at the run level.

As an example, let's consider load-testing a single-instance SageMaker real-time inference endpoint to explore how cost, latency, and throughput interact.

Assuming we've deployed a smaller LLM (like Mistral 7B or Llama 3.1 8B) from SageMaker JumpStart, to a single instance, we can:

- Refer to the [host instance storage volumes table](https://aws.amazon.com/releasenotes/host-instance-storage-volumes-table/) to find the default EBS volume size attached to our chosen instance type
- Refer to the [SageMaker pricing page](https://aws.amazon.com/sagemaker/pricing/) to find:
  - The price per hour for the instance type deployed (for example, at the time of writing, an `ml.g5.4xlarge` in region `us-east-1` was listed at $2.03/hour)
  - The price per month of provisioned SSD storage ($0.14 per GB-month at writing)

In [None]:
model_id = ""  # TODO: Your deployed SageMaker JumpStart model ID
model_version = "*"  # TODO: Replace with your specific version, or leave '*' for latest
endpoint_name = ""  # TODO: Your deployed SageMaker JumpStart endpoint name

cost_model = CostModel(
    # Run-level costs:
    run_dims={
        # Note we can provide a dictionary instead of a list, to explicitly name our cost
        # dimensions:
        "ComputeHours": dimensions.EndpointTime(price_per_hour=2.03),
        "EBSStorage": dimensions.EndpointTime(price_per_hour=0.14 * 600 / (24 * 30)),
    }
)
cost_model

In fact, `llmeter.callbacks.cost.providers.sagemaker` provides tools that can attempt to look up these costs automatically based on your endpoint name or deployed instances - but you'll need to have the AWS IAM [`pricing:GetProducts`](https://docs.aws.amazon.com/service-authorization/latest/reference/list_awspricelist.html) (for price lookup) and [`sagemaker:DescribeEndpoint` and `sagemaker:DescribeEndpointConfig`](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonsagemaker.html) (for endpoint instance type/count lookup) permissions for this to work.

If you have the relevant IAM permissions, try running the cell below to set up the automatic cost model and see how it compares to your manually-crafted one above. Otherwise, you can continue with the manual cost model.

In [None]:
from llmeter.callbacks.cost.providers.sagemaker import (
    cost_model_from_sagemaker_realtime_endpoint,
)

cost_model_auto = cost_model_from_sagemaker_realtime_endpoint(endpoint_name)
cost_model_auto

Below, we'll try configuring the LLMeter endpoint and fetching the JumpStart-provided example payloads from the parameters you provided above. If you find errors, check your deployed model ID, version and endpoint type.

In [None]:
from sagemaker.jumpstart.model import JumpStartModel

endpoint = SageMakerEndpoint(
    endpoint_name=endpoint_name,
    model_id=model_id,
    # See 'LLMeter with Amazon SageMaker JumpStart.ipynb' for tips on checking this value for your
    # model:
    generated_text_jmespath="generated_text",
)

print(
    f"Fetching sample payloads for JumpStart model {model_id}, version {model_version}"
)
model = JumpStartModel(model_id=model_id, model_version=model_version)
sample_payloads = [k.body for k in (model.retrieve_all_examples() or []) if k.body]
print(f"Got {len(sample_payloads)} sample payloads")

print("Testing endpoint with first sample payload")
print(endpoint.invoke(sample_payloads[0]))

With the endpoint set up and a cost model defined, we can run a load test and attach the cost model as a callback:

In [None]:
sweep_test = LoadTest(
    endpoint=endpoint,
    payload=sample_payloads,
    sequence_of_clients=[1, 5, 20, 50, 100, 500],
    min_requests_per_client=5,
    min_requests_per_run=20,
    output_path=f"outputs/{endpoint.model_id}/sweep",
    callbacks=[cost_model],  # <- Include the cost model as callback
)
sweep_results = await sweep_test.run()

In a `LoadTest`, a `Run`/`Result` is created for each concurrency level specified in `sequence_of_clients` - and each should now be annotated with cost estimates.

In this model, our only cost contributors are driven by the overall time the endpoint is provisioned. We can compare the latency, error rate, successful requests-per-second throughput, and estimated cost per successful request - to understand the trade-offs of different request concurrencies to this deployment:

In [None]:
for result in sweep_results:
    print(f"{result.clients} concurrent clients:")
    successful_requests = result.total_requests - result.stats["failed_requests"]
    print(f"  - Request error rate {result.stats.get('failed_requests_rate'):.2%}")
    print(
        f"  - Avg latency {result.stats.get('time_to_last_token-average') * 1000:.0f}ms"
    )
    print(f"  - p90 latency {result.stats.get('time_to_last_token-p90') * 1000:.0f}ms")
    print(
        f"  - {successful_requests / (result.end_time - result.start_time):.2f} reqs/sec"
    )
    print(
        f"  - Est. ${result.cost_total / successful_requests:f} per successful request"
    )

Based on these results, we can understand both the quality of service and cost-to-serve offered by this deployment as a function of concurrent request volume. This can help us decide how many instances to deploy based on our expected production request rate.

Alternatively, you could run load tests (or single test Runs) with different instance types to explore the best available trade-off for a single-instance endpoint.

## Writing custom cost dimensions

`llmeter.callbacks.cost.dimensions` provides a few pre-built cost dimensions to support the most common pricing types (and if others would be useful for you, please raise an issue and/or pull request!) ...But what if you need to model something different?

You can define your own cost dimensions either at request- or run-level (or both), and the easiest way is to inherit from the `RequestCostDimensionBase` and `RunCostDimensionBase` classes.

Request cost dimensions must implement `calculate(...)` to calculate a **request**'s cost based on the initial LLMeter `InvocationResponse` (which includes metadata like the numbers of input & output tokens, the latency, and etc). For example, maybe you're using a service like AWS Lambda which charges based on the actual run-time of each function call. You could define a dimension something like:

In [None]:
@dataclass
class RequestRunTime(dimensions.RequestCostDimensionBase):
    price_per_millisecond: float

    async def calculate(self, response: InvocationResponse) -> float:
        return response.time_to_last_token * self.price_per_millisecond * 1000

Run cost dimensions must implement `calculate(...)` to calculate a **run**'s cost based on the initial LLMeter `Result`.

They can also **optionally** implement `before_run_start(...)`, which receives the initial `Runner`, in case you need to set up any initial state to support cost monitoring.

For example, maybe you're using a service that charges by "session" and running each Run as a separate session in the API. You might define a dimension like:

In [None]:
@dataclass
class FlatChargePerRun(dimensions.RunCostDimensionBase):
    price: float

    async def calculate(self, result: Result) -> float:
        return self.price if len(result.responses) > 0 else 0

These dimensions can be included in cost models just like the built-in `InputTokens`, `OutputTokens`, and `EndpointTime` we used above.

In [None]:
cdm_cost_model = CostModel(
    request_dims=[RequestRunTime(price_per_millisecond=0.01)],
    run_dims=[FlatChargePerRun(price=2)],
)

cdm_runner = Runner(
    endpoint,
    output_path=f"outputs/{endpoint.model_id}",
    callbacks=[cdm_cost_model],  # <- Specify our cost model
)
cdm_result = await cdm_runner.run(
    payload=sample_payloads,
    n_requests=9,
    clients=3,
)

**Remember** with a composite cost model like this one including both run-level and request-level dimensions, that "per-request" statistics will **exclude** any overall run-level costs.

In this example:

- `cost_total` includes both dimensions
- `cost_per_request-average` **only** includes the `RequestRunTime` dimension

In [None]:
{k: v for k, v in cdm_result.stats.items() if k.startswith("cost_")}

## Summary

Cost is important, but complex to analyze. From tiered pricing to enterprise agreements to shared infrastructure and resources, a wide range of factors can complicate analysis and comparison.

LLMeter provides flexible tooling for you to define the cost components that are important for your use-case, and draw comparisons that make sense in your context.

While this can include comparing models with very different pricing structures (such as token-based pricing vs infrastructure-based pricing), it's important to remember that such a comparison can only make sense based on an expected workload. With LLMeter, you can run a range of tests with different datasets and concurrency rates, to explore different scenarios and the actual trade-offs observed between them.