<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>
<br>

# <font color="#76b900">TCO For On-Premise and Cloud Inference</font>

In this notebook, you will learn how to estimate the Total Cost of Ownership (from now on `TCO`) for inference workloads, for both on-premises (from now on on-prem) and on the cloud. With LLMs, the basic unit of outcome is the token. One token can be more or less expensive depending on the LLM size and the other choices we discussed in previous notebooks. Generally, inference configurations with higher throughput lead to lower cost per token.

## Learning Objectives
- Estimate the on-premise annual cost under various assumptions.
- Calculate the cost of using cloud APIs for inference.
- Compare the costs between on-premise and cloud deployments with attention to inherent pros and cons.

## Table of Contents

- [**Starting Assumptions**](#Starting-Assumptions)
- [**Peak vs Average Prompts**](#Peak-vs-Average-Prompts)
- [**On-Premise Cost Per Year**](#On-Premise-Cost-Per-Year)
- [**Cloud API Cost Per 1k Prompts**](#Cloud-API-Cost-Per-1k-Prompts)
- [**[EXERCISE] Comparing On-Prem vs Cloud Costs**](#[EXERCISE]-Comparing-On-Prem-vs-Cloud-Costs)

<br><hr>

## **Starting Assumptions**

We start by assuming that you have computed the number of GPUs needed for your use case, following the logic explained in the previous notebooks. Basically, once you select the configuration that returns a number of prompts per second per 8 GPUs, you can compute the number of DGX systems (each with 8 GPUs) as follows: 
- Take the requests per second that your application is expected to receive.
- Divide it by the prompts per second per 8 GPUs of your configuration. 

In [28]:
target_prompts_per_second_per_system = 200
prompts_per_second_per_8_gpus = 41.6
num_dgxs_needed = target_prompts_per_second_per_system / prompts_per_second_per_8_gpus

print(f"The required number of servers is {num_dgxs_needed:.1f}")

The required number of servers is 4.8


<br><hr>

## **Peak vs Average Prompts**

When observing the daily distribution of requests for a production application, there are usually peaks and valleys. The requests may decrease during the night and then spike up during working hours. However, for our estimation of the number of DGX systems needed, we need a constant reference of requests per second, which in this case we set to 200.

Our recommendation is to set that reference to 95% of the expected peak requests per second. This percentage aims to balance the periods where the GPUs are underutilized (the valleys) and the times when the system is at its peak. If we choose a percentage closer to 100%, our average price per token increases since more GPUs are underutilized during the valleys. If the percentage is lower, the average price per token decreases, but during the peaks we may need to relax the latency conditions to keep up with the demand.



In case you only have an estimate of the average requests per second, we recommend using the [**Poisson distribution**](https://en.wikipedia.org/wiki/Poisson_distribution) to estimate the peak requests. This is a common distribution to model the number of requests in many applications such as web server traffic, call centers or network packets. You can use this code to visualize a distribution and compute a particular percentage of the peak requests, given the average:

In [29]:
average_prompts_per_second_per_system = 200
percentage_peak_requests = 95
sz = 5*60 # how many seconds to simulate

from scipy.stats import poisson
import plotly.graph_objects as go
import numpy as np

rng = np.random.default_rng()
x = np.arange(sz)
y = rng.poisson(average_prompts_per_second_per_system, sz)

y_avg = y.mean()
y_max = y.max()
y_percentile = np.percentile(y, percentage_peak_requests)

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=x,
    y=y,
    name='Actual Per Second Rate',
))

def add_constant(name, value):
    fig.add_trace(go.Scatter(
        x=[0, sz],
        y=[value, value],
        name=name,
        mode="lines"
    ))

add_constant(f"Average Over {sz} Seconds", y_avg)
add_constant(f'Max Over {sz} Seconds', y_max)
add_constant(f'{percentage_peak_requests}-th Percentile Over {sz} Seconds', y_percentile)

fig.update_xaxes(title_text="Time (s)")
fig.update_yaxes(title_text="Request Rate (1/s)")
fig.show()

print(f"Distribution lambda: {average_prompts_per_second_per_system}")
print(f"Actual average over {sz} seconds: {y_avg:.2f}")
print(f"Actual max over {sz} seconds: {y_max:.2f}")
print(f"Actual {percentage_peak_requests}-th Percentile over {sz} seconds: {y_percentile:.2f}")
poisson_estimation = poisson.ppf(percentage_peak_requests/100., average_prompts_per_second_per_system)
print(f"The expected {percentage_peak_requests} of the peak in requests per second is {poisson_estimation}")



Distribution lambda: 200
Actual average over 300 seconds: 200.19
Actual max over 300 seconds: 254.00
Actual 95-th Percentile over 300 seconds: 226.05
The expected 95 of the peak in requests per second is 224.0


The difference between the average and the 95-th percentile is much more pronounced with the lower numbers. If your request rate is low enough, the effect of the peaks is more dramatic.

In [30]:
from scipy.stats import poisson
import plotly.graph_objects as go
import numpy as np
from plotly.subplots import make_subplots


percentage_peak_requests = 95

poisson_lambda = np.arange(0.5, 50, 0.5)
poisson_percentile = poisson.ppf(percentage_peak_requests/100., poisson_lambda)

fig = make_subplots(rows=1, cols=1, specs=[[{"secondary_y": True}]])

fig.add_trace(
    go.Scatter(
        x=poisson_lambda,
        y=poisson_percentile,
        name=f'{percentage_peak_requests}-th percentile',
    ),
    secondary_y=True,
)

fig.add_trace(
    go.Scatter(
        x=poisson_lambda,
        y=poisson_percentile/poisson_lambda,
        name=f'{percentage_peak_requests}-th percentile to avg ratio',
    ),
    secondary_y=False,
)

fig.update_xaxes(title_text="Request Rate (1/s)")
fig.update_yaxes(title_text=f"p{percentage_peak_requests} Request Rate (1/s)", secondary_y=True)
fig.update_yaxes(title_text=f"Ratio of p{percentage_peak_requests} Request Rate to Avg", secondary_y=False)
fig.show()



<br><hr>

## **On-Premise Cost Per Year**

**The on-prem annual cost is computed by adding up three elements:**

1) **Price of purchasing the GPU server:** this price varies depending on the hardware provider and type of system. Usually, the price is amortized through various years. As a simplified depreciation model, we divide the total price by the number of years until it depreciates.

2) **Datacenter cost to host the servers:** this includes costs related to electricity, renting of the space and building, staff, etc.

3) **NVIDIA AI Enterprise License per GPU in the server:** NIMs are available to enterprises for production applications via a license, which has an annual cost.


The following code exemplifies the three elements. Note that the input prices are an estimation in USD, and we recommend that you tweak them for your specific use case.

In [31]:
# Price of purchasing the GPU server
cost_per_server = 320_000
depreciation_period = 4 # in years
cost_per_server_per_year = cost_per_server / depreciation_period

# Datacenter hosting cost
hosting_cost_per_year = 3_000

# NVIDIA AI Enterprise License Cost per server
number_of_gpus_in_server = 8
price_of_license_per_year = 4500 # talk to your distributor to get a quote
cost_of_licenses_per_year = number_of_gpus_in_server * price_of_license_per_year

# Total on-prem cost per year
cost_per_dgx = cost_per_server_per_year + hosting_cost_per_year + cost_of_licenses_per_year
on_prem_cost_per_year = num_dgxs_needed * cost_per_dgx

print(
    f"The yearly on-prem cost for the required {num_dgxs_needed:.1f} servers"
    f" is {on_prem_cost_per_year/1000:.0f}k USD"
)

The yearly on-prem cost for the required 4.8 servers is 572k USD


<br><hr>

## **Cloud API Cost Per 1k Prompts**

Several companies offer cloud APIs to call their managed models. The price model of their easy-to-use APIs is simple: they charge a fixed amount per input or output token, defined per each model. For example, one million input tokens of a particular model may cost 1 USD, whereas three million tokens cost 3 USD. This price model is simpler and more flexible compared to on-prem, but can lead to higher costs in the long term. In addition, there's less control about the latency and throughput of the prompts.

Let's suppose that we want to estimate the cost of 1000 prompts, with 2000 input tokens and 200 output tokens. In the following code we estimate the cloud API costs, but we recommend that you tweak the input and output token costs to your specific cloud provider. Generally, output tokens are more expensive than input tokens.

In [32]:
# Price for input tokens
cloud_api_cost_per_1M_input_tokens = 1
input_len = 2000
input_cost_per_1M_prompts = cloud_api_cost_per_1M_input_tokens * input_len

# Price for output tokens
cloud_api_cost_per_1M_output_tokens = 3
output_len = 200
output_cost_per_1M_prompts = cloud_api_cost_per_1M_output_tokens * output_len

# Sum input and output cost
cloud_api_cost_per_1k_prompts = (input_cost_per_1M_prompts + output_cost_per_1M_prompts) / 1000
print(f"The cloud API price for 1000 prompts is {cloud_api_cost_per_1k_prompts} USD")

The cloud API price for 1000 prompts is 2.6 USD


<br><hr>

## **[EXERCISE] Comparing On-Prem vs Cloud Costs**

Comparing on-prem (self-hosted) versus cloud-based (externally managed) costs is essential to make an educated choice between the two. 

**To have a fair comparison, we assume the following:**
1. The models deployed on-prem and cloud-based are equivalent in quality.
2. The latency and throughput achieved between on-prem and cloud-based are similar.

The first step in the cost comparison is to compute the on-prem cost per 1000 prompts:

In [33]:
# Compute prompts per year with on-prem configuration
prompts_per_year_per_8_gpus = prompts_per_second_per_8_gpus * 3600 * 24 * 365

# Compute cost per 1k on-prem prompts
on_prem_cost_per_1k_prompts = on_prem_cost_per_year / prompts_per_year_per_8_gpus * 1000

print(f"The on-prem cost per 1000 prompts is {on_prem_cost_per_1k_prompts:.2f} USD")

The on-prem cost per 1000 prompts is 0.44 USD


This on-prem cost can be directly compared to the cloud API price for 1000 prompts:

In [34]:
print(f"The cloud API price for 1000 prompts is {cloud_api_cost_per_1k_prompts} USD")

The cloud API price for 1000 prompts is 2.6 USD


For this particular use case, we see that on-prem offers a lower cost of 0.44 USD compared to on the cloud, which is 2.6 USD per 1k prompts.

To deepen more into the analysis, we can also compare the price per input and output token that the cloud APIs offer, versus on-prem. Above we took the following values as an example:

In [35]:
print(f"The cloud API cost per 1M input tokens is {cloud_api_cost_per_1M_input_tokens} USD")
print(f"The cloud API cost per 1M output tokens is {cloud_api_cost_per_1M_output_tokens} USD")

The cloud API cost per 1M input tokens is 1 USD
The cloud API cost per 1M output tokens is 3 USD


Similarly, we would like to compute an on-prem cost per 1M input and output tokens. We refer to these two variables as `on_prem_cost_per_1M_input_tokens` and `on_prem_cost_per_1M_output_tokens`. To obtain these two variables, we consider two equations:

1) Equation adding up the costs of 1M input tokens and 1M output tokens to obtain the cost of 1000 prompts:

```
on_prem_cost_per_1M_input_tokens * input_len
+ on_prem_cost_per_1M_output_token * output_len
= on_prem_cost_per_1k_prompts * 1000
```

2) Equivalence of cost ratios between input and output tokens. *Note, this is just an arbitrary assumption to simplify the comparison. There are multiple other ways to fix this ratio. For example, we could instead fix the cost ratio to the ratio of prefill and decoding latencies to approximate GPU utilization by the stages.*

```
cloud_api_cost_per_1M_input_tokens / cloud_api_cost_per_1M_output_tokens
= on_prem_cost_per_1M_input_tokens / on_prem_cost_per_1M_output_token
```

This is a system of two equations with two unknown variables, `on_prem_cost_per_1M_input_tokens` and `on_prem_cost_per_1M_output_tokens`. Let's solve it in python using `sympy`, a popular symbolic mathematics library:

In [36]:
import sympy as sp

def solve_cost_equations(
    on_prem_cost_per_1k_prompts, 
    input_len, output_len, 
    cloud_api_cost_per_1M_input_tokens, 
    cloud_api_cost_per_1M_output_tokens,
):
    # Define the variables, which are represented symbolicallys
    on_prem_cost_per_1M_input_tokens  = sp.symbols('on_prem_cost_per_1M_input_tokens')
    on_prem_cost_per_1M_output_tokens = sp.symbols('on_prem_cost_per_1M_output_tokens')
    var_symbols = on_prem_cost_per_1M_input_tokens, on_prem_cost_per_1M_output_tokens

    ## Define the equations, which can incorporate both 
    ##    "constant" variables and "symbolic" variables

    ## Equation 1, modeling the cost of 1000 prompts.
    lhs = on_prem_cost_per_1M_output_tokens * output_len
    lhs += on_prem_cost_per_1M_input_tokens * input_len 
    rhs = on_prem_cost_per_1k_prompts * 1000
    eq1 = sp.Eq(lhs, rhs)

    # TODO: Equation 2, modeling cost ratio of cloud vs on-prem
    eq2 = sp.Eq(
        ## TODO: LHS and RHS, can be done without extra variables
        on_prem_cost_per_1M_input_tokens / on_prem_cost_per_1M_output_tokens,
        cloud_api_cost_per_1M_input_tokens / cloud_api_cost_per_1M_output_tokens
    )

    # Solve the system of equations for the set of wanted symbols
    solution = sp.solve((eq1, eq2), var_symbols)
    
    assert on_prem_cost_per_1M_input_tokens in solution, solution
    assert on_prem_cost_per_1M_output_tokens in solution, solution
    return {
        'on_prem_cost_per_1M_input_tokens': solution[on_prem_cost_per_1M_input_tokens],
        'on_prem_cost_per_1M_output_tokens': solution[on_prem_cost_per_1M_output_tokens]
    }

print(f"""
(Average) {on_prem_cost_per_1k_prompts = }
(Average) {input_len = }
(Average) {output_len = }
{cloud_api_cost_per_1M_input_tokens = }
{cloud_api_cost_per_1M_output_tokens = }
""")

solution = solve_cost_equations(
    on_prem_cost_per_1k_prompts, 
    input_len, output_len, 
    cloud_api_cost_per_1M_input_tokens, 
    cloud_api_cost_per_1M_output_tokens
)

print(f"The on-prem cost per 1M input tokens is {solution['on_prem_cost_per_1M_input_tokens']:.2f} USD")
print(f"The on-prem cost per 1M output tokens is {solution['on_prem_cost_per_1M_output_tokens']:.2f} USD")


(Average) on_prem_cost_per_1k_prompts = 0.43609759223233935
(Average) input_len = 2000
(Average) output_len = 200
cloud_api_cost_per_1M_input_tokens = 1
cloud_api_cost_per_1M_output_tokens = 3

The on-prem cost per 1M input tokens is 0.17 USD
The on-prem cost per 1M output tokens is 0.50 USD


With these tools, you can compare the on-prem versus on the cloud costs of your inference LLM application. We hope you can make more informed decisions about the cost of your inference deployments from now on.