## Comparing time per output token

In this notebook we load the result of two runs and compare their behavior with respect to time per output token (TPOT from now on).  
This analysis generalizes well also in cases of dissimilar workloads. Ideally (and excluding non-linear approaches like speculative decoding), the time per output token should be a property of the underlying accelerated compute, and it should be independent of the number of input and generated tokens.

For the analysis we'll leverage some of the plotting functions provided by LLMeter. These functions uses Plotly, and can be combined to create custom visualizations.

In [None]:
import plotly.graph_objects as go
import plotly.io as pio

from llmeter.plotting import (
    boxplot_by_dimension,
    histogram_by_dimension,
    scatter_histogram_2d,
)
from llmeter.results import Result

Setting the plotly template for the rest of the notebook to `plotly_white`.

In [None]:
pio.templates.default = "plotly_white"

## Loading dataset

Load two datasets. As mentioned in the introduction, this analysis will only make sense if the 2 runs are compatible, that is the number of input tokens is the same, or close to.

In [None]:
result_1 = Result.load("<path of saved results of first run>")
result_2 = Result.load("<path of saved results of second run>")

In [None]:
dimension = "time_per_output_token"

## TPOT vs num of output tokens

In [None]:
fig = scatter_histogram_2d(result_1, "num_tokens_output", dimension, 20, 20)
fig.update_layout(title=result_1.run_name)
fig

In [None]:
fig = scatter_histogram_2d(result_2, "num_tokens_output", dimension, 20, 20)
fig.update_layout(title=result_2.run_name)
fig

## Distribution comparison

It might be interesting to have a better understanding of the actual distribution of the TPOT, for example by observing the distribution using boxplots or histograms. We'll start by creating a boxplot for each run using , and then combining them to provide a clear comparison.

In [None]:
fig = go.Figure()

tr1 = boxplot_by_dimension(result=result_1, dimension=dimension)
tr2 = boxplot_by_dimension(result=result_2, dimension=dimension)
fig.add_traces(
    [
        tr1,
        tr2,
    ]
)

# use log scale for the time axis
fig.update_xaxes(type="log")

fig.update_layout(
    legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99),
    title=f"Comparison of {dimension.replace('_', ' ').capitalize()}",
)
fig

We can also create an histogram to visualize the two distributions using `histogram_by_dimension()`. This function is based on plotly `go.Histogram()`, and accepts all the modifier keywords arguments. In this case, we define the size of the histogram bin to be 10 ms.

In [None]:
xbins = dict(size=0.001)

fig = go.Figure()
h1 = histogram_by_dimension(
    result_1,
    dimension,
    # xbins=xbins,
)
h2 = histogram_by_dimension(
    result_2,
    dimension,
    xbins=xbins,
)

fig.add_traces([h1, h2])

fig.update_layout(legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99))
fig

## Estimating the difference of the median values

The data is potentially highly skewed. Assuming there's enough representative data points, we'll use bootstrapping to estimate confidence intervals on the statistics of interest, in this case the median.

In [None]:
import numpy as np
from scipy.stats import bootstrap

#### Median

In [None]:
# create datasets without any Null value
data_1 = [k for k in result_1.get_dimension(dimension) if k]
data_2 = [k for k in result_2.get_dimension(dimension) if k]

In [None]:
res = bootstrap((data_1,), np.median, confidence_level=0.95)
print(
    f"Median of {dimension} for {result_1.run_name}\n "
    f"{np.median(data_1):.3g} ({res.confidence_interval.low:.3g}, {res.confidence_interval.high:.3g})s"
)

In [None]:
res = bootstrap((data_2,), np.median, confidence_level=0.95)
print(
    f"Median of {dimension} for {result_2.run_name}\n"
    f"{np.median(data_2):.3g} ({res.confidence_interval.low:.3g}, {res.confidence_interval.high:.3g})s"
)

#### Difference between medians

In [None]:
def obj_f(sample1, sample2, axis=-1):
    median_1 = np.median(sample1, axis=axis)
    median_2 = np.median(sample2, axis=axis)
    return median_2 - median_1

In [None]:
data = (data_1, data_2)

res = bootstrap((data), obj_f, confidence_level=0.95)

print(
    f"Difference between median {dimension} for {result_1.run_name} and {result_2.run_name} is\n"
    f"{obj_f(data[0], data[1]):.3g} ({res.confidence_interval.low:.3g}, {res.confidence_interval.high:.3g})s"
)
