# Large Model Inference Testing Analysis

This notebook helps us anlayze the results of the large model inference testing. We start by first speciying the `model` below for which we want to analyze the results. We list all the folders under the specified `model` folder to analyze the output results: Each folder corresponds to to a [Deep Java Library LMI engine](https://docs.djl.ai/docs/serving/serving/docs/lmi/conceptual_guide/lmi_engine.html). 

In [None]:
import os
import glob
from pathlib import Path

model="llama-2-7b-chat-hf"
assert model is not None, "Please specify model"

directory = os.path.join(Path().resolve(), "output", f'{model}')

path = Path(directory)

lmi_dirs = [p for p in path.iterdir() if p.is_dir()]

## Find Latest Results

For each [LMI engine](https://docs.djl.ai/docs/serving/serving/docs/lmi/conceptual_guide/lmi_engine.html), we scan the results for the specified model, and select the latest results, as shown below.

In [None]:
import os
import glob


results_latest = []
for lmi_dir in lmi_dirs:
    all_files = glob.glob(os.path.join(lmi_dir, "results-*.json"))
    results_latest.append(max(all_files, key=os.path.getmtime))


## Read the Results into Pandas DataFrames

We read the latest results for the specified `model` into Pandas data frames, using one data frame for each [Deep Java Library LMI engine](https://docs.djl.ai/docs/serving/serving/docs/lmi/conceptual_guide/lmi_engine.html).

In [None]:
import pandas as pd

pdfs = []
for result in results_latest:
    pdfs.append(pd.read_json(result, lines=True))


## Display Top Rows from Each Data Frame

For each Pandas data frame, we display `top_n` rows. The columns in each data frame are as follows:

1. `prompt` column contains the prompt
2. `text` column contains the complete generated text, which includes the `prompt`
3. `n_tokens` column contains number of tokens in text
4. `latency` column contains the total request latency from sending the request to receiving the complete response
5. `tps` column contains tokens per second, which is just number of tokens in the request divided by the request latency

In [None]:
from IPython.display import display

top_n = 2
for i, df in enumerate(pdfs):
    caption=f"{model}/{os.path.basename(lmi_dirs[i])}".upper()
    df = df.truncate(after=top_n - 1, axis=0)
    df.index += 1
    df = df.style \
      .format(precision=5) \
      .format_index(str.upper, axis=1) \
        .set_properties(**{'text-align': 'left'}) \
        .set_caption(caption)
    display(df)

## Results metrics

Below we show the key results metrics, which include `n_tokens`, `latency`, and `tps`.

In [None]:
from IPython.display import display

top_n = 2
for i, df in enumerate(pdfs):
    caption=f"{model}/{os.path.basename(lmi_dirs[i])}".upper()
    df = df[['n_tokens', 'latency', 'tps', 'ttft']]
    df.index += 1
    df = df.style \
      .format(precision=5) \
      .format_index(str.upper, axis=1) \
        .set_properties(**{'text-align': 'left'}) \
        .set_caption(caption)
    display(df)