# Tutorial 2 - Basic Workflow - Add Your Own Tests 

**Scenario**: 

You have developed a chatbot and you want to test how it performs for your unique use case.

In this case, you have already identified a set of questions to test your chatbot and their correct answers. You want to use Moonshot to administer these tests to your chatbot to benchmark its performance.

How can you add your custom dataset into Moonshot and run it with your system?

In this tutorial, you will learn how to:

- Add your own `dataset` into Moonshot
    1. Create the dataset manually.
    2. Convert the csv dataset to Moonshot dataset through API.
    3. Downloading hugging face dataset to Moonshot dataset through API.
- Create and run your own `recipe`
- Create and run your own `cookbook`

Prerequisite:

1. You have added your OpenAI connector configuration named `my-openai-endpoint` in Moonshot. If you are unsure how to do it, please refer to "<b>Tutorial 1</b>" in the same folder.

**Before starting this tutorial, please make sure you have already installed `moonshot` and `moonshot-data`.**<br>
Otherwise, please refer to "<b>Moonshot - Pre-Req - Setup.ipynb</b>" to install and configure Moonshot first.

## Import and configure Moonshot

In this section, we prepare our Jupyter notebook environment by importing necessary libraries required to execute an existing benchmark.

> ⚠️ **Note:** Check that `moonshot_data_path` below matches the location where you installed `moonshot-data` and edit the code to match your location if needed.

In [1]:
# Python built-ins:
import os
import json
import sys

# IF you're running this notebook from the moonshot/examples/jupyter-notebook folder, the below
# line will enable you to import moonshot from the local source code. If you installed moonshot
# from pip, you can remove this:
sys.path.insert(0, '../../')

# Import moonshot utilities:
from moonshot.api import (
    api_create_recipe,
    api_create_cookbook,
    api_convert_dataset,
    api_download_dataset,
    api_load_runner,
    api_set_environment_variables
)

# Environment Configuration
# Here we set up the environment variables for the Moonshot framework.
# These variables define the paths to various modules and components used by Moonshot,
# organizing the framework's structure and access points.

# modify moonshot_data_path to point to your own copy of moonshot-data
moonshot_data_path = "./moonshot-data"
env = {
    "ATTACK_MODULES": os.path.join(moonshot_data_path, "attack-modules"),
    "BOOKMARKS": os.path.join(moonshot_data_path, "generated-outputs/bookmarks"),
    "CONNECTORS": os.path.join(moonshot_data_path, "connectors"),
    "CONNECTORS_ENDPOINTS": os.path.join(moonshot_data_path, "connectors-endpoints"),
    "CONTEXT_STRATEGY": os.path.join(moonshot_data_path, "context-strategy"),
    "COOKBOOKS": os.path.join(moonshot_data_path, "cookbooks"),
    "DATABASES": os.path.join(moonshot_data_path, "generated-outputs/databases"),
    "DATABASES_MODULES": os.path.join(moonshot_data_path, "databases-modules"),
    "DATASETS": os.path.join(moonshot_data_path, "datasets"),
    "IO_MODULES": os.path.join(moonshot_data_path, "io-modules"),
    "METRICS": os.path.join(moonshot_data_path, "metrics"),
    "PROMPT_TEMPLATES": os.path.join(moonshot_data_path, "prompt-templates"),
    "RECIPES": os.path.join(moonshot_data_path, "recipes"),
    "RESULTS": os.path.join(moonshot_data_path, "generated-outputs/results"),
    "RESULTS_MODULES": os.path.join(moonshot_data_path, "results-modules"),
    "RUNNERS": os.path.join(moonshot_data_path, "generated-outputs/runners"),
    "RUNNERS_MODULES": os.path.join(moonshot_data_path, "runners-modules"),
}

# Check user has set moonshot_data_path correctly:
if not os.path.isdir(env["ATTACK_MODULES"]):
    raise ValueError(
        "Configured path %s does not exist. Is moonshot-data installed at %s?"
        % (env["ATTACK_MODULES"], moonshot_data_path)
    )

# Apply the environment variables to configure the Moonshot framework.
api_set_environment_variables(env)

# Note: there might be some warning on IProgress not found. we can ignore it for now.

  from .autonotebook import tqdm as notebook_tqdm


## Prepare the Dataset

In this section, we show how to prepare Moonshot dataset.

Suppose you have a list of "fruits" questions to ask your chatbot, you need to prepare them into the data schema that is compatible with Moonshot.

- `name` (str): name of the data
- `description` (str): description of the dataset
- `license` (str): license of the data
- `reference` (str): a link/reference to where the dataset is from (or author of the dataset)
- `examples` (list): A list of dictionary containing the prompt (`input`) and ground truth (`target`).<br>
A `target` can be left blank.

In [3]:
test_dataset = {
    "name": "Fruits Dataset",
    "description":"Measures whether the model knows what is a fruit",
    "license": "MIT license",
    "reference": "",
    "examples": [
        {
            "input": "Is Lemon a Fruit? Answer Yes or No.",
            "target": "Yes."
        },
        {
            "input": "Is Apple a Fruit? Answer Yes or No.",
            "target": "Yes."
        },
        {
            "input": "Is Bak Choy a Fruit? Answer Yes or No.",
            "target": "No."
        },
        {
            "input": "Is Bak Kwa a Fruit? Answer Yes or No.",
            "target": "No."
        },
        {
            "input": "Is Dragonfruit a Fruit? Answer Yes or No.",
            "target": "Yes."
        },
        {
            "input": "Is Orange a Fruit? Answer Yes or No.",
            "target": "Yes."
        },
        {
            "input": "Is Coke Zero a Fruit? Answer Yes or No.",
            "target": "No."
        }
    ]
}

in_file = f"{moonshot_data_path}/datasets/test-dataset.json"
json.dump(test_dataset, open(in_file, "w+"), indent=2)
if os.path.exists(in_file):
     print(f"Dataset 'test-dataset' has been created.")

Dataset 'test-dataset' has been created.


### Convert CSV dataset to Moonshot dataset

You are able to convert a CSV file to a Moonshot dataset.

This can be useful if you have data in CSV format and want to use it for benchmarking in Moonshot.

By converting the CSV to the required JSON format, you can easily integrate your data into the Moonshot framework and run your tests seamlessly.

Let's take a look at a sample csv file (`jupyter-assets-csv-file.csv`) that contains our dataset.

In [4]:
with open("assets/jupyter-assets-csv-file.csv", "r") as file:
    csv_content = file.read()
print(csv_content)

input,target
"Is Lemon a Fruit? Answer Yes or No.","Yes."
"Is Apple a Fruit? Answer Yes or No.","Yes."
"Is Bak Choy a Fruit? Answer Yes or No.","No."
"Is Bak Kwa a Fruit? Answer Yes or No.","No."
"Is Dragonfruit a Fruit? Answer Yes or No.","Yes."
"Is Orange a Fruit? Answer Yes or No.","Yes."
"Is Coke Zero a Fruit? Answer Yes or No.","No."


We will use the following provided `api_convert_dataset` API function.

- `name` (str): name of the data
- `description` (str): description of the dataset
- `reference` (str): a link/reference to where the dataset is from (or author of the dataset)
- `license` (str): license of the data
- `examples` (list): A list of dictionary containing the prompt (`input`) and ground truth (`target`).<br>
A `target` can be left blank.

In [6]:
api_convert_dataset(
    name="Fruits CSV Dataset",
    description="Measures whether the model knows what is a fruit",
    reference="",
    license="MIT license",
    file_path="assets/jupyter-assets-csv-file.csv"
)

'moonshot-data/datasets/fruits-csv-dataset.json'

Let's take a look at what the converted dataset `fruits-csv-dataset.json` contains:

In [7]:
with open("moonshot-data/datasets/fruits-csv-dataset.json", "r") as json_file:
    json_content = json_file.read()
print(json_content)


{
  "id": "fruits-csv-dataset",
  "name": "Fruits CSV Dataset",
  "description": "Measures whether the model knows what is a fruit",
  "reference": "",
  "license": "MIT license",
  "examples": [
    {"input": "Is Lemon a Fruit? Answer Yes or No.", "target": "Yes."},
    {"input": "Is Apple a Fruit? Answer Yes or No.", "target": "Yes."},
    {"input": "Is Bak Choy a Fruit? Answer Yes or No.", "target": "No."},
    {"input": "Is Bak Kwa a Fruit? Answer Yes or No.", "target": "No."},
    {"input": "Is Dragonfruit a Fruit? Answer Yes or No.", "target": "Yes."},
    {"input": "Is Orange a Fruit? Answer Yes or No.", "target": "Yes."},
    {"input": "Is Coke Zero a Fruit? Answer Yes or No.", "target": "No."}
  ]
}



### Download Hugging Face dataset to Moonshot dataset

You can easily download a dataset from Hugging Face and convert it into a Moonshot dataset. 

This allows you to leverage the extensive collection of datasets available on Hugging Face for your benchmarking needs in Moonshot.

By integrating Hugging Face datasets, you can expand the variety and scope of your tests, ensuring comprehensive evaluation of your models.

Let's take a look at an example Hugging Face dataset:<br>
https://huggingface.co/datasets/cais/mmlu

You may use the full-screen viewer to have a closer look at the dataset:<br>
https://huggingface.co/datasets/cais/mmlu/viewer/college_biology/dev

There are multiple subsets (abstract_algebra, all, college_biology, ...), splits (test, validation, dev), columns to choose from:
1. question (string)
2. subject (string)
3. choices (list)
4. answer (string)

We will use the following provided `api_download_dataset` API function.

- `name` (str): name of the data
- `description` (str): description of the dataset
- `reference` (str): a link/reference to where the dataset is from (or author of the dataset)
- `license` (str): license of the data
- `kwargs`: An additional keyword arguments for downloading the dataset from Hugging Face.<br>

In [8]:
api_download_dataset(
    name="College Biology Dataset",
    description="Multiple-choice questions from various branches of knowledge",
    reference="",
    license="MIT license",
    dataset_name='cais/mmlu',
    dataset_config='college_biology',
    split='dev',
    input_col=['question', 'choices'],
    target_col='answer'
)

'moonshot-data/datasets/college-biology-dataset.json'

Let's take a look at what the converted dataset `college-biology-dataset.json` contains:

In [9]:
with open("moonshot-data/datasets/college-biology-dataset.json", "r") as json_file:
    json_content = json_file.read()
print(json_content)

{
  "id": "college-biology-dataset",
  "name": "College Biology Dataset",
  "description": "Multiple-choice questions from various branches of knowledge",
  "reference": "",
  "license": "MIT license",
  "examples": [
    {"input": "Which of the following represents an accurate statement concerning arthropods? ['They possess an exoskeleton composed primarily of peptidoglycan.', 'They possess an open circulatory system with a dorsal heart.', 'They are members of a biologically unsuccessful phylum incapable of exploiting diverse habitats and nutrition sources.', 'They lack paired, jointed appendages.']", "target": "1"},
    {"input": "In a given population, 1 out of every 400 people has a cancer caused by a completely recessive allele, b. Assuming the population is in Hardy-Weinberg equilibrium, which of the following is the expected proportion of individuals who carry the b allele but are not expected to develop the cancer? ['1/400', '19/400', '20/400', '38/400']", "target": "3"},
    {

## Create a new recipe

To run this dataset, you need to create 2 new <b>recipes</b>.

A <b>recipe</b> contains all the details required to run a benchmark.<br>
A <b>recipe</b> guides Moonshot on what data to use, and how to evaluate the model's responses.

To create a new <b>recipe</b>, you need the following elements:

1. **Name**: A unique name for the recipe.
2. **Description**: An explanation of what the recipe does and what it's for.
3. **Tags**: Keywords that categorise the recipe, making it easier to find and group with similar recipes.
4. **Categories**: Broader classifications that help organise recipes into collections.
5. **Datasets**: The data that will be used when running the recipe. This could be a set of prompts, questions, or any input that the model will respond to.
6. **Prompt Templates**: Pre-prompt or post-prompt static text that will be appended to the prompt.
7. **Metrics**: Criteria or measurements used to evaluate the model's responses, such as accuracy, fluency, or adherence to a prompt.
8. **Grading Scale**: A set of thresholds or criteria used to grade or score the model's performance.

In [10]:
test_recipe = api_create_recipe(
    "Fruit Questions", # name (mandatory)
    "This recipe is created to test model's ability in answering fruits question.", # description (mandatory)
    ["chatbot"], # tags (optional)
    ["capability"], # category (optional)
    ["test-dataset", "fruits-csv-dataset"], # filename of the dataset (mandatory)
    [], # prompt templates (optional)
    ["exactstrmatch", "bertscore" ], # metrics (mandatory)
    { # grading scale (optional)
        "A": [
            80,
            100
        ],
        "B": [
            60,
            79
        ],
        "C": [
            40,
            59
        ],
        "D": [
            20,
            39
        ],
        "E": [
            0,
            19
        ]
    }
)

print(f"Recipe '{test_recipe}' has been created.")


Recipe 'fruit-questions' has been created.


In [11]:
test_college_recipe = api_create_recipe(
    "Biology Questions", # name (mandatory)
    "This recipe is created to test model's ability in answering college biology question.", # description (mandatory)
    ["chatbot"], # tags (optional)
    ["capability"], # category (optional)
    ["college-biology-dataset"], # filename of the dataset (mandatory)
    [], # prompt templates (optional)
    ["exactstrmatch", "bertscore"], # metrics (mandatory)
    { # grading scale (optional)
        "A": [
            80,
            100
        ],
        "B": [
            60,
            79
        ],
        "C": [
            40,
            59
        ],
        "D": [
            20,
            39
        ],
        "E": [
            0,
            19
        ]
    }
)

print(f"Recipe '{test_college_recipe}' has been created.")

Recipe 'biology-questions' has been created.


## Run your new recipe

With these new recipes, you can run this on your `connector endpoint`. We will run this on endpoint `my-openai-endpoint`.

In [12]:
from slugify import slugify
from moonshot.api import api_get_all_run, api_create_runner, api_get_all_runner_name

name = "my new recipe runner" # Indicate the name
recipes = ["fruit-questions", "biology-questions"] # Test two recipes fruit-questions and biology-questions. You can add more recipes in the list to test as well
endpoints = ["my-openai-endpoint"]  #Test against 1 endpoint, my-openai-endpoint
prompt_selection_percentage = 1 # The percentage number of prompt(s) to run from EACH dataset in the recipe; this refers to 1% of each dataset prompts.

# Below are the optional fields
random_seed = 1   # Default: 0; this allows for randomness in dataset selection when prompt selection percentage are set
system_prompt = ""  # Default: ""; this allows setting the system prompt for the endpoints

# Advanced user - Modify runner processing module and result processing module
# Default: benchmarking and benchmarking-result
runner_proc_module = "benchmarking"  # Default: "benchmarking"
result_proc_module = "benchmarking-result"  # Default: "benchmarking-result"

# Run the recipe with the defined endpoint(s)
# If the id exists, it will perform a load on the runner, instead of creating a new runner.
# Using an existing runner allows the new run to possibly use cached results from previous runs, which greatly reduces the run time
slugify_id = slugify(name, lowercase=True)
if slugify_id in api_get_all_runner_name():
    rec_runner = api_load_runner(slugify_id)
else:
    rec_runner = api_create_runner(name, endpoints)

print("")

# run_cookbooks is an async function. Currently there is no sync version.
# We will get an existing event loop and execute the run cookbooks process.
await rec_runner.run_recipes(
    recipes,
    prompt_selection_percentage,
    random_seed,
    system_prompt,
    runner_proc_module,
    result_proc_module,
)
await rec_runner.close()  # Perform a close on the runner to allow proper cleanup.

# Display results
runner_runs = api_get_all_run(rec_runner.id)
result_info = runner_runs[-1].get("results")
if result_info:
    print(json.dumps(result_info, indent=2))
else:
    raise RuntimeError("no run result generated")


2025-04-14 14:06:18,529 [INFO][runner.py::run_recipes(349)] [Runner] my-new-recipe-runner - Running benchmark recipe run...





2025-04-14 14:06:19,562 [INFO][benchmarking.py::generate(169)] [Benchmarking] Running recipes (['fruit-questions', 'biology-questions'])...
2025-04-14 14:06:19,563 [INFO][benchmarking.py::generate(173)] [Benchmarking] Running recipe fruit-questions... (1/2)
2025-04-14 14:06:28,034 [INFO][connector.py::get_prediction(348)] [Connector ID: my-openai-endpoint] Predicting Prompt Index 1.
2025-04-14 14:06:28,042 [INFO][connector.py::get_prediction(348)] [Connector ID: my-openai-endpoint] Predicting Prompt Index 1.
2025-04-14 14:06:34.539139: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744653994.664414   72750 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744653994.693122   72750 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attemptin

{
  "metadata": {
    "id": "my-new-recipe-runner",
    "start_time": "2025-04-14 14:06:18",
    "end_time": "2025-04-14 14:06:50",
    "duration": 32,
    "status": "completed",
    "recipes": [
      "fruit-questions",
      "biology-questions"
    ],
    "cookbooks": null,
    "endpoints": [
      "my-openai-endpoint"
    ],
    "prompt_selection_percentage": 1,
    "random_seed": 1,
    "system_prompt": ""
  },
  "results": {
    "recipes": [
      {
        "id": "fruit-questions",
        "details": [
          {
            "model_id": "my-openai-endpoint",
            "dataset_id": "fruits-csv-dataset",
            "prompt_template_id": "no-template",
            "data": [
              {
                "prompt": "Is Apple a Fruit? Answer Yes or No.",
                "predicted_result": {
                  "response": "Yes",
                  "context": []
                },
                "target": "Yes.",
                "duration": 1.351232307999453
              }
       

## Beautifying Test Results

The result above is shown in our raw JSON file. To beautify the results, we have provided these helper functions to them into a nice table.

In [13]:
from rich.columns import Columns
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
console = Console()

def show_recipe_results(recipes, endpoints, recipe_results, duration):
    """
    Show the results of the recipe benchmarking.

    This function takes the recipes, endpoints, recipe results, results file, and duration as arguments.
    If there are any recipe results, it generates a table to display them using the generate_recipe_table function.
    It also prints the location of the results file and the time taken to run the benchmarking.
    If there are no recipe results, it prints a message indicating that there are no results.

    Args:
        recipes (list): A list of recipes that were benchmarked.
        endpoints (list): A list of endpoints that were used in the benchmarking.
        recipe_results (dict): A dictionary with the results of the recipe benchmarking.
        duration (float): The time taken to run the benchmarking in seconds.

    Returns:
        None
    """
    if recipe_results:
        # Display recipe results
        generate_recipe_table(recipes, endpoints, recipe_results)
    else:
        console.print("[red]There are no results.[/red]")

    # Print run stats
    console.print(f"{'='*50}\n[blue]Time taken to run: {duration}s[/blue]\n*Overall rating will be the lowest grade that the recipes have in each cookbook\n{'='*50}")

def generate_recipe_table(recipes: list, endpoints: list, results: dict) -> None:
    """
    Generate and display a table of recipe results.

    This function creates a table that lists the results of running recipes against various endpoints.
    Each row in the table corresponds to a recipe, and each column corresponds to an endpoint.
    The results include the grade and average grade value for each recipe-endpoint pair.

    Args:
        recipes (list): A list of recipe IDs that were benchmarked.
        endpoints (list): A list of endpoint IDs against which the recipes were run.
        results (dict): A dictionary containing the results of the benchmarking.

    Returns:
        None: This function does not return anything. It prints the table to the console.
    """
    # Create a table with a title and headers
    table = Table(
        title="Recipes Result", show_lines=True, expand=True, header_style="bold"
    )
    table.add_column("No.", width=2)
    table.add_column("Recipe", justify="left", width=78)
    # Add a column for each endpoint
    for endpoint in endpoints:
        table.add_column(endpoint, justify="center")

    # Iterate over each recipe and populate the table with results
    for index, recipe_id in enumerate(recipes, start=1):
        # Attempt to find the result for the current recipe
        recipe_result = next(
            (
                result
                for result in results["results"]["recipes"]
                if result["id"] == recipe_id
            ),
            None,
        )

        # If the result exists, extract and format the results for each endpoint
        if recipe_result:
            endpoint_results = []
            for endpoint in endpoints:
                # Find the evaluation summary for the endpoint
                evaluation_summary = next(
                    (
                        eval_summary
                        for eval_summary in recipe_result["evaluation_summary"]
                        if eval_summary["model_id"] == endpoint
                    ),
                    None,
                )

                # Format the grade and average grade value, or use "-" if not found
                grade = "-"
                if (
                    evaluation_summary
                    and "grade" in evaluation_summary
                    and "avg_grade_value" in evaluation_summary
                    and evaluation_summary["grade"]
                ):
                    grade = f"{evaluation_summary['grade']} [{evaluation_summary['avg_grade_value']}]"
                endpoint_results.append(grade)

            # Add a row for the recipe with its results
            table.add_row(
                str(index),
                f"Recipe: [blue]{recipe_result['id']}[/blue]",
                *endpoint_results,
                end_section=True,
            )
        else:
            # If no result is found, add a row with placeholders
            table.add_row(
                str(index),
                f"Recipe: [blue]{recipe_id}[/blue]",
                *(["-"] * len(endpoints)),
                end_section=True,
            )

    # Print the table to the console
    console.print(table)

if result_info:
    show_recipe_results(
            recipes, endpoints, result_info, result_info["metadata"]["duration"]
    )
    

## Create a new `cookbook`

We can also create a new `cookbook` and add existing recipes together with our new recipe. A `cookbook` in Moonshot is a curated collection of `recipes` designed to be executed together.

To create a new cookbook, you need the following fields:

1. **Name**: A unique name for the cookbook.
2. **Description**: A detailed explanation of the cookbook's purpose and the recipe(s) it contains.
3. **Recipes**: A list of recipe(s) that are included in the cookbook. Each recipe represents a specific test or benchmark.

In [16]:
cookbook_id = api_create_cookbook(
    "test-cookbook",
    "This cookbook tests both fruits questions and general science questions.",
    ["fruit-questions"]
)

print(f"Cookbook '{cookbook_id}' has been created.")

Cookbook 'test-cookbook' has been created.


## Run your new cookbook

In [17]:
from slugify import slugify
from moonshot.api import api_get_all_run, api_create_runner, api_get_all_runner_name

name = "test new cookbook" # Indicate the name
cookbooks = ["test-cookbook"] # Test one cookbook test-cookbook. You can add more cookbooks in the list to test as well
endpoints = ["my-openai-endpoint"] # Test against 1 endpoint, my-openai-endpoint
prompt_selection_percentage = 1 # The percentage number of prompt(s) to run from EACH dataset in the cookbook; this refers to 1% of each dataset prompts.

# Optional fields
random_seed = 1   # Default: 0; this allows for randomness in dataset selection when prompt selection percentage are set
system_prompt = ""  # Default: ""; this allows setting the system prompt for the endpoints

# Advanced user - Modify runner processing module and result processing module
# Default: benchmarking and benchmarking-result. Change it to your module name if you have your own runner and/or result module
runner_proc_module = "benchmarking"  # Default: "benchmarking"
result_proc_module = "benchmarking-result"  # Default: "benchmarking-result"

# Run the cookbooks with the defined endpoint(s)
# If the id exists, it will perform a load on the runner, instead of creating a new runner.
# Using an existing runner allows the new run to possibly use cached results from previous runs, which greatly reduces the run time
slugify_id = slugify(name, lowercase=True)
if slugify_id in api_get_all_runner_name():
    cb_runner = api_load_runner(slugify_id)
else:
    cb_runner = api_create_runner(name, endpoints)

# run_cookbooks() is an async function. Currently there is no sync version
# We will get an existing event loop and execute the run cookbooks process
await cb_runner.run_cookbooks(
        cookbooks,
        prompt_selection_percentage,
        random_seed,
        system_prompt,
        runner_proc_module,
        result_proc_module,
    )
await cb_runner.close()  # Perform a close on the runner to allow proper cleanup.

# Display results
runner_runs = api_get_all_run(cb_runner.id)
result_info = runner_runs[-1].get("results")
if result_info:
    print(json.dumps(result_info, indent=2))
else:
    raise RuntimeError("no run result generated")

2025-04-14 14:09:04,288 [INFO][runner.py::run_cookbooks(412)] [Runner] test-new-cookbook - Running benchmark cookbook run...


2025-04-14 14:09:04,423 [INFO][benchmarking.py::generate(139)] [Benchmarking] Running cookbooks (['test-cookbook'])...
2025-04-14 14:09:04,424 [INFO][benchmarking.py::generate(145)] [Benchmarking] Running cookbook test-cookbook... (1/1)
2025-04-14 14:09:04,450 [INFO][connector.py::get_prediction(348)] [Connector ID: my-openai-endpoint] Predicting Prompt Index 1.
2025-04-14 14:09:04,457 [INFO][connector.py::get_prediction(348)] [Connector ID: my-openai-endpoint] Predicting Prompt Index 1.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this mod

{
  "metadata": {
    "id": "test-new-cookbook",
    "start_time": "2025-04-14 14:09:04",
    "end_time": "2025-04-14 14:09:10",
    "duration": 5,
    "status": "completed",
    "recipes": null,
    "cookbooks": [
      "test-cookbook"
    ],
    "endpoints": [
      "my-openai-endpoint"
    ],
    "prompt_selection_percentage": 1,
    "random_seed": 1,
    "system_prompt": ""
  },
  "results": {
    "cookbooks": [
      {
        "id": "test-cookbook",
        "recipes": [
          {
            "id": "fruit-questions",
            "details": [
              {
                "model_id": "my-openai-endpoint",
                "dataset_id": "fruits-csv-dataset",
                "prompt_template_id": "no-template",
                "data": [
                  {
                    "prompt": "Is Apple a Fruit? Answer Yes or No.",
                    "predicted_result": {
                      "response": "Yes",
                      "context": []
                    },
                  

## Beautifying Test Results

In [18]:
from rich.columns import Columns
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
console = Console()

def show_cookbook_results(cookbooks, endpoints, cookbook_results, duration):
    """
    Show the results of the cookbook benchmarking.

    This function takes the cookbooks, endpoints, cookbook results, results file, and duration as arguments.
    If there are results, it generates a table with the cookbook results and prints a message indicating
    where the results are saved. If there are no results, it prints a message indicating that no results were found.
    Finally, it prints the duration of the run.

    Args:
        cookbooks (list): A list of cookbooks.
        endpoints (list): A list of endpoints.
        cookbook_results (dict): A dictionary with the results of the cookbook benchmarking.
        duration (float): The duration of the run.

    Returns:
        None
    """
    if cookbook_results:
        # Display recipe results
        generate_cookbook_table(cookbooks, endpoints, cookbook_results)
    else:
        console.print("[red]There are no results.[/red]")

    # Print run stats
    console.print(f"{'='*50}\n[blue]Time taken to run: {duration}s[/blue]\n*Overall rating will be the lowest grade that the recipes have in each cookbook\n{'='*50}")

def generate_cookbook_table(cookbooks: list, endpoints: list, results: dict) -> None:
    """
    Generate and display a table with the cookbook benchmarking results.

    This function creates a table that includes the index, cookbook name, recipe name, and the results
    for each endpoint.

    The cookbook names are prefixed with "Cookbook:" and are displayed with their overall grades. Each recipe under a
    cookbook is indented and prefixed with "Recipe:" followed by its individual grades for each endpoint. If there are
    no results for a cookbook, a row with dashes across all endpoint columns is added to indicate this.

    Args:
        cookbooks (list): A list of cookbook names to display in the table.
        endpoints (list): A list of endpoints for which results are to be displayed.
        results (dict): A dictionary containing the benchmarking results for cookbooks and recipes.

    Returns:
        None: The function prints the table to the console but does not return any value.
    """
    table = Table(
        title="Cookbook Result", show_lines=True, expand=True, header_style="bold"
    )
    table.add_column("No.", width=2)
    table.add_column("Cookbook (with its recipes)", justify="left", width=78)
    for endpoint in endpoints:
        table.add_column(endpoint, justify="center")

    index = 1
    for cookbook in cookbooks:
        # Get cookbook result
        cookbook_result = next(
            (
                result
                for result in results["results"]["cookbooks"]
                if result["id"] == cookbook
            ),
            None,
        )

        if cookbook_result:
            # Add the cookbook name with the "Cookbook: " prefix as the first row for this section
            endpoint_results = []
            for endpoint in endpoints:
                # Find the evaluation summary for the endpoint
                evaluation_summary = next(
                    (
                        temp_eval
                        for temp_eval in cookbook_result["overall_evaluation_summary"]
                        if temp_eval["model_id"] == endpoint
                    ),
                    None,
                )

                # Get the grade from the evaluation_summary, or use "-" if not found
                grade = "-"
                if evaluation_summary and evaluation_summary["overall_grade"]:
                    grade = evaluation_summary["overall_grade"]
                endpoint_results.append(grade)
            table.add_row(
                str(index),
                f"Cookbook: [blue]{cookbook}[/blue]",
                *endpoint_results,
                end_section=True,
            )

            for recipe in cookbook_result["recipes"]:
                endpoint_results = []
                for endpoint in endpoints:
                    # Find the evaluation summary for the endpoint
                    evaluation_summary = next(
                        (
                            temp_eval
                            for temp_eval in recipe["evaluation_summary"]
                            if temp_eval["model_id"] == endpoint
                        ),
                        None,
                    )

                    # Get the grade from the evaluation_summary, or use "-" if not found
                    grade = "-"
                    if (
                        evaluation_summary
                        and "grade" in evaluation_summary
                        and "avg_grade_value" in evaluation_summary
                        and evaluation_summary["grade"]
                    ):
                        grade = f"{evaluation_summary['grade']} [{evaluation_summary['avg_grade_value']}]"
                    endpoint_results.append(grade)

                # Add the recipe name indented under the cookbook name
                table.add_row(
                    "",
                    f"  └──  Recipe: [blue]{recipe['id']}[/blue]",
                    *endpoint_results,
                    end_section=True,
                )

            # Increment index only after all recipes of the cookbook have been added
            index += 1
        else:
            # If no results for the cookbook, add a row indicating this with the "Cookbook: " prefix
            # and a dash for each endpoint column
            table.add_row(
                str(index),
                f"Cookbook: {cookbook}",
                *(["-"] * len(endpoints)),
                end_section=True,
            )
            index += 1

    # Display table
    console.print(table)

if result_info:
    show_cookbook_results(
        cookbooks, endpoints, result_info, result_info["metadata"]["duration"]
    )
else:
    raise RuntimeError("no run result generated")