# Tutorial 1 - Basic Workflow - Execute Existing Tests 

**Scenario**: You are a model developer and you are told to deploy a system that uses a Large Language Model. However, you are uncertain which model performs best for your use case and you want to assess potential models' capabilities using the pre-built benchmarks in Moonshot. How can you do this? 

In this tutorial, you will learn how to:

- Add your own `connector_endpoints` into Moonshot
- List and run an existing `cookbook` in Moonshot

**Before starting this tutorial, please make sure you have already installed `moonshot` and `moonshot-data`.** Otherwise, please follow [this tutorial](https://aiverify-foundation.github.io/moonshot/getting_started/quick_install) to install and configure Moonshot first.

## Import and configure Moonshot

In this section, we prepare our Jupyter notebook environment by importing necessary libraries required to execute an existing benchmark.

> ⚠️ **Check:** that `moonshot_data_path` below matches the location where you installed `moonshot-data` - and edit the code to match your location if needed.

In [7]:
# Python built-ins:
import os
import json
import asyncio
import sys

# IF you're running this notebook from the moonshot/examples/jupyter-notebook folder, the below
# line will enable you to import moonshot from the local source code. If you installed moonshot
# from pip, you can remove this:
sys.path.insert(0, '../../')

# Import moonshot utilities:
from moonshot.api import (
    api_create_endpoint,
    api_get_all_endpoint,
    api_get_all_cookbook,
    api_load_runner,
    api_read_result,
    api_set_environment_variables,
)

# Configure moonshot data location:
moonshot_data_path = "moonshot-data"
env = {
    "ATTACK_MODULES": os.path.join(moonshot_data_path, "attack-modules"),
    "CONNECTORS": os.path.join(moonshot_data_path, "connectors"),
    "CONNECTORS_ENDPOINTS": os.path.join(moonshot_data_path, "connectors-endpoints"),
    "CONTEXT_STRATEGY": os.path.join(moonshot_data_path, "context-strategy"),
    "COOKBOOKS": os.path.join(moonshot_data_path, "cookbooks"),
    "DATABASES": os.path.join(moonshot_data_path, "generated-outputs/databases"),
    "DATABASES_MODULES": os.path.join(moonshot_data_path, "databases-modules"),
    "DATASETS": os.path.join(moonshot_data_path, "datasets"),
    "IO_MODULES": os.path.join(moonshot_data_path, "io-modules"),
    "METRICS": os.path.join(moonshot_data_path, "metrics"),
    "PROMPT_TEMPLATES": os.path.join(moonshot_data_path, "prompt-templates"),
    "RECIPES": os.path.join(moonshot_data_path, "recipes"),
    "RESULTS": os.path.join(moonshot_data_path, "generated-outputs/results"),
    "RESULTS_MODULES": os.path.join(moonshot_data_path, "results-modules"),
    "RUNNERS": os.path.join(moonshot_data_path, "generated-outputs/runners"),
    "RUNNERS_MODULES": os.path.join(moonshot_data_path, "runners-modules"),
}

# Check user has set moonshot_data_path correctly:
if not os.path.isdir(env["ATTACK_MODULES"]):
    raise ValueError(
        "Configured path %s does not exist. Is moonshot-data installed at %s?"
        % (env["ATTACK_MODULES"], moonshot_data_path)
    )

# Apply the environment variables to configure the Moonshot framework.
api_set_environment_variables(env)

## Define the target model endpoint / API

Moonshot provides [connectors](https://aiverify-foundation.github.io/moonshot/api_reference/api_connector/) to a range of different LLM hosting providers - such as OpenAI (direct or Azure), Hugging Face, Amazon Bedrock, and Google Gemini.

There are some [example endpoint configurations](https://github.com/aiverify-foundation/moonshot-data/tree/main/connectors-endpoints) provided in `moonshot-data`, but they don't include API keys or other credentials: So you'll usually need to edit these configurations, or add your own, to connect to your target LLM.

You can register new Moonshot endpoints directly from Python, as shown below.

▶️ **TODO: Edit the cell below to configure your own LLM.**

> If you're using OpenAI, you'll just need to replace `ADD_YOUR_TOKEN_HERE` below with your own OpenAI token.
>
> If you're using a different provider, check out the [list of connector IDs](https://github.com/aiverify-foundation/moonshot-data/tree/main/connectors) provided by `moonshot-data`. Different connectors have different required parameters. For example, the `amazon-bedrock-connector` can automatically pick up credentials configured in the AWS CLI - so you'll usually leave `token` blank for this connector type.



In [8]:
endpoint_id = api_create_endpoint(
    "my-openai-endpoint",  # name: Assign a unique name to identify this endpoint later.
    "openai-connector",      # connector_type: Specify the connector type for the model you want to evaluate.
    "",                      # uri: OpenAI connectors handle this automatically, some others may need it.
    "ADD_YOUR_TOKEN_HERE",   # token: Insert your API token here, if using a connector type that needs one.
    1,                       # max_calls_per_second: Set the maximum number of calls allowed per second.
    1,                       # max_concurrency: Set the maximum number of concurrent calls.
    {
        "timeout": 300,            # Define the timeout for API calls in seconds.
        "allow_retries": True,     # Specify whether to allow retries on failed calls.
        "num_of_retries": 3,       # Set the number of retries if allowed.
        "temperature": 0.5,        # Set the temperature for response variability.
        "model": "gpt-3.5-turbo",  # Define the model version to use.
    }  # params: Include any additional parameters required for this model.
)
print(f"The newly created endpoint id: {endpoint_id}")

The newly created endpoint id: my-openai-endpoint


You'll see running the above creates a new configuration file under your Moonshot data `CONNECTORS_ENDPOINTS` folder.

These stored endpoint IDs are what we'll reference when running tests in Moonshot.

## Run a test using our predefined `cookbook`

Moonshot comes with a list of `cookbooks` and `recipes`. A `recipe` contains one or more benchmark datasets and evaluation metrics. A `cookbook` contains one or more `recipes`. To execute an existing test, we can select either a `recipe` or `cookbook`.

In this tutorial, we will run a `cookbook` called `leaderboard-cookbook`. This cookbook contains a set of popular benchmarks (e.g., `mmlu`) that can be used to assess the capability of the model. 

*For the purpose of this tutorial, we will configure our `runner` to run 1 prompt from every recipe in this cookbook - on the endpoint we created*

In [9]:
from slugify import slugify
from moonshot.api import api_get_all_run, api_create_runner, api_get_all_runner_name

name = "sample-cookbook-runner" # Indicate the name
cookbooks = ["leaderboard-cookbook"] # Test against 2 cookbooks, test-category-cookbook and common-risk-easy
endpoints = ["my-openai-endpoint"] # Test against 1 endpoint, test-openai-endpoint
num_of_prompts = 1 # use a smaller number to test out the function; 0 means using all prompts in dataset

# Below are the optional fields
random_seed = 0   # Default: 0; this allows for randomness in dataset selection when num_of_prompts are set
system_prompt = ""  # Default: ""; this allows setting the system prompt for the endpoints

# Advanced user - Modify runner processing module and result processing module
# Default: benchmarking and benchmarking-result
runner_proc_module = "benchmarking"  # Default: "benchmarking"
result_proc_module = "benchmarking-result"  # Default: "benchmarking-result"

# Run the cookbooks with the defined endpoints
# If the id exists, it will perform a load on the runner, instead of creating a new runner.
# The benefit of this, allows the new run to use possible cached results from previous runs which greatly enhances the run time.
slugify_id = slugify(name, lowercase=True)
if slugify_id in api_get_all_runner_name():
    cb_runner = api_load_runner(slugify_id)
else:
    cb_runner = api_create_runner(name, endpoints)

# run_cookbooks is an async function. Currently there is no sync version.
# We will get an existing event loop and execute the run cookbooks process.
await cb_runner.run_cookbooks(
        cookbooks,
        num_of_prompts,
        random_seed,
        system_prompt,
        runner_proc_module,
        result_proc_module,
    )
cb_runner.close()  # Perform a close on the runner to allow proper cleanup.

# Display results in JSON
runner_runs = api_get_all_run(cb_runner.id)
result_info = runner_runs[-1].get("results")
if result_info:
    print(json.dumps(result_info, indent=2))
else:
    raise RuntimeError("no run result generated")

Established connection to database (data/generated-outputs/databases/sample-cookbook-runner.db)
[Runner] sample-cookbook-runner - Running benchmark cookbook run...
[Run] Part 0: Initialising run...
[Run] Initialise run took 0.0014s
[Run] Part 1: Loading asyncio running loop...
[Run] Part 2: Loading modules...
[Run] Module loading took 0.0105s
[Run] Part 3: Running runner processing module...
[Benchmarking] Load recipe connectors took 0.0239s
[Benchmarking] Set connectors system prompt took 0.0000s
[Benchmarking] Part 1: Running cookbooks (['leaderboard-cookbook'])...
[Benchmarking] Running cookbook leaderboard-cookbook... (1/1)
[Benchmarking] Load required instances...
[Benchmarking] Load cookbook instance took 0.0005s
[Benchmarking] Running cookbook recipes...
[Benchmarking] Running recipe mmlu... (1/6)
[Benchmarking] Load required instances...
[Benchmarking] Load recipe instance took 0.0046s
[Benchmarking] Load recipe metrics took 0.0021s
[Benchmarking] Build and execute generator pi

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[Prompt 12624] took 0.8368s
[Benchmarking] Predicting prompts for recipe [mmlu] took 0.9142s
[Benchmarking] Sorting the recipe predictions into groups
[Benchmarking] Sorted the recipe predictions into groups for recipe [mmlu] took 0.0000s
[Benchmarking] Performing metrics calculation
[Benchmarking] Running metrics for conn_id (my-openai-endpoint), recipe_id (mmlu), dataset_id (mmlu-all), prompt_template_id (mmlu)
[exactstrmatch] Running [get_results] took 0.0000s
[Benchmarking] Performing metrics calculation for recipe [mmlu] took 0.0000s
[Benchmarking] Running recipe truthfulqa-mcq... (2/6)
[Benchmarking] Load required instances...
[Benchmarking] Load recipe instance took 0.0023s
[Benchmarking] Load recipe metrics took 0.0009s
[Benchmarking] Build and execute generator pipeline...
[Benchmarking] Dataset truthfulqa-mcq, using 1 of 483 prompts.
Predicting prompt 433 [my-openai-endpoint]


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[Prompt 433] took 0.8910s
[Benchmarking] Predicting prompts for recipe [truthfulqa-mcq] took 0.9001s
[Benchmarking] Sorting the recipe predictions into groups
[Benchmarking] Sorted the recipe predictions into groups for recipe [truthfulqa-mcq] took 0.0000s
[Benchmarking] Performing metrics calculation
[Benchmarking] Running metrics for conn_id (my-openai-endpoint), recipe_id (truthfulqa-mcq), dataset_id (truthfulqa-mcq), prompt_template_id (mcq-template)
[exactstrmatch] Running [get_results] took 0.0000s
[Benchmarking] Performing metrics calculation for recipe [truthfulqa-mcq] took 0.0000s
[Benchmarking] Running recipe winogrande... (3/6)
[Benchmarking] Load required instances...
[Benchmarking] Load recipe instance took 0.0055s
[Benchmarking] Load recipe metrics took 0.0010s
[Benchmarking] Build and execute generator pipeline...
[Benchmarking] Dataset winogrande, using 1 of 41665 prompts.
Predicting prompt 25248 [my-openai-endpoint]


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[Prompt 25248] took 0.8217s
[Benchmarking] Predicting prompts for recipe [winogrande] took 0.9427s
[Benchmarking] Sorting the recipe predictions into groups
[Benchmarking] Sorted the recipe predictions into groups for recipe [winogrande] took 0.0000s
[Benchmarking] Performing metrics calculation
[Benchmarking] Running metrics for conn_id (my-openai-endpoint), recipe_id (winogrande), dataset_id (winogrande), prompt_template_id (mcq-template)
[exactstrmatch] Running [get_results] took 0.0000s
[Benchmarking] Performing metrics calculation for recipe [winogrande] took 0.0000s
[Benchmarking] Running recipe hellaswag... (4/6)
[Benchmarking] Load required instances...
[Benchmarking] Load recipe instance took 0.0263s
[Benchmarking] Load recipe metrics took 0.0008s
[Benchmarking] Build and execute generator pipeline...
[Benchmarking] Dataset hellaswag, using 1 of 49947 prompts.
Predicting prompt 25248 [my-openai-endpoint]


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[Prompt 25248] took 1.3476s
[Benchmarking] Predicting prompts for recipe [hellaswag] took 1.6041s
[Benchmarking] Sorting the recipe predictions into groups
[Benchmarking] Sorted the recipe predictions into groups for recipe [hellaswag] took 0.0000s
[Benchmarking] Performing metrics calculation
[Benchmarking] Running metrics for conn_id (my-openai-endpoint), recipe_id (hellaswag), dataset_id (hellaswag), prompt_template_id (mcq-template)
[exactstrmatch] Running [get_results] took 0.0000s
[Benchmarking] Performing metrics calculation for recipe [hellaswag] took 0.0000s
[Benchmarking] Running recipe arc... (5/6)
[Benchmarking] Load required instances...
[Benchmarking] Load recipe instance took 0.0055s
[Benchmarking] Load recipe metrics took 0.0010s
[Benchmarking] Build and execute generator pipeline...
[Benchmarking] Dataset arc-challenge, using 1 of 2590 prompts.
[Benchmarking] Dataset arc-easy, using 1 of 5197 prompts.
Predicting prompt 1578 [my-openai-endpoint]
Predicting prompt 3156 [

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[Prompt 1578] took 0.6399s


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[Prompt 3156] took 1.4563s
[Benchmarking] Predicting prompts for recipe [arc] took 1.5036s
[Benchmarking] Sorting the recipe predictions into groups
[Benchmarking] Sorted the recipe predictions into groups for recipe [arc] took 0.0000s
[Benchmarking] Performing metrics calculation
[Benchmarking] Running metrics for conn_id (my-openai-endpoint), recipe_id (arc), dataset_id (arc-challenge), prompt_template_id (mcq-template)
[exactstrmatch] Running [get_results] took 0.0000s
[Benchmarking] Running metrics for conn_id (my-openai-endpoint), recipe_id (arc), dataset_id (arc-easy), prompt_template_id (mcq-template)
[exactstrmatch] Running [get_results] took 0.0000s
[Benchmarking] Performing metrics calculation for recipe [arc] took 0.0000s
[Benchmarking] Running recipe gsm8k... (6/6)
[Benchmarking] Load required instances...
[Benchmarking] Load recipe instance took 0.0036s
[Benchmarking] Load recipe metrics took 0.0008s
[Benchmarking] Build and execute generator pipeline...
[Benchmarking] Dat

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


[Prompt 6312] took 0.9611s
[Benchmarking] Predicting prompts for recipe [gsm8k] took 0.9967s
[Benchmarking] Sorting the recipe predictions into groups
[Benchmarking] Sorted the recipe predictions into groups for recipe [gsm8k] took 0.0000s
[Benchmarking] Performing metrics calculation
[Benchmarking] Running metrics for conn_id (my-openai-endpoint), recipe_id (gsm8k), dataset_id (gsm8k), prompt_template_id (mcq-template)
[exactstrmatch] Running [get_results] took 0.0000s
[Benchmarking] Performing metrics calculation for recipe [gsm8k] took 0.0000s
[Benchmarking] Running cookbook [leaderboard-cookbook] took 6.9240s
[Benchmarking] Run took 6.9262s
[Benchmarking] Updating completion status...
[Benchmarking] Preparing results...
[Benchmarking] Preparing results took 0.0000s
[Run] Running runner processing module took 6.9512s
[Run] Part 4: Running result processing module...
[BenchmarkingResult] Generate results took 0.0322s
[Run] Running result processing module took 0.0333s
[Run] Part 5: W

## Beautifying the results

The result above is shown in our raw JSON file. To beautify the results, you can use the `rich` library to put them into a nice table.

In [11]:
from rich.columns import Columns
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
console = Console()

def show_cookbook_results(cookbooks, endpoints, cookbook_results, duration):
    """
    Show the results of the cookbook benchmarking.

    This function takes the cookbooks, endpoints, cookbook results, results file, and duration as arguments.
    If there are results, it generates a table with the cookbook results and prints a message indicating
    where the results are saved. If there are no results, it prints a message indicating that no results were found.
    Finally, it prints the duration of the run.

    Args:
        cookbooks (list): A list of cookbooks.
        endpoints (list): A list of endpoints.
        cookbook_results (dict): A dictionary with the results of the cookbook benchmarking.
        duration (float): The duration of the run.

    Returns:
        None
    """
    if cookbook_results:
        # Display recipe results
        generate_cookbook_table(cookbooks, endpoints, cookbook_results)
    else:
        console.print("[red]There are no results.[/red]")

    # Print run stats
    console.print(f"{'='*50}\n[blue]Time taken to run: {duration}s[/blue]\n*Overall rating will be the lowest grade that the recipes have in each cookbook\n{'='*50}")

def generate_cookbook_table(cookbooks: list, endpoints: list, results: dict) -> None:
    """
    Generate and display a table with the cookbook benchmarking results.

    This function creates a table that includes the index, cookbook name, recipe name, and the results
    for each endpoint.

    The cookbook names are prefixed with "Cookbook:" and are displayed with their overall grades. Each recipe under a
    cookbook is indented and prefixed with "Recipe:" followed by its individual grades for each endpoint. If there are
    no results for a cookbook, a row with dashes across all endpoint columns is added to indicate this.

    Args:
        cookbooks (list): A list of cookbook names to display in the table.
        endpoints (list): A list of endpoints for which results are to be displayed.
        results (dict): A dictionary containing the benchmarking results for cookbooks and recipes.

    Returns:
        None: The function prints the table to the console but does not return any value.
    """
    table = Table(
        title="Cookbook Result", show_lines=True, expand=True, header_style="bold"
    )
    table.add_column("No.", width=2)
    table.add_column("Cookbook (with its recipes)", justify="left", width=78)
    for endpoint in endpoints:
        table.add_column(endpoint, justify="center")

    index = 1
    for cookbook in cookbooks:
        # Get cookbook result
        cookbook_result = next(
            (
                result
                for result in results["results"]["cookbooks"]
                if result["id"] == cookbook
            ),
            None,
        )

        if cookbook_result:
            # Add the cookbook name with the "Cookbook: " prefix as the first row for this section
            endpoint_results = []
            for endpoint in endpoints:
                # Find the evaluation summary for the endpoint
                evaluation_summary = next(
                    (
                        temp_eval
                        for temp_eval in cookbook_result["overall_evaluation_summary"]
                        if temp_eval["model_id"] == endpoint
                    ),
                    None,
                )

                # Get the grade from the evaluation_summary, or use "-" if not found
                grade = "-"
                if evaluation_summary and evaluation_summary["overall_grade"]:
                    grade = evaluation_summary["overall_grade"]
                endpoint_results.append(grade)
            table.add_row(
                str(index),
                f"Cookbook: [blue]{cookbook}[/blue]",
                *endpoint_results,
                end_section=True,
            )

            for recipe in cookbook_result["recipes"]:
                endpoint_results = []
                for endpoint in endpoints:
                    # Find the evaluation summary for the endpoint
                    evaluation_summary = next(
                        (
                            temp_eval
                            for temp_eval in recipe["evaluation_summary"]
                            if temp_eval["model_id"] == endpoint
                        ),
                        None,
                    )

                    # Get the grade from the evaluation_summary, or use "-" if not found
                    grade = "-"
                    if (
                        evaluation_summary
                        and "grade" in evaluation_summary
                        and "avg_grade_value" in evaluation_summary
                        and evaluation_summary["grade"]
                    ):
                        grade = f"{evaluation_summary['grade']} [{evaluation_summary['avg_grade_value']}]"
                    endpoint_results.append(grade)

                # Add the recipe name indented under the cookbook name
                table.add_row(
                    "",
                    f"  └──  Recipe: [blue]{recipe['id']}[/blue]",
                    *endpoint_results,
                    end_section=True,
                )

            # Increment index only after all recipes of the cookbook have been added
            index += 1
        else:
            # If no results for the cookbook, add a row indicating this with the "Cookbook: " prefix
            # and a dash for each endpoint column
            table.add_row(
                str(index),
                f"Cookbook: {cookbook}",
                *(["-"] * len(endpoints)),
                end_section=True,
            )
            index += 1

    # Display table
    console.print(table)

if result_info:
    show_cookbook_results(
        cookbooks, endpoints, result_info, result_info["metadata"]["duration"]
    )
else:
    raise RuntimeError("no run result generated")

## List all the Cookbook

If you are curious what are the other cookbooks available, you can use `api_get_all_cookbook()`.

Here's how it will look like in the output. To run these cookbooks, just replace `leaderboard-cookbook` with one of the cookbook IDs or you can append more cookbook IDs to the list in the previous cell.

In [12]:
cookbook_ids = api_get_all_cookbook()
print("Total number of cookbooks: {0}".format(len(cookbook_ids)))
print("Showing the first three cookbooks below...")
print(json.dumps(cookbook_ids[0:3], indent=2))

Total number of cookbooks: 9
Showing the first three cookbooks below...
[
  {
    "id": "common-risk-easy",
    "name": "Easy test sets for Common Risks",
    "description": "This is a cookbook that consists (easy) test sets for common risks. These test sets are adapted from various research and will be expanded in the future.",
    "recipes": [
      "uciadult",
      "bbq",
      "winobias",
      "challenging-toxicity-prompts-completion",
      "realtime-qa",
      "commonsense-morality-easy",
      "jailbreak-dan",
      "advglue"
    ]
  },
  {
    "id": "common-risk-hard",
    "name": "Hard test sets for Common Risks",
    "description": "This is a cookbook that consists (hard) test sets for common risks. These test sets are adapted from various research and will be expanded in the future.",
    "recipes": [
      "uciadult",
      "bbq",
      "winobias",
      "challenging-toxicity-prompts-completion",
      "realtime-qa",
      "commonsense-morality-hard",
      "jailbreak-dan