# Tutorial 1 - Basic Workflow - Execute Existing Benchmark Tests 

**Scenario**: 

You are a model developer and you are told to deploy a system that uses a Large Language Model.

However, you are uncertain which model performs best for your use case and you want to assess potential models' capabilities using the pre-built benchmarks in Moonshot.<br>

How can you do this? 

In this tutorial, you will learn how to:

- Connect to models you want to test by setting up their 'connector_endpoints' in Moonshot
- List and run a curated set of benchmarks via a Moonshot `cookbook`

Prerequisite:

1. Installed Moonshot (refer to "<b>Moonshot - Pre-Req - Setup.ipynb</b>" for instructions)
2. A copy of `moonshot-data` (refer to "<b>Moonshot - Pre-Req - Setup.ipynb</b>" for instructions). You will be setting the path to `moonshot-data` in the `moonshot_data_path` variable later
3. Any necessary access tokens or URIs to the model that you want to test, which you will provide where the placeholder `ADD_NEW_TOKEN_HERE` is later



## Import and configure Moonshot

In this section, we prepare our Jupyter notebook environment by importing necessary libraries required to execute an existing benchmark.

> ⚠️ **Note:** Check that `moonshot_data_path` below matches the location where you installed `moonshot-data` and edit the code to match your location if needed.

In [1]:
# Python built-ins:
import os
import json
import sys

# If you're running this notebook from the moonshot/examples/jupyter-notebook folder, the below
# line will enable you to import moonshot from the local source code. If you installed moonshot
# from pip, you can remove this:
sys.path.insert(0, '../../')

# Import moonshot utilities:
from moonshot.api import (
    api_create_endpoint,
    api_get_all_cookbook,
    api_load_runner,
    api_set_environment_variables,
)

# Environment Configuration
# Here we set up the environment variables for the Moonshot framework.
# These variables define the paths to various modules and components used by Moonshot,
# organizing the framework's structure and access points.

# modify moonshot_data_path to point to your own copy of moonshot-data
moonshot_data_path = "./moonshot-data"
env = {
    "ATTACK_MODULES": os.path.join(moonshot_data_path, "attack-modules"),
    "BOOKMARKS": os.path.join(moonshot_data_path, "generated-outputs/bookmarks"),
    "CONNECTORS": os.path.join(moonshot_data_path, "connectors"),
    "CONNECTORS_ENDPOINTS": os.path.join(moonshot_data_path, "connectors-endpoints"),
    "CONTEXT_STRATEGY": os.path.join(moonshot_data_path, "context-strategy"),
    "COOKBOOKS": os.path.join(moonshot_data_path, "cookbooks"),
    "DATABASES": os.path.join(moonshot_data_path, "generated-outputs/databases"),
    "DATABASES_MODULES": os.path.join(moonshot_data_path, "databases-modules"),
    "DATASETS": os.path.join(moonshot_data_path, "datasets"),
    "IO_MODULES": os.path.join(moonshot_data_path, "io-modules"),
    "METRICS": os.path.join(moonshot_data_path, "metrics"),
    "PROMPT_TEMPLATES": os.path.join(moonshot_data_path, "prompt-templates"),
    "RECIPES": os.path.join(moonshot_data_path, "recipes"),
    "RESULTS": os.path.join(moonshot_data_path, "generated-outputs/results"),
    "RESULTS_MODULES": os.path.join(moonshot_data_path, "results-modules"),
    "RUNNERS": os.path.join(moonshot_data_path, "generated-outputs/runners"),
    "RUNNERS_MODULES": os.path.join(moonshot_data_path, "runners-modules"),
}

# Check user has set moonshot_data_path correctly:
if not os.path.isdir(env["ATTACK_MODULES"]):
    raise ValueError(
        "Configured path %s does not exist. Is moonshot-data installed at %s?"
        % (env["ATTACK_MODULES"], moonshot_data_path)
    )

# Apply the environment variables to configure the Moonshot framework.
api_set_environment_variables(env)

# Note: there might be some warning on IProgress not found. we can ignore it for now.

## Define the target model endpoint / API

Moonshot provides [connectors](https://aiverify-foundation.github.io/moonshot/api_reference/api_connector/) to a range of different LLM hosting providers - such as OpenAI (direct or Azure), Hugging Face, Amazon Bedrock, and Google Gemini.

There are some [example endpoint configurations](https://github.com/aiverify-foundation/moonshot-data/tree/main/connectors-endpoints) provided in `moonshot-data`, but they don't include API keys or other credentials: So you'll usually need to edit these configurations, or add your own, to connect to your target LLM.

You can register new Moonshot endpoints directly from Python, as shown below.

▶️ **TODO: Edit the cell below to configure your own LLM.**

> If you're using OpenAI, you'll just need to replace `ADD_YOUR_TOKEN_HERE` below with your own OpenAI token.
>
> If you're using a different provider, check out the [list of connector IDs](https://github.com/aiverify-foundation/moonshot-data/tree/main/connectors) provided by `moonshot-data`. Different connectors have different required parameters. For example, the `amazon-bedrock-connector` can automatically pick up credentials configured in the AWS CLI - so you'll usually leave `token` blank for this connector type.



In [2]:
# We are creating a openai endpoint where it uses the gpt-3.5 turbo model.

endpoint_id = api_create_endpoint(
    "my-openai-endpoint",    # name: Assign a unique name to identify this endpoint later.
    "openai-connector",      # connector_type: Specify the connector type for the model you want to evaluate.
    "",                      # uri: Leave blank as the OpenAI library handles the connection.
    "ADD_YOUR_TOKEN_HERE",   # token: Insert your OpenAI API token here.
    1,                       # max_calls_per_second: Set the maximum number of calls allowed per second.
    1,                       # max_concurrency: Set the maximum number of concurrent calls.
    "gpt-3.5-turbo",         # model: Define the model version to use.
    # params: Include any additional parameters required for this model.
    {
        "timeout": 300,      # timeout: Set the timeout for API calls in seconds.
        "max_attempts": 3,   # max_attempts: Set the max number of retry attempts. 
        "temperature": 0.5,  # temperature: Set the temperature for response variability.
    }  
)
print(f"The newly created endpoint id: {endpoint_id}")

The newly created endpoint id: my-openai-endpoint


You'll see running the above creates a new configuration file under your Moonshot Data `CONNECTORS_ENDPOINTS` folder.

These stored endpoint IDs are what we'll reference when running tests in Moonshot.

## Run a test using our predefined cookbook

Moonshot comes with a list of [cookbooks](https://github.com/aiverify-foundation/moonshot-data/tree/main/cookbooks) and [recipes](https://github.com/aiverify-foundation/moonshot-data/tree/main/recipes).

A <b>recipe</b> contains one or more <b>benchmark datasets</b> and <b>evaluation metrics</b>. A <b>cookbook</b> contains one or more <b>recipes</b>. To execute an existing test, we can select either a <b>recipe</b> or <b>cookbook</b>.

In this tutorial, we will run a <b>cookbook</b> called `leaderboard-cookbook`. This <b>cookbook</b> contains a set of popular [benchmark datasets](https://github.com/aiverify-foundation/moonshot-data/tree/main/datasets) (e.g., [mmlu](https://github.com/aiverify-foundation/moonshot-data/blob/main/datasets/mmlu-all.json)) that can be used to assess the capability of the model. 

*For the purpose of this tutorial, we will configure our `runner` to run 1 prompt from every recipe in this cookbook - on the endpoint we created*

In [3]:
from slugify import slugify
from moonshot.api import api_get_all_run, api_create_runner, api_get_all_runner_name

name = "sample-cookbook-runner" # Indicate the name
cookbooks = ["leaderboard-cookbook"] # Test one cookbook leaderboard-cookbook. You can add more cookbooks in the list to test as well
endpoints = ["my-openai-endpoint"] # Test against 1 endpoint, my-openai-endpoint
prompt_selection_percentage = 1 # The percentage number of prompt(s) to run from EACH dataset in the cookbook; this refers to 1% of each dataset prompts.

# Below are the optional fields
random_seed = 1   # Default: 0; this allows for randomness in dataset selection when prompt selection percentage are set
system_prompt = ""  # Default: ""; this allows setting the system prompt for the endpoints

# Advanced user - Modify runner processing module and result processing module
# Default: benchmarking and benchmarking-result. Change it to your module name if you have your own runner and/or result module
runner_proc_module = "benchmarking"  # Default: "benchmarking"
result_proc_module = "benchmarking-result"  # Default: "benchmarking-result"

# Run the cookbooks with the defined endpoint(s)
# If the id exists, it will perform a load on the runner, instead of creating a new runner
# Using an existing runner allows the new run to possibly use cached results from previous runs, which greatly reduces the run time
slugify_id = slugify(name, lowercase=True)
if slugify_id in api_get_all_runner_name():
    cb_runner = api_load_runner(slugify_id)
else:
    cb_runner = api_create_runner(name, endpoints)

# run_cookbooks() is an async function. Currently there is no sync version
# We will get an existing event loop and execute the run cookbooks process
await cb_runner.run_cookbooks(
        cookbooks,
        prompt_selection_percentage,
        random_seed,
        system_prompt,
        runner_proc_module,
        result_proc_module,
    )
await cb_runner.close()  # Perform a close on the runner to allow proper cleanup.

# Display results in JSON
runner_runs = api_get_all_run(cb_runner.id)
result_info = runner_runs[-1].get("results")
if result_info:
    print(json.dumps(result_info, indent=2))
else:
    raise RuntimeError("no run result generated")

2024-12-19 22:58:26,178 [INFO][runner.py::run_cookbooks(412)] [Runner] sample-cookbook-runner - Running benchmark cookbook run...
2024-12-19 22:58:29,072 [INFO][benchmarking.py::generate(139)] [Benchmarking] Running cookbooks (['leaderboard-cookbook'])...
2024-12-19 22:58:29,072 [INFO][benchmarking.py::generate(145)] [Benchmarking] Running cookbook leaderboard-cookbook... (1/1)
2024-12-19 22:58:29,178 [INFO][connector.py::get_prediction(348)] [Connector ID: my-openai-endpoint] Predicting Prompt Index 51.
2024-12-19 22:58:29,206 [INFO][connector.py::get_prediction(348)] [Connector ID: my-openai-endpoint] Predicting Prompt Index 69.
2024-12-19 22:58:29,206 [INFO][connector.py::get_prediction(348)] [Connector ID: my-openai-endpoint] Predicting Prompt Index 187.
2024-12-19 22:58:29,206 [INFO][connector.py::get_prediction(348)] [Connector ID: my-openai-endpoint] Predicting Prompt Index 301.
2024-12-19 22:58:29,206 [INFO][connector.py::get_prediction(348)] [Connector ID: my-openai-endpoint] 

{
  "metadata": {
    "id": "sample-cookbook-runner",
    "start_time": "2024-12-19 22:58:26",
    "end_time": "2024-12-19 23:22:05",
    "duration": 1419,
    "status": "completed",
    "recipes": null,
    "cookbooks": [
      "leaderboard-cookbook"
    ],
    "endpoints": [
      "my-openai-endpoint"
    ],
    "prompt_selection_percentage": 1,
    "random_seed": 1,
    "system_prompt": ""
  },
  "results": {
    "cookbooks": [
      {
        "id": "leaderboard-cookbook",
        "recipes": [
          {
            "id": "mmlu",
            "details": [
              {
                "model_id": "my-openai-endpoint",
                "dataset_id": "mmlu-all",
                "prompt_template_id": "mmlu",
                "data": [
                  {
                    "prompt": "Question:\nJames is going to the baseball field with his friend Tommy.  James has to practice because baseball season starts in a week.  He wants to be a good player when the season starts.  James has bee

## Beautifying the results

The result above is shown in our raw JSON file. To beautify the results, you can use the `rich` library to put them into a nice table.

In [4]:
from rich.columns import Columns
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
console = Console()

def show_cookbook_results(cookbooks, endpoints, cookbook_results, duration):
    """
    Show the results of the cookbook benchmarking.

    This function takes the cookbooks, endpoints, cookbook results, results file, and duration as arguments.
    If there are results, it generates a table with the cookbook results and prints a message indicating
    where the results are saved. If there are no results, it prints a message indicating that no results were found.
    Finally, it prints the duration of the run.

    Args:
        cookbooks (list): A list of cookbooks.
        endpoints (list): A list of endpoints.
        cookbook_results (dict): A dictionary with the results of the cookbook benchmarking.
        duration (float): The duration of the run.

    Returns:
        None
    """
    if cookbook_results:
        # Display recipe results
        generate_cookbook_table(cookbooks, endpoints, cookbook_results)
    else:
        console.print("[red]There are no results.[/red]")

    # Print run stats
    console.print(f"{'='*50}\n[blue]Time taken to run: {duration}s[/blue]\n*Overall rating will be the lowest grade that the recipes have in each cookbook\n{'='*50}")

def generate_cookbook_table(cookbooks: list, endpoints: list, results: dict) -> None:
    """
    Generate and display a table with the cookbook benchmarking results.

    This function creates a table that includes the index, cookbook name, recipe name, and the results
    for each endpoint.

    The cookbook names are prefixed with "Cookbook:" and are displayed with their overall grades. Each recipe under a
    cookbook is indented and prefixed with "Recipe:" followed by its individual grades for each endpoint. If there are
    no results for a cookbook, a row with dashes across all endpoint columns is added to indicate this.

    Args:
        cookbooks (list): A list of cookbook names to display in the table.
        endpoints (list): A list of endpoints for which results are to be displayed.
        results (dict): A dictionary containing the benchmarking results for cookbooks and recipes.

    Returns:
        None: The function prints the table to the console but does not return any value.
    """
    table = Table(
        title="Cookbook Result", show_lines=True, expand=True, header_style="bold"
    )
    table.add_column("No.", width=2)
    table.add_column("Cookbook (with its recipes)", justify="left", width=78)
    for endpoint in endpoints:
        table.add_column(endpoint, justify="center")

    index = 1
    for cookbook in cookbooks:
        # Get cookbook result
        cookbook_result = next(
            (
                result
                for result in results["results"]["cookbooks"]
                if result["id"] == cookbook
            ),
            None,
        )

        if cookbook_result:
            # Add the cookbook name with the "Cookbook: " prefix as the first row for this section
            endpoint_results = []
            for endpoint in endpoints:
                # Find the evaluation summary for the endpoint
                evaluation_summary = next(
                    (
                        temp_eval
                        for temp_eval in cookbook_result["overall_evaluation_summary"]
                        if temp_eval["model_id"] == endpoint
                    ),
                    None,
                )

                # Get the grade from the evaluation_summary, or use "-" if not found
                grade = "-"
                if evaluation_summary and evaluation_summary["overall_grade"]:
                    grade = evaluation_summary["overall_grade"]
                endpoint_results.append(grade)
            table.add_row(
                str(index),
                f"Cookbook: [blue]{cookbook}[/blue]",
                *endpoint_results,
                end_section=True,
            )

            for recipe in cookbook_result["recipes"]:
                endpoint_results = []
                for endpoint in endpoints:
                    # Find the evaluation summary for the endpoint
                    evaluation_summary = next(
                        (
                            temp_eval
                            for temp_eval in recipe["evaluation_summary"]
                            if temp_eval["model_id"] == endpoint
                        ),
                        None,
                    )

                    # Get the grade from the evaluation_summary, or use "-" if not found
                    grade = "-"
                    if (
                        evaluation_summary
                        and "grade" in evaluation_summary
                        and "avg_grade_value" in evaluation_summary
                        and evaluation_summary["grade"]
                    ):
                        grade = f"{evaluation_summary['grade']} [{evaluation_summary['avg_grade_value']}]"
                    endpoint_results.append(grade)

                # Add the recipe name indented under the cookbook name
                table.add_row(
                    "",
                    f"  └──  Recipe: [blue]{recipe['id']}[/blue]",
                    *endpoint_results,
                    end_section=True,
                )

            # Increment index only after all recipes of the cookbook have been added
            index += 1
        else:
            # If no results for the cookbook, add a row indicating this with the "Cookbook: " prefix
            # and a dash for each endpoint column
            table.add_row(
                str(index),
                f"Cookbook: {cookbook}",
                *(["-"] * len(endpoints)),
                end_section=True,
            )
            index += 1

    # Display table
    console.print(table)

if result_info:
    show_cookbook_results(
        cookbooks, endpoints, result_info, result_info["metadata"]["duration"]
    )
else:
    raise RuntimeError("no run result generated")

## List all the Cookbook

If you are curious what are the other cookbooks available, you can use `api_get_all_cookbook()`.

Here's how it will look like in the output. 

To run these cookbooks, just replace `leaderboard-cookbook` with one of the cookbook IDs or you can append more cookbook IDs to the list in the previous cell.

In [5]:
cookbook_ids = api_get_all_cookbook()
print("Total number of cookbooks: {0}".format(len(cookbook_ids)))
print("Showing the first three cookbooks below...")
print(json.dumps(cookbook_ids[0:3], indent=2))

Total number of cookbooks: 10
Showing the first three cookbooks below...
[
  {
    "id": "common-risk-easy",
    "name": "Easy test sets for Common Risks",
    "description": "This is a cookbook that consists (easy) test sets for common risks. These test sets are adapted from various research and will be expanded in the future.",
    "recipes": [
      "uciadult",
      "bbq",
      "winobias",
      "challenging-toxicity-prompts-completion",
      "realtime-qa",
      "commonsense-morality-easy",
      "jailbreak-dan",
      "advglue"
    ]
  },
  {
    "id": "common-risk-hard",
    "name": "Hard test sets for Common Risks",
    "description": "This is a cookbook that consists (hard) test sets for common risks. These test sets are adapted from various research and will be expanded in the future.",
    "recipes": [
      "uciadult",
      "bbq",
      "winobias",
      "challenging-toxicity-prompts-completion",
      "realtime-qa",
      "commonsense-morality-hard",
      "jailbreak-da