# Introduction

Welcome to this Jupyter notebook, where we will explore how you can use the benchmarking features in Moonshot to evaluate your LLMs and LLM applications.

## Benchmarking within Moonshot
To gauge the performance of AI models, we'll employ Moonshot's benchmarking capabilities. By conducting a series of tests, we'll assess the model's prowess across various tasks, shedding light on its efficiency and precision. These insights are invaluable in understanding and maximizing the model's potential.

## Establishing a Connection with AI Models
Our first step is to create connector endpoints to the LLMs that you want to evaluate. We'll walk you through setting up an endpoint, which serves as a conduit between Moonshot and the AI model residing on the provider's servers. This process ensures a robust and uninterrupted flow of interaction with the AI model during testing.

## Pre-requisites

Check that you have created a new virtual environment and selected the preferred kernel before starting this notebooks.

You may refer to the `Moonshot - Pre-Req - Setup.ipynb` in the jupyter-notebook folder for more information.

# Import and Environment Variables

In this section, we prepare our Jupyter notebook environment by importing necessary libraries and setting up environment variables.

In [1]:
!pwd

/home/dtejada/Dev/ai_rator_tools/aigs_moonshot/moonshot/examples/jupyter-notebook


In [2]:
# Python built-ins:
import os
import json
import sys

# If you're running this notebook from the moonshot/examples/jupyter-notebook folder, the below
# line will enable you to import moonshot from the local source code. If you installed moonshot
# from pip, you can remove this:
sys.path.insert(0, '../../')

# Import moonshot utilities:
from moonshot.api import (
    api_create_recipe,
    api_create_cookbook,
    api_create_endpoint,
    api_get_all_connector_type,
    api_get_all_endpoint,
    api_get_all_cookbook,
    api_get_all_recipe,
    api_get_all_prompt_template_detail,
    api_load_runner,
    api_set_environment_variables,
)

# Environment Configuration
# Here we set up the environment variables for the Moonshot framework.
# These variables define the paths to various modules and components used by Moonshot,
# organizing the framework's structure and access points.

# modify moonshot_data_path to point to your own copy of moonshot-data
moonshot_data_path = "./moonshot-data"
env = {
    "ATTACK_MODULES": os.path.join(moonshot_data_path, "attack-modules"),
    "BOOKMARKS": os.path.join(moonshot_data_path, "generated-outputs/bookmarks"),
    "CONNECTORS": os.path.join(moonshot_data_path, "connectors"),
    "CONNECTORS_ENDPOINTS": os.path.join(moonshot_data_path, "connectors-endpoints"),
    "CONTEXT_STRATEGY": os.path.join(moonshot_data_path, "context-strategy"),
    "COOKBOOKS": os.path.join(moonshot_data_path, "cookbooks"),
    "DATABASES": os.path.join(moonshot_data_path, "generated-outputs/databases"),
    "DATABASES_MODULES": os.path.join(moonshot_data_path, "databases-modules"),
    "DATASETS": os.path.join(moonshot_data_path, "datasets"),
    "IO_MODULES": os.path.join(moonshot_data_path, "io-modules"),
    "METRICS": os.path.join(moonshot_data_path, "metrics"),
    "PROMPT_TEMPLATES": os.path.join(moonshot_data_path, "prompt-templates"),
    "RECIPES": os.path.join(moonshot_data_path, "recipes"),
    "RESULTS": os.path.join(moonshot_data_path, "generated-outputs/results"),
    "RESULTS_MODULES": os.path.join(moonshot_data_path, "results-modules"),
    "RUNNERS": os.path.join(moonshot_data_path, "generated-outputs/runners"),
    "RUNNERS_MODULES": os.path.join(moonshot_data_path, "runners-modules"),
}

# Check user has set moonshot_data_path correctly:
if not os.path.isdir(env["ATTACK_MODULES"]):
    raise ValueError(
        "Configured path %s does not exist. Is moonshot-data installed at %s?"
        % (env["ATTACK_MODULES"], moonshot_data_path)
    )

# Apply the environment variables to configure the Moonshot framework.
api_set_environment_variables(env)

# Note: there might be some warning on IProgress not found. we can ignore it for now.

  from .autonotebook import tqdm as notebook_tqdm


## Results Display Enhancement Functions

These functions aid in enhancing the presentation of results obtained from Moonshot libraries and APIs. By leveraging the `rich` library, we can transform plain text outputs into well-structured and visually appealing tables, making it easier to interpret and analyze the data. The functions provided below are designed to display various types of information, such as connector types, endpoints, recipes, cookbooks, and benchmarking results, in a user-friendly tabular format. Each function is equipped with detailed documentation and error handling to ensure clarity and robustness in output display.

In [3]:
# Display Enhancements
# These imports are for improving the visual presentation of outputs in the notebook.
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Rich Library Imports
# The 'rich' library is used to create visually appealing tables, panels, and console outputs.
# This enhances the readability and presentation of data in the notebook.
from rich.columns import Columns
from rich.console import Console
from rich.panel import Panel
from rich.table import Table

# Initialize the global console for rich text display, which will be used throughout the notebook.
console = Console()

In [4]:
from rich.markup import escape
from moonshot.integrations.cli.benchmark.recipe import _display_view_grading_scale_format, _display_view_statistics_format
from moonshot.integrations.cli.common.display_helper import display_view_list_format, display_view_str_format


def display_connector_types(connector_types):
    """
    Display a list of connector types.

    This function takes a list of connector types and displays them in a table format. If the list is empty, it prints a
    message indicating that no connector types were found.

    Args:
        connector_types (list): A list of connector types.

    Returns:
        None
    """
    if connector_types:
        table = Table(
            title="List of Connector Types",
            show_lines=True,
            expand=True,
            header_style="bold",
        )
        table.add_column("No.", width=2)
        table.add_column("Connector Type", justify="left", width=78)
        for connector_id, connector_type in enumerate(connector_types, 1):
            table.add_section()
            table.add_row(str(connector_id), connector_type)
        console.print(table)
    else:
        console.print("[red]There are no connector types found.[/red]")

def display_endpoints(endpoints_list, attributes_list):
    """
    Display a list of endpoints.

    This function takes a list of endpoints and displays them in a table format. If the list is empty, it prints a
    message indicating that no endpoints were found.

    Args:
        endpoints_list (list): A list of endpoints. Each endpoint is a dictionary with keys 'id', 'name',
        'connector_type', 'uri', 'token', 'max_calls_per_second', 'max_concurrency', 'params', and 'created_date'.

    Returns:
        None
    """
    if endpoints_list:
        table = Table(
            title="List of Connector Endpoints",
            show_lines=True,
            expand=True,
            header_style="bold",
        )
        for attr in attributes_list:
            table.add_column(attr, justify="left")

        for endpoint_id, endpoint in enumerate(endpoints_list, 1):
            (
                id,
                name,
                connector_type,
                uri,
                token,
                max_calls_per_second,
                max_concurrency,
                model,
                params,
                created_date,
            ) = endpoint.values()
            table.add_section()

            attr_values_list = []
            for attr in attributes_list:
                if attr == "id":
                    attr_values_list.append(id)
                elif attr == "name":
                    attr_values_list.append(name)
                elif attr == "connector_type":
                    attr_values_list.append(connector_type)
                elif attr == "uri":
                    attr_values_list.append(uri)
                elif attr == "token":
                    attr_values_list.append(token)
                elif attr == "max_calls_per_seconds":
                    attr_values_list.append(str(max_calls_per_second))
                elif attr == "max_concurrency":
                    attr_values_list.append(str(max_concurrency))
                elif attr == "model":
                    attr_values_list.append(str(model))
                elif attr == "params":
                    attr_values_list.append(escape(str(params)))
                elif attr == "created_date":
                    attr_values_list.append(created_date)
            table.add_row(*attr_values_list)
        console.print(table)
    else:
        console.print("[red]There are no endpoints found.[/red]")

def display_recipes(recipes_list: list) -> None:
    """
    Display the list of recipes in a tabular format.

    This function takes a list of recipe dictionaries and displays each recipe's details in a table.
    The table includes the recipe's ID, name, description, and associated details such as tags, categories,
    datasets, prompt templates, metrics, attack strategies, grading scale, and statistics. If the list is empty,
    it prints a message indicating that no recipes are found.

    Args:
        recipes_list (list): A list of dictionaries, where each dictionary contains the details of a recipe.
    """
    if recipes_list:
        table = Table(
            title="List of Recipes", show_lines=True, expand=True, header_style="bold"
        )
        table.add_column("No.", width=2)
        table.add_column("Recipe", justify="left", width=78)
        table.add_column("Contains", justify="left", width=20, overflow="fold")
        for recipe_id, recipe in enumerate(recipes_list, 1):
            (
                id,
                name,
                description,
                tags,
                categories,
                datasets,
                prompt_templates,
                metrics,
                grading_scale,
                stats,
            ) = recipe.values()

            tags_info = display_view_list_format("Tags", tags)
            categories_info = display_view_list_format("Categories", categories)
            datasets_info = display_view_list_format("Datasets", datasets)
            prompt_templates_info = display_view_list_format(
                "Prompt Templates", prompt_templates
            )
            metrics_info = display_view_list_format("Metrics", metrics)
            grading_scale_info = _display_view_grading_scale_format(
                "Grading Scale", grading_scale
            )
            stats_info = _display_view_statistics_format("Statistics", stats)

            recipe_info = (
                f"[red]id: {id}[/red]\n\n[blue]{name}[/blue]\n{description}\n\n"
                f"{tags_info}\n\n{categories_info}\n\n{grading_scale_info}\n\n{stats_info}"
            )
            contains_info = f"{datasets_info}\n\n{prompt_templates_info}\n\n{metrics_info}"

            table.add_section()
            table.add_row(str(recipe_id), recipe_info, contains_info)
        console.print(table)
    else:
        console.print("[red]There are no recipes found.[/red]")

def display_cookbooks(cookbooks_list):
    """
    Display the list of cookbooks in a tabular format.

    This function takes a list of cookbook dictionaries and displays each cookbook's details in a table.
    The table includes the cookbook's ID, name, description, and associated recipes. If the list is empty,
    it prints a message indicating that no cookbooks are found.

    Args:
        cookbooks_list (list): A list of dictionaries, where each dictionary contains the details of a cookbook.
    """
    if cookbooks_list:
        table = Table(
            title="List of Cookbooks", show_lines=True, expand=True, header_style="bold"
        )
        table.add_column("No.", width=2)
        table.add_column("Cookbook", justify="left", width=78)
        table.add_column("Contains", justify="left", width=20, overflow="fold")
        for cookbook_id, cookbook in enumerate(cookbooks_list, 1):
            id, name, tags, categories, description, recipes, = cookbook.values()
            cookbook_info = f"[red]ID: {id}[/red]\n\n[blue]{name}[/blue]\n{description}"
            cookbook_info += (f"\n\n[blue]Tags: {tags}[/blue]\n[blue]Categories: {categories}[/blue]\n")
            recipes_info = display_view_list_format("Recipes", recipes)
            table.add_section()
            table.add_row(str(cookbook_id), cookbook_info, recipes_info)
        console.print(table)
    else:
        console.print("[red]There are no cookbooks found.[/red]")

def display_prompt_templates(prompt_templates) -> None:
    """
    Display the list of prompt templates in a formatted table.

    This function takes a list of prompt templates and displays them in a formatted table.
    Each row in the table represents a prompt template with its ID, name, description, and contents.
    If the list of prompt templates is empty, it prints a message indicating that no prompt templates were found.

    Args:
        prompt_templates (list): A list of dictionaries, each representing a prompt template.
    """
    table = Table(
        title="List of Prompt Templates",
        show_lines=True,
        expand=True,
        header_style="bold",
    )
    table.add_column("No.", width=2)
    table.add_column("Prompt Template", justify="left", width=50)
    table.add_column("Contains", justify="left", width=48, overflow="fold")
    if prompt_templates:
        for prompt_index, prompt_template in enumerate(prompt_templates, 1):
            (
                id,
                name,
                description,
                contents,
            ) = prompt_template.values()

            prompt_info = f"[red]id: {id}[/red]\n\n[blue]{name}[/blue]\n{description}"
            table.add_section()
            table.add_row(str(prompt_index), prompt_info, contents)
        console.print(table)
    else:
        console.print("[red]There are no prompt templates found.[/red]")

def show_cookbook_results(cookbooks, endpoints, cookbook_results, duration):
    """
    Show the results of the cookbook benchmarking.

    This function takes the cookbooks, endpoints, cookbook results, results file, and duration as arguments.
    If there are results, it generates a table with the cookbook results and prints a message indicating
    where the results are saved. If there are no results, it prints a message indicating that no results were found.
    Finally, it prints the duration of the run.

    Args:
        cookbooks (list): A list of cookbooks.
        endpoints (list): A list of endpoints.
        cookbook_results (dict): A dictionary with the results of the cookbook benchmarking.
        duration (float): The duration of the run.

    Returns:
        None
    """
    if cookbook_results:
        # Display recipe results
        generate_cookbook_table(cookbooks, endpoints, cookbook_results)
    else:
        console.print("[red]There are no results.[/red]")

    # Print run stats
    console.print(f"{'='*50}\n[blue]Time taken to run: {duration}s[/blue]\n*Overall rating will be the lowest grade that the recipes have in each cookbook\n{'='*50}")

def generate_cookbook_table(cookbooks: list, endpoints: list, results: dict) -> None:
    """
    Generate and display a table with the cookbook benchmarking results.

    This function creates a table that includes the index, cookbook name, recipe name, and the results
    for each endpoint.

    The cookbook names are prefixed with "Cookbook:" and are displayed with their overall grades. Each recipe under a
    cookbook is indented and prefixed with "Recipe:" followed by its individual grades for each endpoint. If there are
    no results for a cookbook, a row with dashes across all endpoint columns is added to indicate this.

    Args:
        cookbooks (list): A list of cookbook names to display in the table.
        endpoints (list): A list of endpoints for which results are to be displayed.
        results (dict): A dictionary containing the benchmarking results for cookbooks and recipes.

    Returns:
        None: The function prints the table to the console but does not return any value.
    """
    table = Table(
        title="Cookbook Result", show_lines=True, expand=True, header_style="bold"
    )
    table.add_column("No.", width=2)
    table.add_column("Cookbook (with its recipes)", justify="left", width=78)
    for endpoint in endpoints:
        table.add_column(endpoint, justify="center")

    index = 1
    for cookbook in cookbooks:
        # Get cookbook result
        cookbook_result = next(
            (
                result
                for result in results["results"]["cookbooks"]
                if result["id"] == cookbook
            ),
            None,
        )

        if cookbook_result:
            # Add the cookbook name with the "Cookbook: " prefix as the first row for this section
            endpoint_results = []
            for endpoint in endpoints:
                # Find the evaluation summary for the endpoint
                evaluation_summary = next(
                    (
                        temp_eval
                        for temp_eval in cookbook_result["overall_evaluation_summary"]
                        if temp_eval["model_id"] == endpoint
                    ),
                    None,
                )

                # Get the grade from the evaluation_summary, or use "-" if not found
                grade = "-"
                if evaluation_summary and evaluation_summary["overall_grade"]:
                    grade = evaluation_summary["overall_grade"]
                endpoint_results.append(grade)
            table.add_row(
                str(index),
                f"Cookbook: [blue]{cookbook}[/blue]",
                *endpoint_results,
                end_section=True,
            )

            for recipe in cookbook_result["recipes"]:
                endpoint_results = []
                for endpoint in endpoints:
                    # Find the evaluation summary for the endpoint
                    evaluation_summary = next(
                        (
                            temp_eval
                            for temp_eval in recipe["evaluation_summary"]
                            if temp_eval["model_id"] == endpoint
                        ),
                        None,
                    )

                    # Get the grade from the evaluation_summary, or use "-" if not found
                    grade = "-"
                    if (
                        evaluation_summary
                        and "grade" in evaluation_summary
                        and "avg_grade_value" in evaluation_summary
                        and evaluation_summary["grade"]
                    ):
                        grade = f"{evaluation_summary['grade']} [{evaluation_summary['avg_grade_value']}]"
                    endpoint_results.append(grade)

                # Add the recipe name indented under the cookbook name
                table.add_row(
                    "",
                    f"  └──  Recipe: [blue]{recipe['id']}[/blue]",
                    *endpoint_results,
                    end_section=True,
                )

            # Increment index only after all recipes of the cookbook have been added
            index += 1
        else:
            # If no results for the cookbook, add a row indicating this with the "Cookbook: " prefix
            # and a dash for each endpoint column
            table.add_row(
                str(index),
                f"Cookbook: {cookbook}",
                *(["-"] * len(endpoints)),
                end_section=True,
            )
            index += 1

    # Display table
    console.print(table)

def show_recipe_results(recipes, endpoints, recipe_results, duration):
    """
    Show the results of the recipe benchmarking.

    This function takes the recipes, endpoints, recipe results, results file, and duration as arguments.
    If there are any recipe results, it generates a table to display them using the generate_recipe_table function.
    It also prints the location of the results file and the time taken to run the benchmarking.
    If there are no recipe results, it prints a message indicating that there are no results.

    Args:
        recipes (list): A list of recipes that were benchmarked.
        endpoints (list): A list of endpoints that were used in the benchmarking.
        recipe_results (dict): A dictionary with the results of the recipe benchmarking.
        duration (float): The time taken to run the benchmarking in seconds.

    Returns:
        None
    """
    if recipe_results:
        # Display recipe results
        generate_recipe_table(recipes, endpoints, recipe_results)
    else:
        console.print("[red]There are no results.[/red]")

    # Print run stats
    console.print(f"{'='*50}\n[blue]Time taken to run: {duration}s[/blue]\n*Overall rating will be the lowest grade that the recipes have in each cookbook\n{'='*50}")

def generate_recipe_table(recipes: list, endpoints: list, results: dict) -> None:
    """
    Generate and display a table of recipe results.

    This function creates a table that lists the results of running recipes against various endpoints.
    Each row in the table corresponds to a recipe, and each column corresponds to an endpoint.
    The results include the grade and average grade value for each recipe-endpoint pair.

    Args:
        recipes (list): A list of recipe IDs that were benchmarked.
        endpoints (list): A list of endpoint IDs against which the recipes were run.
        results (dict): A dictionary containing the results of the benchmarking.

    Returns:
        None: This function does not return anything. It prints the table to the console.
    """
    # Create a table with a title and headers
    table = Table(
        title="Recipes Result", show_lines=True, expand=True, header_style="bold"
    )
    table.add_column("No.", width=2)
    table.add_column("Recipe", justify="left", width=78)
    # Add a column for each endpoint
    for endpoint in endpoints:
        table.add_column(endpoint, justify="center")

    # Iterate over each recipe and populate the table with results
    for index, recipe_id in enumerate(recipes, start=1):
        # Attempt to find the result for the current recipe
        recipe_result = next(
            (
                result
                for result in results["results"]["recipes"]
                if result["id"] == recipe_id
            ),
            None,
        )

        # If the result exists, extract and format the results for each endpoint
        if recipe_result:
            endpoint_results = []
            for endpoint in endpoints:
                # Find the evaluation summary for the endpoint
                evaluation_summary = next(
                    (
                        eval_summary
                        for eval_summary in recipe_result["evaluation_summary"]
                        if eval_summary["model_id"] == endpoint
                    ),
                    None,
                )

                # Format the grade and average grade value, or use "-" if not found
                grade = "-"
                if (
                    evaluation_summary
                    and "grade" in evaluation_summary
                    and "avg_grade_value" in evaluation_summary
                    and evaluation_summary["grade"]
                ):
                    grade = f"{evaluation_summary['grade']} [{evaluation_summary['avg_grade_value']}]"
                endpoint_results.append(grade)

            # Add a row for the recipe with its results
            table.add_row(
                str(index),
                f"Recipe: [blue]{recipe_result['id']}[/blue]",
                *endpoint_results,
                end_section=True,
            )
        else:
            # If no result is found, add a row with placeholders
            table.add_row(
                str(index),
                f"Recipe: [blue]{recipe_id}[/blue]",
                *(["-"] * len(endpoints)),
                end_section=True,
            )

    # Print the table to the console
    console.print(table)

def display_runners(
    runner_list: list, runner_run_info_list: list, runner_session_info_list: list
) -> None:
    """
    Display runners in a table format.

    This function takes lists of runner information, run information, and session information, then displays them in a
    table format on the command line interface. Each runner is listed with details such as the runner's ID, name,
    description, number of runs, number of sessions, database file, and endpoints.

    Args:
        runner_list: A list of dictionaries, where each dictionary contains information about a runner.

        runner_run_info_list: A list of dictionaries, where each dictionary contains information about a run
        associated with a runner.

        runner_session_info_list: A list of dictionaries, where each dictionary contains information about a session
        associated with a runner.

    Returns:
        None
    """
    if runner_list:
        table = Table(
            title="List of Runners", show_lines=True, expand=True, header_style="bold"
        )
        table.add_column("No.", width=2)
        table.add_column("Runner", justify="left", width=78)
        table.add_column("Contains", justify="left", width=20, overflow="fold")
        for runner_id, runner in enumerate(runner_list, 1):
            (id, name, db_file, endpoints, description) = runner.values()

            db_info = display_view_str_format("Database", db_file)
            endpoints_info = display_view_list_format("Endpoints", endpoints)

            runs_count = sum(
                run_info["runner_id"] == id for run_info in runner_run_info_list
            )
            # Handle the case where session_info can be None
            sessions_count = sum(
                session_info is not None and session_info["session_id"] == id
                for session_info in runner_session_info_list
            )

            runner_info = (
                f"[red]id: {id}[/red]\n\n[blue]{name}[/blue]\n{description}\n"
                f"[blue]Number of Runs:[/blue] {runs_count}\n"
                f"[blue]Number of Sessions:[/blue] {sessions_count}"
            )
            contains_info = f"{db_info}\n\n{endpoints_info}"

            table.add_section()
            table.add_row(str(runner_id), runner_info, contains_info)
        console.print(table)
    else:
        console.print("[red]There are no runners found.[/red]")

def display_runs(runs_list: list):
    """
    Display a list of runs in a table format.

    This function takes a list of run information and displays it in a table format using the rich library's
    Table object.

    Each run's details are formatted and added as a row in the table.
    If there are no runs to display, a message is printed to indicate that no results were found.

    Args:
        runs_list (list): A list of dictionaries, where each dictionary contains details of a run.

    Returns:
        None
    """
    if runs_list:
        table = Table(
            title="List of Runs", show_lines=True, expand=True, header_style="bold"
        )
        table.add_column("No.", width=2)
        table.add_column("Run", justify="left", width=78)
        table.add_column("Contains", justify="left", width=20, overflow="fold")
        for run_number, run in enumerate(runs_list, 1):
            (
                run_id,
                runner_id,
                runner_type,
                runner_args,
                endpoints,
                results_file,
                start_time,
                end_time,
                duration,
                error_messages,
                raw_results,
                results,
                status,
            ) = run.values()

            duration_info = (
                f"[blue]Period:[/blue] {start_time} - {end_time} ({duration}s)"
            )
            run_id = display_view_str_format("Run ID", run_id)
            runner_id = display_view_str_format("Runner ID", runner_id)
            runner_type = display_view_str_format("Runner Type", runner_type)
            runner_args = display_view_str_format("Runner Args", runner_args)
            status_info = display_view_str_format("Status", status)
            results_info = display_view_str_format("Results File", results_file)
            endpoints_info = display_view_list_format("Endpoints", endpoints)
            error_messages_info = display_view_list_format(
                "Error Messages", error_messages
            )

            has_raw_results = bool(raw_results)
            has_results = bool(results)

            result_info = f"[red]{runner_id}[/red]\n\n{run_id}\n\n{duration_info}\n\n{status_info}"
            contains_info = (
                f"{results_info}\n\n{error_messages_info}\n\n{endpoints_info}\n\n"
                f"[blue]Has Raw Results: {has_raw_results}[/blue]\n\n"
                f"[blue]Has Results: {has_results}[/blue]"
            )

            table.add_section()
            table.add_row(str(run_number), result_info, contains_info)
        console.print(table)
    else:
        console.print("[red]There are no results found.[/red]")

## Connectors in Moonshot

A `connector` in the Moonshot framework acts as an interface between the framework itself and an external AI model, such as OpenAI's GPT-3.5. It is responsible for two primary functions:

1. **Communication**
2. **Response Processing**

In [5]:
connection_types = api_get_all_connector_type()
display_connector_types(connection_types)

### Role of an Endpoint

Within the Moonshot framework, an endpoint represents the configured access point that facilitates communication between Moonshot and an AI model. It is the practical implementation of a connector, operationalizing the communication and response processing logic encapsulated in the connector's code.

#### Retrieving Existing Endpoints
`api_get_all_endpoint()` 

In [6]:
# Get the list of endpoints
endpoints_list = api_get_all_endpoint()

In [7]:
# Display the information that we can retrieve from endpoints
print("Total number of endpoints found: ", len(endpoints_list))
print("Information of each endpoint:")
print(list(endpoints_list[0].keys()))

Total number of endpoints found:  51
Information of each endpoint:
['id', 'name', 'connector_type', 'uri', 'token', 'max_calls_per_second', 'max_concurrency', 'model', 'params', 'created_date']


In [8]:
# We display a few key endpoint information
# id, name, model, params, created_date 
display_endpoints(endpoints_list, ["id", "name", "model", "params", "created_date"])

## Create an AzureOpenAI endpoint

In [9]:
# Create a new endpoint for interacting with OpenAI's GPT-3.5 model.
# Replace 'ADD_NEW_TOKEN_HERE' with your actual OpenAI API token.
endpoint_id = api_create_endpoint(
    "test-azure-openai-endpoint",  # name: Assign a unique name to identify this endpoint later.
    "azure-openai-connector",      # connector_type: Specify the connector type for the model you want to evaluate.
    f"{os.getenv('AZURE_OPENAI_ENDPOINT2')}",  # uri: Leave blank as the OpenAI library handles the connection.
    f"{os.getenv('AZURE_OPENAI_API_KEY')}",   # token: Insert your OpenAI API token here.
    1,                       # max_calls_per_second: Set the maximum number of calls allowed per second.
    1,                       # max_concurrency: Set the maximum number of concurrent calls.
    "gpt-4o",         # model: Define the model version to use.
    # params: Include any additional parameters required for this model.
    {
        "timeout": 300,      # timeout: Set the timeout for API calls in seconds.
        "max_attempts": 3,   # max_attempts: Set the max number of retry attempts. 
        "temperature": 0.5,  # temperature: Set the temperature for response variability.
    }
)
print(f"The newly created endpoint id: {endpoint_id}")

The newly created endpoint id: test-azure-openai-endpoint


In [10]:
# Retrieve and display the list of all configured endpoints to verify the addition of the new endpoint.
endpoints_list = api_get_all_endpoint()

# Display the information that we can retrieve from endpoints
print("Total number of endpoints found: ", len(endpoints_list))

# Display if the newly created endpoint id in the list
is_exist = False
new_endpoint = None
for endpoint in endpoints_list:
    if "test-azure-openai-endpoint" == endpoint["id"]:
        is_exist = True
        new_endpoint = endpoint
print("The newly created endpoint is in the list? ", is_exist)

Total number of endpoints found:  52
The newly created endpoint is in the list?  True


In [20]:
# We display a few key endpoint information for the new endpoint only
# id, name, model, params, created_date 
display_endpoints([new_endpoint], ["id", "name", "model", "params", "created_date"])

## Recipe to test

In [21]:
# Display the testing recipe only
# Retrieve all the recipes
recipes_list = api_get_all_recipe()

# Display the information that we can retrieve from recipes
print("Total number of recipes found: ", len(recipes_list))

# Display if the recipe to be tested is in the list of recipes
is_exist = False
testing_recipe = None
for recipe in recipes_list:
    if "item-category" == recipe["id"]:
        is_exist = True
        testing_recipe = recipe
print("The newly created recipe is in the list? ", is_exist)


Total number of recipes found:  108
The newly created recipe is in the list?  True


In [22]:
# We display the new recipe only
display_recipes([testing_recipe])

## Cookbook with the recipe

In [23]:
# Retrieve all the cookbooks
cookbooks_list = api_get_all_cookbook()

# Display the information that we can retrieve from cookbooks
print("Total number of cookbooks found: ", len(cookbooks_list))

# Display if the newly created cookbook in the list
is_exist = False
testing_cookbook = None
for cookbook in cookbooks_list:
    if "test-category-cookbook" == cookbook["id"]:
        is_exist = True
        testing_cookbook = cookbook
print("The newly created cookbook is in the list? ", is_exist)

Total number of cookbooks found:  26
The newly created cookbook is in the list?  True


In [24]:
# We display the new cookbook only
display_cookbooks([testing_cookbook])

## Run a test

Moonshot comes with a list of [cookbooks](https://github.com/aiverify-foundation/moonshot-data/tree/main/cookbooks) and [recipes](https://github.com/aiverify-foundation/moonshot-data/tree/main/recipes).

A <b>recipe</b> contains one or more <b>benchmark datasets</b> and <b>evaluation metrics</b>. A <b>cookbook</b> contains one or more <b>recipes</b>. To execute an existing test, we can select either a <b>recipe</b> or <b>cookbook</b>.

In [25]:
from slugify import slugify
from moonshot.api import api_get_all_run, api_create_runner, api_get_all_runner_name

name = "01-sample-cookbook-runner" # Indicate the name
# recipes = ["item-category", "bbq"] # Test recipes, item-category and bbq
cookbooks = ["test-category-cookbook"] # Test one cookbook test-category-cookbook and bbq. You can add more cookbooks in the list to test as well
endpoints = ["test-azure-openai-endpoint"] # Test against 1 endpoint, the newly created test-azure-openai-endpoint
prompt_selection_percentage = 1 # The percentage number of prompt(s) to run from EACH dataset in the cookbook; this refers to 1% of each dataset prompts.

# Below are the optional fields
random_seed = 0   # Default: 0; this allows for randomness in dataset selection when prompt selection percentage are set
system_prompt = ""  # Default: ""; this allows setting the system prompt for the endpoints

# Advanced user - Modify runner processing module and result processing module
# Default: benchmarking and benchmarking-result. Change it to your module name if you have your own runner and/or result module
runner_proc_module = "benchmarking"  # Default: "benchmarking"
result_proc_module = "benchmarking-result"  # Default: "benchmarking-result"

# Run the cookbooks with the defined endpoint(s)
# If the id exists, it will perform a load on the runner, instead of creating a new runner
# Using an existing runner allows the new run to possibly use cached results from previous runs, which greatly reduces the run time
slugify_id = slugify(name, lowercase=True)
if slugify_id in api_get_all_runner_name():
    cb_runner = api_load_runner(slugify_id)
else:
    cb_runner = api_create_runner(name, endpoints)

# run_cookbooks() is an async function. Currently there is no sync version
# We will get an existing event loop and execute the run cookbooks process
await cb_runner.run_cookbooks(
        cookbooks,
        prompt_selection_percentage,
        random_seed,
        system_prompt,
        runner_proc_module,
        result_proc_module,
    )
await cb_runner.close()  # Perform a close on the runner to allow proper cleanup.

# Display results in JSON
runner_runs = api_get_all_run(cb_runner.id)
result_info = runner_runs[-1].get("results")
if result_info:
    print(json.dumps(result_info, indent=2))
else:
    raise RuntimeError("no run result generated")

2025-04-14 14:27:24,817 [INFO][runner.py::run_cookbooks(412)] [Runner] 01-sample-cookbook-runner - Running benchmark cookbook run...


2025-04-14 14:27:24,979 [INFO][benchmarking.py::generate(139)] [Benchmarking] Running cookbooks (['test-category-cookbook'])...
2025-04-14 14:27:24,981 [INFO][benchmarking.py::generate(145)] [Benchmarking] Running cookbook test-category-cookbook... (1/1)
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2025-04-14 14:27:26,477 [INFO][connector.py::get_prediction(348)] [Connector ID: test-azure-openai-endpoint] Predicting Prompt Index 82.
2025-04-14 14:27:26,488 [INFO][connector.py::get_prediction(348)] [Connector ID: test-azure-openai-endpoint] Predicting Prompt Index 447.
2025-04-14 14:27:26,489 [INFO][connector.py::get_prediction(348)] [Connector ID: test-azure-openai-endpoint] Predicting Prompt Index 530.
2025-04-14 14:27:26,491 [INFO][connector

{
  "metadata": {
    "id": "01-sample-cookbook-runner",
    "start_time": "2025-04-14 14:27:24",
    "end_time": "2025-04-14 14:37:27",
    "duration": 603,
    "status": "completed",
    "recipes": null,
    "cookbooks": [
      "test-category-cookbook"
    ],
    "endpoints": [
      "test-azure-openai-endpoint"
    ],
    "prompt_selection_percentage": 1,
    "random_seed": 0,
    "system_prompt": ""
  },
  "results": {
    "cookbooks": [
      {
        "id": "test-category-cookbook",
        "recipes": [
          {
            "id": "item-category",
            "details": [
              {
                "model_id": "test-azure-openai-endpoint",
                "dataset_id": "test-dataset",
                "prompt_template_id": "test-prompt-template",
                "data": [
                  {
                    "prompt": "Answer this question:\nIs Coke Zero a Fruit? Answer Yes or No.\n with one word. A:",
                    "predicted_result": {
                      "res

## Beautifying the results

The result above is shown in our raw JSON file. To beautify the results, you can use the `rich` library to put them into a nice table.

In [26]:
if result_info:
    show_cookbook_results(
        cookbooks, endpoints, result_info, result_info["metadata"]["duration"]
    )
else:
    raise RuntimeError("no run result generated")

## List all the Cookbook

In [27]:
cookbook_ids = api_get_all_cookbook()
print("Total number of cookbooks: {0}".format(len(cookbook_ids)))
print("Showing the first three cookbooks below...")
print(json.dumps(cookbook_ids[0:3], indent=2))

Total number of cookbooks: 26
Showing the first three cookbooks below...
[
  {
    "id": "rag-evaluation-cookbook",
    "name": "RAG Evaluation Cookbook",
    "tags": [
      "rag"
    ],
    "categories": [
      "Capability",
      "Trust & Safety"
    ],
    "description": "This cookbook assesses how well Retrieval-Augmented Generation systems perform relative to a custom test dataset using LLM-based metrics from Ragas.",
    "recipes": [
      "ragas-evaluation"
    ]
  },
  {
    "id": "tamil-language-cookbook",
    "name": "Tamil Language",
    "tags": [
      "Tamil comprehension",
      "Tamil generation",
      "Tamil literature"
    ],
    "categories": [
      "Capability"
    ],
    "description": "This is a cookbook that consists of datasets related to the Tamil Language.",
    "recipes": [
      "tamil-kural-classification",
      "tamil-tamilnews-classification",
      "tamil-tanglish-tweets"
    ]
  },
  {
    "id": "common-risk-easy",
    "name": "Easy test sets for Co