# Automated LLM Evaluation

This notebook provides an automated evaluation framework for Large Language Models (LLMs) using a set of predefined prompts and rubrics. The responses from the models are classified based on the rubrics, and the results are visualized in a styled DataFrame.

## Requirements

- **Python Packages**: `enum`, `marvin`, `os`, `openai`, `pandas`, `pydantic`, `litellm`, `groq`; see requirements.txt file
- **API Keys**: You need API keys for each of the LLM providers you intend to use. These keys should be stored as environment variables in the format `PROVIDER_KEY`. For example, for OpenAI, you would set an environment variable `OPENAI_KEY` with your API key.

## Workflow

1. **Set Up Configuration**: Initialize settings for Marvin and Pandas display options.
2. **Define Prompts and Rubrics**: Create a list of prompts and their corresponding rubrics for classification.
3. **Process Responses**: For each prompt, call the LLMs, retrieve responses, and classify them based on the rubrics.
4. **Visualize Results**: Pivot the DataFrame to have prompts as rows and models as columns, and apply styling to visualize the classification results.

## Outputs

- `final_df`: A DataFrame containing the model, prompt, response, and classification category for each prompt-model pair.
- `styled_df`: A styled DataFrame with prompts as rows, models as columns, and cells colored based on the classification category (PASS, FAIL, ERROR).

## Usage

To use this framework, you will need to:

1. Set your API keys as environment variables.
2. Define your prompts and rubrics in the `prompt_rubric_pairs` list.
3. Run the notebook to generate and visualize the classification results.

## Customization

You can customize this framework by:

- Adding or modifying prompts and rubrics in the `prompt_rubric_pairs` list.
- Adjusting the styling in the `colorize` function and `styled_df` creation to fit your preferences.
- Changing the LLM providers and models in the `providers_models` dictionary.

## Notes

- Ensure that your API keys are correctly set up and have the necessary permissions for accessing the LLMs.
- The classification results are dependent on the rubrics defined and the responses from the LLMs.


In [2]:
from enum import Enum
import marvin
import os
import openai
import pandas as pd
from pydantic import BaseModel
import litellm
from groq import Groq
import numpy as np


marvin.settings.llm_temperature=0.0 # you want a grading schema which consistently gets you the same answers!!! 
marvin.settings.llm_max_tokens=200
llm_max_context_tokens=600 
openai.api_key = os.environ.get("OPENAI_KEY")  # this is for marvin, not litellm
marvin.settings.llm_model='openai/gpt-4'
pd.set_option('display.max_colwidth', None)

def create_response_df(providers_models: dict, prompts: list, n: int):
    """Generates a DataFrame from model responses for given prompts, repeated n times, more Pythonically."""
    responses = [
        {"model": model, "prompt": prompt, "response": get_model_response(model, prompt, provider)}
        for _ in range(n)
        for provider, models in providers_models.items()
        for model in models
        for prompt in prompts
    ]
    return pd.DataFrame(responses)
    
def calculate_proportions(df):
    """Calculates the proportion of 'PASS' responses for each prompt/model combination, ignoring 'ERROR' responses."""
    def proportion_pass(x):
        valid_responses = x[x != 'ERROR']
        if len(valid_responses) == 0:
            return 'ERROR'
        return np.mean(valid_responses == 'PASS')

    grouped = df.groupby(['prompt', 'model'])
    proportions = grouped['category'].apply(proportion_pass)
    return proportions.unstack()

def gradient_color(proportion):
    """Maps a proportion to a color gradient between green and red, or returns grey for 'ERROR'."""
    if proportion == 'ERROR':
        return 'background-color: grey'
    green_component = int(255 * proportion)
    red_component = 255 - green_component
    return f'background-color: rgb({red_component}, {green_component}, 0)'

    
def call_model(model: str, message: dict, temperature: float, api_key: str):
    """Calls the specified model with given parameters."""
    try:
        response = litellm.completion(model=model, messages=[message], temperature=temperature, api_key=api_key)
        return response
    except Exception as e:
        print(f"Error calling model: {e}")
        return None


def get_model_response(model, prompt, provider):
    """Helper function to clean up response retrieval and formatting."""
    response = call_model(model=model, message={"role": "user", "content": prompt}, temperature=0.0,
                          api_key=os.environ.get(f"{provider.upper()}_KEY"))
    return response["choices"][0]["message"]["content"].replace('\n', ', ').strip() if response else "Error or no response"

def classify_responses(df, prompt_rubric_pairs):
    """Efficiently classifies responses in the DataFrame using the rubrics, based on prompts."""
    prompt_to_rubric = {pair["prompt"]: pair["rubric"] for pair in prompt_rubric_pairs}

    def apply_rubric(row):
        rubric = prompt_to_rubric.get(row['prompt'])
        if rubric:
            return marvin.classify(row['response'], rubric).name
        return "ERROR"

    df['category'] = df.apply(apply_rubric, axis=1)
    return df

@marvin.classifier
class NumberofRs(Enum):
    PASS = """Says that the letter r is in the word carryforward four times"""
    FAIL = """Says something else, including that the letter r is in the word carryforward some number of times that is not four"""

@marvin.classifier
class TimeDiff(Enum):
    PASS = """Says that the time in Copenhagen is 6 PM"""
    FAIL = """Says something else"""
    
@marvin.classifier
class LetterLoc(Enum):
    PASS = """Says that 'y' is the letter"""
    FAIL = """Says something else"""

@marvin.classifier
class WeightQuestion(Enum):
    PASS = """Says that the feathers are heavier. Does not at any point say they weigh the same amount"""
    FAIL = """Says that the bricks and feathers are the same weight, that the bricks are heavier, or is ambiguous or confused"""

@marvin.classifier
class LogicQuestion(Enum):
    PASS = """Contains the following steps in this order: 
    1) Teleport with the Cacodemon
    2) Teleport with the Bunny
    3) Return with the Cacodemon
    4) Teleport with the Scientist
    5) Teleport with the Cacodemon
    May also include 'teleport alone' steps"""
    FAIL = """Says something else"""


prompt_rubric_pairs = [
    {
        "prompt": "How many times is the letter 'r' in the word 'carryforward'?",
        "rubric": NumberofRs,
    },
    {
        "prompt": "If it's 12 PM on March 26th in Boston, what time is that in Copenhagen?",
        "rubric": TimeDiff,
    },
    {
        "prompt": (
            "I'm looking at the letter \"b\" on the keyboard. "
            "I now go up to the row above, and slightly to the right, to a key that is adjacent to b. "
            "Then I go up again, and just slightly to the left, to a key that is adjacent to that letter. "
            "So now I'm two rows up from the initial \"b\" letter. What letter am I touching?"
        ),
        "rubric": LetterLoc,
    },
    {
        "prompt": "Which is heavier, 2 tons of feathers or 1 ton of bricks?",
        "rubric": WeightQuestion,
    },

        {
        "prompt": """Doom Slayer needs to teleport from Phobos to Deimos. 
        He has his pet bunny, his pet cacodemon, and a UAC scientist who tagged along. 
        The Doom Slayer can only teleport with one of them at a time. But if he leaves the bunny and the cacodemon together alone, 
        the bunny will eat the cacodemon. And if he leaves the cacodemon and the scientist alone, 
        the cacodemon will eat the scientist. How should the Doom Slayer get himself and all his companions safely to Deimos?""",
        "rubric": LogicQuestion,
    },
]

# You need API keys for each of your providers saved as environment variables in the format PROVIDER_KEY
providers_models = {
    "GROQ": ["groq/mixtral-8x7b-32768"],
    "GEMINI": ["gemini/gemini-pro"],
    "OPENAI": ["openai/gpt-3.5-turbo", "openai/gpt-4"],
    "ANTHROPIC": ["anthropic/claude-2"],
    "MISTRAL": ["mistral/mistral-large-latest"],
}

providers_models = {
    "OPENAI": ["openai/gpt-3.5-turbo", "openai/gpt-4"],
}

n_repetitions = 2

responses_df = create_response_df(providers_models, [pair["prompt"] for pair in prompt_rubric_pairs], n_repetitions)

final_classified_df = classify_responses(responses_df, prompt_rubric_pairs)


In [3]:
final_classified_df

Unnamed: 0,model,prompt,response,category
0,openai/gpt-3.5-turbo,How many times is the letter 'r' in the word 'carryforward'?,There are three times the letter 'r' appears in the word 'carryforward'.,FAIL
1,openai/gpt-3.5-turbo,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen?",It would be 5 PM on March 26th in Copenhagen. Copenhagen is 5 hours ahead of Boston.,FAIL
2,openai/gpt-3.5-turbo,"I'm looking at the letter ""b"" on the keyboard. I now go up to the row above, and slightly to the right, to a key that is adjacent to b. Then I go up again, and just slightly to the left, to a key that is adjacent to that letter. So now I'm two rows up from the initial ""b"" letter. What letter am I touching?","The letter you are touching is ""Y"".",PASS
3,openai/gpt-3.5-turbo,"Which is heavier, 2 tons of feathers or 1 ton of bricks?","They are both the same weight, as 2 tons of feathers and 1 ton of bricks both weigh 2 tons.",FAIL
4,openai/gpt-3.5-turbo,"Doom Slayer needs to teleport from Phobos to Deimos. \n He has his pet bunny, his pet cacodemon, and a UAC scientist who tagged along. \n The Doom Slayer can only teleport with one of them at a time. But if he leaves the bunny and the cacodemon together alone, \n the bunny will eat the cacodemon. And if he leaves the cacodemon and the scientist alone, \n the cacodemon will eat the scientist. How should the Doom Slayer get himself and all his companions safely to Deimos?","The Doom Slayer should first teleport himself to Deimos, leaving the bunny and the cacodemon on Phobos. Then, he should teleport back to Phobos and take the bunny with him to Deimos. Finally, he should teleport back to Phobos, take the cacodemon with him, and bring the scientist to Deimos. This way, all companions are safely transported without any harm coming to them.",FAIL
5,openai/gpt-4,How many times is the letter 'r' in the word 'carryforward'?,The letter 'r' appears 3 times in the word 'carryforward'.,FAIL
6,openai/gpt-4,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen?",That would be 6 PM on March 26th in Copenhagen. Copenhagen is 6 hours ahead of Boston.,PASS
7,openai/gpt-4,"I'm looking at the letter ""b"" on the keyboard. I now go up to the row above, and slightly to the right, to a key that is adjacent to b. Then I go up again, and just slightly to the left, to a key that is adjacent to that letter. So now I'm two rows up from the initial ""b"" letter. What letter am I touching?","You are touching the letter ""R"".",FAIL
8,openai/gpt-4,"Which is heavier, 2 tons of feathers or 1 ton of bricks?",2 tons of feathers is heavier.,PASS
9,openai/gpt-4,"Doom Slayer needs to teleport from Phobos to Deimos. \n He has his pet bunny, his pet cacodemon, and a UAC scientist who tagged along. \n The Doom Slayer can only teleport with one of them at a time. But if he leaves the bunny and the cacodemon together alone, \n the bunny will eat the cacodemon. And if he leaves the cacodemon and the scientist alone, \n the cacodemon will eat the scientist. How should the Doom Slayer get himself and all his companions safely to Deimos?","1. First, Doom Slayer should teleport with the cacodemon to Deimos. , 2. He then leaves the cacodemon on Deimos and teleports back to Phobos alone., 3. On Phobos, he takes the bunny with him and teleports back to Deimos., 4. He leaves the bunny on Deimos, but takes the cacodemon back with him to Phobos., 5. On Phobos, he leaves the cacodemon and takes the UAC scientist with him to Deimos., 6. He leaves the UAC scientist on Deimos and teleports back to Phobos one last time., 7. Finally, he takes the cacodemon with him to Deimos., , Now, all of them are safely on Deimos.",PASS


## Create Figure

In [4]:
# Calculate proportions of 'PASS' responses
proportions_df = calculate_proportions(final_classified_df)

# Apply gradient coloring
styled_df = proportions_df.style.applymap(gradient_color)

# Apply percentage formatting
styled_df = styled_df.format(lambda x: 'ERROR' if x == 'ERROR' else f'{round(x * 100)}%').set_properties(**{
    'border': '1px solid black',
    'text-align': 'center',
    'font-size': '14px'
}).set_table_styles([{
    'selector': 'th',
    'props': [
        ('background-color', '#F0F8FF'),
        ('text-align', 'center'),
        ('border', '1px solid black')
    ]
}])

# Display the styled DataFrame
styled_df

  styled_df = proportions_df.style.applymap(gradient_color)


model,openai/gpt-3.5-turbo,openai/gpt-4
prompt,Unnamed: 1_level_1,Unnamed: 2_level_1
"Doom Slayer needs to teleport from Phobos to Deimos. He has his pet bunny, his pet cacodemon, and a UAC scientist who tagged along. The Doom Slayer can only teleport with one of them at a time. But if he leaves the bunny and the cacodemon together alone, the bunny will eat the cacodemon. And if he leaves the cacodemon and the scientist alone, the cacodemon will eat the scientist. How should the Doom Slayer get himself and all his companions safely to Deimos?",0%,100%
How many times is the letter 'r' in the word 'carryforward'?,0%,0%
"I'm looking at the letter ""b"" on the keyboard. I now go up to the row above, and slightly to the right, to a key that is adjacent to b. Then I go up again, and just slightly to the left, to a key that is adjacent to that letter. So now I'm two rows up from the initial ""b"" letter. What letter am I touching?",100%,0%
"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen?",0%,100%
"Which is heavier, 2 tons of feathers or 1 ton of bricks?",0%,100%


## Feather test: should fail

In [21]:
text = """ 2 tons of feathers and 1 ton of bricks both weigh the same amount. 
Weight is determined by mass, and a "ton" is a unit of mass equal to 2,000 pounds., * So 2 tons of feathers is equal to 4,000 pounds., * 
And 1 ton of bricks is equal to 2,000 pounds., * Therefore, 2 tons of feathers and 1 ton of bricks both weigh the same amount., , 
The confusing part of this question is that feathers are less dense than bricks. A ton of feathers would take up much more volume and space than a ton of bricks. But the actual weight is only determined by the mass, 
which is the same for both. So the feathers weight twice as much."""
rubric = NumberofRs  # Replace with the rubric you want to use

classification_result =  marvin.classify(text, rubric).name
print(f"Classification result: {classification_result}")

Classification result: FAIL
