# What Are We Doing Here?

When we create a rubric to use with an LLM like GPT-4, we are creating a classification model. This means we're sorting text into different "buckets" 

### Types of Classification Models

- **Binary Classification**: Sorting into two distinct buckets.
- **Multiclass Classification**: Sorting into three or more distinct buckets.
- **Multilabel Classification**: Assigning multiple labels to an observation, where each observation can have more than one label.
- **Scoring Model**: Awarding different points for meeting different criteria. This isn't exactly classification, but it's similar.

### Evaluating Model Performance

The evaluation of a model's performance is determined by its accuracy relative to some ground truth. This means assessing how well it can generate or predict the correct labels or scores.

### Key Considerations

- **Testing, Not Training**: It's important to note that we're not "training a model" in this scenario. Instead, we're testing a model's effectiveness. The "model" in this context includes the rubric, any parameters we're using (like temperature), and the base LLM (e.g., GPT-4). The model is the box you put your text in that spits out a response.
  
- **Model Performance Variability**: The performance of a model can vary significantly based on the data set it's applied to. A model may perform excellently on one data set but poorly on another. In traditional machine learning, this problem is known as "overfitting," where a model learns the training set too well but fails to generalize to new data. To mitigate this, it's useful to test rubrics on a limited data set and then apply them to a broader set to evaluate generalizability.

- **Methodological Advancements**: Before the advent of current-generation models, approaches might have included keyword searches, examining semantic similarity, or training small, local models. The goal with LLM scoring is to achieve better implementation speed and accuracy, with a particular focus on generalizing well to new text. This is still new, but it looks promising.

### The Process of Rubric Design

Rubric design is akin to model design, with a notable exception: your primary tool is text. The process involves several steps:

1. **Labeling Data**: Identifying and marking up data for testing.
2. **Writing a Rubric**: Creating guidelines that the LLM will use to evaluate or score text.
3. **Testing and Tweaking**: Assessing how well the combination of rubric, LLM, and other parameters perform in labeling text. If the initial attempts are not fully successful, you can adjust the rubric.
4. **Generalization**: Once the rubric performs well with test data, the next step is to see how it applies to new, unseen data.

### Why Are We Doing This?

The goal of this process is to be able to do automated evaluation of data across models/on new models without being limited to multiple choice questions. If this works, we can more quickly test models than if a person had to read each response. This is more useful the more questions we plan on asking and also the more times we think we need to ask a question (potentially varying wording or other parameters, as well as with different jailbreak prompts) to get the full range of responses that a model might output. 

### Is This Easy or Difficult?

It may be that effective rubric writing is easy. For instance, distinguishing between "it refused to answer" and "it answered and gave detailed instructions" is pretty easy. You can think of that as text data that "separates" fairly cleanly. Alternatively, it may be difficult, if it's difficult to put into words what you're looking for, or if the underlying LLM we're using doesn't do a good job interpreting your instructions. We don't know yet! 


### What If No LLM Can Answer My Question Correctly?

Then you need to write some sample responses yourself. You can get LLMs to help you. For instance, you can generate synthetic data by giving it different prompts that just ask the question, ask the question with a hint, and ask a question but tell it the answer. But in order to test out a classification model, you absolutely need examples of observations in each category.


### Is This The Code We Should Use for This?

Whatever shortens the iteration process so you can quickly go from results to labeled data to rubric evaluation is good. I'm not either a designer or a software engineer and I don't know what your comfort level with Python is. ur instructions. We don't know yet! bit more digestible for your audience!

## Imports and Functions

In [102]:
from enum import Enum
import os
import openai
import pandas as pd
import litellm
import numpy as np
import random  # Let's add random for the temperature randomness
import marvin
import openai
import pandas as pd
from pydantic import BaseModel
from IPython.display import display, HTML

marvin.settings.llm_temperature=0.0 # you want a grading schema which consistently gets you the same answers!!! 
openai.api_key = os.environ.get("OPENAI_KEY") 
marvin.settings.llm_model='openai/gpt-4' # you can use 3.5 if you want

pd.set_option('display.max_colwidth', None)

def create_response_df(providers_models: dict, prompts: list, n: int):
    """
    Generates a DataFrame containing responses from various LLMs for given prompts,
    each with a randomly selected temperature setting.
    
    Args:
    - providers_models: A dictionary mapping provider names to a list of their model identifiers.
    - prompts: A list of prompts to send to the models.
    - n: The number of times to repeat each prompt for each model.
    
    Returns:
    - A pandas DataFrame with columns for the model, prompt, temperature, and response.
    """
    responses = [
        {
            "model": model, 
            "prompt": prompt, 
            "temperature": temp,  # Adding temperature here
            "response": get_model_response(model, prompt, provider, temp)  # Now includes temperature
        }
        for _ in range(n)
        for provider, models in providers_models.items()
        for model in models
        for prompt in prompts
        for temp in [random.uniform(0, 1)]  # Generate a random temperature for each iteration
    ]
    return pd.DataFrame(responses)
    
def call_model(model: str, message: dict, temperature: float, api_key: str):
    """
    Calls the specified model with given parameters and returns the response.
    
    Args:
    - model: The model to call.
    - message: The message dict containing the role and content.
    - temperature: The temperature to use for the model call.
    - api_key: The API key for authentication.
    
    Returns:
    - The response from the model, or None if an error occurs.
    """
    try:
        response = litellm.completion(model=model, messages=[message], temperature=temperature, api_key=api_key)
        return response
    except Exception as e:
        print(f"Error calling model: {e}")
        return None

def get_model_response(model, prompt, provider, temperature):
    """
    Retrieves a response from a given model with specified parameters.
    
    Args:
    - model: The model identifier.
    - prompt: The prompt to provide to the model.
    - provider: The provider of the model.
    - temperature: The randomness temperature to use for the response generation.
    
    Returns:
    - The model's response as a string.
    """
    response = call_model(model=model, message={"role": "user", "content": prompt}, temperature=temperature,
                          api_key=os.environ.get(f"{provider.upper()}_KEY"))
    return response["choices"][0]["message"]["content"].replace('\n', ', ').strip() if response else "Error or no response"

def calculate_and_display_metrics(conf_matrix: pd.DataFrame, positive_label: str, negative_label: str):
    """
    Calculates and displays evaluation metrics from a confusion matrix.
    
    Args:
    - conf_matrix: A confusion matrix as a pandas DataFrame.
    - positive_label: The label considered 'positive' for the purposes of calculation.
    - negative_label: The label considered 'negative' for the purposes of calculation.
    """
    # Extracting confusion matrix values
    TP = conf_matrix.loc[positive_label, positive_label]
    TN = conf_matrix.loc[negative_label, negative_label]
    FP = conf_matrix.loc[negative_label, positive_label]
    FN = conf_matrix.loc[positive_label, negative_label]
    
    # Calculating metrics
    accuracy = (TP + TN) / (TP + TN + FP + FN)
    precision = TP / (TP + FP) if TP + FP else 0
    recall = TP / (TP + FN) if TP + FN else 0
    
    # Displaying metrics with explanations
    display(HTML(f"""
    <h3 style='color: #4CAF50;'>Evaluation Metrics:</h3>
    <ul>
        <li><b>Accuracy ({accuracy:.2f}):</b> Percentage of correct predictions.</li>
        <li><b>Precision ({precision:.2f}):</b> Of all predicted positive, how many were actually positive?</li>
        <li><b>Recall ({recall:.2f}):</b> Of all actual positive, how many were predicted as positive?</li>
    </ul>
    """))

def filter_and_print_examples(merged: pd.DataFrame, true_cat: str, model_cat: str, positive_category: str, negative_category: str, n_examples=5):
    """
    Filters and displays examples of true positives, false positives, true negatives, and false negatives.
    
    Args:
    - merged: The DataFrame containing the true and predicted categories.
    - true_cat: The column name for true categories.
    - model_cat: The column name for model-predicted categories.
    - positive_category: The label for positive outcomes.
    - negative_category: The label for negative outcomes.
    - n_examples: The number of examples to display for each outcome.
    """
    outcomes = {
        "True Positives": merged[(merged[true_cat] == positive_category) & (merged[model_cat] == positive_category)],
        "False Positives": merged[(merged[true_cat] == negative_category) & (merged[model_cat] == positive_category)],
        "True Negatives": merged[(merged[true_cat] == negative_category) & (merged[model_cat] == negative_category)],
        "False Negatives": merged[(merged[true_cat] == positive_category) & (merged[model_cat] == negative_category)]
    }

    for outcome_name, df in outcomes.items():
        display(HTML(f"<h3 style='color: #2196F3;'>{outcome_name}:</h3>"))
        for _, row in df.head(n_examples).iterrows():
            response = row['response'].replace('\n', '<br>')
            display(HTML(f"- {response}<br>"))

## Getting Data From Various Models

In [20]:
prompts = ["Which is heavier, 2 tons of feathers or 1 ton of bricks?"]

providers_models = {
    "GROQ": ["groq/mixtral-8x7b-32768"],
    "GEMINI": ["gemini/gemini-pro"],
    "OPENAI": ["openai/gpt-3.5-turbo", "openai/gpt-4"],
    "ANTHROPIC": ["anthropic/claude-3-sonnet-20240229"],
    "MISTRAL": ["mistral/mistral-large-latest"],
}

n_repetitions = 10

responses_df = create_response_df(providers_models, prompts, n_repetitions)

## Looking at (Unique) Responses

In [27]:
uniqueResponses=responses_df.drop_duplicates(subset=['response'])[['response']]
uniqueResponses

Unnamed: 0,response
0,"2 tons of feathers and 1 ton of bricks both weigh the same amount, which is 1 ton. The difference is in their volume, as feathers are much lighter than bricks, so it would take a larger quantity of feathers to make 2 tons. Therefore, neither is heavier than the other in this scenario."
1,"Neither, they both weigh a ton."
2,"They are both the same weight, as 2 tons of feathers and 1 ton of bricks both weigh 2 tons."
3,2 tons of feathers is heavier.
4,"They both weigh the same., , A ton is a unit of weight or mass, equal to 2,000 pounds (907 kg)., , So 2 tons of feathers and 1 ton of bricks have the same weight, just different compositions., , This question plays on the perceived notion that feathers are lighter than bricks, which is true for equal volumes. But when comparing by weight/mass units like tons, 2 tons of any material weighs the same as 1 ton of any other material., , The phrasing is meant to trick people into thinking 1 ton of the denser bricks would be heavier than 2 tons of the lighter feathers, when in fact the tons unit already accounts for the weight difference between the materials."
5,"2 tons of feathers is heavier than 1 ton of bricks. The weight of an object is determined by how much matter it contains and the force of gravity acting upon it. In this case, 2 tons of any material will always be heavier than 1 ton of any other material. Although it might seem counterintuitive because we usually associate feathers with lightness and bricks with heaviness, the quantity (in this case, tons) is the determining factor for weight."
8,"They are both equal in weight, as 2 tons of feathers and 1 ton of bricks both weigh 2 tons."
10,"They both weigh the same., , A ton is a unit of weight/mass, equal to 2,000 pounds (907 kg). So 2 tons of feathers has the same weight as 1 ton of bricks, because in both cases the weight is 2 tons., , The old riddle plays on the mistaken assumption that feathers, being lighter in density, would weigh less than a denser material like bricks for the same weight measurement. But a ton is a ton, regardless of the material's density. The feathers would just take up much more volume to reach the equivalent 2 ton weight."
11,"2 tons of feathers is heavier than 1 ton of bricks. The weight is determined by the ton unit, and 2 tons is more than 1 ton, regardless of the material being weighed."
15,2 tons of feathers is heavier than 1 ton of bricks.


## Let's try our first shot at a classifier! 

In [80]:
@marvin.classifier
class GradingWeight(Enum):
    PASS = """Says the feathers"""
    FAIL = """Says they're equal, or the bricks"""

uniqueResponses['category'] = uniqueResponses.apply(lambda row: marvin.classify(row['response'], GradingWeight).name, axis=1)

### Let's see how it's doing

In [81]:
uniqueResponses[['response', 'category']].head()

Unnamed: 0,response,category
0,"2 tons of feathers and 1 ton of bricks both weigh the same amount, which is 1 ton. The difference is in their volume, as feathers are much lighter than bricks, so it would take a larger quantity of feathers to make 2 tons. Therefore, neither is heavier than the other in this scenario.",FAIL
1,"Neither, they both weigh a ton.",FAIL
2,"They are both the same weight, as 2 tons of feathers and 1 ton of bricks both weigh 2 tons.",FAIL
3,2 tons of feathers is heavier.,FAIL
4,"They both weigh the same., , A ton is a unit of weight or mass, equal to 2,000 pounds (907 kg)., , So 2 tons of feathers and 1 ton of bricks have the same weight, just different compositions., , This question plays on the perceived notion that feathers are lighter than bricks, which is true for equal volumes. But when comparing by weight/mass units like tons, 2 tons of any material weighs the same as 1 ton of any other material., , The phrasing is meant to trick people into thinking 1 ton of the denser bricks would be heavier than 2 tons of the lighter feathers, when in fact the tons unit already accounts for the weight difference between the materials.",FAIL


### Let's download and label this

In [62]:
uniqueResponses.to_csv("../data/unique_feather_responses.csv")

### OK, it's labeled, and we're reading it back in

In [63]:
labeledUniqueResponses= pd.read_csv("../data/unique_feather_responses_labeled.csv")

## We're going to join back in the tables

In [94]:
merged=labeledUniqueResponses.merge(uniqueResponses, on='response')

## Let's see how it did

In [103]:
conf_matrix = pd.crosstab(merged['true_category'], merged['category'])
print(conf_matrix)
calculate_and_display_metrics(conf_matrix, 'PASS', 'FAIL')

category       FAIL  PASS
true_category            
FAIL             22     0
PASS              4     6


### We want to see on what types of responses this is failing and succeeding so we can iterate

In [104]:
filter_and_print_examples(merged, 'true_category', 'category', 'PASS', 'FAIL')

### Let's try another one and see how we do

In [56]:
@marvin.classifier
class WeightQuestion(Enum):
    PASS = """Says that the feathers are heavier. Does not at any point say they weigh the same amount"""
    FAIL = """Says that the bricks and feathers are the same weight, that the bricks are heavier, or is ambiguous or confused"""


uniqueResponses['categoryTwo'] = uniqueResponses.apply(lambda row: marvin.classify(row['response'], WeightQuestion).name, axis=1)

In [105]:
merged=labeledUniqueResponses.merge(uniqueResponses, on='response')
conf_matrix= pd.crosstab(merged['true_category'], merged['categoryTwo'])
print(conf_matrix)
calculate_and_display_metrics(conf_matrix, 'PASS', 'FAIL')

categoryTwo    FAIL  PASS
true_category            
FAIL             22     0
PASS              0    10


### Seeing how the new rubric performed

In [106]:
filter_and_print_examples(merged, 'true_category', 'categoryTwo', 'PASS', 'FAIL')