# Can We Use LLM Evals + Semantic Distance To Do Classification? Let's See!

This framework provides an automated evaluation method for Large Language Models (LLMs) by using an LLM grader plus semantic embeddings. It involves a set of predefined prompts, generates responses from various models, and grades them with an LLM grader. It then splits the responses by model, does embeddings with all of the responses, and assigns the query responses the same classification as the semantically-closest reference response. We then test to see if this is accurate.

## Requirements

- **Python Packages**: `enum`, `marvin`, `os`, `openai`, `pandas`, `pydantic`, `litellm`, `groq`, `numpy`, `torch`, `transformers`, `scipy`. See the requirements.txt file in th repo if you want to use that. 
- **API Keys**: API keys for each LLM provider are required. These should be stored as environment variables following the `PROVIDER_KEY` format, e.g., `OPENAI_KEY` for OpenAI.

## Workflow

1. **Configuration Setup**: Initialize settings for `marvin`, `pandas`, and model/tokenizer from `transformers`.
2. **Prompts and Rubrics Definition**: Specify prompts and corresponding rubrics for evaluating responses. Rubrics categorize responses based on criteria set by the user.
3. **Response Generation and Classification**: Call LLMs to generate responses for each prompt, classify these responses using `marvin`, and encode responses into semantic embeddings.
4. **Semantic Analysis**: Split the encoded responses into reference and query sets, find the closest semantic matches between sets, and categorize the query responses based on their closest reference match.
5. **Result Table**: Display the original and closest matched responses along with their categories, highlighting the semantic alignment between models.

## Outputs

- **Encoded Embeddings**: Embeddings represent the semantic content of each response, facilitating the comparison.
- **Reference and Query Sets**: Splitting the data allows for a dynamic evaluation, reflecting a more real-world application where models are compared based on similarity to a reference.
- **Semantic Match Analysis**: Results that include the closest reference response and category for query responses.

## Usage

To utilize this framework:
1. Ensure your environment is set up with necessary API keys.
2. Populate `prompt_rubric_pairs` with your chosen prompts and evaluation criteria.
3. Execute the notebook to process responses, perform semantic analysis, and visualize outcomes.

## Customization

This framework is highly customizable:
- **Prompts and Rubrics**: Modify or extend the `prompt_rubric_pairs` list to include new evaluation scenarios.
- **Model Choices**: Adjust the `providers_models` dictionary to test different LLMs.
- **Semantic Analysis**: Experiment with different models for embeddings to best suit your data or domain-specific needs.

## Notesnecessary. Don't run models you don't have keys set up for. 
- Your rubric doesn't need to be totally accurate in order to test this out, since we're evaluating consistency rather than correctness. But it would still be good if it were accurate.
- Very preliminary results: I'm getting better results when I specify that I want a short answer.embedding model and the defined rubrics.ance across a range of prompts and contexts.

## Imports and Functions

In [65]:
from enum import Enum
import marvin
import os
import openai
import pandas as pd
from pydantic import BaseModel
import litellm
from groq import Groq
import numpy as np
import torch
from transformers import AutoModel, AutoTokenizer
from scipy.spatial.distance import cosine


def create_response_df(providers_models: dict, prompts: list, n: int):
    """Generates a DataFrame from model responses for given prompts, repeated n times, more Pythonically."""
    responses = [
        {"model": model, "prompt": prompt, "response": get_model_response(model, prompt, provider)}
        for _ in range(n)
        for provider, models in providers_models.items()
        for model in models
        for prompt in prompts
    ]
    return pd.DataFrame(responses)
    
def call_model(model: str, message: dict, temperature: float, api_key: str):
    """Calls the specified model with given parameters."""
    try:
        response = litellm.completion(model=model, messages=[message], temperature=temperature, api_key=api_key)
        return response
    except Exception as e:
        print(f"Error calling model: {e}")
        return None


def get_model_response(model, prompt, provider):
    """Helper function to clean up response retrieval and formatting."""
    response = call_model(model=model, message={"role": "user", "content": prompt}, temperature=0.0,
                          api_key=os.environ.get(f"{provider.upper()}_KEY"))
    return response["choices"][0]["message"]["content"].replace('\n', ', ').strip() if response else "Error or no response"

def classify_responses(df, prompt_rubric_pairs):
    """Efficiently classifies responses in the DataFrame using the rubrics, based on prompts."""
    prompt_to_rubric = {pair["prompt"]: pair["rubric"] for pair in prompt_rubric_pairs}

    def apply_rubric(row):
        rubric = prompt_to_rubric.get(row['prompt'])
        if rubric:
            return marvin.classify(row['response'], rubric).name
        return "ERROR"

    df['category'] = df.apply(apply_rubric, axis=1)
    return df

@marvin.classifier
class TimeDiff(Enum):
    PASS = """Responds 6 PM. May also contain other info."""
    FAIL = """Says some other time that's not 6 PM"""


def setup_transformer_model(model_name='bert-base-uncased'):
    """
    Set up the transformer model and tokenizer.
    
    Args:
        model_name (str): The name of the model to load.
        
    Returns:
        model: The loaded transformer model.
        tokenizer: The tokenizer for the model.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    return model, tokenizer

def encode_responses_df(df, response_column, model, tokenizer):
    """
    Encode responses from a specified column in a DataFrame into embeddings.
    
    Args:
        df (pd.DataFrame): DataFrame containing the text responses.
        response_column (str): Name of the column containing the responses to encode.
        model: The transformer model to use for encoding.
        tokenizer: The tokenizer corresponding to the transformer model.
        
    Returns:
        torch.Tensor: The encoded embeddings of the responses.
    """
    responses = df[response_column].tolist()
    with torch.no_grad():
        encoded_input = tokenizer(responses, padding=True, truncation=True, return_tensors='pt', max_length=512)
        outputs = model(**encoded_input)
        embeddings = outputs.pooler_output
    return embeddings


def randomized_split_by_model(df, fraction=0.5, random_state=None):
    """
    Randomly splits the DataFrame into reference and query sets based on unique models.

    Args:
        df (pd.DataFrame): The original DataFrame.
        fraction (float): The fraction of models to include in the reference set.
        random_state (int or None): Seed for the random number generator.

    Returns:
        pd.DataFrame: reference_df, query_df
    """
    np.random.seed(random_state)
    unique_models = df['model'].unique()
    np.random.shuffle(unique_models)
    split_index = int(len(unique_models) * fraction)
    
    reference_models = unique_models[:split_index]
    query_models = unique_models[split_index:]
    
    reference_df = df[df['model'].isin(reference_models)]
    query_df = df[df['model'].isin(query_models)]
    print(query_models)
    print(reference_models)
    
    return reference_df, query_df

def find_closest_reference_details(reference_df, query_df, embeddings_col='embeddings'):
    """
    Finds the closest reference embedding for each query embedding and returns both
    the closest response and its category.

    Args:
        reference_df (pd.DataFrame): DataFrame containing the reference data embeddings along with categories.
        query_df (pd.DataFrame): DataFrame containing the query data embeddings.
        embeddings_col (str): Column name where the embeddings are stored.

    Returns:
        pd.DataFrame: The query DataFrame updated with columns for the closest reference response and category.
    """
    closest_responses = []
    closest_categories = []
    
    # Ensure embeddings are numpy arrays
    reference_embeddings = np.array(reference_df[embeddings_col].tolist())
    query_embeddings = np.array(query_df[embeddings_col].tolist())
    
    for query_embedding in query_embeddings:
        # Calculate cosine distances from this query embedding to all reference embeddings
        distances = np.array([cosine(query_embedding, ref_emb) for ref_emb in reference_embeddings])
        
        # Index of the closest reference embedding
        closest_index = np.argmin(distances)
        
        # Append the closest response and its category to the respective lists
        closest_responses.append(reference_df.iloc[closest_index]['response'])
        closest_categories.append(reference_df.iloc[closest_index]['category'])
    
    # Update the query DataFrame with the closest reference response and category
    query_df['closest_reference_response'] = closest_responses
    query_df['closest_reference_category'] = closest_categories
    
    return query_df

## Settings, Parameters, Models, Rubrics

In [66]:
marvin.settings.llm_temperature=0.0 # you want a grading schema which consistently gets you the same answers!!! 
marvin.settings.llm_max_tokens=200
llm_max_context_tokens=600 
openai.api_key = os.environ.get("OPENAI_KEY")  # this is for marvin, not litellm
marvin.settings.llm_model='openai/gpt-4'
pd.set_option('display.max_colwidth', None)

prompt_rubric_pairs = [

    {
        "prompt": "If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",
        "rubric": TimeDiff,
    },
  
]

# You need API keys for each of your providers saved as environment variables in the format PROVIDER_KEY
providers_models = {
    "GROQ": ["groq/mixtral-8x7b-32768"],
    "GEMINI": ["gemini/gemini-pro"],
    "OPENAI": ["openai/gpt-3.5-turbo", "openai/gpt-4"],
    "ANTHROPIC": ["anthropic/claude-2"],
    "MISTRAL": ["mistral/mistral-large-latest"],
}

n_repetitions = 10

### Getting responses and coding them

In [67]:
responses_df = create_response_df(providers_models, [pair["prompt"] for pair in prompt_rubric_pairs], n_repetitions)
final_classified_df = classify_responses(responses_df, prompt_rubric_pairs)

### Getting the similarity model, doing embeddings, splitting by model, and finding the closest responses and their category

In [68]:
model_name = 'sentence-transformers/paraphrase-mpnet-base-v2'  # Choose the model
model, tokenizer = setup_transformer_model(model_name)
embeddings = encode_responses_df(final_classified_df, 'response', model, tokenizer)
final_classified_df['embeddings'] = [embedding.tolist() for embedding in embeddings]
reference_df, query_df = randomized_split_by_model(final_classified_df, fraction=0.5, random_state=42)
query_df_with_closest = find_closest_reference_details(reference_df, query_df)

['openai/gpt-3.5-turbo' 'anthropic/claude-2' 'openai/gpt-4']
['groq/mixtral-8x7b-32768' 'gemini/gemini-pro'
 'mistral/mistral-large-latest']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  query_df['closest_reference_response'] = closest_responses
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  query_df['closest_reference_category'] = closest_categories


### Do the model-generated labels match the most-similar-response labels?

In [69]:
pd.crosstab(query_df_with_closest['category'], query_df_with_closest['closest_reference_category'])

closest_reference_category,FAIL,PASS
category,Unnamed: 1_level_1,Unnamed: 2_level_1
FAIL,1,0
PASS,14,15


### Do the responses and closest_reference_response make sense?

In [70]:
query_df_with_closest.drop(columns=["embeddings"])

Unnamed: 0,model,prompt,response,category,closest_reference_response,closest_reference_category
2,openai/gpt-3.5-turbo,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",6 PM on March 26th in Copenhagen.,PASS,It's 6 PM on March 26th in Copenhagen.,PASS
3,openai/gpt-4,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",It's 6 PM on March 26th in Copenhagen.,PASS,It's 6 PM on March 26th in Copenhagen.,PASS
4,anthropic/claude-2,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",6 pm,PASS,7 PM on March 26th,FAIL
8,openai/gpt-3.5-turbo,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",6 PM on March 26th.,PASS,7 PM on March 26th,FAIL
9,openai/gpt-4,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",It's 6 PM on March 26th in Copenhagen.,PASS,It's 6 PM on March 26th in Copenhagen.,PASS
10,anthropic/claude-2,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",6 pm,PASS,7 PM on March 26th,FAIL
14,openai/gpt-3.5-turbo,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",6 PM on March 26th.,PASS,7 PM on March 26th,FAIL
15,openai/gpt-4,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",It's 6 PM on March 26th in Copenhagen.,PASS,It's 6 PM on March 26th in Copenhagen.,PASS
16,anthropic/claude-2,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",6 pm,PASS,7 PM on March 26th,FAIL
20,openai/gpt-3.5-turbo,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",6 PM on March 26th in Copenhagen.,PASS,It's 6 PM on March 26th in Copenhagen.,PASS


### Is a weirdly low set of closest_reference_responses getting found?

In [71]:
query_df_with_closest['closest_reference_response'].value_counts()

closest_reference_response
It's 6 PM on March 26th in Copenhagen.    15
7 PM on March 26th                        15
Name: count, dtype: int64

In [72]:
final_classified_df.drop(columns=["embeddings"])

Unnamed: 0,model,prompt,response,category
0,groq/mixtral-8x7b-32768,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words","Use a world clock to convert time across zones. (Answer varies), , As the time difference between Boston and Copenhagen can vary depending on whether daylight saving time is in effect, it's best to use a world clock or similar tool to convert the time accurately. As of March 26th, Copenhagen would typically be 6 hours ahead of Boston, but this can change based on the time of year.",PASS
1,gemini/gemini-pro,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",7 PM on March 26th,FAIL
2,openai/gpt-3.5-turbo,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",6 PM on March 26th in Copenhagen.,PASS
3,openai/gpt-4,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",It's 6 PM on March 26th in Copenhagen.,PASS
4,anthropic/claude-2,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",6 pm,PASS
5,mistral/mistral-large-latest,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",It's 6 PM on March 26th in Copenhagen.,PASS
6,groq/mixtral-8x7b-32768,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words","Use a world clock tool to convert time zones quickly., , (or you can use this direct link: https://www.timeanddate.com/worldclock/converter.html?iso=20230326T120000&p1=152&p2=234), , (but that takes more than 10 words 😊)",PASS
7,gemini/gemini-pro,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",7 PM on March 26th,FAIL
8,openai/gpt-3.5-turbo,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",6 PM on March 26th.,PASS
9,openai/gpt-4,"If it's 12 PM on March 26th in Boston, what time is that in Copenhagen? Answer in less than 10 words",It's 6 PM on March 26th in Copenhagen.,PASS
