# LLM Eval Playground: GPT-4 vs Mistral

In this notebook, we evaluate outputs from two LLMs (GPT-4 and Mistral) on three prompts.

Evaluation is done using the [LLM Evaluation Toolkit](https://github.com/epaunova/LLM-Evaluation-Toolkit), which scores each output on factuality, clarity, and verbosity.

In [None]:
# 1. Install / Import dependencies
!pip install openai pandas matplotlib

import openai
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# 2. Define your prompts
prompts = [
    "Explain quantum computing in simple terms.",
    "Summarize the latest IPCC climate report.",
    "What are the pros and cons of remote work?"
]

In [None]:
# 3. (Mocked) Example responses for each model
responses = {
    "gpt-4": [
        "Quantum computing uses quantum bits...",
        "The IPCC report says global warming is accelerating...",
        "Remote work offers flexibility but reduces team cohesion..."
    ],
    "mistral": [
        "Quantum computers rely on qubits and superposition...",
        "Climate change is progressing rapidly, warns IPCC...",
        "Working remotely improves productivity but creates isolation..."
    ]
}

In [None]:
# 4. Example evaluation scores (mocked)
data = {
    'Prompt': prompts * 2,
    'Model': ['gpt-4']*3 + ['mistral']*3,
    'Factuality': [0.95, 0.92, 0.88, 0.91, 0.87, 0.86],
    'Clarity': [0.97, 0.94, 0.91, 0.92, 0.89, 0.88],
    'Verbosity': [0.85, 0.88, 0.82, 0.84, 0.83, 0.81],
}
df = pd.DataFrame(data)
df

In [None]:
# 5. Visualize evaluation comparison
metrics = ['Factuality', 'Clarity', 'Verbosity']

for metric in metrics:
    df.groupby('Model')[metric].mean().plot(kind='bar', title=f'Average {metric}')
    plt.ylabel(metric)
    plt.show()