# BSD_Evals
An LLM evaluation framework by Brett DiDonato

Project repo: https://github.com/brettdidonato/BSD_Evals

This project enables the creation of your own LLM evaluation framework against popular LLM providers (Anthropic, Google, OpenAI) and cloud providers (Google Cloud). See evals/test_evals.json for an example on how to build your own set of evaluations and test.py for execution. Update config.ini before running to enable APIs and services as needed.

Clone github repo and move to top directory



In [4]:
!git clone https://github.com/brettdidonato/BSD_Evals.git
!mv -v BSD_Evals/* .

Cloning into 'BSD_Evals'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects:   3% (1/27)[Kremote: Counting objects:   7% (2/27)[Kremote: Counting objects:  11% (3/27)[Kremote: Counting objects:  14% (4/27)[Kremote: Counting objects:  18% (5/27)[Kremote: Counting objects:  22% (6/27)[Kremote: Counting objects:  25% (7/27)[Kremote: Counting objects:  29% (8/27)[Kremote: Counting objects:  33% (9/27)[Kremote: Counting objects:  37% (10/27)[Kremote: Counting objects:  40% (11/27)[Kremote: Counting objects:  44% (12/27)[Kremote: Counting objects:  48% (13/27)[Kremote: Counting objects:  51% (14/27)[Kremote: Counting objects:  55% (15/27)[Kremote: Counting objects:  59% (16/27)[Kremote: Counting objects:  62% (17/27)[Kremote: Counting objects:  66% (18/27)[Kremote: Counting objects:  70% (19/27)[Kremote: Counting objects:  74% (20/27)[Kremote: Counting objects:  77% (21/27)[Kremote: Counting objects:  81% (22/27)[Kremote: Countin

Install libraries

In [None]:
!pip install -r requirements.txt

Google authentication

In [2]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

***IMPORTANT: Update config.ini with your credentials before running!***

Run evaluations for all models.

In [16]:
from bsd_evals import BSD_Evals
from eval import Eval
from model import Model

'''
Define your models family and version. Other fields optional.

Check latest model versions:
Anthropic: https://docs.anthropic.com/claude/docs/models-overview#model-recommendations
Google Cloud Vertex AI: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versioning
Google AI Studio: https://ai.google.dev/models/gemini
OpenAI: https://platform.openai.com/docs/models/overview
'''
models = [
    Model(
        model_family="Claude",
        model_version="claude-3-haiku-20240307",
        service="Anthropic",
        max_tokens=4096,
        temperature=1.0),
    Model(
        model_family="Claude",
        model_version="claude-3-sonnet-20240229",
        service="Anthropic",
        max_tokens=4096,
        temperature=1.0),
    Model(
        model_family="Claude",
        model_version="claude-3-opus-20240229",
        service="Anthropic",
        max_tokens=4096,
        temperature=1.0),
    Model(
        model_family="Gemini",
        model_version="gemini-1.0-pro",
        service="Google AI Studio",
        max_output_tokens=2048),
    Model(
        model_family="Gemini",
        model_version="gemini-1.0-pro-001",
        service="Google Cloud",
        max_output_tokens=2048,
        temperature=0.8,
        top_k=40,
        top_p=1),
    Model(
        model_family="GPT",
        model_version="gpt-3.5-turbo",
        service="Open AI",
        temperature=1.0),
    Model(
        model_family="GPT",
        model_version="gpt-4-turbo-preview",
        service="Open AI",
        temperature=1.0),
    Model(
        model_family="GPT",
        model_version="gpt-4",
        service="Open AI",
        temperature=1.0)
]

evals = BSD_Evals(models=models, test_eval_file="./evals/test_evals.json")
evals.run()
evals.display_results("html")

Executing 6 evals across 8 models => 48 total evals.

**********************************
Evaluation #1
Evaluation description: Basic math
Evaluation prompt: 1+2=
Expected response: 3
Evaluation type: perfect_exact_match
---
Model(model_family='Claude', model_version='claude-3-haiku-20240307', service='Anthropic' max_tokens=4096, temperature=1.0)
Model response: 3
runtime: 0.7846758365631104
Evaluation passed.
---
Model(model_family='Claude', model_version='claude-3-sonnet-20240229', service='Anthropic' max_tokens=4096, temperature=1.0)
Model response: 1 + 2 = 3
runtime: 0.6912224292755127
Evaluation failed!
---
Model(model_family='Claude', model_version='claude-3-opus-20240229', service='Anthropic' max_tokens=4096, temperature=1.0)
Model response: 1 + 2 = 3

In mathematics, when you add two numbers together, the result is called the sum. In this case, the sum of 1 and 2 is 3.

This is a simple example of addition, one of the four basic arithmetic operations along with subtraction, mult

Unnamed: 0,claude-3-haiku-20240307 (Anthropic),claude-3-sonnet-20240229 (Anthropic),claude-3-opus-20240229 (Anthropic),gemini-1.0-pro (Google AI Studio),gemini-1.0-pro-001 (Google Cloud),gpt-3.5-turbo (Open AI),gpt-4-turbo-preview (Open AI),gpt-4 (Open AI),Totals
1: Basic math,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,5.0
2: Basic stats,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,7.0
3: World facts,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,4.0
4: Historical facts,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,3.0
5: Historical facts,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,7.0
6: Article summary,0.392523,0.4,0.526316,0.296296,0.46729,0.257426,0.333333,0.318182,2.991366
Totals,3.392523,4.4,3.526316,2.296296,4.46729,3.257426,3.333333,4.318182,28.991366



Runtime matrix:


Unnamed: 0,claude-3-haiku-20240307 (Anthropic),claude-3-sonnet-20240229 (Anthropic),claude-3-opus-20240229 (Anthropic),gemini-1.0-pro (Google AI Studio),gemini-1.0-pro-001 (Google Cloud),gpt-3.5-turbo (Open AI),gpt-4-turbo-preview (Open AI),gpt-4 (Open AI),Totals
1: Basic math,0.784676,0.691222,4.338898,1.345348,0.715232,0.320655,1.08013,0.696294,9.972455
2: Basic stats,0.603246,0.655983,1.682108,0.997801,0.626674,0.389322,0.54707,0.734517,6.236722
3: World facts,0.742686,0.966697,1.465163,1.33128,0.783193,0.701868,1.526971,1.086216,8.604074
4: Historical facts,0.738598,0.750297,1.745967,1.023797,0.657055,0.378675,0.603704,1.342753,7.240844
5: Historical facts,0.654444,0.744292,1.919192,1.229685,0.813789,0.347581,0.663949,0.495444,6.868377
6: Article summary,1.623429,2.898075,6.736964,2.372755,1.216969,1.251796,4.847798,5.614062,26.561848
Totals,5.147078,6.706566,17.888293,8.300666,4.812912,3.389896,9.269623,9.969285,65.484319
