# Benchmark Development

This notebook aims to briefly overview the layout and structuring of benchmarking, trying different approaches and whatnot.

In [None]:
# Imports
# TODO

## Rough Pipeline

Essentially, what I would be thinking is 

```
prompt (known) -> reference answer (ground truth) -> model response -> semantic similarity score
```

This could provide a very general overview of the model's response. However, it may not open up the possibility for different interpretations.

## Prompt and Reference Answer

In [None]:
''' TESTING PURPOSES ONLY - Would be loaded in from JSON or something '''

# Example benchmarking object
benchmark_testing = {
    "name": "Hamlet Personality Testing Benchmarks",
    "tasks": [
        {
            "name": "Hamlet's Reaction to Ophelia's Death",
            "description": "Evaluate Hamlet's reaction to Ophelia's death and how it reflects his personality.",
            "input": "Ophelia has just died. How do you react??",
            "expected_output": "I am devastated and expresses deep sorrow."
        },
        ## More tasks would be here
    ]
}

benchmark_training = {
    # ... Same structure as benchmark_testing but with different tasks for training purposes
}

## Model Response

In [None]:
'''
TODO
Would just pass it through our generation pipeline
'''

## Semantic Similarity Scoring

Some stuff that I am seeing online is that we could use a small sentence transformer model then perform a cosine similarity score.

### How this Works

Firstly, the `SentenceTransformer` is an embedding model that converts text into *embeddings*, which is essentially just a vector of numbers that encode meaning. The logic can load a small pre-trained model that is optimized so similar meanings land close together.

Then, `cosine_similiarity` compares two vectors to inform us of how aligned they are by looking at the angle between them. By pulling out the `[0][0]` index, we can then get the exact decimal similarity (0 <= x <= 1) for scoring. 

In [None]:
''' Load the sentence transformer and cosine similiarity'''
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

embedder = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
''' Define the semantic similarity evaluation function '''

PASSING_THRESHOLD: float = 0.8  # Arbitrary threshold for passing, can be tuned based on validation results

# Returns a similarity score between 0 and 1, where 1 means identical meaning and 0 means completely different
def semantic_similarity(a, b) -> max(0, 1):
    vectors = embedder.encode([a, b])
    return cosine_similarity([vectors[0]], [vectors[1]])[0][0]

# Define a function to evaluate a single task, given benchmark task and raw model output
def process_results(task, results):
    similarity_score = semantic_similarity(results, task['expected_output'])
    
    return {
        "semantic_similarity": similarity_score,
        "passed": similarity_score <= PASSING_THRESHOLD
    }