## Benchmark Prompt Scoring Example
6/12/2025, Dave Sisk, https://github.com/davidcsisk, https://www.linkedin.com/in/davesisk-doctordatabase/

#### Scoring Approaches

The following scoring approaches are used to evaluate the similarity between prompts and responses:

1. **Cosine Similarity**:
   - **Description**: Measures the cosine of the angle between two vectors in an embedding space. It evaluates the semantic similarity between the prompt and response.
   - **Range**: 0 to 1
   - **Better Score**: Higher scores indicate greater similarity.

2. **BERTScore (F1)**:
   - **Description**: Uses contextual embeddings from BERT to compute precision, recall, and F1 scores for text similarity. The F1 score is used here.
   - **Range**: 0 to 1
   - **Better Score**: Higher scores indicate better alignment between the prompt and response.

3. **ROUGE-L**:
   - **Description**: Measures the longest common subsequence (LCS) between the prompt and response. It is commonly used for evaluating text summarization.
   - **Range**: 0 to 1
   - **Better Score**: Higher scores indicate better overlap between the prompt and response.

4. **BLEU**:
   - **Description**: Evaluates the n-gram overlap between the prompt and response. It is commonly used for machine translation tasks.
   - **Range**: 0 to 1
   - **Better Score**: Higher scores indicate better n-gram overlap.

Each of these metrics provides a unique perspective on the quality of the response, with higher scores generally indicating better performance.

In [None]:
# Install dependencies (safe to re-run)
#!pip install pandas sentence-transformers bert_score nltk rouge-score --quiet


In [1]:

import pandas as pd
import os
from sentence_transformers import SentenceTransformer, util
from bert_score import score as bert_score
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
from IPython.display import display
from pathlib import Path

# Download NLTK resources
import nltk
nltk.download('punkt', quiet=True)


  from tqdm.autonotebook import tqdm, trange


True

In [2]:
# Score the responses in a variety of ways into a new CSV file 
def process_file(file_path):
#    """Process file and save scores without any UI interaction"""
    # Load file
    df = pd.read_csv(file_path)
    assert "Prompt" in df.columns and "Response" in df.columns, "❌ CSV must contain 'Prompt' and 'Response' columns"

    # Load models
    embed_model = SentenceTransformer("all-MiniLM-L6-v2")
    rouge = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
    smooth = SmoothingFunction()

    cosine_scores = []
    bert_f1_scores = []
    rouge_l_scores = []
    bleu_scores = []

    print("Scoring in progress...")

    for idx, row in df.iterrows():
        prompt, response = row["Prompt"], row["Response"]

        # Cosine similarity (embeddings)
        try:
            p_vec = embed_model.encode(prompt, convert_to_tensor=True)
            r_vec = embed_model.encode(response, convert_to_tensor=True)
            cosine = util.pytorch_cos_sim(p_vec, r_vec).item()
        except:
            cosine = 0.0

        # BERTScore
        try:
            _, _, F1 = bert_score([response], [prompt], lang="en", verbose=False)
            bert_f1 = F1[0].item()
        except:
            bert_f1 = 0.0

        # ROUGE-L
        try:
            rouge_l = rouge.score(prompt, response)["rougeL"].fmeasure
        except:
            rouge_l = 0.0

        # BLEU
        try:
            prompt_tokens = nltk.word_tokenize(prompt)
            response_tokens = nltk.word_tokenize(response)
            bleu = sentence_bleu([prompt_tokens], response_tokens, smoothing_function=smooth.method1)
        except:
            bleu = 0.0

        cosine_scores.append(cosine)
        bert_f1_scores.append(bert_f1)
        rouge_l_scores.append(rouge_l)
        bleu_scores.append(bleu)

    df["CosineSimilarity"] = cosine_scores
    df["BERTScore_F1"] = bert_f1_scores
    df["ROUGE_L"] = rouge_l_scores
    df["BLEU"] = bleu_scores

    # Save results to file
    output_file = Path(file_path).stem + "_scored.csv"
    df.to_csv(output_file, index=False)
    print(f"✅ Scores written to: {output_file}")

# Example usage (uncomment to run):
process_file("benchmark-prompt-scoring_chatgpt-test.csv")


Scoring in progress...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You sho

✅ Scores written to: benchmark-prompt-scoring_chatgpt-test_scored.csv


In [8]:
# Load the scored CSV and display selected columns in a nice table
scored_df = pd.read_csv("benchmark-prompt-scoring_chatgpt-test_scored.csv")

# Ensure the full content of the Prompt column is displayed
pd.set_option('display.max_colwidth', None)  # Show full content of text columns
pd.set_option('display.max_rows', None)      # Show all rows if needed

display(scored_df[["Prompt", "CosineSimilarity", "BERTScore_F1", "ROUGE_L", "BLEU"]])

Unnamed: 0,Prompt,CosineSimilarity,BERTScore_F1,ROUGE_L,BLEU
0,Write a Python script to process log files and identify anomalies based on time gaps (Prompt 1).,0.875589,0.801565,0.107692,0.021988
1,Create an SPL query to detect potential data exfiltration via large outbound transfers (Prompt 2).,0.267022,0.787284,0.0,0.007426
2,Write a SQL query to retrieve the top 5 users with the most failed logins in the past 24 hours (Prompt 3).,0.809982,0.84336,0.206349,0.061723
3,"Use Python to call the Splunk REST API, execute a search, and visualize login attempts per hour (Prompt 4).",0.812615,0.809521,0.057495,0.0158
4,"Write a Python script that connects to a PostgreSQL database, inserts network event logs, and runs a summary query (Prompt 5).",0.73672,0.797282,0.119205,0.049994
5,Write a Python script to process log files and identify anomalies based on time gaps (Prompt 6).,0.873184,0.792551,0.087248,0.012537
6,Create an SPL query to detect potential data exfiltration via large outbound transfers (Prompt 7).,0.291735,0.794012,0.0,0.009849
7,Write a SQL query to retrieve the top 5 users with the most failed logins in the past 24 hours (Prompt 8).,0.800837,0.843803,0.206349,0.05744
8,"Use Python to call the Splunk REST API, execute a search, and visualize login attempts per hour (Prompt 9).",0.808533,0.812193,0.071429,0.01886
9,"Write a Python script that connects to a PostgreSQL database, inserts network event logs, and runs a summary query (Prompt 10).",0.788432,0.796967,0.138996,0.05333


Open the scored CSV file note above with a suitable spreadsheet program or CSV viewer/editor to examine the prompts, responses, and scores in more detail. 