In [12]:
!pip install "numpy<2.0" bert-score sentence-transformers -q

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [13]:
# 📥 Import Libraries
from bert_score import score
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

In [14]:
# 🔧 Initialize SentenceTransformer Model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [15]:
# 🧠 Sample Data
examples = [
    {
        "prompt": "What is the capital of France?",
        "expected_answer": "Paris is the capital of France.",
        "generated_answer": "The capital of France is Paris."
    },
    {
        "prompt": "Explain the process of photosynthesis.",
        "expected_answer": "Photosynthesis is the process by which green plants convert sunlight into energy.",
        "generated_answer": "Plants make energy using sunlight, a process called photosynthesis."
    },
    {
        "prompt": "Who wrote 'Pride and Prejudice'?",
        "expected_answer": "Jane Austen wrote 'Pride and Prejudice'.",
        "generated_answer": "The author of Pride and Prejudice is Jane Austen."
    }
]

In [16]:
# 🧪 BERTScore Evaluation Function
def evaluate_bertscore(candidate, reference):
    P, R, F1 = score([candidate], [reference], lang="en", verbose=False)
    return F1[0].item()

In [17]:
def evaluate_cosine_similarity(candidate, reference):
    embeddings = model.encode([candidate, reference])
    cos_sim = cosine_similarity([embeddings[0]], [embeddings[1]])
    return cos_sim[0][0]

In [18]:
def evaluate(prompt, expected, generated):
    print(f"Prompt: {prompt}")
    print(f"Expected Answer: {expected}")
    print(f"Generated Answer: {generated}")

    bertscore = evaluate_bertscore(generated, expected)
    cos_sim = evaluate_cosine_similarity(generated, expected)

    print(f"\nBERTScore F1: {bertscore:.4f}")
    print("A BERTScore F1 score close to 1 indicates high semantic similarity.")
    print(f"Cosine Similarity: {cos_sim:.4f}")
    print("A cosine similarity close to 1 indicates high semantic similarity.")
    print("-" * 50)

In [19]:
for example in examples:
    evaluate(example["prompt"], example["expected_answer"], example["generated_answer"])


Prompt: What is the capital of France?
Expected Answer: Paris is the capital of France.
Generated Answer: The capital of France is Paris.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



BERTScore F1: 0.9337
A BERTScore F1 score close to 1 indicates high semantic similarity.
Cosine Similarity: 0.9894
A cosine similarity close to 1 indicates high semantic similarity.
--------------------------------------------------
Prompt: Explain the process of photosynthesis.
Expected Answer: Photosynthesis is the process by which green plants convert sunlight into energy.
Generated Answer: Plants make energy using sunlight, a process called photosynthesis.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



BERTScore F1: 0.9019
A BERTScore F1 score close to 1 indicates high semantic similarity.
Cosine Similarity: 0.8675
A cosine similarity close to 1 indicates high semantic similarity.
--------------------------------------------------
Prompt: Who wrote 'Pride and Prejudice'?
Expected Answer: Jane Austen wrote 'Pride and Prejudice'.
Generated Answer: The author of Pride and Prejudice is Jane Austen.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



BERTScore F1: 0.9278
A BERTScore F1 score close to 1 indicates high semantic similarity.
Cosine Similarity: 0.9122
A cosine similarity close to 1 indicates high semantic similarity.
--------------------------------------------------
