<a href="https://colab.research.google.com/github/anshupandey/Generative-AI-for-Professionals/blob/main/EY2024/C11_LLM_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install nltk rouge-score --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


# Evaluation Metrics for Language Models

## Perplexity

**Definition**: Perplexity is a measure of how well a language model predicts a sample. A lower perplexity indicates that the model is better at predicting the sample.

**Interpretation**:
- **Low Perplexity**: Indicates better performance (e.g., 10 or lower).
- **High Perplexity**: Indicates worse performance.

**Benchmark**: For modern language models, perplexity values typically range between 10 and 50 for well-formed English text.

---




In [2]:
import math
import nltk
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu

In [41]:
# Sample data
predictions = ["the cat is on the the mat in house in the city", "the dog is running towards cat"]
references = [["the cat is on the the mat in city"], ["the dog is running towards cat"]]

In [42]:
from collections import Counter

In [43]:
def calculate_word_probabilities(prediction, reference):
    prediction_tokens = prediction.split()
    reference_tokens = reference.split()

    # Count word frequencies in the prediction
    prediction_counts = Counter(prediction_tokens)
    total_words = sum(prediction_counts.values())

    # Calculate probability of each reference word being in the prediction
    probabilities = []
    for word in reference_tokens:
        # Frequency of word in prediction / total words in prediction
        probabilities.append(prediction_counts.get(word, 0) / total_words)

    return probabilities

def calculate_perplexity(probabilities):
    N = len(probabilities)  # Total number of words (or tokens)

    # Ensure no probability is zero to avoid log(0) issue
    probabilities = [p if p > 0 else 1e-10 for p in probabilities]

    log_sum = sum([math.log2(p) for p in probabilities])
    perplexity = 2 ** (-1/N * log_sum)

    return perplexity

perplexities = [calculate_perplexity(calculate_word_probabilities(pred, ref[0])) for pred, ref in zip(predictions, references)]
average_perplexity = sum(perplexities) / len(perplexities)
print(f"Average Perplexity: {average_perplexity}")

Average Perplexity: 6.499587118728348


In [44]:
# Sample data
predictions = ["the cat is on the mat", "there is a cat on the the mat in the garage"]
references = [["the cat is on the mat"], ["there is a cat on the mat"]]

perplexities = [calculate_perplexity(calculate_word_probabilities(pred, ref[0])) for pred, ref in zip(predictions, references)]
average_perplexity = sum(perplexities) / len(perplexities)
print(f"Average Perplexity: {average_perplexity}")

Average Perplexity: 7.082234277441635


## BLEU (Bilingual Evaluation Understudy)

**Definition**: BLEU is a metric for evaluating a generated sentence to a reference sentence. It measures the n-gram precision with a penalty for overly short sentences.

**Interpretation**:
- **High BLEU Score**: Indicates good performance (e.g., scores above 0.5 or 50%).
- **Low BLEU Score**: Indicates poor performance.

**Benchmark**: For machine translation tasks, a BLEU score above 0.3 (30%) is considered reasonable, while scores above 0.5 (50%) are considered good.

---



In [None]:
# BLEU Score Calculation
def calculate_bleu(predicted_sentence, reference_sentence):
    return sentence_bleu([reference_sentence.split()], predicted_sentence.split())

bleu_scores = [calculate_bleu(pred, ref[0]) for pred, ref in zip(predictions, references)]
average_bleu = sum(bleu_scores) / len(bleu_scores)

print(f"Average BLEU Score: {average_bleu}")

Average BLEU Score: 0.751128696864156


## ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

**Definition**: ROUGE measures the overlap of n-grams between the generated sentence and the reference sentence, focusing on recall.

**Types**:
- **ROUGE-1**: Measures the overlap of unigrams.
- **ROUGE-2**: Measures the overlap of bigrams.
- **ROUGE-L**: Measures the longest common subsequence.

**Interpretation**:
- **High ROUGE Score**: Indicates good performance.
- **Low ROUGE Score**: Indicates poor performance.

**Benchmark**: For summarization tasks, ROUGE scores of 0.5 (50%) or higher are considered good.

---



In [None]:
# ROUGE Score Calculation
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

def calculate_rouge(predicted_sentence, reference_sentence):
    scores = scorer.score(reference_sentence, predicted_sentence)
    return scores

rouge_scores = [calculate_rouge(pred, ref[0]) for pred, ref in zip(predictions, references)]

average_rouge = {
    'rouge1': sum([score['rouge1'].fmeasure for score in rouge_scores]) / len(rouge_scores),
    'rouge2': sum([score['rouge2'].fmeasure for score in rouge_scores]) / len(rouge_scores),
    'rougeL': sum([score['rougeL'].fmeasure for score in rouge_scores]) / len(rouge_scores),
}

print(f" Average ROUGE Scores: {average_rouge}")

 Average ROUGE Scores: {'rouge1': 0.8888888888888888, 'rouge2': 0.875, 'rougeL': 0.8888888888888888}


In [None]:
# Sample data
inputs = [
    "Translate the following English text to French: 'Hello, how are you?'",
    "Summarize the following text wihtout loosing context: 'The quick brown fox jumps over the lazy dog.'"
]
references = [
    ["Bonjour, comment ça va?"],
    ["The quick brown fox jumps over the lazy dog."]
]

### Install Vertex AI SDK for Python


In [None]:
! pip3 install --upgrade --user --quiet openai

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/5.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/5.1 MB[0m [31m6.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m2.6/5.1 MB[0m [31m37.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m5.1/5.1 MB[0m [31m51.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m5.1/5.1 MB[0m [31m51.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
[0m

In [None]:
api_key = "xxxxxxxxxxxxxxxx"
api_version = "2023-07-01-preview" # "2023-05-15"
azure_endpoint = "https://xxxxxxxxx.openai.azure.com/"
model_name = "gpt-4o"

from openai import AzureOpenAI

# gets the API Key from environment variable AZURE_OPENAI_API_KEY
client = AzureOpenAI(
    # https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#rest-api-versioning
    api_version=api_version,
    # https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource
    azure_endpoint=azure_endpoint,
    api_key = api_key,

)

In [None]:

def generate_response(prompt,temp=0.0):
  response = client.chat.completions.create(
      messages=[{"role":"system",'content':"You are an expert programmer, you follow standard best practices for answering coding questions."},
            {"role":"user",'content':prompt}],
      model = model_name,
      temperature=temp,
  )
  return response.choices[0].message.content


In [None]:
# Get predictions from Gemini
predictions = [generate_response(input_text) for input_text in inputs]

In [None]:
for i in range(len(references)):
  print(f"Reference {i+1}: {references[i][0]}")
  print(f"Prediction {i+1}: {predictions[i]}")
  print()

Reference 1: Bonjour, comment ça va?
Prediction 1: Bonjour, comment allez-vous ? 


Reference 2: The quick brown fox jumps over the lazy dog.
Prediction 2: A swift, brown fox leaps over a lethargic dog. 




In [None]:
# Perplexity
perplexities = [calculate_perplexity(pred, ref[0]) for pred, ref in zip(predictions, references)]
average_perplexity = sum(perplexities) / len(perplexities)
print(f"Average Perplexity: {average_perplexity}")

Average Perplexity: 0.8916905184686361


In [None]:
 # BLEU Score Calculation
bleu_scores = [calculate_bleu(pred, ref[0]) for pred, ref in zip(predictions, references)]
average_bleu = sum(bleu_scores) / len(bleu_scores)

print(f"Average BLEU Score: {average_bleu}")

Average BLEU Score: 8.386418434906823e-155


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [None]:
# ROUGE Score
rouge_scores = [calculate_rouge(pred, ref[0]) for pred, ref in zip(predictions, references)]
average_rouge = {
    'rouge1': sum([score['rouge1'].fmeasure for score in rouge_scores]) / len(rouge_scores),
    'rouge2': sum([score['rouge2'].fmeasure for score in rouge_scores]) / len(rouge_scores),
    'rougeL': sum([score['rougeL'].fmeasure for score in rouge_scores]) / len(rouge_scores),
}
print(f"ROUGE 1 Score: {average_rouge['rouge1']}")
print(f"ROUGE 2 Score: {average_rouge['rouge2']}")
print(f"ROUGE L Score: {average_rouge['rougeL']}")

ROUGE 1 Score: 0.4722222222222222
ROUGE 2 Score: 0.22916666666666666
ROUGE L Score: 0.4722222222222222


### Summary

Here’s a quick summary of the metrics:

| Metric      | Definition                                                                                       | Interpretation                                   | Benchmark                                 |
|-------------|--------------------------------------------------------------------------------------------------|-------------------------------------------------|-------------------------------------------|
| **Perplexity** | Measures how well a model predicts a sample. Lower values are better.                            | Low = Better (e.g., 10 or lower), High = Worse  | 10-50 for well-formed English text         |
| **BLEU**      | Evaluates generated sentence against reference. Measures n-gram precision with penalty for short sentences. | High = Better (e.g., > 0.5), Low = Worse         | > 0.3 is reasonable, > 0.5 is good         |
| **ROUGE**     | Measures n-gram overlap between generated and reference sentences. Focuses on recall.            | High = Better (e.g., > 0.5), Low = Worse         | > 0.5 for summarization tasks              |

---