<a href="https://colab.research.google.com/github/anshupandey/Generative-AI-opensource/blob/main/LLM_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install nltk rouge-score --quiet

# Evaluation Metrics for Language Models

## Perplexity

**Definition**: Perplexity is a measure of how well a language model predicts a sample. A lower perplexity indicates that the model is better at predicting the sample.

**Interpretation**:
- **Low Perplexity**: Indicates better performance (e.g., 10 or lower).
- **High Perplexity**: Indicates worse performance.

**Benchmark**: For modern language models, perplexity values typically range between 10 and 50 for well-formed English text.

---




In [2]:
import math
import nltk
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu

In [3]:
# Sample data
predictions = ["the cat is on the the mat in house in the city", "there is a cat on the the mat in the garage"]
references = [["the cat is on the mat"], ["there is a cat on the mat"]]

In [4]:
# the = 1/(2/6) = 1/(1/3) = 1/0.33 = 3 ||||   1/2^3 = 1/8
# cat = 1/(1/6) = 6  ||| 1/2^6 = 1/64

In [5]:
references[0][0].split()

['the', 'cat', 'is', 'on', 'the', 'mat']

In [6]:
# Perplexity Calculation
def calculate_perplexity(predicted_sentence, reference_sentence):
    ref_len = len(reference_sentence.split()) # calculating total num of words in reference
    log_prob_sum = 0
    for word in reference_sentence.split():
        if word in predicted_sentence.split():
            log_prob_sum += math.log(1 / (predicted_sentence.split().count(word) / len(predicted_sentence.split())))
        else:
            log_prob_sum += math.log(1 / len(predicted_sentence.split()))
    return math.exp(log_prob_sum / ref_len)

perplexities = [calculate_perplexity(pred, ref[0]) for pred, ref in zip(predictions, references)]
average_perplexity = sum(perplexities) / len(perplexities)
print(f"Average Perplexity: {average_perplexity}")

Average Perplexity: 8.480895849173958


In [7]:
# Sample data
predictions = ["the cat is on the mat", "there is a cat on the the mat in the garage"]
references = [["the cat is on the mat"], ["there is a cat on the mat"]]

## BLEU (Bilingual Evaluation Understudy)

**Definition**: BLEU is a metric for evaluating a generated sentence to a reference sentence. It measures the n-gram precision with a penalty for overly short sentences.

**Interpretation**:
- **High BLEU Score**: Indicates good performance (e.g., scores above 0.5 or 50%).
- **Low BLEU Score**: Indicates poor performance.

**Benchmark**: For machine translation tasks, a BLEU score above 0.3 (30%) is considered reasonable, while scores above 0.5 (50%) are considered good.

---



In [8]:
# BLEU Score Calculation
def calculate_bleu(predicted_sentence, reference_sentence):
    return sentence_bleu([reference_sentence.split()], predicted_sentence.split())

bleu_scores = [calculate_bleu(pred, ref[0]) for pred, ref in zip(predictions, references)]
average_bleu = sum(bleu_scores) / len(bleu_scores)

print(f"Average BLEU Score: {average_bleu}")

Average BLEU Score: 0.751128696864156


## ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

**Definition**: ROUGE measures the overlap of n-grams between the generated sentence and the reference sentence, focusing on recall.

**Types**:
- **ROUGE-1**: Measures the overlap of unigrams.
- **ROUGE-2**: Measures the overlap of bigrams.
- **ROUGE-L**: Measures the longest common subsequence.

**Interpretation**:
- **High ROUGE Score**: Indicates good performance.
- **Low ROUGE Score**: Indicates poor performance.

**Benchmark**: For summarization tasks, ROUGE scores of 0.5 (50%) or higher are considered good.

---



In [9]:
# ROUGE Score Calculation
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

def calculate_rouge(predicted_sentence, reference_sentence):
    scores = scorer.score(reference_sentence, predicted_sentence)
    return scores

rouge_scores = [calculate_rouge(pred, ref[0]) for pred, ref in zip(predictions, references)]

average_rouge = {
    'rouge1': sum([score['rouge1'].fmeasure for score in rouge_scores]) / len(rouge_scores),
    'rouge2': sum([score['rouge2'].fmeasure for score in rouge_scores]) / len(rouge_scores),
    'rougeL': sum([score['rougeL'].fmeasure for score in rouge_scores]) / len(rouge_scores),
}

print(f" Average ROUGE Scores: {average_rouge}")

 Average ROUGE Scores: {'rouge1': 0.8888888888888888, 'rouge2': 0.875, 'rougeL': 0.8888888888888888}


In [10]:
# Sample data
inputs = [
    "Translate the following English text to French: 'Hello, how are you?'",
    "Summarize the following text wihtout loosing context: 'The quick brown fox jumps over the lazy dog.'"
]
references = [
    ["Bonjour, comment ça va?"],
    ["The quick brown fox jumps over the lazy dog."]
]

## Evaluating Gemma 2 model

Note: For any error related to memory, restart and run all

In [11]:
# Use a pipeline as a high-level helper
from transformers import pipeline

messages = [
    {"role": "user", "content": "Write a poem on city Dubai in 1000 words?"},
]
pipe = pipeline("text-generation", model="google/gemma-2b-it",device=0)
response = pipe(messages,max_length=2000)
response

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[{'generated_text': [{'role': 'user',
    'content': 'Write a poem on city Dubai in 1000 words?'},
   {'role': 'assistant',
    'content': "A mirage in the desert's embrace,\nDubai, a city that never sleeps.\nTowers pierce the sky, a symphony of steel,\nA testament to ambition, a beacon of wealth.\n\nGlittering skyscrapers, reaching for the sky,\nA canvas of glass, where dreams can fly.\nFrom Burj Khalifa's spire, a crown upon the land,\nTo Dubai Fountain's dance, a mesmerizing stand.\n\nA city of contrasts, old and new,\nA blend of tradition and modern lore.\nPalm Jumeirah's crescent, a marvel to behold,\nA haven of luxury, a story to be told.\n\nThe Dubai souks, a vibrant display,\nWhere treasures from every land come to play.\nSpice and spices, textiles so fine,\nA cultural tapestry, a vibrant line.\n\nThe Dubai Mall, a shopper's delight,\nWhere luxury brands ignite the night.\nThe Burj Al Arab, a haven of grace,\nA timeless landmark, a timeless space.\n\nThe Dubai Fountain show, a 

In [12]:
print(response[0]['generated_text'][1]['content'])

A mirage in the desert's embrace,
Dubai, a city that never sleeps.
Towers pierce the sky, a symphony of steel,
A testament to ambition, a beacon of wealth.

Glittering skyscrapers, reaching for the sky,
A canvas of glass, where dreams can fly.
From Burj Khalifa's spire, a crown upon the land,
To Dubai Fountain's dance, a mesmerizing stand.

A city of contrasts, old and new,
A blend of tradition and modern lore.
Palm Jumeirah's crescent, a marvel to behold,
A haven of luxury, a story to be told.

The Dubai souks, a vibrant display,
Where treasures from every land come to play.
Spice and spices, textiles so fine,
A cultural tapestry, a vibrant line.

The Dubai Mall, a shopper's delight,
Where luxury brands ignite the night.
The Burj Al Arab, a haven of grace,
A timeless landmark, a timeless space.

The Dubai Fountain show, a symphony of light,
A spectacle that captivates the night.
The Dubai Miracle Garden, a floral delight,
Where nature's wonders illuminate the night.

A city of ambitio

In [18]:
print(inputs)

["Translate the following English text to French: 'Hello, how are you?'", "Summarize the following text wihtout loosing context: 'The quick brown fox jumps over the lazy dog.'"]


In [19]:

def generate_response(prompt,pipe=pipe):
  messages = [{"role": "user", "content": prompt},]
  response = pipe(messages,max_length=2000)
  return response[0]['generated_text'][1]['content']

In [20]:
# Get predictions from Gemini
predictions = [generate_response(input_text) for input_text in inputs]

In [21]:
predictions

['Sure, here is the French translation of the English text "Hello, how are you?":\n\n**French:** Salut, comment allez-vous?',
 "Sure, here's a summary of the text without losing context:\n\nThe quick brown fox jumps over the lazy dog."]

In [22]:
for i in range(len(references)):
  print(f"Reference {i+1}: {references[i][0]}")
  print(f"Prediction {i+1}: {predictions[i]}")
  print()

Reference 1: Bonjour, comment ça va?
Prediction 1: Sure, here is the French translation of the English text "Hello, how are you?":

**French:** Salut, comment allez-vous?

Reference 2: The quick brown fox jumps over the lazy dog.
Prediction 2: Sure, here's a summary of the text without losing context:

The quick brown fox jumps over the lazy dog.



In [23]:
# Perplexity
perplexities = [calculate_perplexity(pred, ref[0]) for pred, ref in zip(predictions, references)]
average_perplexity = sum(perplexities) / len(perplexities)
print(f"Average Perplexity: {average_perplexity}")

Average Perplexity: 8.913660896927015


In [24]:
 # BLEU Score Calculation
bleu_scores = [calculate_bleu(pred, ref[0]) for pred, ref in zip(predictions, references)]
average_bleu = sum(bleu_scores) / len(bleu_scores)

print(f"Average BLEU Score: {average_bleu}")

Average BLEU Score: 0.21230816589401721


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [25]:
# ROUGE Score
rouge_scores = [calculate_rouge(pred, ref[0]) for pred, ref in zip(predictions, references)]
average_rouge = {
    'rouge1': sum([score['rouge1'].fmeasure for score in rouge_scores]) / len(rouge_scores),
    'rouge2': sum([score['rouge2'].fmeasure for score in rouge_scores]) / len(rouge_scores),
    'rougeL': sum([score['rougeL'].fmeasure for score in rouge_scores]) / len(rouge_scores),
}
print(f"ROUGE 1 Score: {average_rouge['rouge1']}")
print(f"ROUGE 2 Score: {average_rouge['rouge2']}")
print(f"ROUGE L Score: {average_rouge['rougeL']}")

ROUGE 1 Score: 0.35382308845577215
ROUGE 2 Score: 0.2962962962962963
ROUGE L Score: 0.35382308845577215


### Summary

Here’s a quick summary of the metrics:

| Metric      | Definition                                                                                       | Interpretation                                   | Benchmark                                 |
|-------------|--------------------------------------------------------------------------------------------------|-------------------------------------------------|-------------------------------------------|
| **Perplexity** | Measures how well a model predicts a sample. Lower values are better.                            | Low = Better (e.g., 10 or lower), High = Worse  | 10-50 for well-formed English text         |
| **BLEU**      | Evaluates generated sentence against reference. Measures n-gram precision with penalty for short sentences. | High = Better (e.g., > 0.5), Low = Worse         | > 0.3 is reasonable, > 0.5 is good         |
| **ROUGE**     | Measures n-gram overlap between generated and reference sentences. Focuses on recall.            | High = Better (e.g., > 0.5), Low = Worse         | > 0.5 for summarization tasks              |

---