# Using **GenAIResultsComparator**: A Simple BLEU Example

This notebook demonstrates how to use the GenAIResultsComparator library to evaluate generated text using the BLEU (Bilingual Evaluation Understudy) score. BLEU is one of the most popular metrics for evaluating machine-generated text against reference text.

## 1. Installation and Setup

First, let's import the necessary modules:

In [None]:
from llm_metrics import BLEU
import numpy as np

## 2. Understanding BLEU Score

BLEU score measures how similar the machine-generated text is to reference text(s) by computing the n-gram precision. The score ranges from 0 to 1, where:
- 1 indicates a perfect match
- 0 indicates no match

Key characteristics:
- BLEU considers word order
- It can handle multiple references
- It includes a brevity penalty for short translations
- It typically uses n-grams up to length 4

## 3. Basic Usage

Let's start with a simple example:

In [None]:
# Initialize the BLEU metric calculator
bleu = BLEU(n=4)  # Using up to 4-grams

# Example texts
reference = "The cat sits on the mat."
generated = "The cat is sitting on the mat."

# Calculate BLEU score
score = bleu.calculate(generated, reference)

print(f"BLEU Score: {score:.4f}")

## 4. Batch Processing

The library also supports batch processing for multiple text pairs:

In [None]:
# Multiple examples
references = [
    "The cat sits on the mat.",
    "The weather is beautiful today.",
    "I love reading interesting books."
]

generated_texts = [
    "The cat is sitting on the mat.",
    "Today the weather is very nice.",
    "I enjoy reading fascinating books."
]

# Calculate batch scores
scores = bleu.batch_calculate(generated_texts, references)

# Print individual scores
for i, score in enumerate(scores):
    print(f"Example {i+1} BLEU Score: {score:.4f}")

# Print average score
print(f"\nAverage BLEU Score: {np.mean(scores):.4f}")

## 5. Advanced Usage: Customizing BLEU Calculation
BLEU can be customized in several ways:

In [None]:
# Using different n-gram weights
bleu_custom = BLEU(
    n=4,
    additional_params={
        'weights': (0.4, 0.3, 0.2, 0.1)  # Custom weights for 1-gram to 4-gram
    }
)

# Compare scores with default and custom weights
score_default = bleu.calculate(generated, reference)
score_custom = bleu_custom.calculate(generated, reference)

print(f"Default weights BLEU Score: {score_default:.4f}")
print(f"Custom weights BLEU Score: {score_custom:.4f}")

## 6. Conclusion

BLEU score is a useful metric for evaluating text similarity, particularly in machine translation tasks. However, it's best used:
- In combination with other metrics
- With multiple references when possible
- For longer texts
- As part of a broader evaluation strategy

Remember that no single metric is perfect, and human evaluation is often necessary for comprehensive assessment of generated text quality.