<div style="background: linear-gradient(to right, #4b6cb7, #182848); padding: 20px; border-radius: 10px; text-align: center; box-shadow: 0 4px 6px rgba(0,0,0,0.1);">
    <h1 style="color: white; margin: 0; font-size: 2.5em; font-weight: 700;">GenAIResultsComparator</h1>
    <p style="color:hsl(0, 0.00%, 87.80%); margin-top: 10px; font-style: italic; font-size: 1.2em; text-align: center;">A Simple BLEU Example</p>
</div>
<br>

Welcome to the <b>GenAIResultsComparator</b> library! This notebook demonstrates how to use the library to evaluate generated text using the **BLEU** (Bilingual Evaluation Understudy) score.

BLEU is one of the most popular metrics for evaluating machine-generated text against reference text.

### Understanding BLEU Score

BLEU score measures how similar the machine-generated text is to reference text(s) by computing the n-gram precision. The score ranges from 0 to 1, where:

- 1 indicates a perfect match
- 0 indicates no match

Key characteristics:

- BLEU considers word order
- It can handle multiple references
- It includes a brevity penalty for short translations
- It typically uses n-grams up to length 4


### 1. Installation and Setup

For this notebook, we'll assume the library is installed or made accessible via the path modification below.

In [1]:
import sys
import os

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Construct the path to the project root (one level up)
project_root = os.path.abspath(os.path.join(notebook_dir, os.pardir))

# Add project root to the system path if it's not already there
if project_root not in sys.path:
    sys.path.insert(0, project_root)

Now, that the `llm_metrics` package is confirmed importable, let's continue by importing necessary modules from the library and other common packages.


In [2]:
from llm_metrics import BLEU
import numpy as np

### 2. Basic Usage

Let's start with a simple example:

In [3]:
# Initialize the BLEU metric calculator
bleu = BLEU(n=4)  # Using up to 4-grams

# Example texts
reference = "The cat sits on the mat."
generated = "The cat is sitting on the mat."

# Calculate BLEU score
score = bleu.calculate(generated, reference)

print(f"BLEU Score: {score:.4f}")

BLEU Score: 0.2056


### 3. Batch Processing

The library also supports batch processing for multiple text pairs:


In [4]:
# Multiple examples
references = [
    "The cat sits on the mat.",
    "The weather is beautiful today.",
    "I love reading interesting books.",
]

generated_texts = [
    "The cat is sitting on the mat.",
    "Today the weather is very nice.",
    "I enjoy reading fascinating books.",
]

# Calculate batch scores using `use_corpus_bleu` as False
# This parameter is used to specify whether to use corpus-level BLEU or sentence-level BLEU
scores = bleu.calculate(generated_texts, references, use_corpus_bleu=False)

# Print individual scores
for i, score in enumerate(scores):
    print(f"Example {i + 1} BLEU Score: {score:.4f}")

# Print average score
print(f"\nAverage BLEU Score: {np.mean(scores):.4f}")

Example 1 BLEU Score: 0.2056
Example 2 BLEU Score: 0.0863
Example 3 BLEU Score: 0.0707

Average BLEU Score: 0.1209


### 4. Advanced Usage: Customizing BLEU Calculation

BLEU can be customized in several ways:


In [5]:
# Using different n-gram weights
additional_params = {
    "weights": (0.4, 0.3, 0.2, 0.1)  # Custom weights for 1-gram to 4-gram
}

# Compare scores with default and custom weights
score_default = bleu.calculate(generated, reference)
score_custom = bleu.calculate(generated, reference, **additional_params)

print(f"Default weights BLEU Score: {score_default:.4f}")
print(f"Custom weights BLEU Score: {score_custom:.4f}")

Default weights BLEU Score: 0.2056
Custom weights BLEU Score: 0.3558


### 5. Conclusion

BLEU score is a useful metric for evaluating text similarity, particularly in machine translation tasks. However, it's best used:

- In combination with other metrics
- With multiple references when possible
- For longer texts
- As part of a broader evaluation strategy

Remember that no single metric is perfect, and human evaluation is often necessary for comprehensive assessment of generated text quality.
