<div style="background: linear-gradient(to right, #4b6cb7, #182848); padding: 20px; border-radius: 10px; text-align: center; box-shadow: 0 4px 6px rgba(0,0,0,0.1);">
    <h1 style="color: white; margin: 0; font-size: 2.5em; font-weight: 700;">GenAIResultsComparator</h1>
    <p style="color: #e0e0e0; margin-top: 10px; font-style: italic; font-size: 1.2em; text-align: center;">Quickstart Tutorial</p>
</div>
<br>

Welcome to the <b>GenAIResultsComparator</b> library! This notebook will guide you through its features and demonstrate how to use it for evaluating text generated by Large Language Models (LLMs) against reference texts.

This example is designed for technical users who want to:

- Compare generated text strings with ground truth versions.
- Utilize a range of reference-based evaluation metrics.
- Process single pairs or batches of text efficiently.

### 1. Import Required Libraries

For this notebook, we'll assume the library is installed or made accessible via the path modification below.


In [2]:
import sys
import os

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Construct the path to the project root (one level up)
project_root = os.path.abspath(os.path.join(notebook_dir, os.pardir))

# Add project root to the system path if it's not already there
if project_root not in sys.path:
    sys.path.insert(0, project_root)

In [14]:
from llm_metrics.ngram_metrics import ROUGE
from pprint import pprint

# Initialize the metric
r = ROUGE()

# BERT evaluation
score = r.calculate(
    generated_texts=[
        "To make a pasta, cook the pasta.",
        "To make a pasta, cook the pasta.",
    ],  # Generated text from LLM
    reference_texts=[
        "Boil water, add pasta, cook for 10 minutes, and drain.",
        #   "To make a pasta, cook the pasta."
    ],  # Expected output
)

print("BERT score:")
pprint(score)

BERT score:
[{'rouge1': 0.23529411764705882,
  'rouge2': 0.13333333333333333,
  'rougeL': 0.23529411764705882}]


Now, that the `llm_metrics` package is confirmed importable, let's continue by importing necessary modules from the library and other common packages.


In [2]:
# Core library imports
from llm_metrics.ngram_metrics import BLEU, ROUGE
from llm_metrics.semantic_similarity_metrics import BERTScore

from pprint import pformat

### 2. Single Text Pair Processing

Let's start with a simple example of comparing a generated text with a reference text using the `GenAIResultsComparator` class.

In [3]:
# Initialize metrics
bleu = BLEU()
rouge = ROUGE()
bertscore = BERTScore()

In [4]:
# Example texts
sentence_1 = "The quick brown fox jumps over the lazy dog"
sentence_2 = "A fast brown fox leaps over a sleepy canine"

# Calculate scores
bleu_score = bleu.calculate(sentence_1, sentence_2)
rouge_score = rouge.calculate(sentence_1, sentence_2)
bert_score = bertscore.calculate(sentence_1, sentence_2)

print(f"BLEU score:\n{bleu_score}")
print(f"\nROUGE scores:\n{pformat(rouge_score, width=100)}")
print(f"\nBERTScore:\n{pformat(bert_score, width=100)}")

BLEU score:
0.056122223243057295

ROUGE scores:
{'rouge1': 0.3333333333333333, 'rouge2': 0.125, 'rougeL': 0.3333333333333333}

BERTScore:
{'f1': 0.8371803164482117, 'precision': 0.8371802568435669, 'recall': 0.8371802568435669}


### 3. Batch Processing

Now, let's move to batch processing, which allows you to handle multiple text pairs efficiently. This is particularly useful when you have a large dataset of generated texts and references.

For simplicity, we will use lists of length 2 for both generated texts and references. In practice, these could be much longer lists.

In [5]:
generated_texts = ["The quick brown fox jumps over the lazy dog", "The cat chases the mouse"]
reference_texts = ["A fast brown fox leaps over a sleepy canine", "A feline pursues a rodent"]

# Batch calculate scores
bleu_scores = bleu.calculate(generated_texts, reference_texts)
rouge_scores = rouge.calculate(generated_texts, reference_texts)
bert_scores = bertscore.calculate(generated_texts, reference_texts)

print(f"\nBatch BLEU scores:\n{bleu_scores}")
print(f"\nBatch ROUGE scores:\n{pformat(rouge_scores, width=100)}")
print(f"\nBatch BERTScores:\n{pformat(bert_scores, width=100)}")


Batch BLEU scores:
0.03865275878469728

Batch ROUGE scores:
[{'rouge1': 0.3333333333333333, 'rouge2': 0.125, 'rougeL': 0.3333333333333333},
 {'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0.0}]

Batch BERTScores:
[{'f1': 0.8371803164482117, 'precision': 0.8371802568435669, 'recall': 0.8371802568435669},
 {'f1': 0.6456650495529175, 'precision': 0.6858802437782288, 'recall': 0.6099045276641846}]


### 4. Metric Customization

You can customize the metrics used for evaluation by passing in various custom parameters to the metrics. This allows you to focus on specific aspects of text quality that are most relevant to your application.

_Note:_ Each metric has it's own customization options. Refer to the documentation for details on available parameters for each metric.

In [6]:
# Customize BLEU
bleu = BLEU(n=3)  # Use 3-grams instead of default 4-grams

# Customize ROUGE
rouge = ROUGE(rouge_types=["rouge1", "rouge2"], use_stemmer=True)

# Customize BERTScore
bert_score = BERTScore(model_type="bert-base-uncased", num_layers=8)

In [None]:
# Recalculate scores with custom settings
bleu_score_custom = bleu.calculate(sentence_1, sentence_2)
rouge_score_custom = rouge.calculate(sentence_1, sentence_2)
bert_score_custom = bertscore.calculate(sentence_1, sentence_2)

print(f"\nCustom BLEU score:\n{bleu_score_custom}")
print(f"\nCustom ROUGE scores:\n{pformat(rouge_score_custom, width=100)}")
print(f"\nCustom BERTScore:\n{pformat(bert_score_custom, width=100)}")


Custom BLEU score:
0.056122223243057295

Custom ROUGE scores:
{'rouge1': 0.3333333333333333, 'rouge2': 0.125}

Custom BERTScore:
{'f1': 0.8371803164482117, 'precision': 0.8371802568435669, 'recall': 0.8371802568435669}


### 5. Conclusion

This notebook has demonstrated the core functionalities of the `GenAIResultsComparator` library. You've learned how to:

The library is designed to be extensible, so you can also create your own custom metrics by inheriting from `BaseMetric`. For more advanced use cases, such as prompt-aware evaluation, check the `examples` folder in the library's repository.

We encourage you to explore the library further with your own datasets and LLM outputs. Happy evaluating!