# ROUGE

In [None]:
!pip install rouge-score

This code demonstrates three key concepts of ROUGE:

- ROUGE-1: Measures unigram (single word) overlap. 
- ROUGE-2: Measures bigram (two consecutive words) overlap. 
- ROUGE-L: Measures longest common subsequence. 
The code shows three scenarios:

Perfect match (scores will be 1.0). 
Partial match (scores will be between 0 and 1). 
Poor match (scores will be close to 0). 

In [11]:
from rouge_score import rouge_scorer


# Initialize ROUGE scorer with different ROUGE variants
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Example 1: Perfect match
reference = "The cat sat on the mat."
candidate = "The cat sat on the mat."

print("Example 1: Perfect Match")
print(f"Reference: {reference}")
print(f"Candidate: {candidate}")
scores = scorer.score(reference, candidate)
print("ROUGE Scores:")
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}")
print(f"ROUGE-2: {scores['rouge2'].fmeasure:.3f}")
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}")
print("\n")

# Example 2: Partial match
reference = "The cat sat on the mat."
candidate = "A cat is sitting on the mat."

print("Example 2: Partial Match")
print(f"Reference: {reference}")
print(f"Candidate: {candidate}")
scores = scorer.score(reference, candidate)
print("ROUGE Scores:")
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}")
print(f"ROUGE-2: {scores['rouge2'].fmeasure:.3f}")
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}")
print("\n")

# Example 3: Poor match
reference = "The cat sat on the mat."
candidate = "The dog ran in the yard."

print("Example 3: Poor Match")
print(f"Reference: {reference}")
print(f"Candidate: {candidate}")
scores = scorer.score(reference, candidate)
print("ROUGE Scores:")
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}")
print(f"ROUGE-2: {scores['rouge2'].fmeasure:.3f}")
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}")

Example 1: Perfect Match
Reference: The cat sat on the mat.
Candidate: The cat sat on the mat.
ROUGE Scores:
ROUGE-1: 1.000
ROUGE-2: 1.000
ROUGE-L: 1.000


Example 2: Partial Match
Reference: The cat sat on the mat.
Candidate: A cat is sitting on the mat.
ROUGE Scores:
ROUGE-1: 0.615
ROUGE-2: 0.364
ROUGE-L: 0.615


Example 3: Poor Match
Reference: The cat sat on the mat.
Candidate: The dog ran in the yard.
ROUGE Scores:
ROUGE-1: 0.333
ROUGE-2: 0.000
ROUGE-L: 0.333


# Semantic similarity metric - BERT SCORE

In [None]:
!pip install bert-score

### This example demonstrates three key aspects of BERTScore:  

Perfect Match:  
Same sentences should get very high scores (close to 1.0). 
Shows baseline for perfect similarity

Partial Match:  
Semantically similar but different wording. 
Demonstrates BERTScore's ability to capture meaning beyond exact matches. 

Poor Match:  
Different meaning and words. 
Shows how scores decrease with semantic dissimilarity. 

### The output includes:  
- Precision: How well candidate words match reference
- Recall: How well reference words are captured in candidate
- F1: Harmonic mean of precision and recall

### Key differences from ROUGE:  
- BERTScore uses contextual embeddings, capturing semantic similarity
- Can identify similar meanings even with different words
- More robust to paraphrasing than ROUGE

BERTScore typically correlates better with human judgment than ROUGE because it captures semantic similarity rather than just lexical overlap.


In [12]:
from bert_score import score
import torch

# Ensure using CPU if no GPU available
device = "cuda" if torch.cuda.is_available() else "cpu"

# Set rescale_with_baseline=True for more interpretable scores
torch.hub.set_dir('./torch_cache')  # Optional: set cache directory

examples = [
    # Example 1: Perfect match
    {
        "reference": "The cat sat on the mat.",
        "candidate": "The cat sat on the mat."
    },
    # Example 2: Semantically similar
    {
        "reference": "The cat sat on the mat.",
        "candidate": "A cat is sitting on the mat."
    },
    # Example 3: Different but related concepts
    {
        "reference": "The cat sat on the mat.",
        "candidate": "The dog ran in the yard."
    },
    # Example 4: Completely different meaning
    {
        "reference": "The cat sat on the mat.",
        "candidate": "The stock market crashed yesterday."
    }
]

for i, example in enumerate(examples, 1):
    P, R, F1 = score([example["candidate"]], 
                    [example["reference"]], 
                    lang="en", 
                    rescale_with_baseline=True,
                    model_type="roberta-large")
    
    print(f"\nExample {i}:")
    print(f"Reference: {example['reference']}")
    print(f"Candidate: {example['candidate']}")
    print("BERTScore metrics:")
    print(f"Precision: {P.mean():.3f}")
    print(f"Recall: {R.mean():.3f}")
    print(f"F1: {F1.mean():.3f}")
    print("-" * 50)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Example 1:
Reference: The cat sat on the mat.
Candidate: The cat sat on the mat.
BERTScore metrics:
Precision: 1.000
Recall: 1.000
F1: 1.000
--------------------------------------------------


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Example 2:
Reference: The cat sat on the mat.
Candidate: A cat is sitting on the mat.
BERTScore metrics:
Precision: 0.805
Recall: 0.859
F1: 0.832
--------------------------------------------------


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Example 3:
Reference: The cat sat on the mat.
Candidate: The dog ran in the yard.
BERTScore metrics:
Precision: 0.683
Recall: 0.608
F1: 0.646
--------------------------------------------------


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Example 4:
Reference: The cat sat on the mat.
Candidate: The stock market crashed yesterday.
BERTScore metrics:
Precision: 0.247
Recall: 0.293
F1: 0.271
--------------------------------------------------


# Semantic vs Textual - When to Use Each Type of Metric

## Key Decision Factors:
- Precision Critical → ROUGE
- Meaning Critical → BERTScore
- Multiple Valid Expressions → BERTScore
- Exact Terminology Required → ROUGE

## ROUGE (Textual/N-gram Based) use cases:

### 1. News Headline Generation
- Need exact terminology preservation
- Key facts and names must match precisely
- Example: "Apple launches iPhone 15" must contain exact product names

### 2. Medical Report Summarization
- Critical medical terms must be preserved
- No room for semantic alternatives
- Example: "Patient shows signs of hypertension" vs "Patient has high blood pressure"

### 3. Legal Document Summarization
- Specific legal terminology must be maintained
- Exact phrasing is crucial
- Example: Contract terms and conditions

## BERTScore (Semantic) use cases:

### 1. Customer Review Summarization
- "The food was excellent" ≈ "The meal was fantastic"
- Capturing sentiment is more important than exact wording
- Example: Amazon product review summaries

### 2. Conversational AI Responses
- Multiple valid ways to express same information
- Focus on meaning rather than exact wording
- Example: Chatbot responses

### 3. Multi-lingual Translation Evaluation
- Same meaning in different languages
- Accounts for cultural and linguistic variations
- Example: Evaluating machine translation quality

### 4. Social Media Content Analysis
- Similar ideas expressed in different ways
- Slang and variations are common
- Example: Twitter sentiment analysis