<a href="https://colab.research.google.com/github/dimitarpg13/agentic_architectures_and_design_patterns/blob/main/notebooks/model_evaluation/mlflow_rouge_metric_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLflow ROUGE Metric Demonstration

This notebook demonstrates how to use MLflow's ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric for evaluating text generation and summarization models.

## What is ROUGE?

ROUGE is a set of metrics commonly used for evaluating automatic summarization and machine translation tasks. It compares an automatically produced summary or translation against reference summaries (typically human-produced).

### ROUGE Variants:
- **ROUGE-N**: Overlap of n-grams between the system and reference summaries
- **ROUGE-L**: Longest Common Subsequence (LCS) based statistics
- **ROUGE-W**: Weighted LCS-based statistics
- **ROUGE-S**: Skip-bigram based co-occurrence statistics

## 1. Installation and Setup

In [None]:
# Install required packages
!pip install -q mlflow rouge-score transformers torch pandas numpy matplotlib seaborn

In [None]:
import mlflow
import mlflow.metrics
from mlflow.metrics import make_metric, MetricValue
import pandas as pd
import numpy as np
from rouge_score import rouge_scorer
import json
import warnings
warnings.filterwarnings('ignore')

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

print(f"MLflow version: {mlflow.__version__}")

## 2. Creating Custom ROUGE Metrics with MLflow

In [None]:
def create_rouge_metric(rouge_type='rouge1', score_type='fmeasure'):
    """
    Creates a custom ROUGE metric for MLflow evaluation.

    Args:
        rouge_type: Type of ROUGE metric ('rouge1', 'rouge2', 'rougeL', 'rougeLsum')
        score_type: Type of score ('precision', 'recall', 'fmeasure')
    """
    def rouge_metric(predictions, targets):
        scorer = rouge_scorer.RougeScorer([rouge_type], use_stemmer=True)
        scores = []

        for pred, target in zip(predictions, targets):
            score = scorer.score(target, pred)
            scores.append(getattr(score[rouge_type], score_type))

        return MetricValue(
            aggregate_results={f"{rouge_type}_{score_type}": np.mean(scores)},
            scores=scores
        )

    return make_metric(
        eval_fn=rouge_metric,
        greater_is_better=True,
        name=f"{rouge_type}_{score_type}"
    )

# Create different ROUGE metrics
rouge1_f = create_rouge_metric('rouge1', 'fmeasure')
rouge1_precision = create_rouge_metric('rouge1', 'precision')
rouge1_recall = create_rouge_metric('rouge1', 'recall')
rouge2_f = create_rouge_metric('rouge2', 'fmeasure')
rougeL_f = create_rouge_metric('rougeL', 'fmeasure')
rougeLsum_f = create_rouge_metric('rougeLsum', 'fmeasure')

print("Custom ROUGE metrics created successfully!")

## 3. Sample Data Preparation

In [None]:
# Create sample summarization data
sample_data = [
    {
        "original": """Machine learning is a subset of artificial intelligence that enables
        computers to learn from data without being explicitly programmed. It uses algorithms
        that iteratively learn from data to improve their accuracy. Deep learning is a subset
        of machine learning that uses neural networks with multiple layers.""",

        "reference_summary": "Machine learning allows computers to learn from data using algorithms. Deep learning uses multi-layer neural networks.",

        "model_summary_good": "Machine learning enables computers to learn from data through algorithms. Deep learning employs neural networks with multiple layers.",

        "model_summary_poor": "Artificial intelligence is about computers. Neural networks exist."
    },
    {
        "original": """Natural language processing (NLP) is a field of AI that focuses on the
        interaction between computers and human language. It enables machines to understand,
        interpret, and generate human language. Applications include translation, sentiment
        analysis, and chatbots.""",

        "reference_summary": "NLP enables computers to understand and generate human language, with applications in translation and sentiment analysis.",

        "model_summary_good": "Natural language processing allows machines to comprehend and produce human language, used in translation and sentiment analysis.",

        "model_summary_poor": "Computers can process text. AI has many uses."
    },
    {
        "original": """Climate change refers to long-term shifts in global temperatures and
        weather patterns. While climate variations are natural, human activities have been
        the dominant driver since the 1900s, primarily through burning fossil fuels which
        produces greenhouse gases.""",

        "reference_summary": "Climate change involves long-term temperature shifts, driven mainly by human fossil fuel use since 1900s.",

        "model_summary_good": "Climate change represents long-term temperature changes, primarily caused by human burning of fossil fuels since the 1900s.",

        "model_summary_poor": "Weather changes over time. Humans impact the environment."
    }
]

# Convert to DataFrame for easier manipulation
df = pd.DataFrame(sample_data)
print(f"Created {len(df)} sample summarization examples")
df.head()

## 4. Evaluating Models with MLflow ROUGE Metrics

In [None]:
# Set up MLflow experiment
mlflow.set_experiment("rouge-metrics-demo")

def evaluate_summarization_model(model_name, predictions, references, extra_metrics=None):
    """
    Evaluate a summarization model using ROUGE metrics and log to MLflow.
    """
    with mlflow.start_run(run_name=model_name):
        # Log model parameters
        mlflow.log_param("model_name", model_name)
        mlflow.log_param("num_samples", len(predictions))

        # Create evaluation dataframe
        eval_df = pd.DataFrame({
            "predictions": predictions,
            "targets": references
        })

        # Evaluate with all ROUGE metrics
        metrics = [rouge1_f, rouge1_precision, rouge1_recall, rouge2_f, rougeL_f, rougeLsum_f]

        results = mlflow.evaluate(
            data=eval_df,
            targets="targets",
            predictions="predictions",
            extra_metrics=metrics,
            evaluators="default"
        )

        # Log additional custom metrics if provided
        if extra_metrics:
            for key, value in extra_metrics.items():
                mlflow.log_metric(key, value)

        # Extract and return scores
        scores = {}
        for metric in metrics:
            metric_name = metric.name
            if metric_name in results.metrics:
                scores[metric_name] = results.metrics[metric_name]

        return scores, results

# Evaluate "good" model
good_scores, good_results = evaluate_summarization_model(
    "good_summarizer",
    df["model_summary_good"].tolist(),
    df["reference_summary"].tolist(),
    extra_metrics={"model_quality": 0.9}
)

# Evaluate "poor" model
poor_scores, poor_results = evaluate_summarization_model(
    "poor_summarizer",
    df["model_summary_poor"].tolist(),
    df["reference_summary"].tolist(),
    extra_metrics={"model_quality": 0.3}
)

print("Evaluation completed!")

## 5. Comparing Model Performance

In [None]:
# Create comparison DataFrame
comparison_data = {
    "Metric": list(good_scores.keys()),
    "Good Model": list(good_scores.values()),
    "Poor Model": list(poor_scores.values())
}

comparison_df = pd.DataFrame(comparison_data)
comparison_df["Improvement"] = comparison_df["Good Model"] - comparison_df["Poor Model"]
comparison_df["Improvement %"] = (comparison_df["Improvement"] / comparison_df["Poor Model"] * 100).round(2)

print("\nüìä Model Comparison Results:")
print("="*60)
print(comparison_df.to_string(index=False))

## 6. Visualizing ROUGE Scores

In [None]:
# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Bar comparison of all metrics
ax1 = axes[0, 0]
metrics_plot = comparison_df[["Metric", "Good Model", "Poor Model"]].set_index("Metric")
metrics_plot.plot(kind="bar", ax=ax1, color=["#2ecc71", "#e74c3c"])
ax1.set_title("ROUGE Metrics Comparison", fontsize=14, fontweight='bold')
ax1.set_ylabel("Score", fontsize=12)
ax1.set_xlabel("Metric", fontsize=12)
ax1.legend(title="Model", loc="upper right")
ax1.grid(axis='y', alpha=0.3)
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right')

# Plot 2: ROUGE-1 breakdown (Precision, Recall, F-measure)
ax2 = axes[0, 1]
rouge1_metrics = comparison_df[comparison_df["Metric"].str.contains("rouge1")]
rouge1_data = rouge1_metrics[["Good Model", "Poor Model"]].values.T
rouge1_labels = [m.split('_')[1] for m in rouge1_metrics["Metric"]]

x = np.arange(len(rouge1_labels))
width = 0.35

bars1 = ax2.bar(x - width/2, rouge1_data[0], width, label='Good Model', color='#2ecc71')
bars2 = ax2.bar(x + width/2, rouge1_data[1], width, label='Poor Model', color='#e74c3c')

ax2.set_title("ROUGE-1 Detailed Breakdown", fontsize=14, fontweight='bold')
ax2.set_ylabel("Score", fontsize=12)
ax2.set_xticks(x)
ax2.set_xticklabels(rouge1_labels)
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

# Plot 3: Improvement percentages
ax3 = axes[1, 0]
improvements = comparison_df[comparison_df["Improvement %"].notna()]
colors = ['#3498db' if x > 0 else '#e74c3c' for x in improvements["Improvement %"]]
bars = ax3.bar(range(len(improvements)), improvements["Improvement %"], color=colors)
ax3.set_title("Improvement of Good Model over Poor Model (%)", fontsize=14, fontweight='bold')
ax3.set_ylabel("Improvement (%)", fontsize=12)
ax3.set_xlabel("Metric", fontsize=12)
ax3.set_xticks(range(len(improvements)))
ax3.set_xticklabels(improvements["Metric"], rotation=45, ha='right')
ax3.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
ax3.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar, val in zip(bars, improvements["Improvement %"]):
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height + (5 if height > 0 else -15),
             f'{val:.1f}%', ha='center', va='bottom' if height > 0 else 'top', fontsize=10)

# Plot 4: Sample-wise comparison
ax4 = axes[1, 1]
sample_scores_good = []
sample_scores_poor = []

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
for i in range(len(df)):
    good_score = scorer.score(df.iloc[i]["reference_summary"], df.iloc[i]["model_summary_good"])
    poor_score = scorer.score(df.iloc[i]["reference_summary"], df.iloc[i]["model_summary_poor"])
    sample_scores_good.append(good_score['rouge1'].fmeasure)
    sample_scores_poor.append(poor_score['rouge1'].fmeasure)

x_samples = range(1, len(df) + 1)
ax4.plot(x_samples, sample_scores_good, 'o-', color='#2ecc71', label='Good Model', linewidth=2, markersize=8)
ax4.plot(x_samples, sample_scores_poor, 's-', color='#e74c3c', label='Poor Model', linewidth=2, markersize=8)
ax4.set_title("Sample-wise ROUGE-1 F-measure", fontsize=14, fontweight='bold')
ax4.set_xlabel("Sample", fontsize=12)
ax4.set_ylabel("ROUGE-1 F-measure", fontsize=12)
ax4.set_xticks(x_samples)
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Advanced: Custom Evaluation with Multiple ROUGE Variants

In [None]:
def comprehensive_rouge_evaluation(predictions, references):
    """
    Perform comprehensive ROUGE evaluation with all available metrics.
    """
    # Initialize scorer with all ROUGE types
    scorer = rouge_scorer.RougeScorer(
        ['rouge1', 'rouge2', 'rouge3', 'rouge4', 'rougeL', 'rougeLsum'],
        use_stemmer=True
    )

    all_scores = []

    for pred, ref in zip(predictions, references):
        scores = scorer.score(ref, pred)

        score_dict = {}
        for rouge_type, score in scores.items():
            score_dict[f"{rouge_type}_precision"] = score.precision
            score_dict[f"{rouge_type}_recall"] = score.recall
            score_dict[f"{rouge_type}_fmeasure"] = score.fmeasure

        all_scores.append(score_dict)

    # Calculate averages
    avg_scores = {}
    for key in all_scores[0].keys():
        avg_scores[key] = np.mean([s[key] for s in all_scores])

    return pd.DataFrame(all_scores), avg_scores

# Evaluate both models comprehensively
good_detailed, good_avg = comprehensive_rouge_evaluation(
    df["model_summary_good"].tolist(),
    df["reference_summary"].tolist()
)

poor_detailed, poor_avg = comprehensive_rouge_evaluation(
    df["model_summary_poor"].tolist(),
    df["reference_summary"].tolist()
)

# Create comprehensive comparison
comprehensive_comparison = pd.DataFrame({
    "Metric": list(good_avg.keys()),
    "Good Model": list(good_avg.values()),
    "Poor Model": list(poor_avg.values())
})
comprehensive_comparison["Difference"] = comprehensive_comparison["Good Model"] - comprehensive_comparison["Poor Model"]

# Display top performing metrics for good model
print("\nüèÜ Top 10 Metrics (Good Model):")
print("="*50)
top_metrics = comprehensive_comparison.nlargest(10, "Good Model")[["Metric", "Good Model"]]
for _, row in top_metrics.iterrows():
    print(f"{row['Metric']:30s}: {row['Good Model']:.4f}")

print("\nüìà Largest Improvements:")
print("="*50)
improvements = comprehensive_comparison.nlargest(10, "Difference")[["Metric", "Difference"]]
for _, row in improvements.iterrows():
    print(f"{row['Metric']:30s}: +{row['Difference']:.4f}")

## 8. Batch Evaluation with MLflow Tracking

In [None]:
def batch_evaluate_models(model_configs, test_data):
    """
    Evaluate multiple model configurations and track with MLflow.
    """
    results = []

    for config in model_configs:
        with mlflow.start_run(run_name=config['name']):
            # Log configuration
            mlflow.log_params(config['params'])

            # Simulate model predictions (in practice, these would come from your model)
            if config['quality'] == 'high':
                predictions = test_data['model_summary_good'].tolist()
            elif config['quality'] == 'medium':
                # Simulate medium quality by mixing good and poor
                predictions = [
                    good if i % 2 == 0 else poor
                    for i, (good, poor) in enumerate(zip(
                        test_data['model_summary_good'].tolist(),
                        test_data['model_summary_poor'].tolist()
                    ))
                ]
            else:
                predictions = test_data['model_summary_poor'].tolist()

            references = test_data['reference_summary'].tolist()

            # Calculate ROUGE scores
            scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
            scores = []
            for pred, ref in zip(predictions, references):
                score = scorer.score(ref, pred)
                scores.append({
                    'rouge1_f': score['rouge1'].fmeasure,
                    'rouge2_f': score['rouge2'].fmeasure,
                    'rougeL_f': score['rougeL'].fmeasure
                })

            # Calculate and log average scores
            avg_scores = {k: np.mean([s[k] for s in scores]) for k in scores[0].keys()}
            mlflow.log_metrics(avg_scores)

            # Log model artifacts (in practice, you'd save actual model files)
            mlflow.log_dict(config, "config.json")

            # Store results
            results.append({
                'model': config['name'],
                **avg_scores,
                **config['params']
            })

    return pd.DataFrame(results)

# Define model configurations to test
model_configs = [
    {
        'name': 'transformer_base',
        'quality': 'high',
        'params': {'architecture': 'transformer', 'layers': 6, 'learning_rate': 0.001}
    },
    {
        'name': 'transformer_large',
        'quality': 'high',
        'params': {'architecture': 'transformer', 'layers': 12, 'learning_rate': 0.0005}
    },
    {
        'name': 'lstm_model',
        'quality': 'medium',
        'params': {'architecture': 'lstm', 'layers': 2, 'learning_rate': 0.01}
    },
    {
        'name': 'baseline_extractive',
        'quality': 'low',
        'params': {'architecture': 'extractive', 'layers': 1, 'learning_rate': 0.1}
    }
]

# Run batch evaluation
batch_results = batch_evaluate_models(model_configs, df)

print("\nüîÑ Batch Evaluation Results:")
print("="*80)
print(batch_results.to_string(index=False))

# Find best model
best_model_idx = batch_results['rouge1_f'].idxmax()
best_model = batch_results.loc[best_model_idx]
print(f"\nü•á Best Model: {best_model['model']}")
print(f"   ROUGE-1 F1: {best_model['rouge1_f']:.4f}")
print(f"   ROUGE-2 F1: {best_model['rouge2_f']:.4f}")
print(f"   ROUGE-L F1: {best_model['rougeL_f']:.4f}")

## 9. Creating a Reusable ROUGE Evaluation Pipeline

In [None]:
class ROUGEEvaluator:
    """
    A reusable ROUGE evaluation pipeline with MLflow integration.
    """

    def __init__(self, experiment_name="rouge-evaluation", use_stemmer=True):
        self.experiment_name = experiment_name
        self.use_stemmer = use_stemmer
        self.scorer = rouge_scorer.RougeScorer(
            ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'],
            use_stemmer=use_stemmer
        )
        mlflow.set_experiment(experiment_name)

    def evaluate_single(self, prediction, reference):
        """Evaluate a single prediction-reference pair."""
        return self.scorer.score(reference, prediction)

    def evaluate_batch(self, predictions, references, model_name=None, log_to_mlflow=True):
        """
        Evaluate a batch of predictions.

        Args:
            predictions: List of predicted summaries
            references: List of reference summaries
            model_name: Name for MLflow logging
            log_to_mlflow: Whether to log results to MLflow
        """
        if len(predictions) != len(references):
            raise ValueError("Predictions and references must have the same length")

        all_scores = []
        for pred, ref in zip(predictions, references):
            scores = self.evaluate_single(pred, ref)
            all_scores.append(scores)

        # Calculate aggregate metrics
        aggregate_scores = self._aggregate_scores(all_scores)

        # Log to MLflow if requested
        if log_to_mlflow:
            self._log_to_mlflow(aggregate_scores, model_name, len(predictions))

        return aggregate_scores, all_scores

    def _aggregate_scores(self, all_scores):
        """Aggregate individual scores into summary statistics."""
        aggregate = {}

        for rouge_type in all_scores[0].keys():
            precisions = [s[rouge_type].precision for s in all_scores]
            recalls = [s[rouge_type].recall for s in all_scores]
            fmeasures = [s[rouge_type].fmeasure for s in all_scores]

            aggregate[rouge_type] = {
                'precision': {
                    'mean': np.mean(precisions),
                    'std': np.std(precisions),
                    'min': np.min(precisions),
                    'max': np.max(precisions)
                },
                'recall': {
                    'mean': np.mean(recalls),
                    'std': np.std(recalls),
                    'min': np.min(recalls),
                    'max': np.max(recalls)
                },
                'fmeasure': {
                    'mean': np.mean(fmeasures),
                    'std': np.std(fmeasures),
                    'min': np.min(fmeasures),
                    'max': np.max(fmeasures)
                }
            }

        return aggregate

    def _log_to_mlflow(self, aggregate_scores, model_name, num_samples):
        """Log evaluation results to MLflow."""
        with mlflow.start_run(run_name=model_name or "rouge-evaluation"):
            # Log parameters
            mlflow.log_param("num_samples", num_samples)
            mlflow.log_param("use_stemmer", self.use_stemmer)
            if model_name:
                mlflow.log_param("model_name", model_name)

            # Log metrics
            for rouge_type, scores in aggregate_scores.items():
                for metric_type, values in scores.items():
                    for stat_name, stat_value in values.items():
                        metric_name = f"{rouge_type}_{metric_type}_{stat_name}"
                        mlflow.log_metric(metric_name, stat_value)

    def compare_models(self, model_results):
        """
        Compare multiple model evaluation results.

        Args:
            model_results: Dict of {model_name: (predictions, references)}
        """
        comparison_data = []

        for model_name, (predictions, references) in model_results.items():
            aggregate_scores, _ = self.evaluate_batch(
                predictions, references, model_name, log_to_mlflow=True
            )

            # Extract key metrics for comparison
            row = {'model': model_name}
            for rouge_type in ['rouge1', 'rouge2', 'rougeL']:
                row[f"{rouge_type}_f1"] = aggregate_scores[rouge_type]['fmeasure']['mean']
                row[f"{rouge_type}_precision"] = aggregate_scores[rouge_type]['precision']['mean']
                row[f"{rouge_type}_recall"] = aggregate_scores[rouge_type]['recall']['mean']

            comparison_data.append(row)

        return pd.DataFrame(comparison_data)

# Demonstrate the pipeline
evaluator = ROUGEEvaluator(experiment_name="rouge-pipeline-demo")

# Prepare model results for comparison
model_results = {
    "High-Quality Model": (df["model_summary_good"].tolist(), df["reference_summary"].tolist()),
    "Low-Quality Model": (df["model_summary_poor"].tolist(), df["reference_summary"].tolist())
}

# Compare models
comparison = evaluator.compare_models(model_results)

print("\nüéØ Pipeline Evaluation Results:")
print("="*80)
print(comparison.round(4).to_string(index=False))

## 10. Best Practices and Tips

### Key Insights from this Demo:

1. **ROUGE Metric Selection**:
   - **ROUGE-1**: Good for measuring unigram overlap (individual word matches)
   - **ROUGE-2**: Captures bigram overlap (phrase-level similarity)
   - **ROUGE-L**: Based on longest common subsequence (structural similarity)
   - **ROUGE-Lsum**: Better for multi-sentence summaries

2. **MLflow Integration Benefits**:
   - Automatic experiment tracking
   - Easy model comparison
   - Reproducible evaluations
   - Metric visualization in MLflow UI

3. **Evaluation Considerations**:
   - Always use multiple ROUGE variants for comprehensive evaluation
   - Consider both precision and recall, not just F-measure
   - Use stemming for more robust matching
   - Evaluate on diverse test sets

4. **Production Tips**:
   - Create reusable evaluation pipelines
   - Log all hyperparameters and configurations
   - Track both aggregate and sample-level metrics
   - Set up automated evaluation for model updates

In [None]:
# View MLflow UI command (run in terminal)
print("\nüìä To view the MLflow UI with all logged experiments:")
print("Run this command in your terminal:")
print("\n    mlflow ui --port 5000\n")
print("Then open: http://localhost:5000")
print("\n‚úÖ Demo completed successfully!")