# Agentic Workflow KPI Evaluation using Embeddings

This notebook demonstrates comprehensive usage of the embedding-based KPI evaluation system for agentic workflows.

## Key Features:
- **Accuracy Measurement**: Semantic similarity between outputs and ground truth
- **Faithfulness Evaluation**: How well responses align with source context
- **Relevance Scoring**: Query-response alignment measurement
- **No LLM-as-scorer**: Pure embedding-based metrics for efficiency and determinism

## 1. Setup and Installation

In [None]:
# Install required packages
!pip install sentence-transformers numpy scipy pandas matplotlib seaborn tqdm -q

# Optional: Install for other embedding providers
# !pip install openai cohere -q

In [None]:
import sys
sys.path.append('/home/claude')

from agentic_kpi_embeddings import (
    AgenticKPIEvaluator,
    SentenceTransformerProvider,
    EvaluationSample,
    SimilarityMetric,
    KPIAnalyzer,
    AdvancedMetrics,
    SimilarityCalculator
)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
from typing import List, Dict, Any
from tqdm import tqdm

# Set up visualization
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✅ Setup complete!")

## 2. Initialize Embedding Providers

In [None]:
# Initialize different embedding providers

# Option 1: Sentence Transformers (recommended for local usage)
provider_fast = SentenceTransformerProvider("all-MiniLM-L6-v2")  # Fast, 384 dimensions
print(f"Fast provider dimension: {provider_fast.get_dimension()}")

# Option 2: Higher quality model
provider_quality = SentenceTransformerProvider("all-mpnet-base-v2")  # Better quality, 768 dimensions
print(f"Quality provider dimension: {provider_quality.get_dimension()}")

# Option 3: OpenAI (requires API key)
# from agentic_kpi_embeddings import OpenAIProvider
# provider_openai = OpenAIProvider(api_key="your-api-key", model="text-embedding-3-small")

## 3. Basic KPI Evaluation

In [None]:
# Create evaluator with caching enabled
evaluator = AgenticKPIEvaluator(
    embedding_provider=provider_fast,
    similarity_metric=SimilarityMetric.COSINE,
    use_cache=True,
    cache_dir=Path("/tmp/kpi_cache"),
    batch_size=32
)

print("Evaluator initialized with:")
print(f"  - Similarity metric: {evaluator.similarity_metric.value}")
print(f"  - Batch size: {evaluator.batch_size}")
print(f"  - Cache enabled: {evaluator.cache is not None}")

In [None]:
# Example 1: Simple accuracy evaluation
response = "Paris is the capital and largest city of France, located in the northern part of the country."
ground_truth = "The capital of France is Paris."

accuracy_result = evaluator.calculate_accuracy(response, ground_truth, threshold=0.7)

print("ACCURACY EVALUATION:")
print(f"Score: {accuracy_result.score:.4f}")
print(f"Binary accuracy (threshold={accuracy_result.details['threshold']}): {accuracy_result.details['binary_accuracy']}")
print(f"Similarity metric: {accuracy_result.details['similarity_metric']}")

In [None]:
# Example 2: Faithfulness evaluation
response = "Machine learning models can be trained using gradient descent to minimize loss functions."
context = [
    "Machine learning is a subset of artificial intelligence.",
    "Gradient descent is an optimization algorithm used to minimize functions.",
    "Loss functions measure the difference between predictions and actual values.",
    "Neural networks use backpropagation to calculate gradients."
]

faithfulness_result = evaluator.calculate_faithfulness(response, context, aggregation="weighted")

print("FAITHFULNESS EVALUATION:")
print(f"Score: {faithfulness_result.score:.4f}")
print(f"Max similarity: {faithfulness_result.details['max_similarity']:.4f}")
print(f"Min similarity: {faithfulness_result.details['min_similarity']:.4f}")
print(f"Std deviation: {faithfulness_result.details['std_similarity']:.4f}")
print(f"\nIndividual similarities:")
for i, sim in enumerate(faithfulness_result.details['individual_similarities']):
    print(f"  Context {i+1}: {sim:.4f}")

In [None]:
# Example 3: Relevance evaluation
query = "How does photosynthesis work?"
response = "Photosynthesis is the process by which plants convert light energy into chemical energy, using carbon dioxide and water to produce glucose and oxygen."
context = [
    "Plants are autotrophs that produce their own food.",
    "Chlorophyll is the green pigment that captures light energy."
]

relevance_result = evaluator.calculate_relevance(query, response, context)

print("RELEVANCE EVALUATION:")
print(f"Score: {relevance_result.score:.4f}")
print(f"Query-Response similarity: {relevance_result.details['query_response_similarity']:.4f}")
if 'context_similarity' in relevance_result.details:
    print(f"Context similarity: {relevance_result.details['context_similarity']:.4f}")
    print(f"Combined score: {relevance_result.details['combined_score']:.4f}")

## 4. Batch Evaluation of Agentic Workflows

In [None]:
# Create sample agentic workflow data
agentic_samples = [
    # RAG System Examples
    EvaluationSample(
        query="What are the main features of transformers in NLP?",
        response="Transformers use self-attention mechanisms to process sequences in parallel, enabling better long-range dependencies capture than RNNs.",
        context=[
            "Transformers were introduced in the 'Attention is All You Need' paper.",
            "Self-attention allows models to weigh the importance of different positions.",
            "Transformers eliminate the need for recurrence and convolutions."
        ],
        ground_truth="Transformers are neural network architectures that use self-attention mechanisms for sequence processing."
    ),
    
    # Code Generation Agent
    EvaluationSample(
        query="Write a function to calculate factorial",
        response="def factorial(n): return 1 if n <= 1 else n * factorial(n-1)",
        context=[
            "Factorial is the product of all positive integers up to n.",
            "Recursive solutions are elegant for factorial calculation."
        ],
        ground_truth="def factorial(n):\n    if n == 0 or n == 1:\n        return 1\n    else:\n        return n * factorial(n - 1)"
    ),
    
    # Question Answering Agent
    EvaluationSample(
        query="What is the speed of light?",
        response="The speed of light in vacuum is approximately 299,792,458 meters per second.",
        context=[
            "Light travels at different speeds in different media.",
            "The speed of light in vacuum is a fundamental physical constant denoted by c."
        ],
        ground_truth="The speed of light in vacuum is 299,792,458 m/s."
    ),
    
    # Summarization Agent
    EvaluationSample(
        query="Summarize the key points about climate change",
        response="Climate change refers to long-term shifts in global temperatures and weather patterns, primarily driven by human activities since the 1800s.",
        context=[
            "Global average temperature has risen by about 1.1°C since pre-industrial times.",
            "Greenhouse gas emissions from burning fossil fuels are the main driver.",
            "Climate change impacts include rising sea levels, extreme weather events, and ecosystem disruption."
        ],
        ground_truth="Climate change is the long-term alteration of Earth's climate patterns, mainly caused by human activities."
    ),
    
    # Multi-turn Conversation Agent
    EvaluationSample(
        query="What did we discuss about Python?",
        response="We discussed Python's versatility as a programming language, its use in data science, and its simple syntax.",
        context=[
            "Python is a high-level, interpreted programming language.",
            "Python is widely used in data science, web development, and automation.",
            "Python's syntax emphasizes readability with significant whitespace."
        ],
        ground_truth="We talked about Python being a versatile programming language used in various domains."
    )
]

print(f"Created {len(agentic_samples)} evaluation samples for different agent types")

In [None]:
# Evaluate all samples
print("Evaluating samples...")
results_df = evaluator.evaluate_batch(agentic_samples, parallel=True, n_workers=4)

print(f"\nEvaluation complete! Generated {len(results_df)} metric measurements")
print("\nSample of results:")
results_df.head(10)

## 5. Analysis and Visualization

In [None]:
# Initialize analyzer
analyzer = KPIAnalyzer(results_df)

# Get summary statistics
summary_stats = analyzer.get_summary_statistics()
print("SUMMARY STATISTICS BY METRIC:")
print("=" * 50)
print(summary_stats)

# Visualize distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for idx, metric in enumerate(['accuracy', 'faithfulness', 'relevance']):
    metric_data = results_df[results_df['metric'] == metric]['score']
    
    axes[idx].hist(metric_data, bins=20, edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'{metric.capitalize()} Distribution')
    axes[idx].set_xlabel('Score')
    axes[idx].set_ylabel('Frequency')
    axes[idx].axvline(metric_data.mean(), color='red', linestyle='--', label=f'Mean: {metric_data.mean():.3f}')
    axes[idx].legend()

plt.tight_layout()
plt.show()

In [None]:
# Correlation analysis
correlation_matrix = analyzer.get_correlation_matrix()

plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Metric Correlation Matrix')
plt.show()

print("\nKey Insights:")
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        metric1 = correlation_matrix.columns[i]
        metric2 = correlation_matrix.columns[j]
        corr = correlation_matrix.iloc[i, j]
        if abs(corr) > 0.5:
            print(f"  - Strong correlation ({corr:.3f}) between {metric1} and {metric2}")

In [None]:
# Identify outliers
outliers_df = analyzer.identify_outliers(threshold_std=1.5)

if not outliers_df.empty:
    print("OUTLIER SAMPLES DETECTED:")
    print("=" * 50)
    for _, outlier in outliers_df.iterrows():
        print(f"Sample {outlier['sample_idx']}: {outlier['metric']} = {outlier['score']:.4f} (deviation: {outlier['deviation']:.2f}σ)")
else:
    print("No significant outliers detected")

## 6. Advanced Metrics

In [None]:
# Test consistency across multiple responses
responses_consistent = [
    "Python is a high-level programming language known for its simplicity.",
    "Python is a versatile high-level language with simple syntax.",
    "Python is an easy-to-learn high-level programming language."
]

responses_inconsistent = [
    "Python is a high-level programming language.",
    "JavaScript is primarily used for web development.",
    "Machine learning requires large amounts of data."
]

consistency_score_1 = AdvancedMetrics.calculate_consistency(
    responses_consistent, 
    provider_fast,
    SimilarityCalculator(),
    SimilarityMetric.COSINE
)

consistency_score_2 = AdvancedMetrics.calculate_consistency(
    responses_inconsistent,
    provider_fast,
    SimilarityCalculator(),
    SimilarityMetric.COSINE
)

print("CONSISTENCY ANALYSIS:")
print(f"Consistent responses score: {consistency_score_1:.4f}")
print(f"Inconsistent responses score: {consistency_score_2:.4f}")

In [None]:
# Test coverage of key concepts
response = """Machine learning is a branch of artificial intelligence that enables 
computers to learn from data without explicit programming. It uses algorithms 
to identify patterns and make predictions based on training data."""

key_concepts = [
    "artificial intelligence",
    "learning from data",
    "algorithms",
    "patterns",
    "predictions",
    "neural networks"  # Not covered
]

coverage_score = AdvancedMetrics.calculate_coverage(
    response,
    key_concepts,
    provider_fast,
    SimilarityCalculator(),
    SimilarityMetric.COSINE,
    threshold=0.6
)

print("COVERAGE ANALYSIS:")
print(f"Coverage score: {coverage_score:.4f}")
print(f"Covered concepts: {int(coverage_score * len(key_concepts))}/{len(key_concepts)}")

In [None]:
# Test specificity (non-generic responses)
specific_response = "The transformer architecture uses multi-head self-attention with 12 layers and 768 hidden dimensions in BERT-base."
generic_response = "This is an interesting topic that has many aspects to consider."

generic_templates = [
    "This is a complex topic with many considerations.",
    "There are various aspects to this question.",
    "This is an interesting area of study.",
    "Many factors contribute to this.",
    "This depends on various circumstances."
]

specificity_score_1 = AdvancedMetrics.calculate_specificity(
    specific_response,
    generic_templates,
    provider_fast,
    SimilarityCalculator(),
    SimilarityMetric.COSINE
)

specificity_score_2 = AdvancedMetrics.calculate_specificity(
    generic_response,
    generic_templates,
    provider_fast,
    SimilarityCalculator(),
    SimilarityMetric.COSINE
)

print("SPECIFICITY ANALYSIS:")
print(f"Specific response score: {specificity_score_1:.4f}")
print(f"Generic response score: {specificity_score_2:.4f}")

## 7. Comparing Different Similarity Metrics

In [None]:
# Compare different similarity metrics
metrics_comparison = []

test_response = "Deep learning uses neural networks with multiple layers to learn representations."
test_ground_truth = "Deep learning is a subset of machine learning using multi-layer neural networks."

for metric in SimilarityMetric:
    evaluator_metric = AgenticKPIEvaluator(
        embedding_provider=provider_fast,
        similarity_metric=metric,
        use_cache=True
    )
    
    result = evaluator_metric.calculate_accuracy(test_response, test_ground_truth)
    metrics_comparison.append({
        'metric': metric.value,
        'score': result.score
    })

metrics_df = pd.DataFrame(metrics_comparison)

# Visualize
plt.figure(figsize=(10, 6))
bars = plt.bar(metrics_df['metric'], metrics_df['score'], color='skyblue', edgecolor='navy')
plt.title('Comparison of Different Similarity Metrics')
plt.xlabel('Similarity Metric')
plt.ylabel('Score')
plt.ylim(0, 1)

# Add value labels on bars
for bar, score in zip(bars, metrics_df['score']):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{score:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("\nMetric Comparison Results:")
print(metrics_df.sort_values('score', ascending=False))

## 8. Production Pipeline Example

In [None]:
class AgenticWorkflowPipeline:
    """Production pipeline for evaluating agentic workflows."""
    
    def __init__(self, 
                 embedding_provider,
                 thresholds: Dict[str, float] = None):
        """
        Initialize pipeline with configurable thresholds.
        
        Args:
            embedding_provider: Provider for embeddings
            thresholds: Minimum scores for each metric
        """
        self.evaluator = AgenticKPIEvaluator(
            embedding_provider=embedding_provider,
            similarity_metric=SimilarityMetric.COSINE,
            use_cache=True
        )
        
        self.thresholds = thresholds or {
            'accuracy': 0.7,
            'faithfulness': 0.75,
            'relevance': 0.8
        }
    
    def evaluate_agent_response(self, 
                               query: str,
                               response: str,
                               context: List[str] = None,
                               ground_truth: str = None) -> Dict[str, Any]:
        """Evaluate a single agent response."""
        
        sample = EvaluationSample(
            query=query,
            response=response,
            context=context,
            ground_truth=ground_truth
        )
        
        results = self.evaluator.evaluate_sample(sample)
        
        # Determine pass/fail for each metric
        evaluation = {
            'scores': {},
            'passed': {},
            'overall_pass': True
        }
        
        for metric_name, result in results.items():
            score = result.score
            threshold = self.thresholds.get(metric_name, 0.5)
            passed = score >= threshold
            
            evaluation['scores'][metric_name] = score
            evaluation['passed'][metric_name] = passed
            
            if not passed:
                evaluation['overall_pass'] = False
        
        return evaluation
    
    def batch_evaluate(self, 
                      agent_outputs: List[Dict[str, Any]],
                      return_dataframe: bool = True):
        """Evaluate multiple agent outputs."""
        
        all_evaluations = []
        
        for output in tqdm(agent_outputs, desc="Evaluating agents"):
            evaluation = self.evaluate_agent_response(
                query=output['query'],
                response=output['response'],
                context=output.get('context'),
                ground_truth=output.get('ground_truth')
            )
            
            evaluation['agent_id'] = output.get('agent_id', 'unknown')
            evaluation['timestamp'] = output.get('timestamp', None)
            all_evaluations.append(evaluation)
        
        if return_dataframe:
            # Convert to DataFrame for analysis
            rows = []
            for eval_result in all_evaluations:
                row = {
                    'agent_id': eval_result['agent_id'],
                    'overall_pass': eval_result['overall_pass']
                }
                
                for metric, score in eval_result['scores'].items():
                    row[f'{metric}_score'] = score
                    row[f'{metric}_pass'] = eval_result['passed'][metric]
                
                rows.append(row)
            
            return pd.DataFrame(rows)
        
        return all_evaluations


# Example usage
pipeline = AgenticWorkflowPipeline(
    embedding_provider=provider_fast,
    thresholds={
        'accuracy': 0.75,
        'faithfulness': 0.8,
        'relevance': 0.85
    }
)

# Test single evaluation
test_evaluation = pipeline.evaluate_agent_response(
    query="What is gradient descent?",
    response="Gradient descent is an optimization algorithm that iteratively adjusts parameters to minimize a loss function.",
    context=["Gradient descent uses derivatives to find minima.", "It's widely used in machine learning."],
    ground_truth="Gradient descent is an iterative optimization algorithm for finding minima of functions."
)

print("PIPELINE EVALUATION RESULT:")
print(f"Overall Pass: {test_evaluation['overall_pass']}")
print("\nScores:")
for metric, score in test_evaluation['scores'].items():
    passed = "✅" if test_evaluation['passed'][metric] else "❌"
    print(f"  {metric}: {score:.4f} {passed}")

In [None]:
# Simulate multiple agent outputs for batch evaluation
simulated_agent_outputs = [
    {
        'agent_id': 'rag_agent_v1',
        'query': 'What is reinforcement learning?',
        'response': 'Reinforcement learning is a type of machine learning where agents learn through trial and error.',
        'context': ['RL uses rewards and punishments.', 'Agents interact with environments.'],
        'ground_truth': 'Reinforcement learning is ML paradigm where agents learn optimal behavior through rewards.'
    },
    {
        'agent_id': 'rag_agent_v2',
        'query': 'What is reinforcement learning?',
        'response': 'Reinforcement learning enables agents to learn optimal policies by maximizing cumulative rewards in an environment.',
        'context': ['RL uses rewards and punishments.', 'Agents interact with environments.'],
        'ground_truth': 'Reinforcement learning is ML paradigm where agents learn optimal behavior through rewards.'
    },
    {
        'agent_id': 'qa_agent',
        'query': 'What is the capital of Japan?',
        'response': 'Tokyo is the capital city of Japan.',
        'context': ['Japan is an island nation.', 'Tokyo is the largest city in Japan.'],
        'ground_truth': 'The capital of Japan is Tokyo.'
    }
]

# Batch evaluate
batch_results_df = pipeline.batch_evaluate(simulated_agent_outputs)

print("\nBATCH EVALUATION RESULTS:")
print(batch_results_df)

# Summary statistics
print("\nAGENT PERFORMANCE SUMMARY:")
print(f"Overall pass rate: {batch_results_df['overall_pass'].mean():.1%}")
print("\nBy Metric:")
for metric in ['accuracy', 'faithfulness', 'relevance']:
    if f'{metric}_score' in batch_results_df.columns:
        avg_score = batch_results_df[f'{metric}_score'].mean()
        pass_rate = batch_results_df[f'{metric}_pass'].mean()
        print(f"  {metric}: avg={avg_score:.3f}, pass_rate={pass_rate:.1%}")

## 9. Export and Reporting

In [None]:
# Export comprehensive report
output_dir = Path("/tmp/kpi_reports")
output_dir.mkdir(exist_ok=True)

# Export to different formats
analyzer.export_report(output_dir / "kpi_report.json", format="json")
analyzer.export_report(output_dir / "kpi_results.csv", format="csv")
analyzer.export_report(output_dir / "kpi_report.html", format="html")

print(f"Reports exported to {output_dir}")

# Create custom summary report
summary_report = {
    'evaluation_date': pd.Timestamp.now().isoformat(),
    'num_samples': len(agentic_samples),
    'embedding_model': type(provider_fast).__name__,
    'similarity_metric': SimilarityMetric.COSINE.value,
    'summary_statistics': summary_stats.to_dict(),
    'correlations': correlation_matrix.to_dict(),
    'pass_rates': {
        'accuracy': (results_df[results_df['metric'] == 'accuracy']['score'] > 0.7).mean(),
        'faithfulness': (results_df[results_df['metric'] == 'faithfulness']['score'] > 0.75).mean(),
        'relevance': (results_df[results_df['metric'] == 'relevance']['score'] > 0.8).mean()
    }
}

with open(output_dir / "summary_report.json", 'w') as f:
    json.dump(summary_report, f, indent=2, default=str)

print("\nSUMMARY REPORT:")
print(json.dumps(summary_report, indent=2, default=str)[:500] + "...")

## 10. Performance Comparison: Embeddings vs LLM-as-Judge

In [None]:
import time

# Benchmark embedding-based evaluation
num_test_samples = 100

# Generate test samples
test_samples = [
    EvaluationSample(
        query=f"Query {i}",
        response=f"Response {i} with some content",
        context=[f"Context {i}.1", f"Context {i}.2"],
        ground_truth=f"Ground truth {i}"
    )
    for i in range(num_test_samples)
]

# Time embedding-based evaluation
start_time = time.time()
embedding_results = evaluator.evaluate_batch(test_samples, parallel=True)
embedding_time = time.time() - start_time

print("PERFORMANCE COMPARISON:")
print("=" * 50)
print(f"\nEmbedding-based Evaluation:")
print(f"  Samples evaluated: {num_test_samples}")
print(f"  Total time: {embedding_time:.2f} seconds")
print(f"  Time per sample: {embedding_time/num_test_samples:.3f} seconds")
print(f"  Samples per second: {num_test_samples/embedding_time:.1f}")

# Compare with typical LLM-as-judge performance
typical_llm_time_per_sample = 2.0  # seconds (conservative estimate)
llm_total_time = typical_llm_time_per_sample * num_test_samples

print(f"\nTypical LLM-as-Judge (estimated):")
print(f"  Time per sample: ~{typical_llm_time_per_sample} seconds")
print(f"  Total time: ~{llm_total_time:.0f} seconds")
print(f"  Samples per second: ~{1/typical_llm_time_per_sample:.1f}")

print(f"\nSpeedup: {llm_total_time/embedding_time:.1f}x faster with embeddings")
print(f"Cost reduction: ~{(1 - embedding_time/llm_total_time)*100:.1f}% lower")

# Additional benefits
print("\n✅ Additional Benefits of Embedding-based KPIs:")
print("  - Deterministic results (no randomness)")
print("  - No API rate limits")
print("  - Works offline after initial model download")
print("  - Consistent evaluation criteria")
print("  - Lower computational cost")
print("  - Easy to parallelize")

## Conclusion

This notebook demonstrated a comprehensive embedding-based KPI evaluation system for agentic workflows that:

1. **Eliminates LLM-as-Judge dependency**: Uses deterministic embedding comparisons
2. **Provides fast evaluation**: 10-100x faster than LLM-based scoring
3. **Supports multiple metrics**: Accuracy, faithfulness, relevance, consistency, coverage, specificity
4. **Enables production deployment**: With caching, batch processing, and parallel evaluation
5. **Offers comprehensive analysis**: Statistical summaries, correlations, outlier detection
6. **Reduces costs**: No API calls for scoring, works offline

The system is production-ready and can be integrated into CI/CD pipelines for continuous agent evaluation.