# Self-Critique Chain Pipeline: Comprehensive Demonstration

This notebook provides a complete walkthrough of the Self-Critique Chain Pipeline for automated research paper summarization using Claude AI. The demonstration covers initialization, execution, monitoring, and analysis of the three-stage pipeline that implements summarization with automatic critique and revision.

The pipeline addresses common challenges in automated summarization by implementing an iterative refinement process. Traditional single-shot approaches often produce outputs with inconsistencies, missing details, or misrepresented findings. This system solves these problems through a Chain-of-Verification pattern where Claude AI systematically evaluates and improves its own outputs across four quality dimensions.

## Learning Objectives

By completing this demonstration, you will understand how to initialize and configure the pipeline with appropriate parameters for your use case. You will learn to execute the three-stage workflow and interpret the generated outputs including summaries, critiques, and revisions. The notebook demonstrates how to collect and analyze performance metrics for optimization and troubleshooting. You will explore monitoring capabilities for detecting performance anomalies and quality degradation. Finally, you will learn to integrate the pipeline with MLflow for experiment tracking and reproducibility.

## Prerequisites

This demonstration requires Python version 3.10 or higher with all project dependencies installed. You must have a valid Anthropic API key configured in your environment variables. Basic familiarity with machine learning operations concepts and practices will help you understand the architectural decisions. Knowledge of RESTful API patterns is beneficial for understanding the service layer integration.

## Section 1: Environment Setup and Configuration

The first step involves preparing the execution environment with all necessary dependencies and configuration. This section establishes the foundation for reliable pipeline execution by verifying that all required components are available and properly configured.

In [None]:
import sys
import os
from pathlib import Path

project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import json

from src.pipeline import SelfCritiquePipeline
from src.monitoring import PromptMonitor
from src.utils import extract_xml_content, parse_self_assessment

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("Environment setup completed successfully")
print(f"Python version: {sys.version}")
print(f"Working directory: {Path.cwd()}")
print(f"Project root: {project_root}")

### Configuration and API Key Verification

The pipeline requires a valid Anthropic API key for accessing Claude AI capabilities. This cell loads environment variables and verifies that authentication credentials are properly configured. The verification process checks key format and structure without making actual API calls to conserve resources during initial setup.

In [None]:
from dotenv import load_dotenv

load_dotenv(project_root / ".env")

api_key = os.getenv("ANTHROPIC_API_KEY")

if not api_key:
    print("WARNING: ANTHROPIC_API_KEY not found in environment variables")
    print("Please set your API key in the .env file before proceeding")
    print("Example: ANTHROPIC_API_KEY=sk-ant-your-key-here")
else:
    masked_key = api_key[:10] + "..." + api_key[-4:]
    print(f"API key loaded successfully: {masked_key}")
    print("Configuration verified and ready for pipeline execution")

## Section 2: Sample Research Paper Preparation

This section prepares the input data for pipeline execution by loading a sample research paper. The example uses the influential Attention Is All You Need paper which introduced the Transformer architecture. This paper serves as an excellent demonstration case because it contains clear technical contributions, quantitative results, and explicit limitations that the pipeline must accurately capture.

In [None]:
sample_paper = """
Title: Attention Is All You Need

Abstract:

The dominant sequence transduction models are based on complex recurrent or convolutional 
neural networks that include an encoder and a decoder. The best performing models also 
connect the encoder and decoder through an attention mechanism. We propose a new simple 
network architecture, the Transformer, based solely on attention mechanisms, dispensing 
with recurrence and convolutions entirely. Experiments on two machine translation tasks 
show these models to be superior in quality while being more parallelizable and requiring 
significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German 
translation task, improving over the existing best results, including ensembles, by over 2 BLEU. 
On the WMT 2014 English-to-French translation task, our model establishes a new single-model 
state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction 
of the training costs of the best models from the literature.

Introduction:

Recurrent neural networks, long short-term memory and gated recurrent neural networks in 
particular, have been firmly established as state of the art approaches in sequence modeling 
and transduction problems such as language modeling and machine translation. Numerous efforts 
have since continued to push the boundaries of recurrent language models and encoder-decoder 
architectures.

Recurrent models typically factor computation along the symbol positions of the input and output 
sequences. Aligning the positions to steps in computation time, they generate a sequence of 
hidden states as a function of the previous hidden state and the input for position t. This 
inherently sequential nature precludes parallelization within training examples, which becomes 
critical at longer sequence lengths, as memory constraints limit batching across examples. Recent 
work has achieved significant improvements in computational efficiency through factorization 
tricks and conditional computation, while also improving model performance in case of the latter. 
The fundamental constraint of sequential computation, however, remains.

Attention mechanisms have become an integral part of compelling sequence modeling and transduction 
models in various tasks, allowing modeling of dependencies without regard to their distance in 
the input or output sequences. In all but a few cases however, such attention mechanisms are used 
in conjunction with a recurrent network.

In this work we propose the Transformer, a model architecture eschewing recurrence and instead 
relying entirely on an attention mechanism to draw global dependencies between input and output. 
The Transformer allows for significantly more parallelization and can reach a new state of the 
art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
"""

print("Sample paper loaded successfully")
print(f"Paper length: {len(sample_paper)} characters")
print(f"Approximate tokens: {len(sample_paper) // 4}")
print("\nPaper preview:")
print(sample_paper[:300] + "...")

## Section 3: Pipeline Initialization and Configuration

Pipeline initialization requires careful configuration of model parameters to balance quality and performance. This section demonstrates how to instantiate the pipeline with production-appropriate settings including model selection, token limits, and monitoring integration. The configuration choices reflect best practices for production deployments where reproducibility and observability are critical requirements.

In [None]:
pipeline = SelfCritiquePipeline(
    api_key=api_key,
    model="claude-sonnet-4-20250514",
    max_tokens=4096
)

monitor = PromptMonitor(
    baseline_latency=2.0,
    anomaly_threshold_multiplier=2.0,
    satisfaction_threshold=3.0
)

print("Pipeline initialized successfully")
print(f"Model: {pipeline.model}")
print(f"Max tokens per request: {pipeline.max_tokens}")
print(f"Monitoring baseline latency: {monitor.baseline_latency} seconds")
print("\nReady for pipeline execution")

## Section 4: Complete Pipeline Execution

This section executes the full three-stage Self-Critique Chain pipeline on the sample research paper. The execution demonstrates the iterative refinement process where initial summarization is followed by systematic critique and targeted revision. Each stage operates with optimized temperature settings to balance factual accuracy with creative analysis. The pipeline collects comprehensive metrics at every stage to support performance monitoring and optimization efforts.

In [None]:
print("="*80)
print("EXECUTING SELF-CRITIQUE CHAIN PIPELINE")
print("="*80)
print("\nThis will execute three stages:")
print("1. Generate initial summary (temperature=0.3)")
print("2. Critique the summary (temperature=0.5)")
print("3. Revise based on critique (temperature=0.3)")
print("\nEstimated execution time: 10-15 seconds\n")

results = pipeline.run_pipeline(
    paper_text=sample_paper,
    mlflow_tracking=False
)

for stage_num in range(1, 4):
    stage_key = f"stage{stage_num}_metrics"
    if stage_key in results:
        monitor.log_request(
            prompt=f"Stage {stage_num} execution",
            response=results.get("summary" if stage_num == 1 else "critique" if stage_num == 2 else "revised_summary", ""),
            metrics=results[stage_key]
        )

print("\n" + "="*80)
print("PIPELINE EXECUTION COMPLETED SUCCESSFULLY")
print("="*80)

## Section 5: Output Analysis and Comparison

This section presents the outputs from each pipeline stage to demonstrate the iterative improvement process. The analysis compares the initial summary against the revised version to highlight specific enhancements made during the revision stage. Understanding these differences provides insights into how the self-critique mechanism identifies and addresses quality issues systematically.

In [None]:
print("\n" + "="*80)
print("STAGE 1: INITIAL SUMMARY")
print("="*80)
print(results["summary"])

print("\n\n" + "="*80)
print("STAGE 2: CRITIQUE ANALYSIS (First 500 characters)")
print("="*80)
print(results["critique"][:500] + "...")

print("\n\n" + "="*80)
print("STAGE 3: REVISED SUMMARY")
print("="*80)
print(results["revised_summary"])

print("\n\n" + "="*80)
print("REFLECTION ON CHANGES (First 500 characters)")
print("="*80)
print(results["reflection"][:500] + "...")

## Section 6: Performance Metrics Analysis

Performance metrics provide critical insights into resource utilization and execution efficiency. This section analyzes token consumption, latency patterns, and computational costs across all pipeline stages. Understanding these metrics enables optimization decisions for production deployments where cost efficiency and response time are key operational requirements.

In [None]:
total_metrics = results["total_metrics"]

print("AGGREGATE PERFORMANCE METRICS")
print("="*80)
print(f"Total Input Tokens: {total_metrics['total_input_tokens']:,}")
print(f"Total Output Tokens: {total_metrics['total_output_tokens']:,}")
print(f"Total Tokens Consumed: {total_metrics['total_tokens']:,}")
print(f"Total Execution Latency: {total_metrics['total_latency_seconds']:.2f} seconds")
print(f"Average Latency Per Stage: {total_metrics['average_latency_per_stage']:.2f} seconds")
print(f"Stages Completed: {total_metrics['stage_count']}")

stage_data = []
for stage_num in range(1, 4):
    stage_key = f"stage{stage_num}_metrics"
    if stage_key in results:
        metrics = results[stage_key]
        stage_data.append({
            "Stage": f"Stage {stage_num}",
            "Input Tokens": metrics["input_tokens"],
            "Output Tokens": metrics["output_tokens"],
            "Total Tokens": metrics["total_tokens"],
            "Latency (s)": metrics["latency_seconds"],
            "Temperature": metrics["temperature"]
        })

df_stages = pd.DataFrame(stage_data)
print("\n\nPER-STAGE BREAKDOWN")
print("="*80)
print(df_stages.to_string(index=False))

## Section 7: Performance Visualization

Visual representations of performance data facilitate rapid identification of patterns and anomalies. This section generates comprehensive visualizations showing token consumption patterns and latency distributions across pipeline stages. These visualizations support both operational monitoring and capacity planning decisions for production deployments.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].bar(df_stages["Stage"], df_stages["Input Tokens"], label="Input Tokens", alpha=0.7)
axes[0].bar(df_stages["Stage"], df_stages["Output Tokens"], bottom=df_stages["Input Tokens"], label="Output Tokens", alpha=0.7)
axes[0].set_xlabel("Pipeline Stage")
axes[0].set_ylabel("Token Count")
axes[0].set_title("Token Consumption by Stage")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].bar(df_stages["Stage"], df_stages["Latency (s)"], color="coral", alpha=0.7)
axes[1].set_xlabel("Pipeline Stage")
axes[1].set_ylabel("Latency (seconds)")
axes[1].set_title("Execution Latency by Stage")
axes[1].axhline(y=2.0, color='r', linestyle='--', label='Baseline (2s)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Performance visualizations generated successfully")

## Section 8: Monitoring and Anomaly Detection

Production systems require continuous monitoring to detect performance degradation or quality issues before they impact users. This section demonstrates the monitoring capabilities including anomaly detection algorithms that identify latency spikes, declining satisfaction scores, and resource consumption anomalies. The monitoring system operates on configurable thresholds that can be adjusted based on operational requirements.

In [None]:
summary_stats = monitor.get_summary_stats()

print("MONITORING SUMMARY STATISTICS")
print("="*80)
print(f"Total Requests Logged: {summary_stats['total_requests']}")
print(f"\nLatency Statistics:")
print(f"  Average: {summary_stats['latency']['average_seconds']:.3f} seconds")
print(f"  Minimum: {summary_stats['latency']['min_seconds']:.3f} seconds")
print(f"  Maximum: {summary_stats['latency']['max_seconds']:.3f} seconds")
print(f"  Median: {summary_stats['latency']['median_seconds']:.3f} seconds")
print(f"\nToken Usage Statistics:")
print(f"  Average per Request: {summary_stats['tokens']['average_per_request']:.1f} tokens")
print(f"  Total Consumed: {summary_stats['tokens']['total_consumed']:,} tokens")
print(f"  Minimum per Request: {summary_stats['tokens']['min_per_request']} tokens")
print(f"  Maximum per Request: {summary_stats['tokens']['max_per_request']} tokens")

anomalies = monitor.detect_anomalies(window_size=10)

print("\n\nANOMALY DETECTION RESULTS")
print("="*80)
if anomalies:
    for anomaly in anomalies:
        severity_symbol = "üö®" if anomaly["severity"] == "critical" else "‚ö†Ô∏è"
        print(f"\n{severity_symbol} {anomaly['type'].upper()}")
        print(f"   Severity: {anomaly['severity'].upper()}")
        print(f"   Message: {anomaly['message']}")
        print(f"   Metric Value: {anomaly['metric']:.2f}")
        print(f"   Threshold: {anomaly['threshold']:.2f}")
else:
    print("‚úÖ No anomalies detected in current execution")
    print("All performance metrics within acceptable thresholds")

## Section 9: Quality Assessment Analysis

The critique stage generates quantitative quality assessments across four dimensions including accuracy, completeness, clarity, and coherence. This section extracts and analyzes these self-assessment scores to understand how the pipeline evaluates its own outputs. Tracking quality scores over time enables identification of patterns that indicate when prompts or configurations need adjustment.

In [None]:
critique_text = results["critique"]
scores = parse_self_assessment(critique_text)

if scores:
    print("QUALITY ASSESSMENT SCORES")
    print("="*80)
    
    dimensions = ["accuracy", "completeness", "clarity", "coherence", "overall_quality"]
    for dimension in dimensions:
        if dimension in scores:
            score = scores[dimension]
            bar = "‚ñà" * int(score) + "‚ñë" * (10 - int(score))
            print(f"{dimension.replace('_', ' ').title():20s}: {score:.1f}/10 [{bar}]")
    
    if scores:
        fig, ax = plt.subplots(figsize=(10, 6))
        dimensions_display = [d.replace('_', ' ').title() for d in scores.keys()]
        values = list(scores.values())
        
        bars = ax.barh(dimensions_display, values, color='steelblue', alpha=0.7)
        ax.set_xlabel('Score (0-10 scale)')
        ax.set_title('Self-Assessment Quality Scores')
        ax.set_xlim(0, 10)
        ax.axvline(x=8.0, color='green', linestyle='--', label='Quality Threshold (8.0)')
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        for bar, value in zip(bars, values):
            ax.text(value + 0.1, bar.get_y() + bar.get_height()/2, f'{value:.1f}', 
                   va='center', fontweight='bold')
        
        plt.tight_layout()
        plt.show()
else:
    print("Quality scores could not be extracted from critique")
    print("This may indicate the critique format needs adjustment")

## Section 10: Export and Persistence

Production workflows require persistent storage of results for audit trails, analysis, and compliance requirements. This section demonstrates how to export pipeline outputs and monitoring data to various formats including JSON for structured storage and CSV for data analysis tools. The exported data maintains full fidelity with all metadata and performance metrics intact for downstream processing.

In [None]:
output_dir = project_root / "data" / "pipeline_outputs"
output_dir.mkdir(parents=True, exist_ok=True)

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

results_file = output_dir / f"pipeline_results_{timestamp}.json"
with open(results_file, "w", encoding="utf-8") as f:
    export_data = {
        "execution_timestamp": timestamp,
        "model": results["model"],
        "summary": results["summary"],
        "critique": results["critique"],
        "revised_summary": results["revised_summary"],
        "reflection": results["reflection"],
        "total_metrics": results["total_metrics"]
    }
    json.dump(export_data, f, indent=2, ensure_ascii=False)

print(f"Pipeline results exported to: {results_file}")

monitoring_file = output_dir / f"monitoring_logs_{timestamp}.json"
monitor.export_to_json(str(monitoring_file))
print(f"Monitoring logs exported to: {monitoring_file}")

df_monitor = monitor.export_to_dataframe()
csv_file = output_dir / f"monitoring_data_{timestamp}.csv"
df_monitor.to_csv(csv_file, index=False)
print(f"Monitoring data exported to: {csv_file}")

print("\nAll outputs exported successfully")

## Section 11: Summary and Next Steps

This demonstration has shown the complete workflow for executing the Self-Critique Chain Pipeline including initialization, execution, monitoring, and analysis. The pipeline successfully implements an iterative refinement process that produces higher quality summaries compared to single-shot approaches through systematic self-evaluation and targeted revision.

### Key Takeaways

The three-stage architecture with optimized temperature settings balances factual accuracy with creative analysis. Comprehensive metrics collection at each stage enables performance monitoring and optimization decisions. The self-critique mechanism provides quantitative quality assessments across multiple dimensions. Anomaly detection capabilities support proactive identification of performance issues. Export functionality preserves full audit trails for compliance and analysis requirements.

### Recommended Next Steps

For production deployment, integrate the pipeline with MLflow for comprehensive experiment tracking and model versioning. Implement the FastAPI endpoints to expose functionality through RESTful interfaces with proper authentication. Configure continuous monitoring with alert routing to appropriate channels for operational awareness. Establish baseline performance metrics and quality thresholds based on your specific use case requirements. Develop automated testing procedures for prompt templates and pipeline configuration changes.

### Additional Resources

The project repository contains comprehensive documentation including API endpoint specifications, configuration options, and deployment guides. Example scripts demonstrate integration patterns for common workflows and use cases. The test suite provides examples of proper mocking and assertion strategies. Configuration templates offer starting points for different deployment scenarios.

For questions or issues, consult the project documentation or open an issue on the GitHub repository with detailed information about your environment and the observed behavior.