# DSPy Benchmark: CLI Summarization

This notebook benchmarks two small language models (SLMs), `phi3:mini` and `llama3:8b`, for a command-line interface (CLI) text summarization task. 

We will perform a grid search to evaluate the models across different temperature settings, measuring both the quality (accuracy) and speed (latency) of their summaries.

**Objective:** Identify the optimal model and temperature configuration for generating concise and faithful summaries of long text passages.

In [None]:
import dspy
import os
import pandas as pd
import time
import csv
import warnings

# Suppress verbose warnings from libraries
warnings.filterwarnings('ignore')

## 1. Configure Language Models

We define the two local models to be benchmarked using `dspy.OllamaLocal`. This assumes you have Ollama running and have pulled both `phi3:mini` and `llama3:8b`.

We also configure a more powerful model (e.g., from OpenAI) to act as the judge for our quality metric. **Remember to set your `OPENAI_API_KEY` environment variable for the evaluator to work.**

In [None]:
# Configure the local models for the grid search
phi3_mini = dspy.OllamaLocal(model='phi3:mini', max_tokens=1024)
llama3_8b = dspy.OllamaLocal(model='llama3:8b', max_tokens=1024)

# Configure a more powerful model to act as the evaluator for our metric
# Make sure your OPENAI_API_KEY is set in your environment variables
try:
    evaluator_lm = dspy.OpenAI(model='gpt-4o-mini', max_tokens=2048, model_type='chat')
    dspy.configure(lm=llama3_8b, rm=None, evaluator=evaluator_lm)
    print("Successfully configured models. Using Llama-3-8B as default and GPT-4o-mini as evaluator.")
except Exception as e:
    print(f"Error configuring OpenAI model: {e}")
    print("Please ensure your OPENAI_API_KEY is set as an environment variable.")
    # Fallback for evaluator if OpenAI key is not set
    evaluator_lm = llama3_8b 
    dspy.configure(lm=llama3_8b, rm=None, evaluator=evaluator_lm)
    print("Using Llama-3-8B as a fallback evaluator. Accuracy metric may be less reliable.")

models_to_test = {
    "phi3:mini": phi3_mini,
    "llama3:8b": llama3_8b
}

temperature_values = [0.0, 0.5, 1.0]

## 2. Define Data & Summarization Signature

We'll create a simple dataset of long-form text passages. In a real-world scenario, you would load a more extensive and representative dataset. 

The `SummarizationSignature` defines the input (`document`) and output (`summary`) for our task.

In [None]:
# Define the DSPy Signature for the summarization task
class SummarizationSignature(dspy.Signature):
    """Summarize the given document into a short, concise paragraph."""
    document = dspy.InputField(desc="A long text document.")
    summary = dspy.OutputField(desc="A concise summary of the document.")

# Create a small example dataset
trainset = [
    dspy.Example(document="Quantum computing is a revolutionary type of computing that leverages the principles of quantum mechanics to process information. Unlike classical computers, which use bits to represent information as either a 0 or a 1, quantum computers use qubits. Qubits can exist in a superposition of both 0 and 1 simultaneously, and they can be entangled, meaning their fates are linked even when physically separated. This allows quantum computers to perform complex calculations at speeds unattainable by classical computers. Key applications include drug discovery, materials science, financial modeling, and breaking cryptographic codes. However, building and maintaining stable quantum computers is a massive engineering challenge due to qubit decoherence, where qubits lose their quantum properties due to interaction with the environment.").with_inputs('document'),
    dspy.Example(document="The Roman Republic was a period of ancient Roman civilization that began with the overthrow of the Roman Kingdom in 509 BC and ended in 27 BC with the establishment of the Roman Empire. It was characterized by a republican form of government, where annual magistrates were elected by the citizens. The cornerstone of the Republic was the Senate, a body of elder statesmen that advised the magistrates. Society was divided into patricians (the aristocracy) and plebeians (the common citizens). For centuries, the Republic expanded its territory through conquest and alliances, eventually controlling the entire Mediterranean basin. However, internal tensions, civil wars, and the rise of powerful military leaders like Julius Caesar ultimately led to its collapse and the rise of Augustus as the first Roman Emperor.").with_inputs('document'),
    dspy.Example(document="Artificial neural networks (ANNs) are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron that receives a signal then processes it and can signal neurons connected to it. The 'signal' at a connection is a real number, and the output of each neuron is computed by some non-linear function of the sum of its inputs. These networks are trained on large datasets, adjusting the connection weights to perform tasks like image recognition, natural language processing, and decision making, forming the foundation of modern deep learning.").with_inputs('document')
]

## 3. Define Evaluation Metrics

We need two key metrics:

1.  **Latency:** A simple wrapper to measure the wall-clock time for generating a summary.
2.  **Accuracy (Quality):** A more complex, LLM-based metric. We ask our `evaluator_lm` to rate the summary on two criteria: **faithfulness** (does it accurately reflect the source?) and **conciseness** (is it brief and to the point?). The scores are combined for a final quality rating.

In [None]:
def latency_metric(model, data_point):
    """Measures the time taken by a model to make a prediction."""
    start_time = time.time()
    model(document=data_point.document)
    end_time = time.time()
    return end_time - start_time

def accuracy_metric(gold, pred, trace=None):
    """Uses an LLM to evaluate the quality of a generated summary."""
    document = gold.document
    summary = pred.summary
    
    # Use a ChainOfThought prompt to get a more reasoned evaluation
    eval_prompt = f"""Evaluate the quality of the summary for the given document.
    
    **Document:**
    {document}
    
    **Summary:**
    {summary}
    
    **Instructions:** Please rate the summary on two criteria:
    1. **Faithfulness:** How accurately does the summary reflect the main points of the document? (1-5, 5 is best)
    2. **Conciseness:** How brief and to-the-point is the summary? (1-5, 5 is best)
    
    First, think step-by-step about the faithfulness and conciseness. Then, provide your scores in the format 'Faithfulness: <score>, Conciseness: <score>'."""

    with dspy.settings.context(lm=evaluator_lm):
        response = dspy.Predict(dspy.Signature("evaluation_prompt -> evaluation_output"))(evaluation_prompt=eval_prompt).evaluation_output
    
    try:
        # Extract scores from the response
        faithfulness_score = int(response.split("Faithfulness:")[-1].split(",")[0].strip())
        conciseness_score = int(response.split("Conciseness:")[-1].strip())
        # Average the scores for a final quality rating
        return (faithfulness_score + conciseness_score) / 2.0
    except (ValueError, IndexError):
        # If parsing fails, return a neutral score
        return 2.5 # Neutral score on a 1-5 scale

## 4. Run Grid Search

This is the core of the notebook. We'll iterate through each model and temperature setting, configure DSPy accordingly, and run our evaluation metrics on the dataset. The results are stored in a list for later analysis.

In [None]:
results = []
total_configs = len(models_to_test) * len(temperature_values)
current_config = 0

print(f"Starting grid search across {total_configs} configurations...")

# Define a simple summarization module
summarizer_module = dspy.Predict(SummarizationSignature)

for model_name, model_lm in models_to_test.items():
    for temp in temperature_values:
        current_config += 1
        print(f"
--- [ {current_config}/{total_configs} ] --- ")
        print(f"Testing Model: {model_name}, Temperature: {temp}")
        
        # Configure DSPy with the current model and temperature
        dspy.settings.configure(lm=model_lm, temperature=temp, evaluator=evaluator_lm)

        total_latency = 0
        total_accuracy = 0
        num_examples = len(trainset)

        for i, example in enumerate(trainset):
            print(f"  Processing example {i+1}/{num_examples}...", end='')
            # Measure latency
            total_latency += latency_metric(summarizer_module, example)

            # Measure accuracy
            prediction = summarizer_module(document=example.document)
            total_accuracy += accuracy_metric(example, prediction)
        
        avg_latency = total_latency / num_examples
        avg_accuracy = total_accuracy / num_examples

        print(f"
  Avg Latency: {avg_latency:.4f} seconds")
        print(f"  Avg Accuracy: {avg_accuracy:.4f} / 5.0")
        
        results.append({
            'model': model_name,
            'temperature': temp,
            'avg_accuracy': avg_accuracy,
            'avg_latency_s': avg_latency
        })

print("
Grid search complete!")

## 5. Save and Display Ranked Results

Finally, we'll convert our results into a Pandas DataFrame, rank them (prioritizing higher accuracy, then lower latency), and save the output to `benchmark_results.csv`.

In [None]:
# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Rank the results: higher accuracy is better, lower latency is better
ranked_df = results_df.sort_values(by=['avg_accuracy', 'avg_latency_s'], ascending=[False, True])

# Reset index for clean presentation
ranked_df.reset_index(drop=True, inplace=True)
ranked_df.index += 1 # Start index at 1 for ranking

# Save to CSV
output_filename = 'benchmark_results.csv'
ranked_df.to_csv(output_filename, index_label='rank')

print(f"Results saved to {output_filename}")

# Display the final ranked table
print("
--- Final Ranked Results ---")
print(ranked_df.to_markdown(index=True))