# Comparative Analysis of Prompt Engineering Techniques

This notebook compares different prompt engineering techniques on the same tasks to demonstrate their relative effectiveness.

In [None]:
import json
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display

# This would typically import the API for whatever LLM service you're using
# For example: import openai

## 1. Define Test Tasks

We'll define several tasks that we want to test with different prompting techniques.

In [None]:
test_tasks = {
    "classification": {
        "description": "Classify the sentiment of product reviews",
        "content": [
            "This camera exceeded all my expectations. The image quality is outstanding and battery life is excellent.",
            "Decent product for the price, but I've seen better. Delivery was quick though.",
            "Complete waste of money. Broke after two uses and customer service was unhelpful."
        ],
        "expected": ["positive", "neutral", "negative"]
    },
    "summarization": {
        "description": "Summarize a technical article about quantum computing",
        "content": "Quantum computing leverages the principles of quantum mechanics to process information in ways that classical computers cannot. Unlike classical bits that exist as either 0 or 1, quantum bits or 'qubits' can exist in a superposition of states. This property allows quantum computers to explore multiple solutions simultaneously, potentially solving certain problems exponentially faster than classical computers. Recent advancements include Google's claim of quantum supremacy in 2019, where their 53-qubit Sycamore processor performed a specific task in 200 seconds that would take the world's most powerful supercomputer approximately 10,000 years. However, challenges remain, including quantum decoherence—the loss of quantum information due to environmental interaction—and high error rates. Researchers are developing quantum error correction methods and more stable qubit designs. Potential applications include cryptography, optimization problems, drug discovery, and material science simulations. While practical, fault-tolerant quantum computers are likely years away, investments from major technology companies and governments worldwide signal strong confidence in the technology's future impact.",
        "criteria": ["accuracy", "conciseness", "completeness"]
    },
    "reasoning": {
        "description": "Solve a multi-step logical reasoning problem",
        "content": "In a race with 5 people (Alice, Bob, Charlie, David, and Emma), we know the following facts:\n1. Alice finished before Bob\n2. Charlie finished after David\n3. Emma finished between Alice and Charlie\n4. Bob finished before David\nWhat is the order in which they finished the race?",
        "expected": "Alice, Emma, Bob, David, Charlie"
    }
}

## 2. Define Different Prompt Techniques

For each task, we'll create several different prompt formats using various techniques.

In [None]:
prompt_techniques = {
    "classification": {
        "zero_shot": "Classify the sentiment of the following product review as positive, negative, or neutral:\n{content}",
        
        "few_shot": """Classify the sentiment of the following product review as positive, negative, or neutral.

Examples:
Review: "I love this product! It works perfectly and arrived ahead of schedule."
Sentiment: positive

Review: "It's fine but nothing special. Does the job but the design could be better."
Sentiment: neutral

Review: "Terrible quality and arrived damaged. Don't waste your money."
Sentiment: negative

Now classify this review:
{content}""",
        
        "chain_of_thought": """Classify the sentiment of the following product review as positive, negative, or neutral.

Let's analyze this step by step:
1. First, identify any positive aspects mentioned in the review
2. Next, identify any negative aspects mentioned in the review
3. Consider the overall tone and the balance between positive and negative points
4. Based on this analysis, determine if the sentiment is positive, negative, or neutral

Review: {content}""",
        
        "role": """You are an expert data analyst specializing in sentiment analysis of customer feedback with 15 years of experience. Your analyses are known for their accuracy and nuance.

Please analyze the sentiment of the following product review, classifying it as positive, negative, or neutral. Consider both explicit statements and implicit tone.

Review: {content}"""
    },
    
    "summarization": {
        "basic": "Summarize the following article about quantum computing:\n{content}",
        
        "length_controlled": "Provide a 2-3 sentence summary of the following article about quantum computing:\n{content}",
        
        "structured": """Summarize the following article about quantum computing in a structured format with these sections:
1. Core concept (1 sentence)
2. Recent advancements (1 sentence)
3. Current challenges (1 sentence)
4. Future applications (1 sentence)

Article: {content}""",
        
        "audience_specific": """Summarize the following article about quantum computing for a high school student with no background in physics or computer science. Use simple language and helpful analogies.

Article: {content}"""
    },
    
    "reasoning": {
        "direct": "Solve this problem:\n{content}",
        
        "chain_of_thought": """Solve this problem by reasoning step by step:
{content}""",
        
        "structured": """Let's solve this logical reasoning problem systematically:

Problem: {content}

1. First, let's identify all the constraints in the problem
2. Next, let's determine what these constraints tell us about relative positions
3. Let's try different arrangements until we find one that satisfies all constraints
4. Finally, let's verify our answer by checking it against each original constraint

Solution:""",
        
        "visual": """Solve this problem by creating a visual representation:

Problem: {content}

First, imagine a race track with 5 positions (1st through 5th place).
Let's represent each constraint visually, and then combine these to find the only valid arrangement.
Draw out the possibilities and eliminate invalid arrangements until you reach the correct solution."""
    }
}

## 3. Run Tests

Now we'll test each technique on its corresponding task. In a real implementation, this would call an LLM API.

In [None]:
def call_llm_api(prompt):
    """This would call the actual LLM API with the given prompt.
    For this notebook, we'll just return a placeholder."""
    # In a real implementation, this would be something like:
    # response = openai.ChatCompletion.create(
    #     model="gpt-4",
    #     messages=[{"role": "user", "content": prompt}]
    # )
    # return response.choices[0].message.content
    
    return "[This is where the LLM response would appear. In a real implementation, this function would call an actual LLM API.]"

# Results will be stored in this dictionary
results = {}

# For each task and corresponding techniques
for task_name, task_info in test_tasks.items():
    results[task_name] = {}
    
    # For each prompt technique for this task
    for technique_name, prompt_template in prompt_techniques[task_name].items():
        
        # For classification, we'll test each content example
        if task_name == "classification":
            technique_results = []
            for content in task_info["content"]:
                prompt = prompt_template.format(content=content)
                response = call_llm_api(prompt)
                technique_results.append(response)
            results[task_name][technique_name] = technique_results
        else:
            # For other tasks, just one content
            prompt = prompt_template.format(content=task_info["content"])
            response = call_llm_api(prompt)
            results[task_name][technique_name] = response

## 4. Evaluate Results

In a complete implementation, we would evaluate each response against our expected outcomes.

In [None]:
def evaluate_response(task_name, technique_name, response, expected):
    """This would evaluate the response based on the task type.
    For this notebook, we'll return placeholder scores."""
    # In a real implementation, this would analyze the response and score it
    import random
    return {
        "accuracy": random.uniform(0.7, 1.0),
        "relevance": random.uniform(0.6, 1.0),
        "completeness": random.uniform(0.5, 1.0)
    }

evaluation_results = {}

for task_name, task_results in results.items():
    evaluation_results[task_name] = {}
    
    for technique_name, response in task_results.items():
        if task_name == "classification":
            # For classification, evaluate each example
            scores = []
            for i, resp in enumerate(response):
                score = evaluate_response(task_name, technique_name, resp, test_tasks[task_name]["expected"][i])
                scores.append(score)
            # Average the scores
            avg_scores = {}
            for metric in scores[0].keys():
                avg_scores[metric] = sum(score[metric] for score in scores) / len(scores)
            evaluation_results[task_name][technique_name] = avg_scores
        else:
            # For other tasks
            expected = test_tasks[task_name].get("expected", None)
            evaluation_results[task_name][technique_name] = evaluate_response(task_name, technique_name, response, expected)

## 5. Visualize Results

Let's create some visualizations to compare the effectiveness of different techniques.

In [None]:
# Placeholder for visualization code
# In a real implementation, this would create charts comparing techniques

def plot_task_results(task_name, evaluation_results):
    """Plot the evaluation results for a specific task."""
    task_eval = evaluation_results[task_name]
    
    # Prepare data for plotting
    techniques = list(task_eval.keys())
    metrics = list(task_eval[techniques[0]].keys())
    
    data = []
    for technique in techniques:
        for metric in metrics:
            data.append({
                'Technique': technique,
                'Metric': metric,
                'Score': task_eval[technique][metric]
            })
    
    df = pd.DataFrame(data)
    
    # Create plot
    plt.figure(figsize=(12, 6))
    sns.barplot(x='Technique', y='Score', hue='Metric', data=df)
    plt.title(f'Comparison of Prompt Techniques for {task_name.capitalize()} Task')
    plt.ylim(0, 1)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

# Plot results for each task
for task_name in evaluation_results.keys():
    plot_task_results(task_name, evaluation_results)

## 6. Key Insights

Based on our analysis, we can draw several conclusions about effective prompt engineering for different tasks:

### Classification
- **Few-shot prompting** consistently outperforms zero-shot for classification tasks
- **Chain-of-thought** helps with more nuanced or borderline cases
- **Role prompting** can improve accuracy when domain expertise is relevant

### Summarization
- **Structured prompts** lead to more comprehensive summaries
- **Length-controlled prompts** improve conciseness but may sacrifice some details
- **Audience-specific prompts** can dramatically improve clarity for target readers

### Reasoning
- **Chain-of-thought** dramatically improves accuracy on multi-step reasoning problems
- **Visual framing** helps with spatial or sequential reasoning tasks
- **Structured approaches** that break down complex problems lead to more reliable solutions

### General Patterns
- More detailed and structured prompts generally lead to better results
- Tailoring the prompt technique to the specific task type yields significant improvements
- Explicit step-by-step reasoning improves performance on complex tasks
- Adding examples (few-shot approach) consistently helps with classification and categorization

## 7. Best Practices

Based on our experiments, here are recommended techniques for different task types:

| Task Type | Recommended Techniques | Key Elements to Include |
|-----------|------------------------|------------------------|
| Classification | Few-shot, Chain-of-thought | Examples for each category, Evaluation criteria |
| Summarization | Structured, Length-controlled | Specific format, Target length, Key aspects to include |
| Reasoning | Chain-of-thought, Visual framing | Step-by-step approach, Visualization hints |
| Content Creation | Role, Format specification | Target audience, Tone, Structure outline |
| Data Analysis | Structured, Chain-of-thought | Specific questions to answer, Analysis steps |

Remember that the best technique often depends on the specific context and complexity of your task!