# Prompt Engineering for AG News Text Classification

## Overview

This notebook demonstrates advanced prompt engineering techniques following methodologies from:
- Brown et al. (2020): "Language Models are Few-Shot Learners"
- Wei et al. (2022): "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models"
- Sanh et al. (2022): "Multitask Prompted Training Enables Zero-Shot Task Generalization"

### Tutorial Objectives
1. Design effective prompts for text classification
2. Implement zero-shot and few-shot learning
3. Apply chain-of-thought reasoning
4. Create instruction-based prompts
5. Optimize prompt templates
6. Evaluate prompt-based models

Author: Võ Hải Dũng  
Email: vohaidung.work@gmail.com  
Date: 2025

## 1. Environment Setup

In [None]:
# Standard library imports
import sys
import json
from pathlib import Path
from typing import Dict, List, Tuple, Optional, Any
import warnings

# Data and ML imports
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

# Project imports
PROJECT_ROOT = Path("../..").resolve()
sys.path.insert(0, str(PROJECT_ROOT))

from src.data.datasets.ag_news import AGNewsDataset, AGNewsConfig
from src.data.datasets.prompted_dataset import PromptedDataset, PromptConfig
from src.data.preprocessing.prompt_formatter import PromptFormatter, PromptTemplate
from src.models.prompt_based.prompt_model import PromptBasedClassifier
from src.models.prompt_based.soft_prompt import SoftPromptModel
from src.models.prompt_based.template_manager import TemplateManager
from src.utils.prompt_utils import (
    create_zero_shot_prompt,
    create_few_shot_prompt,
    create_instruction_prompt,
    optimize_prompt_template
)
from src.utils.reproducibility import set_seed
from src.utils.logging_config import setup_logging
from configs.config_loader import ConfigLoader
from configs.constants import AG_NEWS_CLASSES, AG_NEWS_NUM_CLASSES, DATA_DIR

# Setup
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
set_seed(42)
logger = setup_logging('prompt_engineering_tutorial')

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print(f"AG News Classes: {AG_NEWS_CLASSES}")

## 2. Load Configuration

In [None]:
# Load prompt engineering configuration
config_loader = ConfigLoader()

# Load prompt-based training config
prompt_config = config_loader.load_config('training/advanced/prompt_based_tuning.yaml')

# Tutorial configuration
tutorial_config = {
    'max_samples': 500,
    'batch_size': 4,
    'max_length': 512,
    'num_shots': 3,  # For few-shot learning
    'model_name': 'microsoft/deberta-v3-base',
    'use_soft_prompts': False,
    'prompt_length': 10,  # For soft prompts
    'template_optimization_steps': 5
}

print("Prompt Engineering Configuration:")
print("="*50)
for key, value in tutorial_config.items():
    print(f"  {key}: {value}")

## 3. Data Loading and Preparation

In [None]:
# Load AG News dataset
data_config = AGNewsConfig(
    data_dir=DATA_DIR / "processed",
    max_samples=tutorial_config['max_samples'],
    use_cache=True
)

print("Loading datasets...")
train_dataset = AGNewsDataset(data_config, split="train")
val_dataset = AGNewsDataset(data_config, split="validation")

print(f"\nDataset loaded:")
print(f"  Train samples: {len(train_dataset)}")
print(f"  Validation samples: {len(val_dataset)}")

# Sample data for demonstration
sample_idx = 0
sample_text = train_dataset.texts[sample_idx]
sample_label = train_dataset.labels[sample_idx]

print(f"\nSample data:")
print(f"  Label: {sample_label} ({AG_NEWS_CLASSES[sample_label]})")
print(f"  Text: {sample_text[:200]}...")

## 4. Zero-Shot Prompting

In [None]:
# Initialize template manager
template_manager = TemplateManager()

# Define zero-shot templates
zero_shot_templates = [
    {
        'name': 'direct_classification',
        'template': """Classify the following news article into one of these categories: World, Sports, Business, or Science/Technology.

Article: {text}

Category:"""
    },
    {
        'name': 'question_based',
        'template': """What category does this news article belong to? Choose from: World, Sports, Business, or Science/Technology.

"{text}"

The category is:"""
    },
    {
        'name': 'task_description',
        'template': """Task: Categorize news articles into appropriate sections.
Categories: World | Sports | Business | Science/Technology

Article to categorize:
{text}

This article belongs to the category:"""
    }
]

# Register templates
for template_dict in zero_shot_templates:
    template_manager.register_template(
        name=template_dict['name'],
        template=template_dict['template'],
        template_type='zero_shot'
    )

# Test zero-shot prompts
print("Zero-Shot Prompt Examples:")
print("="*50)

for i, template_dict in enumerate(zero_shot_templates[:2]):
    prompt = template_dict['template'].format(text=sample_text[:150])
    print(f"\nTemplate {i+1} ({template_dict['name']}):")
    print("-"*40)
    print(prompt)
    print()

## 5. Few-Shot Prompting

In [None]:
def create_few_shot_examples(
    dataset: AGNewsDataset,
    num_shots: int = 3,
    balanced: bool = True
) -> List[Dict[str, Any]]:
    """
    Create few-shot examples for prompting.
    
    Following few-shot learning principles from:
        Brown et al. (2020): "Language Models are Few-Shot Learners"
    """
    examples = []
    
    if balanced:
        # Get balanced examples across classes
        for class_idx in range(AG_NEWS_NUM_CLASSES):
            class_samples = [
                (text, label) for text, label in zip(dataset.texts, dataset.labels)
                if label == class_idx
            ][:num_shots]
            
            for text, label in class_samples:
                examples.append({
                    'text': text[:200],  # Truncate for brevity
                    'label': AG_NEWS_CLASSES[label]
                })
    else:
        # Random sampling
        indices = np.random.choice(len(dataset), num_shots * AG_NEWS_NUM_CLASSES, replace=False)
        for idx in indices:
            examples.append({
                'text': dataset.texts[idx][:200],
                'label': AG_NEWS_CLASSES[dataset.labels[idx]]
            })
    
    return examples

# Create few-shot examples
few_shot_examples = create_few_shot_examples(
    train_dataset,
    num_shots=tutorial_config['num_shots'],
    balanced=True
)

# Create few-shot prompt
def format_few_shot_prompt(
    examples: List[Dict[str, Any]],
    query_text: str
) -> str:
    """
    Format few-shot prompt with examples.
    """
    prompt = "Classify news articles into categories. Here are some examples:\n\n"
    
    # Add examples
    for i, example in enumerate(examples, 1):
        prompt += f"Example {i}:\n"
        prompt += f"Article: {example['text']}\n"
        prompt += f"Category: {example['label']}\n\n"
    
    # Add query
    prompt += "Now classify this article:\n"
    prompt += f"Article: {query_text}\n"
    prompt += "Category:"
    
    return prompt

# Create and display few-shot prompt
few_shot_prompt = format_few_shot_prompt(
    few_shot_examples[:3],  # Use first 3 examples
    sample_text[:200]
)

print("Few-Shot Prompt Example:")
print("="*50)
print(few_shot_prompt[:800] + "...\n[truncated for display]")
print(f"\nTotal prompt length: {len(few_shot_prompt)} characters")

## 6. Chain-of-Thought Prompting

In [None]:
# Chain-of-thought templates
cot_templates = [
    {
        'name': 'reasoning_steps',
        'template': """Analyze this news article step by step to determine its category.

Article: {text}

Let's think step by step:
1. What is the main topic of this article?
2. What key words or entities are mentioned?
3. Which news category (World, Sports, Business, Science/Technology) best fits this content?

Based on this analysis, the category is:"""
    },
    {
        'name': 'evidence_based',
        'template': """Read the article and identify evidence for its category.

Article: {text}

Evidence analysis:
- Key entities mentioned: [identify entities]
- Domain-specific terms: [identify domain terms]
- Event type: [identify event type]
- Geographic scope: [identify scope]

Conclusion: Based on the evidence, this article belongs to the category:"""
    }
]

# Create CoT prompt with examples
def create_cot_prompt_with_examples() -> str:
    """
    Create chain-of-thought prompt with reasoning examples.
    
    Following CoT prompting from:
        Wei et al. (2022): "Chain-of-Thought Prompting Elicits Reasoning"
    """
    prompt = """Classify news articles by analyzing their content step by step.

Example 1:
Article: "Apple announced record quarterly earnings driven by iPhone sales..."
Reasoning: This mentions Apple (a company), earnings (financial term), and iPhone sales (product revenue). These are business-related topics.
Category: Business

Example 2:
Article: "Scientists discover new exoplanet using James Webb telescope..."
Reasoning: This mentions scientists, exoplanet (astronomy term), and James Webb telescope (scientific instrument). These indicate scientific content.
Category: Science/Technology

Now analyze this article:
Article: {text}
Reasoning:"""
    
    return prompt

# Test CoT prompting
cot_prompt = create_cot_prompt_with_examples().format(text=sample_text[:200])

print("Chain-of-Thought Prompt Example:")
print("="*50)
print(cot_prompt)

# Compare prompt complexities
prompt_lengths = {
    'Zero-shot': len(zero_shot_templates[0]['template'].format(text=sample_text[:200])),
    'Few-shot': len(few_shot_prompt),
    'Chain-of-Thought': len(cot_prompt)
}

print("\nPrompt Length Comparison:")
print("="*50)
for prompt_type, length in prompt_lengths.items():
    print(f"  {prompt_type}: {length} characters")

## 7. Instruction-Based Prompting

In [None]:
# Instruction-based templates
instruction_templates = [
    {
        'name': 'detailed_instruction',
        'template': """### Instruction:
You are a news editor tasked with categorizing articles for publication. Read the following article carefully and assign it to the most appropriate section: World (international news and politics), Sports (athletic events and competitions), Business (economics and companies), or Science/Technology (research and innovation).

### Input:
{text}

### Response:
Category:"""
    },
    {
        'name': 'role_based',
        'template': """You are an expert news classifier. Your task is to accurately categorize news articles.

Categories:
- World: International affairs, politics, conflicts, diplomacy
- Sports: Games, athletes, competitions, tournaments
- Business: Companies, markets, economy, finance
- Science/Technology: Research, discoveries, innovations, tech companies

Article to classify:
{text}

Classification:"""
    }
]

# Create instruction tuning dataset
class InstructionDataset:
    """
    Dataset for instruction-based prompting.
    
    Following instruction tuning from:
        Sanh et al. (2022): "Multitask Prompted Training"
    """
    
    def __init__(self, base_dataset: AGNewsDataset, template: str):
        self.base_dataset = base_dataset
        self.template = template
    
    def create_instruction_prompt(self, idx: int) -> Dict[str, Any]:
        text = self.base_dataset.texts[idx]
        label = self.base_dataset.labels[idx]
        
        prompt = self.template.format(text=text[:300])
        target = AG_NEWS_CLASSES[label]
        
        return {
            'prompt': prompt,
            'target': target,
            'label': label
        }

# Create instruction dataset
instruction_dataset = InstructionDataset(
    train_dataset,
    instruction_templates[0]['template']
)

# Generate examples
instruction_example = instruction_dataset.create_instruction_prompt(0)

print("Instruction-Based Prompt Example:")
print("="*50)
print(instruction_example['prompt'][:600])
print(f"\nExpected output: {instruction_example['target']}")

## 8. Prompt Optimization

In [None]:
class PromptOptimizer:
    """
    Optimize prompt templates for better performance.
    
    Following prompt optimization techniques from:
        Zhou et al. (2022): "Large Language Models Are Human-Level Prompt Engineers"
    """
    
    def __init__(self, templates: List[Dict[str, str]], validation_data: List[Tuple[str, int]]):
        self.templates = templates
        self.validation_data = validation_data
    
    def evaluate_template(self, template: str, tokenizer, model) -> float:
        """
        Evaluate a template's performance on validation data.
        """
        correct = 0
        total = min(10, len(self.validation_data))  # Limit for speed
        
        for text, label in self.validation_data[:total]:
            prompt = template.format(text=text[:200])
            
            # Simulate model prediction (placeholder)
            # In practice, this would use the actual model
            predicted_label = np.random.randint(0, AG_NEWS_NUM_CLASSES)
            
            if predicted_label == label:
                correct += 1
        
        return correct / total
    
    def optimize(self, tokenizer=None, model=None, num_iterations: int = 5) -> Dict[str, Any]:
        """
        Find the best performing template.
        """
        results = []
        
        for template_dict in self.templates:
            score = self.evaluate_template(
                template_dict['template'],
                tokenizer,
                model
            )
            
            results.append({
                'name': template_dict['name'],
                'score': score,
                'template': template_dict['template']
            })
        
        # Sort by score
        results = sorted(results, key=lambda x: x['score'], reverse=True)
        
        return {
            'best_template': results[0],
            'all_results': results
        }

# Prepare validation data
validation_samples = [
    (val_dataset.texts[i], val_dataset.labels[i])
    for i in range(min(20, len(val_dataset)))
]

# Optimize prompts
optimizer = PromptOptimizer(
    templates=zero_shot_templates + instruction_templates,
    validation_data=validation_samples
)

print("Prompt Optimization Results:")
print("="*50)

optimization_results = optimizer.optimize(
    num_iterations=tutorial_config['template_optimization_steps']
)

print(f"Best template: {optimization_results['best_template']['name']}")
print(f"Score: {optimization_results['best_template']['score']:.3f}")

print("\nAll template scores:")
for result in optimization_results['all_results']:
    print(f"  {result['name']:25} Score: {result['score']:.3f}")

## 9. Soft Prompts and Parameter-Efficient Tuning

In [None]:
# Initialize soft prompt model
from src.models.prompt_based.soft_prompt import SoftPromptConfig

soft_prompt_config = SoftPromptConfig(
    prompt_length=tutorial_config['prompt_length'],
    hidden_size=768,  # For base models
    num_layers=12,
    initialization='random',
    reparameterization=True
)

# Create soft prompt model
print("Soft Prompt Configuration:")
print("="*50)
print(f"Prompt length: {soft_prompt_config.prompt_length} tokens")
print(f"Trainable parameters: {soft_prompt_config.prompt_length * soft_prompt_config.hidden_size:,}")
print(f"Initialization: {soft_prompt_config.initialization}")

# Compare with full fine-tuning
base_model_params = 184_000_000  # Approximate for DeBERTa-base
soft_prompt_params = soft_prompt_config.prompt_length * soft_prompt_config.hidden_size
efficiency_ratio = soft_prompt_params / base_model_params * 100

print(f"\nParameter Efficiency:")
print(f"  Base model parameters: {base_model_params:,}")
print(f"  Soft prompt parameters: {soft_prompt_params:,}")
print(f"  Efficiency ratio: {efficiency_ratio:.4f}%")

# Demonstrate soft prompt usage
class SoftPromptClassifier:
    """
    Classifier using soft prompts.
    
    Following soft prompt tuning from:
        Lester et al. (2021): "The Power of Scale for Parameter-Efficient Prompt Tuning"
    """
    
    def __init__(self, config: SoftPromptConfig):
        self.config = config
        self.soft_prompts = self._initialize_prompts()
    
    def _initialize_prompts(self) -> torch.Tensor:
        """Initialize soft prompt embeddings."""
        if self.config.initialization == 'random':
            return torch.randn(
                self.config.prompt_length,
                self.config.hidden_size
            ) * 0.01
        elif self.config.initialization == 'vocab_sample':
            # Sample from vocabulary embeddings
            return torch.randn(
                self.config.prompt_length,
                self.config.hidden_size
            ) * 0.1
        else:
            raise ValueError(f"Unknown initialization: {self.config.initialization}")
    
    def apply_soft_prompt(self, input_embeddings: torch.Tensor) -> torch.Tensor:
        """Prepend soft prompts to input embeddings."""
        batch_size = input_embeddings.shape[0]
        
        # Expand soft prompts for batch
        soft_prompts_batch = self.soft_prompts.unsqueeze(0).expand(
            batch_size, -1, -1
        )
        
        # Concatenate soft prompts with input
        prompted_embeddings = torch.cat(
            [soft_prompts_batch, input_embeddings],
            dim=1
        )
        
        return prompted_embeddings

# Create soft prompt classifier
soft_prompt_classifier = SoftPromptClassifier(soft_prompt_config)

# Simulate application
dummy_input = torch.randn(tutorial_config['batch_size'], 100, 768)
prompted_input = soft_prompt_classifier.apply_soft_prompt(dummy_input)

print(f"\nSoft Prompt Application:")
print(f"  Original input shape: {dummy_input.shape}")
print(f"  Prompted input shape: {prompted_input.shape}")
print(f"  Added tokens: {prompted_input.shape[1] - dummy_input.shape[1]}")

## 10. Evaluation and Comparison

In [None]:
# Evaluate different prompting strategies
def evaluate_prompting_strategies(
    dataset: AGNewsDataset,
    num_samples: int = 50
) -> pd.DataFrame:
    """
    Compare different prompting strategies.
    
    Following evaluation methodology from:
        Zhao et al. (2021): "Calibrate Before Use: Improving Few-Shot Performance"
    """
    strategies = [
        'zero_shot',
        'few_shot_1',
        'few_shot_3',
        'few_shot_5',
        'chain_of_thought',
        'instruction_based'
    ]
    
    results = []
    
    for strategy in strategies:
        # Simulate evaluation (placeholder scores)
        accuracy = np.random.uniform(0.7, 0.95)
        latency = np.random.uniform(10, 100)  # ms
        prompt_tokens = np.random.randint(50, 500)
        
        results.append({
            'Strategy': strategy,
            'Accuracy': accuracy,
            'Latency (ms)': latency,
            'Prompt Tokens': prompt_tokens,
            'Cost Efficiency': accuracy / (prompt_tokens / 100)
        })
    
    return pd.DataFrame(results)

# Run evaluation
evaluation_results = evaluate_prompting_strategies(
    val_dataset,
    num_samples=50
)

print("Prompting Strategy Comparison:")
print("="*70)
print(evaluation_results.to_string(index=False))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Accuracy comparison
axes[0].bar(evaluation_results['Strategy'], evaluation_results['Accuracy'])
axes[0].set_xlabel('Strategy')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Accuracy by Prompting Strategy')
axes[0].tick_params(axis='x', rotation=45)

# Efficiency comparison
axes[1].scatter(
    evaluation_results['Prompt Tokens'],
    evaluation_results['Accuracy'],
    s=100
)
for i, strategy in enumerate(evaluation_results['Strategy']):
    axes[1].annotate(
        strategy,
        (evaluation_results['Prompt Tokens'].iloc[i], evaluation_results['Accuracy'].iloc[i]),
        fontsize=8
    )
axes[1].set_xlabel('Prompt Tokens')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Accuracy vs. Prompt Length Trade-off')

plt.tight_layout()
plt.show()

# Best strategy analysis
best_accuracy = evaluation_results.loc[evaluation_results['Accuracy'].idxmax()]
best_efficiency = evaluation_results.loc[evaluation_results['Cost Efficiency'].idxmax()]

print("\nBest Performing Strategies:")
print("="*50)
print(f"Highest Accuracy: {best_accuracy['Strategy']} ({best_accuracy['Accuracy']:.3f})")
print(f"Best Efficiency: {best_efficiency['Strategy']} ({best_efficiency['Cost Efficiency']:.3f})")

## 11. Conclusions and Next Steps

### Prompt Engineering Summary

This tutorial demonstrated fundamental prompt engineering concepts:

1. **Zero-Shot Prompting**: Simple templates for direct classification
2. **Few-Shot Learning**: Using examples to guide model predictions
3. **Chain-of-Thought**: Step-by-step reasoning for better accuracy
4. **Instruction-Based**: Detailed task descriptions and role-playing
5. **Prompt Optimization**: Systematic template evaluation and selection
6. **Soft Prompts**: Parameter-efficient tuning with learnable prompts
7. **Evaluation**: Comprehensive comparison of strategies

### Key Takeaways

1. **Template Design Matters**: Well-crafted prompts significantly impact performance
2. **Few-Shot vs Zero-Shot**: Few-shot generally better but costs more tokens
3. **Chain-of-Thought**: Improves reasoning but increases latency
4. **Soft Prompts**: Offer parameter efficiency with competitive performance
5. **Trade-offs**: Balance accuracy, latency, and computational cost

### Next Steps

1. **Advanced Prompting**:
   - Implement self-consistency prompting
   - Try least-to-most prompting
   - Explore tree-of-thoughts reasoning

2. **Optimization**:
   - Automatic prompt engineering with RL
   - Gradient-based prompt search
   - Prompt ensemble methods

3. **Efficiency**:
   - Prompt compression techniques
   - Dynamic prompt selection
   - Caching strategies for few-shot

4. **Production**:
   - Build prompt management system
   - Implement A/B testing for prompts
   - Monitor prompt performance metrics

### References

For deeper understanding, consult:
- Instruction tuning: `notebooks/tutorials/06_instruction_tuning.ipynb`
- API integration: `notebooks/tutorials/07_api_usage.ipynb`
- Advanced techniques: `docs/user_guide/advanced_techniques.md`
- Production deployment: `docs/user_guide/deployment.md`