# Prompt Optimization

This notebook demonstrates methods/intuitions for prompt optimization.

## Prompt Optimization Approaches

### 1. Bayesian Optimization (ie. DsPy)
Proposal Generation:
- Uses LLM to generate (input, LM_output) pairs given (input, output) dataset
- Uses program structure, dataset summaries, and Bayesian Optimization to generate proposals.

Proposal Selection:
- Uses probabilistic Bayesian model to rank proposals.

Proposal Testing:
- Evaluates prompts on subsets of data.

### 2. Evolutionary Algorithms (ie. EvoPrompt)
Proposal Generation:
- Given population, uses LLM-based mutation/crossover operators to generate proposals.

Proposal Selection:
- Samples subset of prompts, evaluates them on "development set" (ie. subset of data), and picks top performers based on "fitness" ie. your given metrics.

Proposal Testing:
- Runs proposal selection and then returns best-performing prompt.

### 3."Gradient"-based Methods (ie. ProTeGI, APE)
Proposal Generation:
 - Uses LLM, textual feedback, and previous prompt to generate future proposals.

Proposal Selection:
 - Samples subset of prompts, evaluates them on subset of data, and uses evaluations to update prompt manually or to pick future prompt.

Proposal Testing:
- Samples subset of data and evaluates all prompts on subset based on metric, returning best-performing one. 

### 4. Human-interactive Methods (ie. iPrOp)
Proposal Generation:
- Uses LLM to generate variations and present to human in multi-armed bandit interface to learn reward for generating new proposals.

Proposal Selection:
- Uses human-selected feedback to filter proposals.

Proposal Testing:
- Evaluate filtered prompts on subset of data on given metric.

### 5. Translation-based Methods (ie. BPO)
Proposal Generation:
- Uses LLM to generate critiques to generate "optimal" (intput, LM_output) pairs.
- Train a seq2seq model to decode "optimal" prompts.

Proposal Selection:
- Can view (input, optimal LM_output) as means of selection.

Proposal Testing:
- Evaluate generated prompt to baseline and compare.

## Prompt Optimization Loop

In [None]:
class AdaptiveOptimizationLoop:
    def init(self, optimizers, validators, generator):
        self.optimizers = optimizers
        self.validators = validators
        self.generator = generator 

        self.history = []
        self.data = []

        self.n_iter = 1000


def generate_proposals(self, feedback):
    if feedback:
        return self.generator(self.data, feedback)
    else:
        return self.generator(self.data)


def run_iteration(self, current_prompt):
    # 1. Evaluation
    eval_results = self.evaluate_prompt(current_prompt)

    # 2. Error Analysis
    error_patterns = self.analyze_errors(eval_results)
    
    # 3. Optimization
    proposals = self.generate_proposals(error_patterns)
    
    # 4. Validation
    best_proposal = self.validate_proposals(proposals)
    
    return best_proposal


def main(self):
    # 1. Initialization
    proposals = self.generate_proposals(feedback=None)
    best_proposal = proposals[0]

    # 2. Iteration
    for iter in range(self.n_iter):
        best_proposal = self.run_iteration(proposals)

    return best_proposal

Given (input, output) pairs and in some cases, an initial prompt, prompt optimization does the following:

    (1) Use LLM to generate prompt candidates

    Iterate over (2):

    (2) Use an LLM to filter/critique the prompt candidates (ie. update) and/or generate additional (input, output) pairs

    (3) Use an LLM to evaluate output of iteration loop

## Trade-offs

### 1. Bayesian Optimization (ie. DsPy)

Best For: Multi-stage pipelines with multiple chained prompts (ie. RAG, code generation)

Limitations: Overkill for simple tasks; requires program definitions in DsPY which can be unnecessary 

### 2. Evolutionary Algorithms (ie. EvoPrompt)
Best For: Open-ended creativity and exploring new solutions (ie. story generation, brainstorming)

Limitations: High computational cost, not compatible with multi-stage problems

### 3."Gradient"-based Methods (ie. ProTeGI, APE)
Best For: General NLP tasks with clear input-output pairs (ie. translation) that can be optimized via incremental refinement (ie. stylistic alignment)

Limitations: Not good for open-ended, creative tasks, generally low exploration



### 4. Human-interactive Methods (ie. iPrOp)
Best For: Domain-specific tuning (ie. medical/legal fields)

Limitations: Limited scalability due to need of experts, systems may also be fickle with convergence depending on expert feedback consistency

### 5. Translation-based Methods (ie. BPO)
Best For: Human-preferences alignment (e.g., chatbot tuning, content moderation)

Limitations: Requires large datasets usually; objectives are inflexible and require multiple models for multi-objective problems

## Systems for Prompt Optimization

### 1. Optimization Trajectories: 
- Apply different optimization methods (1-5), showing their convergence patterns over iterations. iPrOp requires human supervision, and others are automatic.
- Explore ensembling of prompts based on proposal selection/filtering during the optimization process.

### 2. Prompt Inference:
- Map top prompts to embedding space for visualization of similarity over performance.
- Run inference using top prompts on held-out dataset. 

### 3. Multi-Objective Evaluation: 
- Use parallel coordinates plot to visualize trade-offs between different quantiative evaluation metrics (ie. accuracy, latency).

### 4. LM + Human Evaluation: 
- Visualize the distribution of different types of errors identified by LLM judges based on the simulations (ie. sunburst chart).

### 5. Performance Drift:
- Repeat 2-4 regularly to monitor performance over time, starting at 1 if failure.
- Interactive plot showing performance drift over time with daily variations and trend line.

## Future directions/approaches
- Investigate the use of prompt ensembling: (1) merging prompts into one and (2) combining the answers from each prompt method.

- Design a meta-learning system where a learned subnetwork identifies which optimization method will yield the optimal prompt under different conditions.

- Understand the model dependencies (ie. architecture, model size) of prompt optimization methods and their performance.