# üìò Promptimus Prime: LLM-AutoDiff Reproduction

## ü§ñ **LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**

Welcome to **Promptimus Prime**! This notebook reproduces the experiments from the paper *"LLM-AutoDiff: Auto-Differentiate Any LLM Workflow"*.

We utilize **Textual Gradient Descent (TGD)** to automatically optimize system prompts for Large Language Models. Instead of manual prompt engineering, we treat the prompt as a trainable parameter and use a "Teacher" LLM to provide gradients (textual feedback) based on the "Student" LLM's errors.

### üßÆ **The Task: GSM8K (Grade School Math)**
*   **Goal:** Solve multi-step mathematical reasoning problems.
*   **Student Model:** `Qwen2.5-1.5B-Instruct` (Lightweight, efficient).
*   **Teacher Model:** `Qwen2.5-7B-Instruct` (Stronger reasoning capabilities).
*   **Optimization:** We optimize the system prompt to improve the Student's Chain-of-Thought reasoning.

### üõ†Ô∏è **Architecture**
1.  **Forward Pass:** Student attempts to solve a math problem.
2.  **Evaluation:** We check if the final answer matches the Ground Truth.
3.  **Backward Pass:** If incorrect, the Teacher analyzes the error and generates a "Textual Gradient".
4.  **Update:** The Optimizer refines the system prompt to fix the error.

### üöÄ **Step 1: Setup & Installation**

We start by cloning the **Promptimus Prime** repository. Then, we install all necessary dependencies defined in `requirements.txt` to ensure our environment matches the project specifications.

**Note:** Ensure you are connected to a **GPU Runtime** (T4 is sufficient) before running this cell.

In [None]:
# 1. Clone the repository
!git clone https://github.com/antonisbaro/promptimus-prime.git

# 2. Enter the project directory
%cd promptimus-prime

# 3. Install dependencies from requirements.txt
!pip install -q -r requirements.txt

We add the repository to the system path to allow direct imports. We also configure logging to suppress verbose output from libraries, ensuring that progress bars (tqdm) render correctly in Colab.

In [None]:
import sys
import logging
import transformers

# Add the repository to Python path
repo_path = "/content/promptimus-prime"
if repo_path not in sys.path:
    sys.path.append(repo_path)

# Configure Global Logging (Silence the noise)
# Force re-configuration to override Colab defaults
logging.basicConfig(level=logging.INFO, force=True)

# Suppress specific library noise
logging.getLogger("transformers").setLevel(logging.ERROR)
logging.getLogger("adalflow").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.ERROR)
transformers.logging.set_verbosity_error()

print("‚úÖ Environment configured for interactive execution.")

### üîë **Step 2: Hugging Face Login (Optional)**

If you plan to use gated models or want to avoid download limits, log in to Hugging Face. For `Qwen2.5`, this is usually not strictly required but recommended.

In [None]:
from google.colab import userdata 
from huggingface_hub import login

try:
    # Ensure you have added 'HF_TOKEN' to your Colab Secrets
    token = userdata.get('HF_TOKEN')
    login(token)
    print("‚úÖ Successfully logged in to Hugging Face!")
except:
    print("‚ö†Ô∏è HF_TOKEN not found in secrets. Continuing without authentication (some models may not work).")

### üß† **Step 3: Run Training (Optimization Loop)**

We will now start the **Textual Gradient Descent** loop.
*   **Train Split:** Used to generate gradients.
*   **Validation Split:** Used to validate if the new prompt is actually better.

The script `src.tasks.gsm8k.train` handles the entire pipeline: loading models, slicing data, and running the AdalFlow trainer.

In [None]:
# We import the main execution function and run it directly
# This will load the models (4-bit), run the optimization steps, and save the result.
from src.tasks.gsm8k.train import run_training # pyright: ignore[reportMissingImports]

# Execute the training pipeline
run_training()

### üìä **Step 4: Final Evaluation**

Now that we have an **Optimized Prompt**, let's compare it against the **Baseline (Zero-shot)** prompt on a held-out **Test Set**.

This script will:
1.  Run inference using the default prompt.
2.  Run inference using the optimized prompt found in Step 3.
3.  Report the accuracy improvement.

In [None]:
# We import the evaluation function and run it directly
from src.tasks.gsm8k.evaluate import run_evaluation # pyright: ignore[reportMissingImports]

# Execute the evaluation
run_evaluation()

### üìù **Step 5: Inspect the Optimized Prompt**

Let's see what the "Teacher" taught the "Student". Here is the final system prompt that yielded the best results.

In [None]:
import os

prompt_path = "outputs/gsm8k/optimized_prompt.txt"

if os.path.exists(prompt_path):
    print("\n‚ú® \033[1mFINAL OPTIMIZED PROMPT:\033[0m\n" + "="*40)
    with open(prompt_path, "r") as f:
        print(f.read())
    print("="*40)
else:
    print("‚ùå Prompt file not found. Did the training finish successfully?")

### üìà **Step 6: Visualization & Analysis**

We now visualize the improvements (success stories) and the evolution of the prompt.

In [None]:
from src.tasks.gsm8k.visualize import run_visualization

run_visualization()