# LLM Feedback Generation

This notebook covers the `LLMFeedbackGenerator` class, which provides rich diagnostic feedback for GEPA's reflective optimization. Learn how to:

1. Initialize the feedback generator
2. Generate single trajectory feedback
3. Perform differential analysis (success vs failure)
4. Generate rubric-specific feedback
5. Combine all feedback types

---

## Setup

In [None]:
import sys
from pathlib import Path

sys.path.insert(0, str(Path.cwd().parent.parent.parent / "src"))

from karenina import Benchmark
from karenina.integrations.gepa import (
    KareninaDataInst,
    KareninaTrajectory,
    LLMFeedbackGenerator,
)
from karenina.schemas import ModelConfig, VerificationConfig

# Load benchmark
benchmark_path = Path.home() / "Projects/karenina-monorepo/local_data/data/checkpoints/aime_2025.jsonld"
benchmark = Benchmark.load(benchmark_path)
print(f"Loaded: {benchmark.name}")

---

## Initializing LLMFeedbackGenerator

The generator uses an LLM to analyze verification failures and provide actionable feedback.

In [None]:
# Create a ModelConfig for the feedback LLM
feedback_model_config = ModelConfig(
    id="feedback-haiku",
    model_provider="anthropic",
    model_name="claude-haiku-4-5",
    temperature=0.7,  # Some creativity for suggestions
    interface="langchain",
)

# Initialize the generator
generator = LLMFeedbackGenerator(feedback_model_config)

print("LLMFeedbackGenerator initialized")
print(f"  Model: {feedback_model_config.model_name}")
print(f"  Temperature: {feedback_model_config.temperature}")

---

## Creating Sample Trajectories

To demonstrate feedback generation, we'll create sample trajectories simulating verification results.

In [None]:
# Get a question from the benchmark
question_ids = benchmark.get_question_ids()
question = benchmark.get_question(question_ids[0])
template_code = benchmark.get_template(question_ids[0])

print(f"Sample question: {question['question'][:80]}...")
print(f"Expected answer: {question['raw_answer']}")

In [None]:
# Create a data instance
data_inst = KareninaDataInst(
    question_id=question_ids[0],
    question_text=question["question"],
    raw_answer=question["raw_answer"],
    template_code=template_code or "",
)

print(f"Created KareninaDataInst for: {data_inst.question_id[:50]}...")

In [None]:
# Run actual verification to get real trajectories
config = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="claude-haiku",
            model_provider="anthropic",
            model_name="claude-haiku-4-5",
            temperature=0.0,
            interface="langchain",
            system_prompt="You are a math expert. Provide only the final integer answer.",
        )
    ],
    parsing_models=[
        ModelConfig(
            id="parser",
            model_provider="anthropic",
            model_name="claude-haiku-4-5",
            temperature=0.0,
            interface="langchain",
        )
    ],
    evaluation_mode="template_only",
    replicate_count=1,
)

print("Running verification on 3 questions...")
results = benchmark.run_verification(config, question_ids=question_ids[:3])
print(f"Got {len(results.results)} results")

In [None]:
# Convert results to trajectories
from karenina.integrations.gepa import compute_single_score, extract_failed_fields, questions_to_data_insts

data_insts = questions_to_data_insts(benchmark, question_ids[:3])

trajectories = []
for inst, result in zip(data_insts, results.results, strict=False):
    traj = KareninaTrajectory(
        data_inst=inst,
        model_name=result.metadata.answering_model or "claude-haiku-4-5",
        model_config=config.answering_models[0],
        optimized_components={"answering_system_prompt": config.answering_models[0].system_prompt},
        verification_result=result,
        score=compute_single_score(result, template_weight=1.0, rubric_weight=0.0),
        raw_llm_response=result.template.raw_llm_response if result.template else None,
        parsing_error=result.metadata.error,
        failed_fields=extract_failed_fields(result),
        rubric_scores=None,
    )
    trajectories.append(traj)

print(f"Created {len(trajectories)} trajectories:")
for t in trajectories:
    status = "PASS" if t.passed() else "FAIL"
    print(f"  {status}: Score {t.score:.2f} | {t.data_inst.question_text[:50]}...")

---

## generate_single_feedback(): Analyzing One Failure

Analyze a single failed trajectory to understand why it failed.

In [None]:
# Find a failed trajectory
failed_trajs = [t for t in trajectories if not t.passed()]
passed_trajs = [t for t in trajectories if t.passed()]

print(f"Failed trajectories: {len(failed_trajs)}")
print(f"Passed trajectories: {len(passed_trajs)}")

In [None]:
if failed_trajs:
    failed_traj = failed_trajs[0]

    print("Analyzing failed trajectory...")
    print(f"  Question: {failed_traj.data_inst.question_text[:60]}...")
    print(f"  Expected: {failed_traj.data_inst.raw_answer}")
    print(f"  Response: {failed_traj.raw_llm_response[:100] if failed_traj.raw_llm_response else 'None'}...")

    # Generate single feedback
    feedback = generator.generate_single_feedback(failed_traj)

    print("\n=== LLM Feedback ===")
    print(feedback)
else:
    print("No failed trajectories to analyze.")
    print("Using a passed trajectory for demonstration...")

    if passed_trajs:
        demo_traj = passed_trajs[0]
        print(f"  Question: {demo_traj.data_inst.question_text[:60]}...")
        print(f"  Answer: {demo_traj.data_inst.raw_answer}")
        print(f"  Response: {demo_traj.raw_llm_response[:100] if demo_traj.raw_llm_response else 'None'}...")
        print(f"  Score: {demo_traj.score:.2f}")

---

## generate_differential_feedback(): Comparing Success vs Failure

Compare a failed trajectory against successful ones to identify what works.

In [None]:
if failed_trajs and passed_trajs:
    failed_traj = failed_trajs[0]

    print("Performing differential analysis...")
    print(f"  Failed: {failed_traj.model_name}")
    print(f"  Comparing against {len(passed_trajs)} successful trajectories")

    # Generate differential feedback
    diff_feedback = generator.generate_differential_feedback(
        failed_trajectory=failed_traj,
        successful_trajectories=passed_trajs,
    )

    print("\n=== Differential Feedback ===")
    print(diff_feedback)
else:
    print("Need both failed and passed trajectories for differential analysis.")
    print("\nDifferential feedback compares:")
    print("  - What successful models did differently")
    print("  - The specific failure mode")
    print("  - Concrete prompt improvements")

---

## generate_rubric_feedback(): Rubric-Specific Analysis

Analyze why specific rubric traits failed or scored low.

In [None]:
# Simulate rubric scores
sample_rubric_scores = {
    "Conciseness": 3,  # 3/5 - moderate
    "ShowsWork": False,  # Failed - didn't show reasoning
    "CorrectFormat": True,  # Passed - answer format was correct
    "Accuracy": 0.4,  # Low score
}

if trajectories:
    traj = trajectories[0]

    print("Generating rubric feedback...")
    print(f"  Rubric scores: {sample_rubric_scores}")

    rubric_feedback = generator.generate_rubric_feedback(
        trajectory=traj,
        rubric_scores=sample_rubric_scores,
    )

    print("\n=== Rubric Feedback ===")
    print(rubric_feedback)

---

## generate_complete_feedback(): Combined Analysis

The main entry point that combines template verification and rubric feedback.

In [None]:
if trajectories:
    traj = trajectories[0]

    print("Generating complete feedback...")

    complete_feedback = generator.generate_complete_feedback(
        failed_trajectory=traj,
        successful_trajectories=passed_trajs if passed_trajs else None,
        rubric_scores=sample_rubric_scores,
    )

    print("\n" + "=" * 50)
    print("COMPLETE FEEDBACK")
    print("=" * 50)
    print(complete_feedback)

---

## Feedback System Prompts

The generator uses specialized system prompts for each feedback type.

In [None]:
from karenina.integrations.gepa.feedback import (
    DIFFERENTIAL_FEEDBACK_SYSTEM_PROMPT,
    RUBRIC_FEEDBACK_SYSTEM_PROMPT,
    SINGLE_FEEDBACK_SYSTEM_PROMPT,
)

print("=== Single Feedback System Prompt ===")
print(SINGLE_FEEDBACK_SYSTEM_PROMPT)
print()

In [None]:
print("=== Differential Feedback System Prompt ===")
print(DIFFERENTIAL_FEEDBACK_SYSTEM_PROMPT)
print()

In [None]:
print("=== Rubric Feedback System Prompt ===")
print(RUBRIC_FEEDBACK_SYSTEM_PROMPT)

---

## Integration with KareninaAdapter

The feedback generator integrates with the adapter via `make_reflective_dataset()`.

In [None]:
# Conceptual integration
print("""
# Integration with KareninaAdapter

from karenina.integrations.gepa import KareninaAdapter, OptimizationTarget

# Create adapter with feedback generator
adapter = KareninaAdapter(
    benchmark=benchmark,
    base_config=verification_config,
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    feedback_model_config=ModelConfig(
        model_provider="anthropic",
        model_name="claude-haiku-4-5",
        temperature=0.7,
        interface="langchain",
    ),
    enable_differential_analysis=True,
)

# Evaluate candidate
eval_batch = adapter.evaluate(batch, candidate, capture_traces=True)

# Build reflective dataset (uses LLMFeedbackGenerator internally)
reflective_data = adapter.make_reflective_dataset(
    candidate=candidate,
    eval_batch=eval_batch,
    components_to_update=["answering_system_prompt"],
)

# The reflective_data now contains LLM-generated feedback
# for GEPA's reflection LLM to use
""")

---

## Summary

| Method | Purpose | When to Use |
|--------|---------|-------------|
| `generate_single_feedback()` | Analyze one failure | Single model fails |
| `generate_differential_feedback()` | Compare success vs failure | Multiple models, some pass |
| `generate_rubric_feedback()` | Analyze rubric scores | Quality traits enabled |
| `generate_complete_feedback()` | Combined analysis | Full optimization |

## Key Benefits

1. **Rich diagnostics**: LLM provides human-readable explanations
2. **Differential analysis**: Learn from successful models
3. **Rubric integration**: Understand quality trait failures
4. **Actionable suggestions**: Concrete prompt improvements

## Next Steps

- [07_tracking_runs.ipynb](07_tracking_runs.ipynb) - Optimization run tracking