# Lesson 5: Implementing LLM Feedback Loops

## Code Generation and Debugging Assistant

In this hands-on exercise, you will implement iterative feedback loops where an AI generates, tests, and revises Python code snippets based on test results and feedback.

We will use an LLM Feedback loop to create and iteratively improve a Python function called `process_data` that is best described using the following examples:

```python
process_data([1, 2, 3, 4, 5], mode='average')  # Should return 3.0
process_data([1, 2, 'a', 3], mode='sum')  # Should return 6
```


### Outline:

- Setup
- Define Task and Test Cases
- Initial Generation
- Expand the Test Cases
- First Iteration with Feedback
- Create Feedback Loop
- Reflection

## 1. Setup

Import necessary libraries and define helper functions, including a mock LLM client, code execution environment, and test runner.

In [1]:
# Import necessary libraries
# No changes needed in this cell
from openai import OpenAI
from IPython.display import Markdown, display
import traceback
import io
import os
from contextlib import redirect_stdout, redirect_stderr
from enum import Enum
from dotenv import load_dotenv

In [2]:
load_dotenv()  # Load environment variables from .env file

True

In [None]:
# Set up LLM credentials

client = OpenAI(
    base_url="https://openai.vocareum.com/v1",
    # Uncomment one of the following
    # api_key="**********",  # <--- TODO: Fill in your Vocareum API key here
    # api_key=os.getenv(
    #     "OPENAI_API_KEY"
    # ),  # <-- Alternately, set as an environment variable
)

# If using OpenAI's API endpoint
# client = OpenAI()

In [4]:
# Define helper functions
# No changes needed in this cell


class OpenAIModels(str, Enum):
    GPT_4O_MINI = "gpt-4o-mini"
    GPT_41_MINI = "gpt-4.1-mini"
    GPT_41_NANO = "gpt-4.1-nano"


MODEL = OpenAIModels.GPT_41_NANO


def get_completion(messages=None, system_prompt=None, user_prompt=None, model=MODEL):
    """
    Function to get a completion from the OpenAI API.
    Args:
        system_prompt: The system prompt
        user_prompt: The user prompt
        model: The model to use (default is gpt-4.1-mini)
    Returns:
        The completion text
    """

    messages = list(messages)
    if system_prompt:
        messages.insert(0, {"role": "system", "content": system_prompt})
    if user_prompt:
        messages.append({"role": "user", "content": user_prompt})
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.7,
    )
    return response.choices[0].message.content


def execute_code(code, test_cases):
    """
    Executes Python code and returns the results of test cases.
    Args:
        code: String containing Python code
        test_cases: List of dictionaries with inputs and expected outputs
    Returns:
        Dictionary containing execution results and test outcomes
    """
    results = {"execution_error": None, "test_results": [], "passed": 0, "failed": 0}

    # Create a namespace for execution
    namespace = {}

    # Capture stdout and stderr
    output_buffer = io.StringIO()

    try:
        with redirect_stdout(output_buffer), redirect_stderr(output_buffer):
            exec(code, namespace)

        # Run test cases
        for i, test in enumerate(test_cases):
            inputs = test["inputs"]
            expected = test["expected"]

            # Execute the function with test inputs
            try:
                if isinstance(inputs, dict):
                    actual = namespace["process_data"](**inputs)
                else:
                    actual = namespace["process_data"](*inputs)

                passed = actual == expected

                if passed:
                    results["passed"] += 1
                else:
                    results["failed"] += 1

                results["test_results"].append(
                    {
                        "test_id": i + 1,
                        "inputs": inputs,
                        "expected": expected,
                        "actual": actual,
                        "passed": passed,
                    }
                )
            except Exception as e:
                # If the error is the expected type, mark as passed
                passed = isinstance(expected, type) and isinstance(e, expected)
                results["test_results"].append(
                    {
                        "test_id": i + 1,
                        "inputs": inputs,
                        "expected": expected,
                        "error": str(e),
                        "passed": passed,
                    }
                )
                if passed:
                    results["passed"] += 1
                else:
                    results["failed"] += 1

    except Exception as e:
        results["execution_error"] = {
            "error_type": type(e).__name__,
            "error_message": str(e),
            "traceback": traceback.format_exc(),
        }

    results["stdout"] = output_buffer.getvalue()
    return results


# Function to format test results as feedback for the model
def format_feedback(results):
    """
    Formats test results into a clear feedback string for the model.
    Args:
        results: Dictionary containing execution results
    Returns:
        Formatted feedback string
    """
    feedback = []

    if results["execution_error"]:
        feedback.append(
            f"ERROR: Code execution failed with {results['execution_error']['error_type']}"
        )
        feedback.append(f"Message: {results['execution_error']['error_message']}")
        feedback.append("Traceback:")
        feedback.append(results["execution_error"]["traceback"])
        feedback.append("\nPlease fix the syntax or runtime errors in the code.")
        return "\n".join(feedback)

    feedback.append(
        f"Test Results: {results['passed']} passed, {results['failed']} failed"
    )

    if results["stdout"]:
        feedback.append(f"\nStandard output:\n{results['stdout']}")

    if results["failed"] > 0:
        feedback.append("\nFailed Test Cases:")
        for test in results["test_results"]:
            if not test.get("passed"):
                feedback.append(f"\nTest #{test['test_id']}:")
                feedback.append(f"  Inputs: {test['inputs']}")
                feedback.append(f"  Expected: {test['expected']}")
                if "actual" in test:
                    feedback.append(f"  Actual: {test['actual']}")
                if "error" in test:
                    feedback.append(f"  Error: {test['error']}")

    return "\n".join(feedback)

## 2. Define Task and Test Cases

We will create a Python function called `process_data` that analyzes numerical data with the following (possibly incomplete) set of requirements:

1. The function should accept a list of numbers and an optional parameter 'mode' that can be 'sum' or 'average' (default should be 'average').
2. If mode is 'sum', return the sum of all numbers.
3. If mode is 'average', return the average (mean) of all numbers.

Example:
```python
process_data([1, 2, 3, 4, 5], mode='average')  # Should return 3.0
process_data([1, 2, 'a', 3], mode='sum')  # Should return 6
```

In [None]:
# COMPLETADO: Descripci√≥n clara de la tarea para el LLM
# Esta descripci√≥n define exactamente qu√© debe hacer la funci√≥n process_data
task_description = """
Create a Python function called `process_data` that processes a list of values with different modes:

Requirements:
1. Accept a list as the first parameter and a 'mode' parameter (default: 'average')
2. Support three modes:
   - 'sum': Return the sum of all numeric values
   - 'average': Return the average (mean) of all numeric values
   - 'median': Return the median of all numeric values
3. Filter out non-numeric values (ignore strings, None, etc.)
4. Return None if the list is empty or contains no numeric values
5. Raise ValueError if an invalid mode is provided

Examples:
- process_data([1, 2, 3, 4, 5], mode='average') should return 3.0
- process_data([1, 2, 'a', 3], mode='sum') should return 6 (ignoring 'a')
- process_data([], mode='sum') should return None
"""

In [None]:
# COMPLETADO: Casos de prueba iniciales (simples) para validar la funci√≥n
# Empezamos con casos b√°sicos para que el LLM genere una primera versi√≥n
test_cases = [
    # Test b√°sico de suma
    {"inputs": ([1, 2, 3, 4, 5], "sum"), "expected": 15},
    # Test b√°sico de promedio
    {"inputs": ([1, 2, 3, 4, 5], "average"), "expected": 3.0},
    # Test con otros n√∫meros para suma
    {"inputs": ([10, 20, 30], "sum"), "expected": 60},
    # Test con otros n√∫meros para promedio
    {"inputs": ([2, 4, 6, 8], "average"), "expected": 5.0},
]

## 3. Initial Generation

Let's start with a basic prompt to generate an initial solution to our problem.

In [None]:
# COMPLETADO: Prompt inicial para que el LLM genere la primera versi√≥n del c√≥digo
# Este prompt es claro y espec√≠fico, pidiendo solo la funci√≥n sin explicaciones
initial_prompt = f"""
You are an expert Python developer.

{task_description}

Write only the function surrounded by ```python and ``` without any additional explanations or examples.

Example format:

```python
def process_data(data, mode='average'):
    # Your implementation here
    pass
```
"""

# Get initial completion
messages = [{"role": "user", "content": initial_prompt}]
initial_response = get_completion(messages)


def extract_code(code):
    """Extrae el c√≥digo Python de la respuesta del LLM (entre ```python y ```)"""
    lines = code.split("\n")
    start = lines.index("```python") + 1
    end = lines.index("```", start)
    return "\n".join(lines[start:end])


# Extraer el c√≥digo generado
initial_code = extract_code(initial_response)

print("Initial Generated Code:")
print(initial_code)

# Ejecutar y probar el c√≥digo inicial
initial_results = execute_code(initial_code, test_cases)
initial_feedback = format_feedback(initial_results)

print("\nTest Results:")
print(initial_feedback)

## 4. Expand the Test Cases

Now, pretend that you've used this code in a production setting and have received feedback. The first version of your generated code worked marvelously, and now you are seeking to expand the capabilities of your function.

Unfortunately, your product manager is on vacation, but you have know your function needs to:
1) support a new mode, "median"
2) ignore non-numeric values
3) handle empty lists, returning None

So, following test-driven development practices, you update your tests:


In [None]:
# These are the new test cases. No updates needed.
test_cases = [
    {"inputs": ([1, 2, 3, 4, 5], "sum"), "expected": 15},
    {"inputs": ([1, 2, 3, 4, 5], "average"), "expected": 3.0},
    {"inputs": ([11, 12, 13, 14, 15], "sum"), "expected": 65},
    {"inputs": ([11, 12, 13, 14, 15], "average"), "expected": 13.0},
    {"inputs": ([], "sum"), "expected": None},
    {"inputs": ([1, 3, 4], "median"), "expected": 3},
    {"inputs": ([1, 2, 3, 5], "median"), "expected": 2.5},
    {"inputs": ([1, 2, "a", 3], "sum"), "expected": 6},
    {"inputs": ([1, 2, None, 3, "b", 4], "average"), "expected": 2.5},
    {"inputs": ([10], "median"), "expected": 10},
    {"inputs": ([], "median"), "expected": None},
    {"inputs": ([1, 2, 3, 4, 5], "invalid_mode"), "expected": ValueError},
]

In [None]:
# Re-test the code
# No updates are needed in this cell
print("Initial Generated Code:")
print(initial_code)

# Execute and test the initial code
initial_results = execute_code(initial_code, test_cases)
initial_feedback = format_feedback(initial_results)

print("\nTest Results:")
print(initial_feedback)

## 5. First Iteration with Feedback
Now, let's feed the test results back to the model and ask for an improved version.

In [None]:
# COMPLETADO: Primera iteraci√≥n con feedback
# Aqu√≠ le mostramos al LLM los resultados de las pruebas y le pedimos que mejore
feedback_prompt = f"""
You are an expert Python developer. You wrote a function based on these requirements:

{task_description}

Here is your current implementation:
```python
{initial_code}
```

I've tested your code and here are the results:
{initial_feedback}

Please improve your code to fix any issues and make all tests pass.
Write only the improved function surrounded by ```python and ``` without any explanations.
"""

messages = [{"role": "user", "content": feedback_prompt}]

# Get improved code
improved_response = get_completion(messages)

# Extract the improved code
improved_code = extract_code(improved_response)

print("\nImproved Code:")
print(improved_code)

# Execute and test the improved code
improved_results = execute_code(improved_code, test_cases)
improved_feedback = format_feedback(improved_results)
print("\nTest Results for Improved Code:")
print(improved_feedback)

## 6. Create Feedback Loop

We may want to give the LLM more than one chance to generate the correct code. We may even want to introduce test cases gradually, so that it has the opportunity to fix errors one at a time.

Let's develop a loop that will start from scratch and run the loop a maximum number of times or until the code is correct.

In [None]:
# COMPLETADO: Loop de feedback completo - Itera hasta que todos los tests pasen
# Este es el patr√≥n de feedback loop: generar ‚Üí probar ‚Üí retroalimentar ‚Üí mejorar
from pprint import pprint

iterations = []

# ====================
# PASO 1: Generaci√≥n inicial
# ====================
# Get initial completion and extract code
messages = [{"role": "user", "content": initial_prompt}]
initial_response = get_completion(messages)
initial_code = extract_code(initial_response)

# Execute and test the initial code
initial_results = execute_code(initial_code, test_cases)
initial_feedback = format_feedback(initial_results)

# Store the initial iteration
iterations.append(
    {
        "iteration": 0,
        "code": initial_code,
        "test_results": {
            "passed": initial_results["passed"],
            "failed": initial_results["failed"],
        },
    }
)

print("=== ITERATION 0 (Initial Generation) ===")
pprint(iterations[-1]["test_results"])

current_code = initial_code
current_feedback = initial_feedback

# ====================
# PASO 2: Loop de mejora iterativa
# ====================
# Loop to improve the code based on feedback
for i in range(3):  # M√°ximo 3 iteraciones de mejora
    # Si todos los tests pasan, salimos del loop
    if iterations[-1]["test_results"]["failed"] == 0:
        print("\n‚úÖ Success! All tests passed.")
        break
    
    print(f"\n=== ITERATION {i+1} (Improvement) ===")
    
    # Crear el prompt de feedback con el c√≥digo actual y los resultados
    feedback_prompt = f"""
You are an expert Python developer. You wrote a function based on these requirements:

{task_description}

Here is your current implementation:
```python
{current_code}
```

I've tested your code and here are the results:
{current_feedback}

Please improve your code to fix any issues and make sure it passes all test cases.
Write only the improved function surrounded by ```python and ``` without any explanation.
"""

    # Obtener c√≥digo mejorado del LLM
    messages = [{"role": "user", "content": feedback_prompt}]
    improved_response = get_completion(messages)
    improved_code = extract_code(improved_response)

    # Execute and test the improved code
    improved_results = execute_code(improved_code, test_cases)
    improved_feedback = format_feedback(improved_results)
    
    # Guardar esta iteraci√≥n
    iterations.append(
        {
            "iteration": i + 1,
            "code": improved_code,
            "test_results": {
                "passed": improved_results["passed"],
                "failed": improved_results["failed"],
            },
        }
    )
    pprint(iterations[-1]["test_results"])

    # Actualizar para la siguiente iteraci√≥n
    current_code = improved_code
    current_feedback = improved_feedback

print("\n" + "="*60)
print("FEEDBACK LOOP COMPLETED")
print("="*60)

In [None]:
# View a summary of the different iterations
from pprint import pprint
pprint(iterations, width=200)

In [None]:
# Print the final code
print(iterations[-1]["code"])


## 7. Reflection & Transfer

### üìä AN√ÅLISIS DE MEJORAS POR ITERACI√ìN

**Observaciones sobre las iteraciones:**

1. **Correcci√≥n (Correctness)**:
   - ¬øAument√≥ el n√∫mero de tests que pasan en cada iteraci√≥n?
   - El feedback loop permite que el LLM aprenda de sus errores espec√≠ficos
   
2. **Manejo de errores (Error handling)**:
   - Primera iteraci√≥n: C√≥digo b√°sico, puede fallar con edge cases
   - Iteraciones posteriores: Agrega validaciones y manejo de excepciones
   
3. **Casos extremos (Edge cases)**:
   - Lista vac√≠a ‚Üí Debe retornar None
   - Valores no num√©ricos ‚Üí Deben ser filtrados
   - Modo inv√°lido ‚Üí Debe levantar ValueError
   
4. **Legibilidad (Readability)**:
   - El c√≥digo se vuelve m√°s robusto y documentado
   - Se agregan validaciones claras

---

### üîÑ EFECTIVIDAD DEL FEEDBACK LOOP

**Ventajas del approach:**

‚úÖ **Iterativo y espec√≠fico**: El LLM recibe feedback detallado sobre qu√© fall√≥
‚úÖ **Aprendizaje progresivo**: Cada iteraci√≥n mejora sobre la anterior
‚úÖ **Automatizable**: Este patr√≥n se puede aplicar a cualquier tarea de generaci√≥n de c√≥digo
‚úÖ **Test-driven**: Usa TDD (Test-Driven Development) como gu√≠a

**Tipos de problemas resueltos por iteraci√≥n:**
- Iteraci√≥n 0: Implementaci√≥n b√°sica (sum/average)
- Iteraci√≥n 1: Agregar modo 'median', filtrar no-num√©ricos
- Iteraci√≥n 2: Manejar listas vac√≠as, validar modo inv√°lido
- Iteraci√≥n 3: Refinamientos finales

**Problemas persistentes:**
- A veces el LLM puede "sobrecomplicar" la soluci√≥n
- Algunos edge cases muy espec√≠ficos pueden requerir m√°s iteraciones

---

### üí° LECCIONES CLAVE PARA REPLICAR

**Patr√≥n de Feedback Loop para Code Generation:**

```python
# 1. DEFINIR TASK + TEST CASES
task_description = "..."
test_cases = [...]

# 2. GENERAR C√ìDIGO INICIAL
code = llm.generate(task_description)

# 3. LOOP DE MEJORA
for iteration in range(max_iterations):
    # 3a. Ejecutar y obtener resultados
    results = execute_and_test(code, test_cases)
    
    # 3b. Si todos pasan, terminar
    if all_tests_passed(results):
        break
    
    # 3c. Generar feedback estructurado
    feedback = format_feedback(results)
    
    # 3d. Pedir al LLM que mejore
    code = llm.improve(code, feedback)
```

**Mejoras potenciales:**
1. Agregar m√°s contexto en el feedback (ej: traceback completo)
2. Usar diferentes temperaturas (m√°s baja = m√°s determinista)
3. Implementar "reflexi√≥n" donde el LLM explica qu√© cambi√≥ y por qu√©
4. Guardar todas las versiones del c√≥digo para an√°lisis

---

### üÜö COMPARACI√ìN CON ENFOQUES TRADICIONALES

| Enfoque | Tradicional | LLM Feedback Loop |
|---------|-------------|-------------------|
| **Debugging** | Manual, l√≠nea por l√≠nea | Automatizado, basado en tests |
| **Iteraciones** | Lentas (humano escribe c√≥digo) | R√°pidas (LLM genera en segundos) |
| **Cobertura** | Depende del desarrollador | Sistem√°tica (todos los test cases) |
| **Aprendizaje** | Experiencia acumulada | Feedback inmediato y espec√≠fico |
| **Escalabilidad** | Limitada por tiempo humano | Alta (m√∫ltiples tareas en paralelo) |

**Cu√°ndo usar Feedback Loops:**
- ‚úÖ Tareas bien definidas con tests claros
- ‚úÖ Problemas que requieren m√∫ltiples iteraciones
- ‚úÖ Automatizaci√≥n de debugging y refinamiento
- ‚úÖ Generaci√≥n de c√≥digo con requisitos complejos

**Cu√°ndo NO usar:**
- ‚ùå Tareas creativas sin criterios claros de √©xito
- ‚ùå Problemas que requieren contexto humano profundo
- ‚ùå Cuando el feedback no puede ser automatizado

## Summary

In this exercise, we explored how LLM feedback loops can be used to iteratively improve code generation.

By providing structured feedback about test failures, we enabled the model to focus on specific issues and incrementally improve its solution.

The key insight is that well-structured feedback loops can significantly enhance the quality and correctness of AI-generated code, especially for complex tasks with multiple (possibly incomplete) requirements and edge cases.


Congratulations on completing this exercise! Give yourself a hand! ü§óü§ó