# Week 13 ‚Äî Software Engineering Evaluation
### BenchRight LLM Evaluation Master Program (18 Weeks)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

1. Understand how to evaluate LLM code generation and debugging capabilities
2. Use a synthetic bug-fix dataset for evaluation
3. Implement pass/fail metrics using unit tests
4. Automatically score model outputs using BenchRight's engine
5. Analyze which types of bugs are hardest for models to fix

---

## üß† Why Code Evaluation is Different

### The Challenge

Unlike natural language tasks, code has a **strict correctness criterion**:

| Aspect | Natural Language | Code |
|--------|------------------|------|
| Correctness | Multiple phrasings OK | Must pass all tests |
| Evaluation | Human judgment needed | Automated testing possible |
| Errors | Graceful degradation | Syntax error = complete failure |

### What We Evaluate

1. **Functional Correctness:** Does the code pass unit tests?
2. **Bug Detection:** Can the model identify what's wrong?
3. **Fix Quality:** Does the fix resolve the issue without side effects?

---

## üõ†Ô∏è Step 1: Setup & Dependencies

In [None]:
# Standard library imports
import sys
import json
import subprocess
import tempfile
import os
from typing import Dict, List, Any, Callable, Iterator, Tuple

# Add src to path if running in Colab
sys.path.insert(0, '.')

# For data display
try:
    from IPython.display import display, HTML
except ImportError:
    display = print

print("‚úÖ Setup complete!")

---

## üêõ Step 2: Define the Synthetic Bug-Fix Dataset

In [None]:
# Define a comprehensive synthetic bug-fix dataset
# Each entry contains: description, buggy code, expected fix, and unit tests

BUG_FIX_DATASET = [
    {
        "name": "Off-by-One Error",
        "description": "Calculate the sum of all numbers from 1 to n (inclusive).",
        "buggy_code": '''def sum_to_n(n):
    total = 0
    for i in range(n):  # Bug: should be range(1, n+1)
        total += i
    return total''',
        "expected_fix": '''def sum_to_n(n):
    total = 0
    for i in range(1, n + 1):
        total += i
    return total''',
        "test_code": '''def test_sum_to_n():
    assert sum_to_n(5) == 15, "sum_to_n(5) should be 15"
    assert sum_to_n(1) == 1, "sum_to_n(1) should be 1"
    assert sum_to_n(10) == 55, "sum_to_n(10) should be 55"''',
        "bug_type": "off-by-one",
    },
    {
        "name": "Wrong Operator",
        "description": "Check if a number is even.",
        "buggy_code": '''def is_even(n):
    return n % 2 == 1  # Bug: should check == 0, not == 1''',
        "expected_fix": '''def is_even(n):
    return n % 2 == 0''',
        "test_code": '''def test_is_even():
    assert is_even(2) == True, "2 should be even"
    assert is_even(3) == False, "3 should be odd"
    assert is_even(0) == True, "0 should be even"
    assert is_even(-4) == True, "-4 should be even"''',
        "bug_type": "wrong-operator",
    },
    {
        "name": "Missing Return Statement",
        "description": "Find the maximum value in a list.",
        "buggy_code": '''def find_max(lst):
    if not lst:
        return None
    max_val = lst[0]
    for item in lst:
        if item > max_val:
            max_val = item
    # Bug: missing return statement''',
        "expected_fix": '''def find_max(lst):
    if not lst:
        return None
    max_val = lst[0]
    for item in lst:
        if item > max_val:
            max_val = item
    return max_val''',
        "test_code": '''def test_find_max():
    assert find_max([1, 5, 3, 9, 2]) == 9, "max of [1,5,3,9,2] should be 9"
    assert find_max([42]) == 42, "max of [42] should be 42"
    assert find_max([]) == None, "max of [] should be None"
    assert find_max([-1, -5, -2]) == -1, "max of negative list"''',
        "bug_type": "missing-return",
    },
    {
        "name": "Wrong Comparison Direction",
        "description": "Return a list of numbers greater than a threshold.",
        "buggy_code": '''def filter_greater_than(numbers, threshold):
    result = []
    for n in numbers:
        if n < threshold:  # Bug: should be > not <
            result.append(n)
    return result''',
        "expected_fix": '''def filter_greater_than(numbers, threshold):
    result = []
    for n in numbers:
        if n > threshold:
            result.append(n)
    return result''',
        "test_code": '''def test_filter_greater_than():
    assert filter_greater_than([1, 5, 10, 3], 4) == [5, 10]
    assert filter_greater_than([1, 2, 3], 10) == []
    assert filter_greater_than([], 5) == []''',
        "bug_type": "wrong-comparison",
    },
    {
        "name": "String Concatenation Order",
        "description": "Reverse a string.",
        "buggy_code": '''def reverse_string(s):
    result = ""
    for char in s:
        result = result + char  # Bug: should prepend, not append
    return result''',
        "expected_fix": '''def reverse_string(s):
    result = ""
    for char in s:
        result = char + result
    return result''',
        "test_code": '''def test_reverse_string():
    assert reverse_string("hello") == "olleh"
    assert reverse_string("a") == "a"
    assert reverse_string("") == ""
    assert reverse_string("ab") == "ba"''',
        "bug_type": "string-order",
    },
]

print(f"üìä Defined {len(BUG_FIX_DATASET)} bug-fix test cases:")
for i, tc in enumerate(BUG_FIX_DATASET, 1):
    print(f"   {i}. {tc['name']} (type: {tc['bug_type']})")

---

## üß™ Step 3: Implement the CodeEvaluator Class

In [None]:
class CodeEvaluator:
    """
    Evaluator for code generation and bug-fix tasks.
    
    Uses unit tests to determine pass/fail for generated code.
    """
    
    def __init__(self, timeout_seconds: int = 5):
        """
        Initialize the CodeEvaluator.
        
        Args:
            timeout_seconds: Maximum time allowed for test execution
        """
        self.timeout_seconds = timeout_seconds
    
    def evaluate_code(
        self,
        generated_code: str,
        test_code: str,
    ) -> Dict[str, Any]:
        """
        Evaluate generated code using unit tests.
        
        Args:
            generated_code: The code to evaluate
            test_code: Unit test code to verify correctness
            
        Returns:
            Dictionary with:
            - passed: bool indicating if all tests passed
            - error: str with error message if failed
            - output: str with test output
        """
        # Combine generated code with tests
        full_code = f"{generated_code}\n\n{test_code}\n\n"
        full_code += "if __name__ == '__main__':\n"
        full_code += "    import sys\n"
        full_code += "    # Run all test functions\n"
        full_code += "    test_functions = [name for name in dir() if name.startswith('test_')]\n"
        full_code += "    all_passed = True\n"
        full_code += "    for test_name in test_functions:\n"
        full_code += "        try:\n"
        full_code += "            globals()[test_name]()\n"
        full_code += "            print(f'‚úì {test_name} passed')\n"
        full_code += "        except AssertionError as e:\n"
        full_code += "            print(f'‚úó {test_name} failed: {e}')\n"
        full_code += "            all_passed = False\n"
        full_code += "        except Exception as e:\n"
        full_code += "            print(f'‚úó {test_name} error: {e}')\n"
        full_code += "            all_passed = False\n"
        full_code += "    if all_passed:\n"
        full_code += "        print('All tests passed!')\n"
        full_code += "    else:\n"
        full_code += "        sys.exit(1)\n"
        
        temp_path = None
        try:
            # Write code to temporary file
            with tempfile.NamedTemporaryFile(
                mode='w',
                suffix='.py',
                delete=False
            ) as f:
                f.write(full_code)
                temp_path = f.name
            
            # Execute the code
            result = subprocess.run(
                ['python', temp_path],
                capture_output=True,
                text=True,
                timeout=self.timeout_seconds
            )
            
            # Check result
            if result.returncode == 0:
                return {
                    "passed": True,
                    "error": None,
                    "output": result.stdout,
                }
            else:
                return {
                    "passed": False,
                    "error": result.stderr or result.stdout,
                    "output": result.stdout,
                }
                
        except subprocess.TimeoutExpired:
            return {
                "passed": False,
                "error": f"Execution timed out after {self.timeout_seconds} seconds",
                "output": "",
            }
        except SyntaxError as e:
            return {
                "passed": False,
                "error": f"Syntax error: {str(e)}",
                "output": "",
            }
        except Exception as e:
            return {
                "passed": False,
                "error": f"Execution error: {str(e)}",
                "output": "",
            }
        finally:
            # Clean up temporary file
            if temp_path and os.path.exists(temp_path):
                os.unlink(temp_path)
    
    def compute_pass_rate(self, results: List[Dict[str, Any]]) -> float:
        """
        Compute the pass rate across multiple evaluations.
        
        Args:
            results: List of evaluation results
            
        Returns:
            Pass rate as a float between 0.0 and 1.0
        """
        if not results:
            return 0.0
        passed = sum(1 for r in results if r["passed"])
        return passed / len(results)


print("‚úÖ CodeEvaluator class defined!")

---

## üèÉ Step 4: Initialize the Evaluator

In [None]:
# Create the CodeEvaluator
evaluator = CodeEvaluator(timeout_seconds=5)

print("‚úÖ CodeEvaluator initialized!")
print(f"   Timeout: {evaluator.timeout_seconds} seconds")

---

## ‚úì Step 5: Verify Expected Fixes Pass All Tests

First, let's verify that our expected fixes actually pass the unit tests.

In [None]:
print("üîç Verifying Expected Fixes...")
print("=" * 70)

verification_results = []
for tc in BUG_FIX_DATASET:
    result = evaluator.evaluate_code(
        tc["expected_fix"],
        tc["test_code"]
    )
    
    verification_results.append({
        "name": tc["name"],
        "bug_type": tc["bug_type"],
        "passed": result["passed"],
        "error": result["error"],
    })
    
    status = "‚úÖ PASS" if result["passed"] else "‚ùå FAIL"
    print(f"\n{status} {tc['name']}")
    if result["output"]:
        for line in result["output"].strip().split('\n'):
            print(f"   {line}")

# Summary
pass_rate = evaluator.compute_pass_rate(verification_results)
print("\n" + "=" * 70)
print(f"üìä Verification Pass Rate: {pass_rate:.0%}")
if pass_rate == 1.0:
    print("‚úÖ All expected fixes pass their tests!")

---

## ‚úó Step 6: Verify Buggy Code Fails Tests

Now let's verify that the buggy code actually fails the tests.

In [None]:
print("üêõ Verifying Buggy Code Fails...")
print("=" * 70)

buggy_results = []
for tc in BUG_FIX_DATASET:
    result = evaluator.evaluate_code(
        tc["buggy_code"],
        tc["test_code"]
    )
    
    buggy_results.append({
        "name": tc["name"],
        "bug_type": tc["bug_type"],
        "passed": result["passed"],
    })
    
    # For buggy code, we WANT it to fail
    status = "‚úÖ Correctly fails" if not result["passed"] else "‚ö†Ô∏è Unexpectedly passes"
    print(f"\n{status}: {tc['name']}")
    print(f"   Bug type: {tc['bug_type']}")

# Summary - for buggy code, pass rate should be 0%
fail_rate = 1.0 - evaluator.compute_pass_rate(buggy_results)
print("\n" + "=" * 70)
print(f"üìä Buggy Code Fail Rate: {fail_rate:.0%}")
if fail_rate == 1.0:
    print("‚úÖ All buggy code correctly fails tests!")

---

## ü§ñ Step 7: Define Mock Model for Demonstration

In [None]:
class MockCodeFixModel:
    """
    Mock model that simulates LLM code fix responses.
    
    For demonstration, it returns correct fixes for some bugs
    and incorrect fixes for others.
    """
    
    def __init__(self, success_rate: float = 0.6):
        """
        Initialize the mock model.
        
        Args:
            success_rate: Probability of returning correct fix
        """
        self.success_rate = success_rate
        self.call_count = 0
        
        # Predefined responses - some correct, some incorrect
        self.responses = {
            "sum_to_n": {
                "correct": '''def sum_to_n(n):
    total = 0
    for i in range(1, n + 1):
        total += i
    return total''',
                "incorrect": '''def sum_to_n(n):
    total = 0
    for i in range(n + 1):
        total += i
    return total''',  # Still wrong - starts at 0
            },
            "is_even": {
                "correct": '''def is_even(n):
    return n % 2 == 0''',
                "incorrect": '''def is_even(n):
    return n / 2 == 0''',  # Wrong - uses division
            },
            "find_max": {
                "correct": '''def find_max(lst):
    if not lst:
        return None
    max_val = lst[0]
    for item in lst:
        if item > max_val:
            max_val = item
    return max_val''',
                "incorrect": '''def find_max(lst):
    if not lst:
        return None
    max_val = lst[0]
    for item in lst:
        if item > max_val:
            max_val = item''',  # Still missing return
            },
            "filter_greater_than": {
                "correct": '''def filter_greater_than(numbers, threshold):
    result = []
    for n in numbers:
        if n > threshold:
            result.append(n)
    return result''',
                "incorrect": '''def filter_greater_than(numbers, threshold):
    result = []
    for n in numbers:
        if n >= threshold:
            result.append(n)
    return result''',  # Wrong - uses >= instead of >
            },
            "reverse_string": {
                "correct": '''def reverse_string(s):
    result = ""
    for char in s:
        result = char + result
    return result''',
                "incorrect": '''def reverse_string(s):
    return s[::-1]''',  # Correct but different approach - we'll count as correct
            },
        }
    
    def generate_fix(self, prompt: str) -> str:
        """
        Generate a code fix based on the prompt.
        
        Args:
            prompt: Bug fix prompt containing description and buggy code
            
        Returns:
            Generated code fix
        """
        self.call_count += 1
        
        # Determine which function we're fixing
        for func_name in self.responses:
            if func_name in prompt:
                # Alternate between correct and incorrect based on call count
                # to simulate varying model performance
                if self.call_count % 2 == 0:
                    return self.responses[func_name]["correct"]
                else:
                    return self.responses[func_name]["incorrect"]
        
        # Default: return a syntax error
        return "def broken_code(\n    # This won't work"


# Create mock model
mock_model = MockCodeFixModel()
print("‚úÖ Mock code fix model created!")
print("   (Simulates varying model performance for demonstration)")

---

## üìù Step 8: Define Bug-Fix Prompt Template

In [None]:
BUG_FIX_PROMPT_TEMPLATE = """You are a skilled Python programmer. Your task is to fix the bug in the following code.

## Function Description
{description}

## Buggy Code
```python
{buggy_code}
```

## Instructions
1. Identify the bug in the code above
2. Fix the bug while maintaining the same function signature
3. Return ONLY the fixed Python code, no explanations

## Fixed Code
```python
"""


def create_bug_fix_prompt(description: str, buggy_code: str) -> str:
    """Create a prompt for bug-fix tasks."""
    return BUG_FIX_PROMPT_TEMPLATE.format(
        description=description,
        buggy_code=buggy_code
    )


# Show an example prompt
example_prompt = create_bug_fix_prompt(
    BUG_FIX_DATASET[0]["description"],
    BUG_FIX_DATASET[0]["buggy_code"]
)
print("üìù Example Bug-Fix Prompt:")
print("=" * 60)
print(example_prompt)
print("=" * 60)

---

## üß™ Step 9: Run Bug-Fix Evaluation

In [None]:
print("üß™ Running Bug-Fix Evaluation...")
print("=" * 70)

evaluation_results = []

for tc in BUG_FIX_DATASET:
    # Create prompt
    prompt = create_bug_fix_prompt(
        tc["description"],
        tc["buggy_code"]
    )
    
    # Get model's fix attempt
    generated_fix = mock_model.generate_fix(prompt)
    
    # Evaluate the fix
    result = evaluator.evaluate_code(
        generated_fix,
        tc["test_code"]
    )
    
    evaluation_results.append({
        "name": tc["name"],
        "bug_type": tc["bug_type"],
        "passed": result["passed"],
        "error": result["error"],
        "generated_fix": generated_fix,
    })
    
    status = "‚úÖ PASS" if result["passed"] else "‚ùå FAIL"
    print(f"\n{status} {tc['name']}")
    print(f"   Bug Type: {tc['bug_type']}")
    if not result["passed"] and result["error"]:
        # Show first line of error only
        error_line = result["error"].strip().split('\n')[0][:60]
        print(f"   Error: {error_line}...")

# Summary
pass_rate = evaluator.compute_pass_rate(evaluation_results)
print("\n" + "=" * 70)
print(f"üìä Model Pass Rate: {pass_rate:.0%}")

---

## üìä Step 10: Display Results Summary

In [None]:
print("üìä Bug-Fix Evaluation Results Summary")
print("=" * 80)
print(f"{'Test Case':<35} {'Bug Type':<20} {'Result':<10}")
print("-" * 80)

for r in evaluation_results:
    status = "‚úÖ Pass" if r["passed"] else "‚ùå Fail"
    print(f"{r['name']:<35} {r['bug_type']:<20} {status:<10}")

print("-" * 80)
print(f"{'TOTAL':<35} {'':<20} {pass_rate:.0%} pass rate")

---

## üìà Step 11: Analyze Results by Bug Type

In [None]:
print("üìà Results by Bug Type")
print("=" * 60)

# Group results by bug type
bug_type_results = {}
for r in evaluation_results:
    bug_type = r["bug_type"]
    if bug_type not in bug_type_results:
        bug_type_results[bug_type] = {"passed": 0, "total": 0}
    bug_type_results[bug_type]["total"] += 1
    if r["passed"]:
        bug_type_results[bug_type]["passed"] += 1

print(f"\n{'Bug Type':<25} {'Passed':<10} {'Total':<10} {'Rate':<10}")
print("-" * 60)

for bug_type, stats in sorted(bug_type_results.items()):
    rate = stats["passed"] / stats["total"] if stats["total"] > 0 else 0
    status = "‚úÖ" if rate == 1.0 else "‚ùå" if rate == 0.0 else "‚ö†Ô∏è"
    print(f"{status} {bug_type:<23} {stats['passed']:<10} {stats['total']:<10} {rate:.0%}")

print("\nüìã Analysis:")
# Find hardest bug types
sorted_types = sorted(bug_type_results.items(), key=lambda x: x[1]["passed"]/x[1]["total"])
print(f"   Hardest bug type: {sorted_types[0][0]}")
print(f"   Easiest bug type: {sorted_types[-1][0]}")

---

## üîç Step 12: View Failed Fix Attempts

In [None]:
print("üîç Failed Fix Attempts")
print("=" * 70)

failed_attempts = [r for r in evaluation_results if not r["passed"]]

if not failed_attempts:
    print("\n‚úÖ No failures - all fixes passed!")
else:
    for r in failed_attempts:
        print(f"\n‚ùå {r['name']} ({r['bug_type']})")
        print("-" * 60)
        print("Generated Fix (incorrect):")
        print(r["generated_fix"])
        if r["error"]:
            print(f"\nError: {r['error'][:200]}")

---

## üîß Step 13: Integration with BenchRight Engine

Here's how to integrate the code evaluator with BenchRight's benchmark engine.

In [None]:
def code_pass_fail_metric(
    generated_code: str,
    test_code: str,
) -> float:
    """
    Metric function that returns 1.0 if code passes all tests, 0.0 otherwise.
    
    This can be used with BenchRight's run_benchmark function.
    
    Args:
        generated_code: The generated code to evaluate
        test_code: Unit tests for verification
        
    Returns:
        1.0 if all tests pass, 0.0 otherwise
    """
    eval_instance = CodeEvaluator(timeout_seconds=5)
    result = eval_instance.evaluate_code(generated_code, test_code)
    return 1.0 if result["passed"] else 0.0


# Demonstrate the metric
print("üìä Demonstrating pass/fail metric:")
print("=" * 50)

# Test with correct code
correct_code = BUG_FIX_DATASET[0]["expected_fix"]
test_code = BUG_FIX_DATASET[0]["test_code"]
score = code_pass_fail_metric(correct_code, test_code)
print(f"Correct code score: {score} {'‚úÖ' if score == 1.0 else '‚ùå'}")

# Test with buggy code
buggy_code = BUG_FIX_DATASET[0]["buggy_code"]
score = code_pass_fail_metric(buggy_code, test_code)
print(f"Buggy code score: {score} {'‚úÖ' if score == 1.0 else '‚ùå'}")

---

## üéì Mini-Project: Your Bug-Fix Evaluation

### Task

Create your own bug-fix test case and evaluate it.

### Template

In [None]:
# Define your own bug-fix test case
my_test_case = {
    "name": "Your Bug Name",
    "description": "# What should the function do?",
    "buggy_code": '''# Your buggy code here
def my_function(x):
    # Contains a bug
    pass''',
    "expected_fix": '''# Your fixed code here
def my_function(x):
    # Bug is fixed
    pass''',
    "test_code": '''# Your unit tests here
def test_my_function():
    # assert my_function(input) == expected_output
    pass''',
    "bug_type": "your-bug-type",
}

# Evaluate your fix (uncomment to run)
# result = evaluator.evaluate_code(
#     my_test_case["expected_fix"],
#     my_test_case["test_code"]
# )
# print(f"Your fix passed: {result['passed']}")

---

## ü§î Paul-Elder Critical Thinking Questions

Reflect on these questions as you complete the exercises:

### Question 1: TEST COVERAGE
**What are the risks if an LLM-generated code fix passes all provided unit tests but contains a subtle bug that wasn't covered by the tests?**

*Consider: Test coverage vs. correctness, edge cases not tested, the difference between "passes tests" and "is correct," and how comprehensive test suites should be.*

### Question 2: PROMPT ENGINEERING
**How does the quality of the bug description in the prompt affect the LLM's ability to fix the bug correctly?**

*Consider: The importance of function specifications, the role of examples, whether showing the expected output helps, and the tradeoff between detailed prompts and model generalization.*

### Question 3: SECURITY IMPLICATIONS
**Should LLM-generated code be trusted in production systems? What safeguards should be in place?**

*Consider: Code review requirements, automated security scanning, testing requirements, the potential for introducing vulnerabilities, and the difference between "works" and "safe."*

---

## ‚ö†Ô∏è Limitations and Risks

### What This Evaluation DOESN'T Cover

1. **Security Vulnerabilities:** Code may pass tests but have security flaws
2. **Performance Issues:** Correct but inefficient code
3. **Edge Cases:** Tests may not cover all scenarios
4. **Readability:** Code quality beyond correctness
5. **Maintainability:** Long-term code health

### Required Safeguards for Production

- **Human Code Review:** Always review generated code
- **Security Scanning:** Use automated vulnerability detection
- **Comprehensive Testing:** Include edge cases, error handling
- **Performance Testing:** Benchmark critical code paths
- **Documentation:** Require clear comments and docstrings

---

## ‚úÖ Knowledge Mastery Checklist

Before moving to Week 14, ensure you can check all boxes:

- [ ] I understand why code generation evaluation requires unit tests rather than semantic comparison
- [ ] I can design a synthetic bug-fix dataset with descriptions, buggy code, expected fixes, and tests
- [ ] I can use the CodeEvaluator to automatically run unit tests on generated code
- [ ] I understand the pass/fail metric and how to compute pass rates
- [ ] I know how BenchRight's engine can integrate with code evaluation
- [ ] I can identify different bug types and analyze which are hardest to fix
- [ ] I understand the security implications of LLM-generated code
- [ ] I can articulate the limitations of unit test-based evaluation

---

**Week 13 Complete!** üéâ

**Next:** *Week 14 ‚Äî Data Analytics Use Cases*