# AI Code-Fixing Agent: Task, Solution, and Demonstration

## Task Description

The goal of this project was to implement an LLM-based AI agent that can fix buggy Python code. The requirements were to:
- Use an agentic framework (like LangGraph).
- Employ a ReAct-style agent with a sandboxed code interpreter.
- Use a small, open-source LLM.
- Evaluate the agent on the HumanEvalFix benchmark using the pass@1 metric.

## Solution Overview

This notebook demonstrates a solution using a LangGraph-based agent that follows a "generate, test, reflect" loop. The agent uses the `Qwen/Qwen2.5-0.5B-Instruct` model to generate code fixes, which are then tested in a secure, sandboxed environment. If a fix fails, the agent reflects on the error and retries.

This notebook provides:
1.  A step-by-step demonstration of the agent fixing a single function.
2.  A small-scale evaluation of the agent's performance on the HumanEvalFix benchmark.

## 1. Install Dependencies

In [1]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## 2. Load the Language Model

In [2]:
from evaluation.humaneval_eval import load_model

model = load_model()

  from .autonotebook import tqdm as notebook_tqdm


🔹 Loading model: Qwen/Qwen2.5-0.5B-Instruct


## 3. Demonstrate the Agent on a Single Task

### The Buggy Function
The function `find_closest_elements` is supposed to find the `k` closest numbers to `x` in a sorted array `arr`. The buggy version has a flaw in its sorting logic. (source: [https://huggingface.co/datasets/bigcode/humanevalpack])

In [3]:
buggy_code = """\
from typing import List

def find_closest_elements(arr: List[int], k: int, x: int) -> List[int]:
    \"\"\"
    Given a sorted integer array arr, two integers k and x, return the k closest integers to x in the array.
    The result should also be sorted in ascending order.
    \"\"\"
    # The bug is in the lambda function: it should sort by absolute difference from x.
    return sorted(arr, key=lambda val: val - x)[:k]
"""

tests = """\
assert find_closest_elements([1, 2, 3, 4, 5], 4, 3) == [1, 2, 3, 4]
assert find_closest_elements([1, 2, 3, 4, 5], 4, -1) == [1, 2, 3, 4]
"""

print("Buggy function body:\\n", buggy_code)

Buggy function body:\n from typing import List

def find_closest_elements(arr: List[int], k: int, x: int) -> List[int]:
    """
    Given a sorted integer array arr, two integers k and x, return the k closest integers to x in the array.
    The result should also be sorted in ascending order.
    """
    # The bug is in the lambda function: it should sort by absolute difference from x.
    return sorted(arr, key=lambda val: val - x)[:k]



### Running the Agent

The agent will reflect on its mistakes and retry up to 3 times if the initial fix doesn't pass the tests.

In [4]:
from agent.agent import run_agent, create_agent

agent = create_agent(model=model)
result = run_agent(
    agent,
    imports="from typing import List",
    buggy_body='return sorted(arr, key=lambda val: val - x)[:k]',
    declaration='def find_closest_elements(arr: List[int], k: int, x: int) -> List[int]:',
    entry_point="find_closest_elements",
    tests=tests,
    max_retries=3,
)

print(f"Passed: {result['passed']}\\n")
print("Corrected function:\\n")
print(result["program"])

Passed: True\n
Corrected function:\n
from typing import List

def find_closest_elements(arr: List[int], k: int, x: int) -> List[int]:
    arr.sort(key=lambda val: val - x)
    return arr[:k]


## 4. Evaluate Agent Performance on HumanEvalFix

Finally, we'll run a small-scale evaluation on the first 5 tasks of the HumanEvalFix benchmark to get a pass@1 score. This demonstrates the agent's ability to generalize to a variety of problems.

To run on the full dataset, you can execute `python main.py` from your terminal.

In [5]:
from evaluation.humaneval_eval import evaluate_on_humanevalfix

evaluate_on_humanevalfix(limit=10)

Dataset downloaded.
🔹 Loading model: Qwen/Qwen2.5-0.5B-Instruct
############################################################
Task: Python/0
############################################################
Task: Python/0
############################################################
Task: Python/0 | Passed: False
Error:
 Traceback (most recent call last):
  File "C:\Users\Bartus\AppData\Local\Temp\prog_i8ijy9ke.py", line 32, in <module>
    check(has_close_elements)
  File "C:\Users\Bartus\AppData\Local\Temp\prog_i8ijy9ke.py", line 25, in check
    assert has_close_elements([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
AssertionError

Model code:
 from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """
    This function checks if there exists at least one pair of elements in the list 'numbers'
    such that their difference is less than the given 'threshold'. It returns True if such a pair
    exists, otherwise False.
    
    :param numbers: 

0.5

## 5. Experiment: Evaluating with a Different Model (Qwen2.5-1.5B)

To see how the choice of model impacts performance, let's run the same evaluation with a different small, open-source model. We'll use `Qwen/Qwen2.5-1.5B-Instruct`, a powerful 1.54B parameter model. This will help us understand if the agent's success is tied to a specific model architecture or if the agentic framework is robust enough to work with different LLMs.

Now, let's run the same evaluation on the first 10 tasks of HumanEvalFix with our new `Qwen2.5-1.5B`-powered agent.

In [6]:
from evaluation.humaneval_eval import evaluate_on_humanevalfix

print("--- Evaluating Qwen2.5-1.5B Agent ---")
evaluate_on_humanevalfix(model_name="Qwen/Qwen2.5-1.5B-Instruct", limit=10, retries=1)

--- Evaluating Qwen2.5-1.5B Agent ---
Dataset downloaded.
🔹 Loading model: Qwen/Qwen2.5-1.5B-Instruct
Dataset downloaded.
🔹 Loading model: Qwen/Qwen2.5-1.5B-Instruct


Some parameters are on the meta device because they were offloaded to the cpu.


############################################################
Task: Python/0
############################################################
Task: Python/0 | Passed: True
Model code:
 from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = abs(elem - elem2)
                if distance < threshold:
                    return True

    return False
############################################################
[1] Python/0 – passed 1/1
############################################################
Task: Python/1
############################################################
Task: Python/0 | Passed: True
Model code:
 from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx 

0.6

# Summary — model size vs. retries

Even with a smaller retry budget (`retries = 1`), the **1.5B** model outperforms the **0.5B** model on HumanEvalFix (Python). In our runs:

- **Qwen2.5-0.5B-Instruct** at `retries=4` achieved **pass@1 = 50.0%** on a 10-task subset (**5/10** solved).  
- **Qwen2.5-1.5B-Instruct** solved problems that the 0.5B model missed at the same retry setting (e.g., **Python/0**: 0.5B vs. 1.5B), yielding a **higher pass@1** on the same subset.

| Model | Params | Retries | pass@1 | Subset size |
|---|---:|---:|---:|---:|
| Qwen2.5-0.5B-Instruct | 0.5B | 1 | **16.5%** | 164 |
| Qwen2.5-0.5B-Instruct | 0.5B | 4 | **50.0%** | 10 |
| Qwen2.5-1.5B-Instruct | 1.5B | 1 | **60.0%** | 10 |

**Takeaway.** Capacity matters: the 1.5B agent exhibits stronger single-sample reasoning, so it beats the 0.5B agent even without relying on multiple retries. This trend suggests that, for this task, scaling the base model can be more impactful than increasing the retry count.


In [8]:
evaluate_on_humanevalfix(retries=1)

Dataset downloaded.
🔹 Loading model: Qwen/Qwen2.5-0.5B-Instruct
############################################################
Task: Python/0
############################################################
Task: Python/0
############################################################
Task: Python/0 | Passed: False
Error:
 Traceback (most recent call last):
  File "C:\Users\Bartus\AppData\Local\Temp\prog_rm463stt.py", line 33, in <module>
    check(has_close_elements)
  File "C:\Users\Bartus\AppData\Local\Temp\prog_rm463stt.py", line 26, in check
    assert has_close_elements([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
AssertionError

Model code:
 from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """
    Check if there exists at least one pair of elements from two lists whose difference is less than a given threshold.

    Args:
    numbers: A list of floating-point numbers.
    threshold: A float representing the minimum acceptable differen

0.16463414634146342