# Run SWE-Bench evaluation
This notebook describes how you can recreate the evaluation of [Moatless Tools](https://github.com/aorwall/moatless-tools) used for submission to the [SWE Bench Leaderboard](https://www.swebench.com/).

In [14]:
import os
import sys
sys.path.append(os.path.abspath(os.path.join('..')))

## Download index files

Moatless Tools use a vector index to do semantic code search. To avoid having to index all repositories, you can use this shared volume with index embeddings embedded with voyage-code-2 for all instances of SWE-Bench Lite: https://drive.google.com/drive/folders/1RhG1w_VVY938ogHRhZ7K5Tapnvs5b-PW?usp=sharing

If you add a shortcut to “20240522-voyage-code-2” in "My Drive" it should be possible to mount Google Drive on `/content/drive` and find it on `/content/drive/MyDrive/20240522-voyage-code-2`

In [15]:
import json
keys_dir = "../keys.json"
with open(keys_dir) as f:
    keys = json.load(f)
    f.close()
index_store_dir = "../20240522-voyage-code-2"

To use `voyage-code-2` embeddings, you also need an API key from Voyage AI (https://www.voyageai.com/). Add this to your secrets.

In [16]:
from moatless.index import CodeIndex, IndexSettings
import os

os.environ["VOYAGE_API_KEY"] = keys["VOYAGE_API_KEY"]

## Stage for evaluation
Litellm is used to run requests to LLMs. Use the model names specified for Litellm and add the API Key to *Secrets*.

`model=gpt-4` with `temperature=0.2` was used in the latest subission to the SWE-Bench Lite leaderboard.

`max_cost` is set to limit how much each run is allowed to cost.

In [17]:
os.environ["OPENAI_API_KEY"] = keys['OPENAI_API_KEY']

model = "gpt-4o"
temperature = 0.2

max_cost=0.5

Enter a evaluation name and specify directories to save predictions and trajectories.

In [18]:
import datetime
import os

evaluations_dir = "../evaluations"
evaluation_name = "20240617_moatless_gpt4o_demo"
evaluation_dir = f"{evaluations_dir}/{evaluation_name}"
trajectory_dir = f"{evaluations_dir}/{evaluation_name}/trajs"
predictions_path = f"{evaluation_dir}/all_preds.jsonl"

if not os.path.exists(trajectory_dir):
    os.makedirs(trajectory_dir)

print(evaluation_dir)

../evaluations/20240617_moatless_gpt4o_demo


(Optional) Set up tracing. [Langfuse](https://langfuse.com/) for example .

In [19]:
import litellm

os.environ["LANGFUSE_PUBLIC_KEY"] = keys['LANGFUSE_PUBLIC_KEY']
os.environ["LANGFUSE_SECRET_KEY"] = keys['LANGFUSE_SECRET_KEY']

litellm.success_callback = ["langfuse"]
litellm.failure_callback = ["langfuse"]

Define the evaluation function.

In [24]:
from moatless.loop import AgenticLoop
from moatless.transitions import search_and_code_transitions
from moatless.workspace import Workspace
from moatless.benchmark.utils import trace_metadata

import json
import logging
import subprocess
import time
import traceback

def evaluate(instance):
    instance_id = instance["instance_id"]
    trajectory_path = os.path.join(trajectory_dir, f"{instance_id}.json")

    repo_dir = setup_swebench_repo(instance)
    persist_dir = os.path.join(
        index_store_dir, get_repo_dir_name(instance_id)
    )
    workspace = Workspace.from_dirs(repo_dir=repo_dir, index_dir=persist_dir)

    # # If you need to restart the evaluation you can read up already existing trajectories.
    # if os.path.exists(trajectory_path):
    #     with open(trajectory_path) as file:
    #         trajectory = json.load(file)
    #     if "info" in trajectory and trajectory["info"].get("submission") or "error" in trajectory["info"]:
    #         return trajectory

    problem_statement = instance["problem_statement"]

    metadata = trace_metadata(instance_id=instance_id, session_id=evaluation_name, trace_name="search_and_code")
    transitions = search_and_code_transitions(global_params={"model": model, "temperature": temperature})

    loop = AgenticLoop(transitions=transitions, 
                       workspace=workspace, 
                       metadata=metadata, 
                       trajectory_path=trajectory_path, 
                       max_cost=0.5)

    info = {
        "evaluation_name": evaluation_name,
        "instance_id": instance["instance_id"]
    }

    start_time = time.time()
    try:
        response = loop.run(problem_statement)

    except Exception as e:
        info["error"] = traceback.format_exc()
        logging.exception(f"Error in evaluation of {instance['instance_id']} ")

    info["duration"] = time.time() - start_time
    info["total_cost"] = loop.trajectory.total_cost()

    workspace.save()

    output = subprocess.run(
          ["git", "diff"],
          capture_output=True,
          text=True,
          cwd=repo_dir,
    )

    info["submission"] = output.stdout

    loop.trajectory.save_info(info)
    trajectory = loop.trajectory.to_dict()

    return trajectory

instance_whitelist = []

## Run the evaluation

Test if evaluation works with a sub set of 5 instances. Remove this to run the full benchmark.

In [21]:
instance_whitelist = ["pytest-dev__pytest-5227", "django__django-16139", "sympy__sympy-24152", "django__django-16379", "django__django-16527"]
instance_whitelist = [instance_whitelist[0]]

Run the evaluation

In [25]:
from moatless.benchmark.swebench import get_repo_dir_name, sorted_instances, setup_swebench_repo
from tqdm.notebook import tqdm

def run_evaluation(dataset: str = "princeton-nlp/SWE-bench_Lite", split="test"):
    instances = sorted_instances(dataset, split)

    count = 0
    generated = 0
    error = 0

    sum_duration = 0
    sum_total_cost = 0

    with open(predictions_path, "w") as file:
        file.write("")

    if instance_whitelist:
        instances = [instance for instance in instances if instance["instance_id"] in instance_whitelist]

    stats = {}
    pbar = tqdm(instances)
    for instance in pbar:
        trajectory = evaluate(instance)
        if not trajectory:
            error += 1
            continue

        sum_duration += trajectory["info"]["duration"]
        sum_total_cost += trajectory["info"]["total_cost"]

        if trajectory["info"].get("error"):
            error += 1

        if trajectory["info"].get("submission"):
            generated += 1

        count += 1

        if sum_duration > 0:
            stats["avg_duration"] = sum_duration / count

        if sum_total_cost > 0:
            stats["avg_cost"] = sum_total_cost / count
            stats["total_cost"] = sum_total_cost

        if generated > 0:
            success_rate = (generated / count) * 100
            stats["generated"] = f"{success_rate:.2f}%"

        stats["error"] = error

        pbar.set_postfix(stats)

        prediction = {
            "model_name_or_path": evaluation_name,
            "instance_id": instance["instance_id"],
            "model_patch": trajectory["info"].get("submission", ""),
        }

        with open(predictions_path, "a") as file:
            json_string = json.dumps(prediction)
            file.write(json_string + "\n")


run_evaluation()

  0%|          | 0/300 [00:00<?, ?it/s]

ACTION: thoughts='To address the issue with the hooks argument in the Request class, I need to locate the Request class and its __init__ method. This will allow me to understand how the hooks argument is currently being parsed and modify it to accept lists as its values.' file_pattern=None query=None code_snippet=None class_name='Request' function_name='__init__'
ACTION: thoughts='Since the previous search did not yield any results, I will broaden the search to look for the Request class in general. This will help me locate the file where the Request class is defined, and from there, I can find the __init__ method.' file_pattern=None query=None code_snippet=None class_name='Request' function_name=None




ACTION: thoughts='The search results contain the relevant code for the Request class and its methods, including the __init__ method and the register_hook method. These are crucial for modifying the hooks argument to accept lists. Therefore, the relevant spans have been identified.' identified_spans=[FileWithSpans(file_path='requests/models.py', span_ids=['Request', 'Request.__init__', 'Request.register_hook'])] complete=True
ACTION: thoughts='The first step is to modify the `__init__` method of the `Request` class to handle lists of hooks properly. This involves checking if the hook is a list and, if so, iterating through the list to register each hook individually.' action=ApplyChange(instructions='Update the `__init__` method of the `Request` class to handle lists of hooks properly. Specifically, modify the loop that registers hooks to check if the hook is a list and, if so, iterate through the list to register each hook individually.', file_path='requests/models.py', span_id='Reques

ERROR:root:Error in evaluation of psf__requests-2317 
Traceback (most recent call last):
  File "/tmp/ipykernel_2346473/2240678423.py", line 47, in evaluate
    response = loop.run(problem_statement)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/loop.py", line 144, in run
    self._run()
  File "/share/edc/home/antonis/moatless-tools/moatless/loop.py", line 300, in _run
    response = self.state.handle_action(action)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/find/search.py", line 224, in handle_action
    search_result = self.workspace.code_index.search(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/index/code_index.py", line 155, in search
    return self.semantic_search(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/index/code_index.py", line 210, in semantic_search

ACTION: thoughts="The user is experiencing an issue with the line 'method = builtin_str(method)' in the requests/sessions.py file. This line converts a binary string method to a literal string, which causes a 404 Not Found response. The user is using python3.4 and python-neutronclient (2.3.9) with requests (2.4.3). The issue does not occur in python2.6. To investigate, I need to locate the 'method = builtin_str(method)' line in the requests/sessions.py file." file_pattern='requests/sessions.py' query=None code_snippet='method = builtin_str(method)' class_name=None function_name=None
ACTION: thoughts="To locate where urllib3 exceptions are passing through the requests API, I should look for references to 'requests.packages.urllib3.exceptions.DecodeError' and 'requests.packages.urllib3.exceptions.TimeoutError' within the codebase." file_pattern='**/*.py' query='requests.packages.urllib3.exceptions.DecodeError or requests.packages.urllib3.exceptions.TimeoutError' code_snippet=None class_n

ERROR:root:Error in evaluation of psf__requests-2674 
Traceback (most recent call last):
  File "/tmp/ipykernel_2346473/2240678423.py", line 47, in evaluate
    response = loop.run(problem_statement)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/loop.py", line 144, in run
    self._run()
  File "/share/edc/home/antonis/moatless-tools/moatless/loop.py", line 300, in _run
    response = self.state.handle_action(action)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/find/search.py", line 224, in handle_action
    search_result = self.workspace.code_index.search(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/index/code_index.py", line 155, in search
    return self.semantic_search(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/index/code_index.py", line 210, in semantic_search

ACTION: thoughts="To address the user's concern, I need to locate the usage of `iter_content` and `r.text` in the codebase. This will help understand how these methods are being used and why they might be returning different types." file_pattern='**/*.py' query='usage of iter_content and r.text in the codebase' code_snippet=None class_name=None function_name=None
ACTION: thoughts='The user is experiencing an issue with the `ccode` function when trying to generate C code for the `sinc` function. The `ccode` function is likely part of a symbolic mathematics library, possibly SymPy. I will search for the `ccode` function definition to understand how it handles the `sinc` function.' file_pattern=None query=None code_snippet=None class_name=None function_name='ccode'
ACTION: thoughts="The search results contain the definition of the `ccode` function in the `sympy/printing/ccode.py` file. This is relevant to the user's instructions, which mention the `ccode` function and its behavior with th



ACTION: thoughts='The search results include several files related to LaTeX and pretty printing. The relevant code spans are found in `sympy/printing/latex.py` and `sympy/printing/pretty/pretty.py`. These files contain the `LatexPrinter` and `PrettyPrinter` classes, respectively, which are responsible for the LaTeX and pretty printing functionalities. These are the areas where inconsistencies might be addressed.' identified_spans=[FileWithSpans(file_path='sympy/printing/latex.py', span_ids=['LatexPrinter', 'LatexPrinter.parenthesize', 'LatexPrinter._print_MatrixBase']), FileWithSpans(file_path='sympy/printing/pretty/pretty.py', span_ids=['PrettyPrinter', 'PrettyPrinter._print_MatrixBase', 'PrettyPrinter:27', 'PrettyPrinter._print_Trace'])] complete=False


ERROR:root:Error in evaluation of sympy__sympy-11897 
Traceback (most recent call last):
  File "/tmp/ipykernel_2346473/2240678423.py", line 47, in evaluate
    response = loop.run(problem_statement)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/loop.py", line 144, in run
    self._run()
  File "/share/edc/home/antonis/moatless-tools/moatless/loop.py", line 300, in _run
    response = self.state.handle_action(action)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/find/search.py", line 224, in handle_action
    search_result = self.workspace.code_index.search(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/index/code_index.py", line 155, in search
    return self.semantic_search(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/index/code_index.py", line 210, in semantic_search

ACTION: thoughts='The relevant files for LaTeX and pretty printing have been identified as `sympy/printing/latex.py` and `sympy/printing/pretty/pretty.py`. To address the inconsistencies, I need to search for specific functions within these files that handle the printing of expressions, particularly those involving fractions and logarithms.' file_pattern='sympy/printing/latex.py' query='fraction and logarithm printing' code_snippet=None class_name=None function_name=None
ACTION: thoughts='To address the issue with the mathematica code printer not handling floats and derivatives correctly, I need to locate the MCodePrinter class. This class likely contains methods related to printing or formatting code. I will search for the class definition first.' file_pattern=None query=None code_snippet=None class_name='MCodePrinter' function_name=None
ACTION: thoughts='The search results contain the definition of the MCodePrinter class and several of its methods. However, the methods _print_Derivat

ERROR:root:Error in evaluation of sympy__sympy-12171 
Traceback (most recent call last):
  File "/tmp/ipykernel_2346473/2240678423.py", line 47, in evaluate
    response = loop.run(problem_statement)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/loop.py", line 144, in run
    self._run()
  File "/share/edc/home/antonis/moatless-tools/moatless/loop.py", line 300, in _run
    response = self.state.handle_action(action)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/find/search.py", line 224, in handle_action
    search_result = self.workspace.code_index.search(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/index/code_index.py", line 155, in search
    return self.semantic_search(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/share/edc/home/antonis/moatless-tools/moatless/index/code_index.py", line 210, in semantic_search

ACTION: thoughts="I have located the MCodePrinter class in the sympy/printing/mathematica.py file. Now, I need to search within this file for the methods related to printing derivatives and floats. I will search for methods starting with '_print_' in this file." file_pattern='sympy/printing/mathematica.py' query='def _print_' code_snippet=None class_name=None function_name=None
ACTION: thoughts="The user is experiencing an issue with the 'apart' function in the sympy library. We need to locate the implementation of the 'apart' function in the sympy library to investigate the issue." file_pattern='**/sympy/**/*.py' query=None code_snippet=None class_name=None function_name='apart'
ACTION: thoughts="The user's issue seems to be related to the `apart` function in SymPy, specifically when used in certain contexts. The relevant code spans for this issue are found in `sympy/core/expr.py` and `sympy/polys/partfrac.py`. The `apart` method in `Expr` class and the `apart` function in `partfrac.p



ACTION: thoughts='The method to calculate the sum of all elements in the Identity class has been added. Now, we need to update the code to use this method when calculating the total sum of elements in the identity matrix.' action=ApplyChange(instructions='Update the code to use the _eval_sum_elements method when calculating the total sum of elements in the identity matrix.', file_path='user_code.py', span_id='sum_elements')




ACTION: thoughts='The method to calculate the sum of all elements in the Identity class has been added. Now, we need to ensure that the sum of elements is correctly calculated using this method in the existing code.' action=ApplyChange(instructions='Update the code to use the _eval_sum_elements method when calculating the total sum of elements in the identity matrix.', file_path='sympy/matrices/expressions/matexpr.py', span_id='sum_elements')
ACTION: thoughts="The method to calculate the sum of all elements in the Identity class has been added. The user's code should now correctly calculate the sum of elements in the identity matrix using this method. No further changes are needed in the provided file context." action=Finish(message="The method to calculate the sum of all elements in the Identity class has been added. The user's code should now correctly calculate the sum of elements in the identity matrix using this method.", action='Finish')
ACTION: thoughts='The issue is with the Ma



ACTION: thoughts="The relevant code for the Mod function's eval logic is found in sympy/core/mod.py, specifically in the span with id 'Mod.eval'. This span contains the logic that needs to be updated to check if the base is an integer." identified_spans=[FileWithSpans(file_path='sympy/core/mod.py', span_ids=['Mod.eval'])] complete=True
ACTION: thoughts="The user's request is to update the eval logic of the Mod function to check if the base is an integer. This requires modifying the condition inside the doit function to include a check for p.base being an integer when p is a power." action=ApplyChange(instructions='Update the eval logic of the Mod function to check if the base is an integer when p is a power.', file_path='sympy/core/mod.py', span_id='Mod.eval')
ACTION: thoughts='The eval logic of the Mod function is defined within the eval method. The specific lines that need to be modified are those that check if p is a power and its base is equal to q. We need to add an additional che



ACTION: thoughts="The relevant code for the bell function is located in the file sympy/functions/combinatorial/numbers.py. The spans related to the bell function are: 'bell', 'bell._bell', 'bell._bell_poly', 'bell._bell_incomplete_poly', 'bell.eval', and 'bell._eval_rewrite_as_Sum'. These spans contain the implementation details of the bell function, which is what the user is looking to modify to address the limit issue." identified_spans=[FileWithSpans(file_path='sympy/functions/combinatorial/numbers.py', span_ids=['bell', 'bell._bell', 'bell._bell_poly', 'bell._bell_incomplete_poly', 'bell.eval', 'bell._eval_rewrite_as_Sum'])] complete=False
ACTION: thoughts='We have located the implementation of the bell function in sympy/functions/combinatorial/numbers.py. Now, we need to search for the limit handling for the bell function, which might be in the same file or related files handling limits.' file_pattern='**/*.py' query='bell function limit handling' code_snippet=None class_name=None



ACTION: thoughts='The relevant code for handling the limit of the bell function is likely within the `bell` class in `sympy/functions/combinatorial/numbers.py`. The spans related to the `bell` class and its methods are identified as potentially relevant. The search task is not complete as the specific handling for limits is not yet identified.' identified_spans=[FileWithSpans(file_path='sympy/functions/combinatorial/numbers.py', span_ids=['bell', 'bell._bell', 'bell._bell_poly', 'bell.eval', 'bell._eval_rewrite_as_Sum'])] complete=False
ACTION: thoughts='The implementation of the bell function is located in sympy/functions/combinatorial/numbers.py. To address the limit handling, we should search for the limit method or related limit handling in the same file or related files.' file_pattern='**/*.py' query='def limit' code_snippet=None class_name=None function_name=None
ACTION: thoughts="The search results contain two functions related to limits: `limit` in `sympy/series/limits.py` and 