# Autofix Evaluation
This initial preliminary high-level evaluation for Autofix runs on a dataset of Sentry Issues <-> Github Commits.

It is graded by a sending the expected diff vs the predicted diff to n GPTs with a prompt to evaluate whether the diff is a good fix or not.

Returns the average score of the GPTs as a float between 0 and 1.

Install the seer requirements:

In [1]:
!pip install -r ../requirements.txt

Collecting certifi==2023.7.22 (from -r ../requirements.txt (line 1))
  Downloading certifi-2023.7.22-py3-none-any.whl.metadata (2.2 kB)
Collecting charset-normalizer==2.0.12 (from -r ../requirements.txt (line 2))
  Downloading charset_normalizer-2.0.12-py3-none-any.whl.metadata (11 kB)
Collecting click==8.1.7 (from -r ../requirements.txt (line 3))
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting contourpy==1.1.0 (from -r ../requirements.txt (line 4))
  Downloading contourpy-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.7 kB)
Collecting convertdate==2.4.0 (from -r ../requirements.txt (line 5))
  Downloading convertdate-2.4.0-py3-none-any.whl.metadata (8.3 kB)
Collecting cycler==0.11.0 (from -r ../requirements.txt (line 6))
  Downloading cycler-0.11.0-py3-none-any.whl.metadata (785 bytes)
Collecting Cython==3.0.2 (from -r ../requirements.txt (line 7))
  Downloading Cython-3.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.wh

A couple more libraries are needed for running the eval:

In [2]:
!pip install python-dotenv 'psycopg[binary,pool]' langchain langchain-openai

Collecting langchain
  Downloading langchain-0.1.13-py3-none-any.whl.metadata (13 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.1.1-py3-none-any.whl.metadata (2.5 kB)
Collecting psycopg-binary==3.1.18 (from psycopg[binary,pool])
  Downloading psycopg_binary-3.1.18-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.8 kB)
Collecting psycopg-pool (from psycopg[binary,pool])
  Downloading psycopg_pool-3.2.1-py3-none-any.whl.metadata (2.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl.metadata (25 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-community<0.1,>=0.0.29 (from langchain)
  Downloading langchain_community-0.0.29-py3-none-any.whl.metadata (8.3 kB)
Collecting langchain-core<0.2.0,>=0.1.33 (from langchain)
  Downloading langchain_core-0.1.36-py3-none-any.whl.metadata (6.0 kB)
Collecti

In [1]:
import os
os.environ['DATABASE_URL'] = "postgresql+psycopg://root:seer@localhost:5433/seer"
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = "https://api.smith.langchain.com"
os.environ['LANGCHAIN_PROJECT'] = "ai-autofix-evals"

from dotenv import load_dotenv
load_dotenv('../.env')

import sys
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../src')))

import logging

logger = logging.getLogger('autofix')
logger.setLevel(logging.DEBUG)
logger.handlers = []
logger.addHandler(logging.StreamHandler())

from github import Github
from github.Auth import Token

github = Github(auth=Token(token=os.environ.get('GITHUB_TOKEN')))
repo = github.get_repo('getsentry/sentry')

from seer.bootup import bootup

bootup(__name__)

  from .autonotebook import tqdm as notebook_tqdm


<Flask '__main__'>

In [2]:
from pydantic import field_serializer, BaseModel
from github.Commit import Commit
from typing import Any
from pydantic import ConfigDict, field_validator

from seer.automation.autofix.models import IssueDetails, EventDetails

class EvalItem(BaseModel):
    raw_data: dict[str, Any]
    commit: Commit
    issue: IssueDetails
    event: EventDetails

    model_config = ConfigDict(
        arbitrary_types_allowed=True
    )

    @field_serializer('commit')
    def serialize_commit(self, commit: Commit, _info):
        return commit.sha
    
    @field_validator('commit', mode="before")
    @classmethod
    def validate_commit(cls, commit: Commit | str):
        return commit if isinstance(commit, Commit) else repo.get_commit(commit)
    
class EvalItemWithDiff(EvalItem):
    diff: str

Create a predict function to be called during the eval:

In [6]:
from seer.automation.autofix.autofix import Autofix
from seer.automation.autofix.tasks import ContinuationState
from seer.rpc import DummyRpcClient
from seer.automation.autofix.models import (
    AutofixContinuation,
    AutofixRequest,
    RepoDefinition,
)
from sentence_transformers import SentenceTransformer
from seer.automation.autofix.autofix_context import AutofixContext
from seer.automation.autofix.event_manager import AutofixEventManager

embedding_model = SentenceTransformer("../models/autofix_embeddings_v0", trust_remote_code=True)
embedding_model.max_seq_length = 4096

def predict_result(input_: dict) -> dict:
    run_item = EvalItem.model_validate(input_)

    # Initializes the rpc client in DRY RUN mode
    rpc_client = DummyRpcClient()
    rpc_client.dry_run = True

    request = AutofixRequest(
        organization_id=1,
        project_id=1,
        repos=[RepoDefinition(provider="github", owner="getsentry", name="sentry")],
        base_commit_sha=run_item.commit.parents[0].sha,
        issue=run_item.issue,
    )

    state = ContinuationState(
        val=AutofixContinuation(request=AutofixRequest.model_validate(request)), rpc_client=rpc_client
    )

    event_manager = AutofixEventManager(state)
    context = AutofixContext(
        organization_id=request.organization_id,
        project_id=request.project_id,
        repos=request.repos,
        event_manager=event_manager,
        state=state,
        embedding_model=embedding_model,
        sha=run_item.commit.parents[0].sha,
    )
    context.commit_changes = False
    autofix = Autofix(context)

    response = autofix.invoke(request)

    if response is None:
        return {"output": None}

    return {"output": response['outputs'][0]}

Create the scoring prompt:

In [7]:
from langsmith import traceable
from langchain.chat_models.openai import ChatOpenAI
from github.Commit import Commit
from github.File import File
from xml.etree import ElementTree as ET

from seer.automation.autofix.models import AutofixOutput
from seer.automation.autofix.prompts import format_exceptions
from seer.automation.autofix.utils import extract_xml_element_text, escape_multi_xml

n_panel = 3
model = ChatOpenAI(model_name="gpt-4-0125-preview", temperature=0.8)

def score_fix_single_it(eval_item: EvalItemWithDiff, predicted_output: AutofixOutput) -> float:
    completion = model.invoke(f"""<issue>
<error_message>
{eval_item.event.title}
</error_message>
<exceptions>
{format_exceptions(eval_item.event.exceptions)}
</exceptions>
</issue>

Given the above issue, we know the correct fix is:

<expected_solution>
<description>
{eval_item.commit.commit.message}
</description>
<changes>
{eval_item.diff}
</changes>
</expected_solution>

The model outputted the following solution:

<predicted_solution>
{predicted_output.diff_str}
</predicted_solution>

Score how well the predicted solution matches the expected solution with a float score from 0 to 1, where 1 means the solution fully fixes the issue and 0 means the solution does not fix the issue at all.
- Consider the context of the issue and the diff
- Consider that there are multiple ways to fix an issue

Think step-by-step inside a <thoughts> tag before giving a score.
Return the score inside a <score> tag.""")
    tree = ET.fromstring(f"<root>{escape_multi_xml(completion.content, ['score'])}</root>")
    score_str = extract_xml_element_text(tree, 'score')
    score = float(score_str) if score_str else 0

    return score

@traceable(name="Score 1 item", run_type="chain")
def score_one(eval_item: EvalItemWithDiff, predicted_output: AutofixOutput) -> float:
    return round(sum([score_fix_single_it(eval_item, predicted_output) for _ in range(n_panel)]) / n_panel, 2)

Run the eval:

In [8]:
from langsmith import Client
from langsmith.evaluation import EvaluationResult, run_evaluator
from langsmith.schemas import Example, Run
from langchain.smith import RunEvalConfig

@run_evaluator
def gpt_panel(run: Run, example: Example | None = None):
    eval_item = EvalItem.model_validate(run.inputs)
    with_diff = EvalItemWithDiff.model_validate(dict(**dict(eval_item), diff=example.outputs.get('diff')))
    score = score_one(with_diff, AutofixOutput.model_validate(run.outputs.get('output')))
    return EvaluationResult(key="diff_gpt_panel_n3_score", score=score)

langsmith_client = Client()
dataset_name = "Autofix Eval Full 240314"

eval_config = RunEvalConfig(
    custom_evaluators=[gpt_panel]
)

ds = langsmith_client.read_dataset(dataset_name=dataset_name)

langsmith_client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=predict_result,
    evaluation=eval_config,
    verbose=True,
    project_name="Autofix v2 chroma-test2,
    concurrency_level=4,
)

View the evaluation results for project 'Autofix v2 chroma-test2' at:
https://smith.langchain.com/o/69969f08-f971-4321-84d8-85f6f3475112/datasets/ed9565ba-da35-4002-8054-b77ed9aa2bac/compare?selectedSessions=3372adea-f9ff-4727-865a-7d18c9700e25

View all tests for Dataset Autofix Eval Full 240314 at:
https://smith.langchain.com/o/69969f08-f971-4321-84d8-85f6f3475112/datasets/ed9565ba-da35-4002-8054-b77ed9aa2bac
[>                                                 ] 0/36

Loaded codebase index for getsentry/sentry, with existing data
Beginning autofix for issue 4736975025
on_autofix_step_update invoking...
on_autofix_step_update done
on_autofix_step_update invoking...
on_autofix_step_update done
----[GptAgent] Running Agent----
Previous messages: 
system: You are an exceptional principal engineer that is amazing at finding the root cause of any issue.

You have tools to search a codebase to find the root cause of an issue. Please use the tools as many times as you want to find the root cause of the issue. It is very important to be very detailed and clear in your output.

You must follow the below XML format in your output:
<potential_root_causes>
<potential_cause likelihood="0.8" actionability="1.0">
<title>
The foo() function is returning the wrong value
</title>
<description>
The root cause of the issue is that the foo() function is returning the wrong value
</description>
<suggested_fix>
<title>
Fix the foo() function by returning 'bar'
</title>
<de

[>                                                 ] 1/36

Loaded codebase index for getsentry/sentry, with existing data
Beginning autofix for issue 4742425761
on_autofix_step_update invoking...
on_autofix_step_update done
on_autofix_step_update invoking...
on_autofix_step_update done
Cleaned up workspace for namespace 1
----[GptAgent] Running Agent----
Previous messages: 
system: You are an exceptional principal engineer that is amazing at finding the root cause of any issue.

You have tools to search a codebase to find the root cause of an issue. Please use the tools as many times as you want to find the root cause of the issue. It is very important to be very detailed and clear in your output.

You must follow the below XML format in your output:
<potential_root_causes>
<potential_cause likelihood="0.8" actionability="1.0">
<title>
The foo() function is returning the wrong value
</title>
<description>
The root cause of the issue is that the foo() function is returning the wrong value
</description>
<suggested_fix>
<title>
Fix the foo() fun

In [None]:

import json
from seer.automation.codebase.models import CodebaseNamespace, RepositoryInfo
from seer.db import DbCodebaseNamespace, DbRepositoryInfo, Session


def get_namespace_dumps():
    with Session() as session:
        repository_info = session.query(DbRepositoryInfo).all()
        codebase_namespaces = session.query(DbCodebaseNamespace).all()
        
        repo_infos = [RepositoryInfo.from_db(repo_info).model_dump_json() for repo_info in repository_info]
        namespaces = [CodebaseNamespace.from_db(codebase_namespace).model_dump_json() for codebase_namespace in codebase_namespaces]

    return repo_infos, namespaces

repo_infos, namespaces = get_namespace_dumps()

with open('repo_infos.json', 'w') as f:
    f.write(json.dumps(repo_infos))

with open('namespaces.json', 'w') as f:
    f.write(json.dumps(namespaces))
