-
Notifications
You must be signed in to change notification settings - Fork 57
Adding Fix-Code Microbenchmark #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Fix-Code Microbenchmark #159
Conversation
… agent debugging to play with prompts and tool design of how the agent can fix issues in generated code
da57b29 to
6a76899
Compare
|
@claude review this PR. |
|
Claude finished @kovtcharov-amd's task —— View job Reviewing PR #159: Adding Fix-Code Microbenchmark
SummaryThis PR adds a well-designed testbench for evaluating code-fixing prompts against LLM models. The implementation is clean, follows GAIA patterns, and includes good documentation with real-world examples. Positive Aspects
Issues and SuggestionsHigh Priority
Medium Priority
Low Priority / Suggestions
Overall AssessmentThis is a solid contribution that adds valuable tooling for prompt engineering experimentation. The design follows GAIA patterns and the documentation is thorough. Once the minor issues above are addressed (especially the copyright years and markdown syntax), this is ready to merge. |
|
@claude review this PR. |
|
Claude finished @kovtcharov-amd's task —— View job Reviewing PR #159: Adding Fix-Code Microbenchmark
SummaryThis is a well-designed testbench for evaluating code-fixing prompts against LLM models. The implementation follows GAIA patterns and includes thorough documentation with real-world examples. Issues to Address🔴 High Priority
🟡 Medium Priority
🟢 Low Priority
Positive Aspects
Overall AssessmentSolid contribution adding valuable tooling for prompt engineering experimentation. Once the high-priority issues (missing |
|
@claude review this PR. |
|
Claude finished @eddierichter-amd's task —— View job Reviewing PR #159: Adding Fix-Code Microbenchmark
SummaryThis PR adds a well-designed testbench for evaluating code-fixing prompts against LLM models (local or Claude). The implementation is clean, follows GAIA patterns, and includes thorough documentation with real-world examples. Status of Previously Reported Issues
Remaining Issues🔴 High Priority
🟡 Medium Priority
🟢 Low Priority
Positive Aspects
Overall AssessmentSolid contribution adding valuable tooling for prompt engineering experimentation. The critical issues from previous reviews (missing |
|
@claude I will fix the headers in another PR. Review this PR, I believe I have fixed all of the high and medium priority issues. |
|
Claude finished @eddierichter-amd's task —— View job Reviewing PR #159: Adding Fix-Code Microbenchmark
SummaryGood progress on addressing the previously reported issues! Most items have been fixed. Status of Previously Reported Issues
Remaining Issue🟡 CLI Example Path Mismatch (
|
|
@claude good catch. I just fixed cli.py. Anything else that I missed? |
|
Claude finished @eddierichter-amd's task —— View job Final Review of PR #159
All High and Medium Priority Issues Addressed ✅
Remaining Low Priority Item🟢 Potential None return ( Ready to MergeAll high and medium priority issues have been addressed. The PR is ready to merge. Nice work! |
|
@kovtcharov-amd this passed Claude's review. Let me know what you think. |
Added gaia eval fix-code CLI helper to run the new fix-code testbench (local model or Claude) with prompt-engineering, edit_file, line-range, and tuning options wired through the handler in src/gaia/cli.py. Introduced
src/gaia/eval/fix_code_testbenchmodule plus README and sample bug fixtures to experiment with and iterate on prompt designs for automated code fixes, writing patched output and showing diffs.