-
Notifications
You must be signed in to change notification settings - Fork 56
feat: Add Python code verifiers for Mellea-generated code #171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add Python code verifiers for Mellea-generated code #171
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
I just added a Design RationaleEvaluated three approaches for LLM-based code verification. Direct LLM judgment (querying the model with "does this code satisfy the specification?") is trivially inconsistent and incurs per-validation LLM costs during rejection sampling so it would be prohibitively expensive. I then perused the web and found an interesting paper on Stanford's Clover's multi-consistency verification approach (http://ai.stanford.edu/blog/clover/), which generates N independent implementations and validates through consensus on test inputs. The main problem I see with this approach is that it multiplies generation costs by N per validation which I think isn't too great for our purposes. I read into Python's doctest framework (https://docs.python.org/3/library/doctest.html) as a model for specification-driven testing. The core idea of embedding executable test cases in specifications is sound, but doctest's rigid format doesn't mesh well with flexible LLM outputs, so I had to scrap this idea. So herein is a sort of hybrid mish-mash, if you will: LLM-based test generation with deterministic validation. This architecture amortizes the expensive LLM call (one-time test case generation) across multiple cheap validation passes (deterministic code execution during rejection sampling). ImplementationImplemented two-phase verification. Phase one ( Temperature parameter controls test generation stochasticity. I went with 0.3 as a good balance, but the following were my general findings. Low temperature (0.2) produces deterministic, consistent test suites but unadventurous ones. Medium (0.5) balances coverage and diversity. High (0.8) increases exploration of edge cases and parameter space but can be too much. Validated with Granite 3.3 8B; YMMV with different models. Usageverifier = PythonMatchesDocstring("Calculate factorial of n", num_tests=5)
verifier.generate_tests(temperature=0.3) # Note the one-time LLM call
result = session.instruct(
"Write a factorial function",
requirements=[verifier]
)ProblemsThe LLM still occasionally doesn't use the correct debug params. This might be a greater issue but it happened only once for me in the final iteration. Worth testing though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than the unsafe code execution.
mellea/stdlib/reqlib/python.py
Outdated
| # Create temporary file and execute | ||
| with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f: | ||
| f.write(code) | ||
| temp_file = f.name | ||
|
|
||
| try: | ||
| result = subprocess.run( | ||
| [sys.executable, temp_file], capture_output=True, text=True, timeout=timeout | ||
| ) | ||
|
|
||
| if result.returncode == 0: | ||
| return ValidationResult(result=True, reason="Code executed successfully") | ||
| else: | ||
| return ValidationResult( | ||
| result=False, | ||
| reason=f"Execution failed with error: {result.stderr[:200]}", | ||
| ) | ||
| except subprocess.TimeoutExpired: | ||
| return ValidationResult( | ||
| result=False, reason=f"Execution timed out after {timeout} seconds" | ||
| ) | ||
| except Exception as e: | ||
| return ValidationResult(result=False, reason=f"Execution error: {e!s}") | ||
| finally: | ||
| # Clean up temp file | ||
| try: | ||
| Path(temp_file).unlink() | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use ibm/guardx or something similar here. There are now good wasm solutions which might prove to be a better solution than containers long term.
|
@0xCUB3 Thanks for the contribution! LGTM other than sandboxing code execution. Can you please make that update, then comment here indicating it's ready for review? |
|
That's a good point. It is unsafe, so I've done a little bit of research on some options.
What direction do you think we should follow? |
|
I added the unsafe execution backend. A full implementation would include something like GuardX for a sandboxed option. Another alternative could be llm-sandbox, though using in-house solutions is probably the best idea. |
|
please rebase & squash the changes into several commits for review |
- Add SafeEnvironment for static validation - Add UnsafeEnvironment for direct execution - Add LLMSandboxEnvironment for sandboxed execution - Add Python code validation requirements (PythonMatchesDocstring, etc.) - Support multiple execution backends for different security needs
- Update backend tests to use async methods - Add tests for Python code verifiers - Update test fixtures for new execution backends - Adjust tests for removed tools module
- Update all examples to use async backend methods - Remove references to deprecated tools module - Clean up import statements
- Remove mellea/stdlib/tools module (functionality moved to reqlib/python.py) - Remove mellea/stdlib/reqlib/tools.py - Update pyproject.toml dependencies - Update CHANGELOG and LICENSE formatting - Update uv.lock with dependency changes
9700ecb to
fcc2e7e
Compare
Added Verifiers
Since the "good" code isn't always in the beginning or the end, I handle common LLM patterns (function + tests, bad -> good examples, code evolution) using a basic scoring algorithm based on length, structure, test indicators, and context cues. Worked well for about 50 code samples of varying complexity.
Validation
Tested with Ollama + Granite 3.3 8B. Tests covered extraction edge cases, syntax validation, execution, and integration scenarios.
Note: Some test cases were generated with AI assistance and reviewed/validated by me.
Review Before Merging