feat: Add Python code verifiers for Mellea-generated code #171

0xCUB3 · 2025-10-01T10:53:02Z

Added Verifiers

HasPythonCodeListing: Smart extraction from markdown/plain text with multi-block scoring
PythonCodeParses: AST-based syntax validation
PythonValidImports: Import availability checking (supports venv paths as well)
PythonExecutesWithoutError: Isolated execution, also with timeout protection (could be more robust potentially)
PythonHasFunctionDef: Function definition validation
PythonHasClassDef: Class definition validation
PythonMatchesExamples: Functional correctness testing with examples provided by user

Since the "good" code isn't always in the beginning or the end, I handle common LLM patterns (function + tests, bad -> good examples, code evolution) using a basic scoring algorithm based on length, structure, test indicators, and context cues. Worked well for about 50 code samples of varying complexity.

Validation

Tested with Ollama + Granite 3.3 8B. Tests covered extraction edge cases, syntax validation, execution, and integration scenarios.

Note: Some test cases were generated with AI assistance and reviewed/validated by me.

Review Before Merging

Smart extraction logic (python.py:19-112) on more complicated code samples

mergify · 2025-10-01T10:53:36Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

0xCUB3 · 2025-10-05T21:35:54Z

I just added a PythonMatchesDocstring verifier for validating generated code against natural language specifications using LLM-generated test cases.

Design Rationale

Evaluated three approaches for LLM-based code verification. Direct LLM judgment (querying the model with "does this code satisfy the specification?") is trivially inconsistent and incurs per-validation LLM costs during rejection sampling so it would be prohibitively expensive.

I then perused the web and found an interesting paper on Stanford's Clover's multi-consistency verification approach (http://ai.stanford.edu/blog/clover/), which generates N independent implementations and validates through consensus on test inputs. The main problem I see with this approach is that it multiplies generation costs by N per validation which I think isn't too great for our purposes.

I read into Python's doctest framework (https://docs.python.org/3/library/doctest.html) as a model for specification-driven testing. The core idea of embedding executable test cases in specifications is sound, but doctest's rigid format doesn't mesh well with flexible LLM outputs, so I had to scrap this idea.

So herein is a sort of hybrid mish-mash, if you will: LLM-based test generation with deterministic validation. This architecture amortizes the expensive LLM call (one-time test case generation) across multiple cheap validation passes (deterministic code execution during rejection sampling).

Implementation

Implemented two-phase verification. Phase one (generate_tests(temperature)) uses the LLM to parse the docstring specification and generate test cases as structured JSON, mapping input dictionaries to expected outputs. Phase two (_validate()) executes generated code against these test cases deterministically.

Temperature parameter controls test generation stochasticity. I went with 0.3 as a good balance, but the following were my general findings. Low temperature (0.2) produces deterministic, consistent test suites but unadventurous ones. Medium (0.5) balances coverage and diversity. High (0.8) increases exploration of edge cases and parameter space but can be too much. Validated with Granite 3.3 8B; YMMV with different models.

Usage

verifier = PythonMatchesDocstring("Calculate factorial of n", num_tests=5)
verifier.generate_tests(temperature=0.3)  # Note the one-time LLM call

result = session.instruct(
    "Write a factorial function", 
    requirements=[verifier]
)

Problems

The LLM still occasionally doesn't use the correct debug params. This might be a greater issue but it happened only once for me in the final iteration. Worth testing though.

nrfulton

LGTM other than the unsafe code execution.

nrfulton · 2025-10-09T22:20:11Z

mellea/stdlib/reqlib/python.py

+    # Create temporary file and execute
+    with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
+        f.write(code)
+        temp_file = f.name
+
+    try:
+        result = subprocess.run(
+            [sys.executable, temp_file], capture_output=True, text=True, timeout=timeout
+        )
+
+        if result.returncode == 0:
+            return ValidationResult(result=True, reason="Code executed successfully")
+        else:
+            return ValidationResult(
+                result=False,
+                reason=f"Execution failed with error: {result.stderr[:200]}",
+            )
+    except subprocess.TimeoutExpired:
+        return ValidationResult(
+            result=False, reason=f"Execution timed out after {timeout} seconds"
+        )
+    except Exception as e:
+        return ValidationResult(result=False, reason=f"Execution error: {e!s}")
+    finally:
+        # Clean up temp file
+        try:
+            Path(temp_file).unlink()
+        except Exception:
+            pass


We should use ibm/guardx or something similar here. There are now good wasm solutions which might prove to be a better solution than containers long term.

nrfulton · 2025-10-10T16:58:12Z

@0xCUB3 Thanks for the contribution!

LGTM other than sandboxing code execution. Can you please make that update, then comment here indicating it's ready for review?

0xCUB3 · 2025-10-10T18:14:07Z

That's a good point. It is unsafe, so I've done a little bit of research on some options.

IBM GuardX (your suggestion). Full Python stdlib support, Docker-based isolation with syscall filtering, and an in-house solution. The downside is it requires Docker and users need to run guardx init which can take a long time (not to mention resources) to build containers. Definitely secure, but heavy for a verification library. Or is there a simpler approach I'm missing?
WebAssembly sandboxing. I looked at wasmtime-py with VMware's Python WASM and wasm-py-sandbox from PyPI. The wasmtime approach has no Docker requirement and good resource limiting via "fuel" mechanism, but you need to bundle a 50MB+ Python WASM binary and stdlib support is limited (no numpy, pandas, etc). The PyPI package is alpha quality from 2023 and uses RustPython instead of CPython (which probably isn't a huge deal though much less mature). All WASM solutions seem to have the same general issue that you need a Python interpreter compiled to WASM which can be large and have compatibility issues if your code relies on a bunch of proprietary dependencies.
Ignore this issue and just make execution opt-in with explicit flag. Change PythonExecutesWithoutError() to skip by default. Require allow_unsafe=True to use subprocess execution, or use_sandbox=True for GuardX/WASM or whatever we choose. This forces users to think about security and prevents accidental unsafe execution. The downside is it's a breaking change and doesn't solve sandboxing, just makes the risk explicit.

What direction do you think we should follow?

0xCUB3 · 2025-10-23T13:46:51Z

I added the unsafe execution backend. A full implementation would include something like GuardX for a sandboxed option. Another alternative could be llm-sandbox, though using in-house solutions is probably the best idea.

guicho271828 · 2025-11-25T16:59:04Z

please rebase & squash the changes into several commits for review

- Add SafeEnvironment for static validation - Add UnsafeEnvironment for direct execution - Add LLMSandboxEnvironment for sandboxed execution - Add Python code validation requirements (PythonMatchesDocstring, etc.) - Support multiple execution backends for different security needs

- Update backend tests to use async methods - Add tests for Python code verifiers - Update test fixtures for new execution backends - Adjust tests for removed tools module

- Update all examples to use async backend methods - Remove references to deprecated tools module - Clean up import statements

- Remove mellea/stdlib/tools module (functionality moved to reqlib/python.py) - Remove mellea/stdlib/reqlib/tools.py - Update pyproject.toml dependencies - Update CHANGELOG and LICENSE formatting - Update uv.lock with dependency changes

0xCUB3 changed the title ~~Add Python code verifiers for Mellea-generated code~~ feat: Add Python code verifiers for Mellea-generated code Oct 1, 2025

nrfulton requested a review from avinash2692 October 2, 2025 15:14

nrfulton self-requested a review October 9, 2025 22:13

nrfulton reviewed Oct 9, 2025

View reviewed changes

nrfulton removed the request for review from avinash2692 October 9, 2025 22:22

nrfulton mentioned this pull request Nov 17, 2025

feat: code_interpreter #232

Merged

0xCUB3 requested a review from nrfulton November 17, 2025 20:39

nrfulton mentioned this pull request Nov 20, 2025

Melleaic Tool Use: Clarify Sampling + Requirement Checking + Tool Calling Protocols #235

Open

6 tasks

0xCUB3 added 4 commits November 25, 2025 14:52

test: update tests for Python verifiers and async backend changes

25a0523

- Update backend tests to use async methods - Add tests for Python code verifiers - Update test fixtures for new execution backends - Adjust tests for removed tools module

docs: update examples for async backend changes

aa47621

- Update all examples to use async backend methods - Remove references to deprecated tools module - Clean up import statements

0xCUB3 force-pushed the feat/python-code-verifiers branch from 9700ecb to fcc2e7e Compare November 25, 2025 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Python code verifiers for Mellea-generated code #171

feat: Add Python code verifiers for Mellea-generated code #171

0xCUB3 commented Oct 1, 2025

Uh oh!

mergify bot commented Oct 1, 2025 •

edited

Loading

Uh oh!

0xCUB3 commented Oct 5, 2025 •

edited

Loading

Uh oh!

nrfulton left a comment •

edited

Loading

Uh oh!

nrfulton Oct 9, 2025

Uh oh!

nrfulton commented Oct 10, 2025

Uh oh!

0xCUB3 commented Oct 10, 2025 •

edited

Loading

Uh oh!

0xCUB3 commented Oct 23, 2025

Uh oh!

guicho271828 commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Add Python code verifiers for Mellea-generated code #171

Are you sure you want to change the base?

feat: Add Python code verifiers for Mellea-generated code #171

Conversation

0xCUB3 commented Oct 1, 2025

Added Verifiers

Validation

Review Before Merging

Uh oh!

mergify bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

Uh oh!

0xCUB3 commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design Rationale

Implementation

Usage

Problems

Uh oh!

nrfulton left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nrfulton Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

nrfulton commented Oct 10, 2025

Uh oh!

0xCUB3 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xCUB3 commented Oct 23, 2025

Uh oh!

guicho271828 commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mergify bot commented Oct 1, 2025 •

edited

Loading

0xCUB3 commented Oct 5, 2025 •

edited

Loading

nrfulton left a comment •

edited

Loading

0xCUB3 commented Oct 10, 2025 •

edited

Loading