feat: Add secure Python code execution with llm-sandbox support #217

0xCUB3 · 2025-10-25T00:15:46Z

Flexible Python code execution validation system with three execution modes:

Safe mode (default): Syntax validation only
Unsafe execution: Direct subprocess execution with warnings
Sandbox execution: Secure Docker-based execution via llm-sandbox

Elements

Abstract backend architecture with pluggable execution strategies
Import restrictions as an add'l security layer (AST-based analysis)
Configurable timeouts for both unsafe and sandbox execution
Safety warnings when using unsafe execution mode
Fallbacks when llm-sandbox is not available

API

# Safe mode (default) - validation only
req = PythonExecutesWithoutError()

# Unsafe execution with warning
req = PythonExecutesWithoutError(allow_unsafe_execution=True, timeout=10)

# Secure sandbox execution
req = PythonExecutesWithoutError(use_sandbox=True, timeout=10)

# With import restrictions
req = PythonExecutesWithoutError(
    use_sandbox=True,
    allowed_imports=["os", "sys", "json"],
    timeout=10
)

Dependencies

Adds llm-sandbox[docker]
Requires Docker for sandbox functionality

Testing

# Run core tests (no Docker required)
python -m pytest test/stdlib_basics/test_reqlib_python.py -k "not sandbox"

# Run all tests including sandbox (requires Docker)
python -m pytest test/stdlib_basics/test_reqlib_python.py

No breaking changes. Existing code using PythonExecutesWithoutError() continues to work with safe validation mode.

TODO: Documentation, perhaps more rigorous testing on larger code chunks

mergify · 2025-10-25T00:16:20Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

jakelorocco

Looks good! I left a few minor comments but the core of it looks nice. Please feel free to push back on anything you disagree with / let me know if you already talked to Nathan about any choices I've commented on.

Also, can you please rebase/merge main and resolve the uv.lock conflict as well?

mellea/stdlib/reqlib/python.py

test/stdlib_basics/test_reqlib_python.py

mellea/stdlib/reqlib/python.py

- Add PythonExecutesWithoutError requirement with three execution backends: - SafeBackend: Validates syntax and imports without execution (default) - UnsafeBackend: Direct subprocess execution with warnings - LLMSandboxBackend: Docker-based execution using llm-sandbox - Implement allow_unsafe_execution flag with explicit opt-in and warnings - Add import restriction support for defense-in-depth security - Support use_sandbox flag for secure Docker-based execution - Include comprehensive test suite with 21 test cases - Maintain backward compatibility while defaulting to safe mode - Add llm-sandbox[docker] dependency for optional sandbox functionality

Improves code formatting and readability in python.py by splitting long lines, adding whitespace, and updating argument formatting. Also updates test import order in test_reqlib_python.py for consistency.

As suggested in PR review, the class name now includes 'Req' to better align with naming conventions for Requirement classes.

All subclasses had identical __init__ methods, so this reduces code duplication by implementing it once in the base class.

- Renamed ExecutionBackend to ExecutionEnvironment and all subclasses - Updated documentation to clarify that allowed_imports=None means any import is allowed - This avoids confusion with the main Backend classes used for LLM generation

Clarified why blocks with less than 2 non-trivial lines are penalized and added TODO for future improvements using comment-to-code ratio

- Enhanced docstring to explain code extraction and 'best block' selection - Preserve underlying extraction failure reasons for better debugging - Error messages now include specific details about why extraction failed

Removed the unused context_text parameter from _score_code_block function and updated all call sites to match the simplified signature.

- Added _get_unauthorized_imports function to return specific unauthorized import names - Enhanced all import restriction error messages to show which imports are unauthorized - Improved debugging experience by providing actionable error details - Maintained backward compatibility with existing tests

Added comment explaining that sys.executable uses the same Python interpreter and environment as the current process, ensuring access to all installed packages and dependencies.

- Added stdout output to success messages for both subprocess and sandbox execution - Provides valuable debugging information and execution feedback - Only includes output when present, keeps messages clean when no output - Helps users understand what their code actually did during execution

- Added detailed logging for unknown sandbox errors - Include exit code and available attributes for better debugging - Provide more informative error messages when stderr is not available - Helps diagnose sandbox execution issues more effectively

Documented all scoring criteria including length bonus, function/class detection, control flow analysis, and non-trivial content filtering to help developers understand code block prioritization logic.

- Replaced hard-coded skip with runtime Docker availability detection - Added llm_sandbox import check and Docker connectivity test - Sandbox tests now run when Docker is available, skip gracefully when not - All 21 tests now pass when Docker is running

- Fixed ruff formatting across all files - Added explicit type annotation for unauthorized list in _get_unauthorized_imports - Resolved MyPy type annotation error in python.py

0xCUB3 · 2025-11-11T16:35:08Z

I pushed fixes for pre-commit errors relating to my code. There are still more errors but they seem to be unrelated to my changes. I would be happy to fix those too, but everything on my end should be good.

jakelorocco · 2025-11-11T17:15:06Z

I pushed fixes for pre-commit errors relating to my code. There are still more errors but they seem to be unrelated to my changes. I would be happy to fix those too, but everything on my end should be good.

@0xCUB3, yeah it looks like we likely haven't touched the folder your editing since we apparently forced those mypy changes in. Could you please just add # type: ignore to those lines? That way we can make sure your tests run on the github runner and merge your changes. Everything looks good to me!

0xCUB3 · 2025-11-11T19:44:01Z

I pushed fixes for pre-commit errors relating to my code. There are still more errors but they seem to be unrelated to my changes. I would be happy to fix those too, but everything on my end should be good.

@0xCUB3, yeah it looks like we likely haven't touched the folder your editing since we apparently forced those mypy changes in. Could you please just add # type: ignore to those lines? That way we can make sure your tests run on the github runner and merge your changes. Everything looks good to me!

I added the ignores. The checks should hopefully pass now.

0xCUB3 · 2025-11-11T19:58:42Z

Sorry about that... ran the linter again to fix the one remaining error.

jakelorocco · 2025-11-11T20:40:11Z

Looks like a test not related to your stuff failed. Let me take a look to make confirm, but we should be good to merge your PR.

jakelorocco · 2025-11-11T20:57:14Z

Okay, confirmed that it's a separate issue with an update to the litellm package we use. I will approve and merge this.

nrfulton self-requested a review October 29, 2025 23:05

0xCUB3 mentioned this pull request Nov 7, 2025

feat: Convert legacy verifiers to mellea reqlib generative-computing/mellea-contribs#8

Open

nrfulton requested review from jakelorocco and removed request for nrfulton November 11, 2025 00:51

jakelorocco reviewed Nov 11, 2025

View reviewed changes

0xCUB3 added 2 commits November 11, 2025 10:19

Refactor Python execution backends and formatting

5aad5cb

Improves code formatting and readability in python.py by splitting long lines, adding whitespace, and updating argument formatting. Also updates test import order in test_reqlib_python.py for consistency.

0xCUB3 force-pushed the feat/llm-sandbox-execution branch from c38e456 to 5aad5cb Compare November 11, 2025 15:19

0xCUB3 and others added 13 commits November 11, 2025 10:20

Merge branch 'main' into feat/llm-sandbox-execution

00592f7

refactor: rename PythonExecutesWithoutError to PythonExecutionReq

127681d

As suggested in PR review, the class name now includes 'Req' to better align with naming conventions for Requirement classes.

refactor: move duplicate __init__ to ExecutionBackend base class

34cb9fc

All subclasses had identical __init__ methods, so this reduces code duplication by implementing it once in the base class.

docs: add explanatory comment for code scoring logic

6b1d1d3

Clarified why blocks with less than 2 non-trivial lines are penalized and added TODO for future improvements using comment-to-code ratio

docs: improve error handling and documentation

3aa0feb

- Enhanced docstring to explain code extraction and 'best block' selection - Preserve underlying extraction failure reasons for better debugging - Error messages now include specific details about why extraction failed

refactor: remove unused context_text parameter

a69e377

Removed the unused context_text parameter from _score_code_block function and updated all call sites to match the simplified signature.

docs: clarify sys.executable usage in subprocess execution

aaf79f9

Added comment explaining that sys.executable uses the same Python interpreter and environment as the current process, ensuring access to all installed packages and dependencies.

docs: add detailed scoring metrics to _score_code_block docstring

18cd3dc

Documented all scoring criteria including length bonus, function/class detection, control flow analysis, and non-trivial content filtering to help developers understand code block prioritization logic.

0xCUB3 requested a review from jakelorocco November 11, 2025 16:12

fix: address pre-commit issues

5792791

- Fixed ruff formatting across all files - Added explicit type annotation for unauthorized list in _get_unauthorized_imports - Resolved MyPy type annotation error in python.py

Add mypy ignores.

60e9092

Linting

f20e744

jakelorocco approved these changes Nov 11, 2025

View reviewed changes

jakelorocco merged commit 9d12458 into generative-computing:main Nov 11, 2025
1 of 4 checks passed

feat: Add secure Python code execution with llm-sandbox support #217

feat: Add secure Python code execution with llm-sandbox support #217

Uh oh!

Conversation

0xCUB3 commented Oct 25, 2025

Elements

API

Dependencies

Testing

Uh oh!

mergify bot commented Oct 25, 2025

Merge Protections

🟢 Enforce conventional commit

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

0xCUB3 commented Nov 11, 2025

Uh oh!

jakelorocco commented Nov 11, 2025

Uh oh!

0xCUB3 commented Nov 11, 2025

Uh oh!

0xCUB3 commented Nov 11, 2025

Uh oh!

jakelorocco commented Nov 11, 2025

Uh oh!

jakelorocco commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants