Final polish: non-root Dockerfile, /reset task_id API support, logging cleanup, new tests#2
Merged
Merged
Conversation
…sive tests - Add confidentiality/NDA (medium+), termination (hard++), data protection (expert) tasks - Add opponent simulation with contextual counterparty responses per action type - Enhance grading with semantic similarity (cosine+Jaccard) and clause completeness scoring - Add 3 new task-specific graders: grade_medium_plus, grade_hard_plus2, grade_expert - Add required_elements field for completeness scoring per task - Update reward formula: 35% correctness + 25% improvement + 25% risk_alignment + 10% semantic + 5% completeness - Improve inference script with adaptive multi-turn strategy for all 8 tasks - Add 21 new tests (42 total, all passing) - Update openenv.yaml, README, and all exports Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/e3c399ed-eac8-40b6-9c61-196e32f44385 Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/e3c399ed-eac8-40b6-9c61-196e32f44385 Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…d action_space - Fix effective_risk_high() to cover all tasks with trap_markers (was only HARD/HARD_PLUS) - Fix observation_risk_float() to boost risk for all trap-bearing tasks (was only HARD) - Add opponent stance parsing (_parse_opponent_stance) for concession/firmness detection - Add opponent-aware action adjustment in inference _choose() function - Add HTTP-client mode (--mode api) for Docker API evaluation - Add per-task scoring summary in benchmark output - Add action_space section to openenv.yaml - Remove vestigial server/app.py and test_endpoints.py - Add 7 new tests (49 total): trap coverage, accept blocking, opponent parsing - Update README with new features and ENV_SERVER_URL Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/c8dd2642-d749-42d1-9fad-91bc78d8d379 Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…, clarify comments Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/c8dd2642-d749-42d1-9fad-91bc78d8d379 Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…ce pydantic warnings - Track specific opponent concessions per topic (cap, liability, IP, etc.) - Feed concession summary to LLM for richer negotiation context - Add --retry-low THRESHOLD flag to re-run low-scoring tasks - Add smart ACCEPT gate: block acceptance when contract hasn't improved - Improve MODERATE strategy to front-load PROPOSE_COUNTER - Sort per-task summary by score (worst first) with best/mean stats - Silence FastAPI/pydantic internal deprecation warnings in pytest config - Add 2 new tests for concession tracking (51 total, all passing) Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/8526d5cb-221b-4b88-9ca5-4c8679367d2f Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…tants Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/8526d5cb-221b-4b88-9ca5-4c8679367d2f Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…tests, input length guards - Make python-dotenv import optional so tests pass without LLM deps - Add Literal types for risk_level (HIGH/MODERATE/LOW) and clause_type - Add Pydantic validator for opponent_responses keys (must be valid ActionType) - Add content length validation in ContractEnv (max_content_length=50_000) - Add /evaluate-quality input length guard (100_000 chars max) - Consolidate NUM_GRADED_TASKS to single definition in graders.py - Add 8 new edge-case tests: content length, max steps, done state, unicode, empty risk_keywords, opponent_response key validation, API max length - Update README: --retry-low docs, CORS_ORIGINS env var, test count (59) - All 59 tests pass Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/028275d3-079a-4945-8a77-3e3dcdf5d12a Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…iteral in models.py Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/028275d3-079a-4945-8a77-3e3dcdf5d12a Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…ion leak, pin deps - Fix negation-aware keyword matching: risk keywords in negation context (e.g. "no party is liable for consequential damages") no longer falsely trigger effective_risk_high(), which was blocking ACCEPT on the easy task even after the agent submitted the correct safe edit - Fix retry logic: --retry-low now passes task_id to reset() so it actually retries the same low-scoring task instead of cycling to the next one - Add reset(task_id=) parameter to ContractEnv for targeted task selection - Remove stack trace exposure in production error handler (app.py) - Add close()/context manager to _HTTPEnvClient for proper session cleanup - Pin requirements.txt versions to match pyproject.toml - Add 7 new tests: negation matching, safe edit acceptance, full edit+accept flow, reset with task_id, original contracts flagged correctly - 66 tests pass Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/5b50d4d5-393e-4df1-ad7e-50ef984ec409 Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/5b50d4d5-393e-4df1-ad7e-50ef984ec409 Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…, docstrings, new API tests - Replace 3 assertions in environment.py with RuntimeError guards that survive python -O - Replace assert in graders.py with explicit ValueError - Add Pydantic EvaluateQualityRequest model to app.py replacing raw dict - Remove dead _MAX_EVALUATE_TEXT_LEN constant and unused Any import - Add comprehensive docstrings to ContractEnv class and methods - Add docstring to evaluate_action() explaining 5-dimension rubric - Fix fragile __code__.co_varnames introspection in inference.py with try/except - Remove redundant TASKS import in environment.py tasks property - Normalize "belongs to Supplier" → "belongs to supplier" in tasks.py - Add 6 new API tests: schema, root, evaluate-quality missing/empty/success, invalid action - Update README test count from 66 → 72 Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/5ac284f3-516c-4d77-ba64-14f8ee06778c Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
… 3 new tests Security: global exception handler no longer leaks internal error details Robustness: _HTTPEnvClient now uses 30s request timeouts Robustness: opponent_responses validator rejects empty response lists Code quality: added docstrings to 10 undocumented public functions in graders.py Code quality: cleaned up dev-note comments in models.py (✅/❌ markers) Code quality: removed unused StepResponse model from models.py Code quality: added proper __init__.py to tests/ directory Test coverage: added test_step_before_reset_raises (RuntimeError check) Test coverage: added test_step_after_reset_specific_task_keeps_task Test coverage: added test_opponent_responses_non_empty_validation README: updated test count 72 → 75 Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/08a54a1a-0cc0-4519-a9c8-096c1ff8f4b8 Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…x, 3 new tests - Dockerfile: add non-root user for container security - app.py: move logging import to module level, add /reset task_id body param - graders.py: clean stale comments, add grade_action docstring - tests: add 3 new API tests (reset with task_id, invalid task_id, evaluate before reset) - README: update test count 75 → 78 Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/e910db62-1682-4ee6-9dc6-e8639ab081ea Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
Copilot created this pull request from a session on behalf of
AbeerChaturvedi
April 11, 2026 10:31
View session
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Comprehensive hardening pass addressing container security, an API feature gap, code quality, and test coverage.
Security
appuserinstead of rootAPI
/resetnow accepts optionaltask_id: The environment already supportedreset(task_id=...)but the API never exposed it. Now accepts{"task_id": "expert_data_protection"}body, returns 400 on invalid IDs.Code quality
app.py: Moveimport loggingfrom inside exception handler to module level — was re-importing on every 500graders.py: Remove stale✅ STRICT RANGE FIXimplementation comments; add docstring tograde_actionTest coverage (75 → 78)
/resetwith validtask_idreturns correct task/resetwith invalidtask_idreturns 400/evaluate-qualitybefore any/resetreturns 400