Skip to content

Final polish: non-root Dockerfile, /reset task_id API support, logging cleanup, new tests#2

Merged
bigturtle679 merged 13 commits into
mainfrom
copilot/improve-inference-script-again
Apr 11, 2026
Merged

Final polish: non-root Dockerfile, /reset task_id API support, logging cleanup, new tests#2
bigturtle679 merged 13 commits into
mainfrom
copilot/improve-inference-script-again

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 11, 2026

Comprehensive hardening pass addressing container security, an API feature gap, code quality, and test coverage.

Security

  • Dockerfile: Run as non-root appuser instead of root

API

  • /reset now accepts optional task_id: The environment already supported reset(task_id=...) but the API never exposed it. Now accepts {"task_id": "expert_data_protection"} body, returns 400 on invalid IDs.
# Before: only sequential cycling
POST /resetnext task in rotation

# After: targeted reset supported
POST /reset  {"task_id": "expert_data_protection"}  →  specific task
POST /reset  {"task_id": "bad_id"}                   →  400

Code quality

  • app.py: Move import logging from inside exception handler to module level — was re-importing on every 500
  • graders.py: Remove stale ✅ STRICT RANGE FIX implementation comments; add docstring to grade_action

Test coverage (75 → 78)

  • /reset with valid task_id returns correct task
  • /reset with invalid task_id returns 400
  • /evaluate-quality before any /reset returns 400

Copilot AI and others added 13 commits April 11, 2026 07:41
…sive tests

- Add confidentiality/NDA (medium+), termination (hard++), data protection (expert) tasks
- Add opponent simulation with contextual counterparty responses per action type
- Enhance grading with semantic similarity (cosine+Jaccard) and clause completeness scoring
- Add 3 new task-specific graders: grade_medium_plus, grade_hard_plus2, grade_expert
- Add required_elements field for completeness scoring per task
- Update reward formula: 35% correctness + 25% improvement + 25% risk_alignment + 10% semantic + 5% completeness
- Improve inference script with adaptive multi-turn strategy for all 8 tasks
- Add 21 new tests (42 total, all passing)
- Update openenv.yaml, README, and all exports

Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/e3c399ed-eac8-40b6-9c61-196e32f44385

Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…d action_space

- Fix effective_risk_high() to cover all tasks with trap_markers (was only HARD/HARD_PLUS)
- Fix observation_risk_float() to boost risk for all trap-bearing tasks (was only HARD)
- Add opponent stance parsing (_parse_opponent_stance) for concession/firmness detection
- Add opponent-aware action adjustment in inference _choose() function
- Add HTTP-client mode (--mode api) for Docker API evaluation
- Add per-task scoring summary in benchmark output
- Add action_space section to openenv.yaml
- Remove vestigial server/app.py and test_endpoints.py
- Add 7 new tests (49 total): trap coverage, accept blocking, opponent parsing
- Update README with new features and ENV_SERVER_URL

Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/c8dd2642-d749-42d1-9fad-91bc78d8d379

Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…, clarify comments

Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/c8dd2642-d749-42d1-9fad-91bc78d8d379

Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…ce pydantic warnings

- Track specific opponent concessions per topic (cap, liability, IP, etc.)
- Feed concession summary to LLM for richer negotiation context
- Add --retry-low THRESHOLD flag to re-run low-scoring tasks
- Add smart ACCEPT gate: block acceptance when contract hasn't improved
- Improve MODERATE strategy to front-load PROPOSE_COUNTER
- Sort per-task summary by score (worst first) with best/mean stats
- Silence FastAPI/pydantic internal deprecation warnings in pytest config
- Add 2 new tests for concession tracking (51 total, all passing)

Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/8526d5cb-221b-4b88-9ca5-4c8679367d2f

Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…tests, input length guards

- Make python-dotenv import optional so tests pass without LLM deps
- Add Literal types for risk_level (HIGH/MODERATE/LOW) and clause_type
- Add Pydantic validator for opponent_responses keys (must be valid ActionType)
- Add content length validation in ContractEnv (max_content_length=50_000)
- Add /evaluate-quality input length guard (100_000 chars max)
- Consolidate NUM_GRADED_TASKS to single definition in graders.py
- Add 8 new edge-case tests: content length, max steps, done state, unicode,
  empty risk_keywords, opponent_response key validation, API max length
- Update README: --retry-low docs, CORS_ORIGINS env var, test count (59)
- All 59 tests pass

Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/028275d3-079a-4945-8a77-3e3dcdf5d12a

Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…iteral in models.py

Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/028275d3-079a-4945-8a77-3e3dcdf5d12a

Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…ion leak, pin deps

- Fix negation-aware keyword matching: risk keywords in negation context
  (e.g. "no party is liable for consequential damages") no longer falsely
  trigger effective_risk_high(), which was blocking ACCEPT on the easy task
  even after the agent submitted the correct safe edit
- Fix retry logic: --retry-low now passes task_id to reset() so it actually
  retries the same low-scoring task instead of cycling to the next one
- Add reset(task_id=) parameter to ContractEnv for targeted task selection
- Remove stack trace exposure in production error handler (app.py)
- Add close()/context manager to _HTTPEnvClient for proper session cleanup
- Pin requirements.txt versions to match pyproject.toml
- Add 7 new tests: negation matching, safe edit acceptance, full edit+accept
  flow, reset with task_id, original contracts flagged correctly
- 66 tests pass

Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/5b50d4d5-393e-4df1-ad7e-50ef984ec409

Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/5b50d4d5-393e-4df1-ad7e-50ef984ec409

Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…, docstrings, new API tests

- Replace 3 assertions in environment.py with RuntimeError guards that
  survive python -O
- Replace assert in graders.py with explicit ValueError
- Add Pydantic EvaluateQualityRequest model to app.py replacing raw dict
- Remove dead _MAX_EVALUATE_TEXT_LEN constant and unused Any import
- Add comprehensive docstrings to ContractEnv class and methods
- Add docstring to evaluate_action() explaining 5-dimension rubric
- Fix fragile __code__.co_varnames introspection in inference.py with try/except
- Remove redundant TASKS import in environment.py tasks property
- Normalize "belongs to Supplier" → "belongs to supplier" in tasks.py
- Add 6 new API tests: schema, root, evaluate-quality missing/empty/success, invalid action
- Update README test count from 66 → 72

Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/5ac284f3-516c-4d77-ba64-14f8ee06778c

Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
… 3 new tests

Security: global exception handler no longer leaks internal error details
Robustness: _HTTPEnvClient now uses 30s request timeouts
Robustness: opponent_responses validator rejects empty response lists
Code quality: added docstrings to 10 undocumented public functions in graders.py
Code quality: cleaned up dev-note comments in models.py (✅/❌ markers)
Code quality: removed unused StepResponse model from models.py
Code quality: added proper __init__.py to tests/ directory
Test coverage: added test_step_before_reset_raises (RuntimeError check)
Test coverage: added test_step_after_reset_specific_task_keeps_task
Test coverage: added test_opponent_responses_non_empty_validation
README: updated test count 72 → 75

Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/08a54a1a-0cc0-4519-a9c8-096c1ff8f4b8

Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
…x, 3 new tests

- Dockerfile: add non-root user for container security
- app.py: move logging import to module level, add /reset task_id body param
- graders.py: clean stale comments, add grade_action docstring
- tests: add 3 new API tests (reset with task_id, invalid task_id, evaluate before reset)
- README: update test count 75 → 78

Agent-Logs-Url: https://github.com/bigturtle679/Contract-Negotiation-Environment/sessions/e910db62-1682-4ee6-9dc6-e8639ab081ea

Co-authored-by: AbeerChaturvedi <171315954+AbeerChaturvedi@users.noreply.github.com>
@bigturtle679 bigturtle679 marked this pull request as ready for review April 11, 2026 10:31
@bigturtle679 bigturtle679 merged commit 3fdae1e into main Apr 11, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants