[codex] Slim watchlist skill runtime#10
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the watchlist-md skill by moving detailed lifecycle and safety documentation into dedicated reference files and introducing a new validation script (validate_watchlist.py). It updates the item ID format to WL-YYYYMMDD-NNN across the skill definition, prompts, and evaluation cases. Additionally, the repository structure is updated to ignore local .watchlist files and provide clearer examples. Feedback was provided regarding the scan_safety function in the new validation script, which is currently unused dead code.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 019d2da6d3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Add an item only when the user explicitly asks to record a future, time-gated, or | ||
| event-gated check, or has opted into pre-authorized watchlist recording. If the |
There was a problem hiding this comment.
Preserve implicit deferred-check adds
For the existing semantic case evals/cases/add-kr-02.json, the prompt 배포가 방금 시작됐어. 30분 뒤에 에러 로그 확인해야 해. is still expected to trigger add_item, but it does not explicitly ask to “record” anything. This new “only when the user explicitly asks to record” rule tells agents to skip that supported event-gated deferred-check flow, so users who state that a deployment/log check must be done later will no longer get a watchlist entry unless they know the exact WATCHLIST wording.
Useful? React with 👍 / 👎.
Summary
watchlist-mdruntime instructions and move lifecycle/safety detail into progressive-disclosure references.examples/, ignore generated.watchlist/WATCHLIST.md, and expand negative trigger eval coverage.Validation
python3 -m unittest discover -s evals -p 'test_*.py'python3 evals/check_release_metadata.pypython3 evals/check_policy_markers.pypython3 evals/check_semantic_cases.pypython3 evals/check_watchlist.py examples/WATCHLIST.example.md --strict-format --strict-safety --require-archive-sectionpython3 evals/check_watchlist.py .agents/skills/watchlist-md/assets/WATCHLIST.template.md --strict-format --strict-safety --require-archive-sectionpython3 .agents/skills/watchlist-md/scripts/validate_watchlist.py .agents/skills/watchlist-md/assets/WATCHLIST.template.md --strict-format --strict-safety --require-archive-sectiongit diff --check