Claude-based linter for reviewing quality of new tasks#554
Conversation
|
Claude finished @bai-uipath's task in 3m 6s —— View job SummaryThis PR adds a Change-by-Change Review1.
|
| Area | Status | Notes |
|---|---|---|
| Frontmatter | N/A | No skill SKILL.md changed |
| E2E Tests | Issue | Test fixture must be removed before merge |
| Skill Body | N/A | No skill content changed |
| References & Assets | OK | No broken links in changed files |
| Repo Hygiene | Issue | Test fixture committed with "delete before merge" instruction; no CODEOWNERS for .claude/commands/ |
Issues for Manual Review
- Rubric calibration: The rubric axes and severity mappings are based on one audit. The PR description acknowledges this ("calibrated for the audit findings, not battle-tested across every skill"). Manual spot-checking against a few real tasks on
mainwould build confidence before iterating on the rubric. - Theme-aware downgrade consistency: The theme logic in
lint-task.md(lines ~195-221) is complex. Different LLM invocations may apply the downgrade/suppress rules inconsistently. Consider running the slash command on 5-10 real tasks and comparing outputs to validate consistency. - Workflow permissions: The workflow requests
pull-requests: writewhich is correct for posting comments. Verify this doesn't conflict with any branch protection rules or required-review policies.
Conclusion
The rubric design is strong — the six axes target real, documented quality gaps, and the carve-outs for false positives show careful thought. The workflow is consistent with existing CI patterns.
One blocking issue: the _DELETE_ME_lint_test/bad_self_report.yaml test fixture must be removed before merge. The file itself documents this requirement.
One medium concern: consider whether the hardcoded model in the workflow should be extracted, or accept the consistency-with-existing-workflows trade-off.
|
Claude finished @bai-uipath's task in 2m 7s —— View job Coder-eval task lint (advisory)
Coder-eval task lint (advisory)1 task YAML changed; verdict counts: 2 Critical, 0 High, 0 Medium, 0 Low, 0 OK. Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge. Evidence of passing run❌ High — PR body does not claim the changed task has been run and passed. This is moot for this PR: the task is an intentional bad-example fixture ( Per-task lint
|
Summary
Adds a
/lint-taskslash command and a non-blocking PR-time GitHub Action that scores newly added or modified coder-eval task YAMLs against a quality rubric. Purely advisory — never fails CI.Why
A recent quality audit of the coder-eval tasks in this repo found that roughly half score below 7/10. The dominant patterns:
report.json(or similar), criteria grade that file. The agent grades its own homework; deterministic checks are bypassed; hallucination-prone.initial_promptprescribes the procedure ("Walk the discovery hierarchy", "Use--output json"), so the skill itself is never actually exercised.file_exists/ loosecommand_executedpatterns; a dummy implementation passes.flow validateonly checks JSON shape; withoutflow debugthe test cannot tell whether the flow actually produces the right output.Reviewers don't catch these consistently in PR review, so they keep accumulating. This change adds a lightweight automated reviewer that catches them at PR time and on demand.
What's added
.claude/commands/lint-task.md— slash command./lint-task <path|glob|dir>reads the task YAML(s) plus up to 5 nearest siblings (for duplicate detection) and emits a per-task report with severity-tagged issues (Critical / High / Medium / Low / OK) and concrete suggested fixes. Source of truth for the rubric..github/workflows/lint-tasks.yml— advisory PR bot. Triggers on PRs touchingtests/tasks/**/*.yaml, runs only when the PR is out of draft state, applies the same rubric to changed files only, also flags within-PR duplicates and checks the PR body for a verbal claim that the new/changed tasks have been run and passed at least once. Posts a single comment, never fails the check.Both files share the rubric — the workflow reads
.claude/commands/lint-task.mdat runtime, so iterating on the rubric only requires editing one file.Design choices
anthropics/claude-code-action@v1andANTHROPIC_API_KEYsecret asclaude-pr-review.yml. No new secrets, no new workflow dependencies.Out of scope (follow-ups)
main(intentionally excluded from this PR — that's a separate cleanup pass; expect to tag low-scoring tests[draft]and exclude them from daily metrics).