Skip to content

Claude-based linter for reviewing quality of new tasks#554

Merged
bai-uipath merged 7 commits intomainfrom
bai/lint-new-tasks
May 5, 2026
Merged

Claude-based linter for reviewing quality of new tasks#554
bai-uipath merged 7 commits intomainfrom
bai/lint-new-tasks

Conversation

@bai-uipath
Copy link
Copy Markdown
Contributor

@bai-uipath bai-uipath commented May 4, 2026

Summary

Adds a /lint-task slash command and a non-blocking PR-time GitHub Action that scores newly added or modified coder-eval task YAMLs against a quality rubric. Purely advisory — never fails CI.

Why

A recent quality audit of the coder-eval tasks in this repo found that roughly half score below 7/10. The dominant patterns:

  • Self-report anti-pattern — prompt asks the agent to write report.json (or similar), criteria grade that file. The agent grades its own homework; deterministic checks are bypassed; hallucination-prone.
  • Prompt over-specificationinitial_prompt prescribes the procedure ("Walk the discovery hierarchy", "Use --output json"), so the skill itself is never actually exercised.
  • Trivially-passing criteria — only file_exists / loose command_executed patterns; a dummy implementation passes.
  • Near-duplicate tests — same shape, only surface-level naming differences; pure infra cost with no marginal coverage.
  • Validate-only flow testsflow validate only checks JSON shape; without flow debug the test cannot tell whether the flow actually produces the right output.
  • No record of ever passing — tests merged that have never passed, surfacing only on the next nightly run.

Reviewers don't catch these consistently in PR review, so they keep accumulating. This change adds a lightweight automated reviewer that catches them at PR time and on demand.

What's added

  • .claude/commands/lint-task.md — slash command. /lint-task <path|glob|dir> reads the task YAML(s) plus up to 5 nearest siblings (for duplicate detection) and emits a per-task report with severity-tagged issues (Critical / High / Medium / Low / OK) and concrete suggested fixes. Source of truth for the rubric.
  • .github/workflows/lint-tasks.yml — advisory PR bot. Triggers on PRs touching tests/tasks/**/*.yaml, runs only when the PR is out of draft state, applies the same rubric to changed files only, also flags within-PR duplicates and checks the PR body for a verbal claim that the new/changed tasks have been run and passed at least once. Posts a single comment, never fails the check.

Both files share the rubric — the workflow reads .claude/commands/lint-task.md at runtime, so iterating on the rubric only requires editing one file.

Design choices

  • Advisory, not blocking. The rubric is calibrated for the audit findings, not battle-tested across every skill. Hard-gating at PR time would create false-positive friction. Severities communicate priority; reviewers and authors decide how to act.
  • Honor system on passing-run evidence. The bot checks the PR body for a verbal claim ("ran X locally and it passed") but does not verify against blob storage or the dashboard. Trust assumption is cheap to relax later if it breaks.
  • Within-skill duplicate detection, not embeddings. Compares each changed task to up to 5 nearby siblings (same folder, then same skill if folder is sparse). Catches ~95% of real-world duplicates without infrastructure.
  • Reuses existing infra. Same anthropics/claude-code-action@v1 and ANTHROPIC_API_KEY secret as claude-pr-review.yml. No new secrets, no new workflow dependencies.

Out of scope (follow-ups)

  • Backfilling lint over all existing tasks in main (intentionally excluded from this PR — that's a separate cleanup pass; expect to tag low-scoring tests [draft] and exclude them from daily metrics).
  • Cross-repo dup detection (across all skills, not just within-skill).
  • Hard CI gating after the rubric stabilizes.
  • Verifying passing-run claims against blob-storage run JSON.

@bai-uipath bai-uipath marked this pull request as ready for review May 4, 2026 23:00
@bai-uipath bai-uipath requested a review from uipreliga as a code owner May 4, 2026 23:00
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Claude finished @bai-uipath's task in 3m 6s —— View job


Summary

This PR adds a /lint-task slash command and an advisory CI workflow that scores new/modified coder-eval task YAMLs against a quality rubric. The rubric covers six axes: self-report anti-pattern, prompt over-specification, meaningful coverage, gameability, near-duplicates, and validate-only flow tests. A test fixture (bad_self_report.yaml) is included to exercise the workflow.

Change-by-Change Review

1. .claude/commands/lint-task.md — OK with minor notes

Severity: Low

Well-structured rubric with clear axes, severity levels, and actionable output format. Follows the existing command file convention (generate-tasks.md, test-coverage.md). Key observations:

  • Good: The carve-outs (ground-truth anchors in axis B, scaffold reuse in axis E, description-rationale carve-out in axis F) are well-designed and will prevent common false positives.
  • Good: Theme-aware downgrade logic is clever — avoids inflating severity when many tasks share the same systemic issue.
  • Low: The theme-aware suppression logic (lines ~200-221 in the diff) is sophisticated but may produce inconsistent results across different LLM invocations. The rules for "suppress issues block when fully theme-captured" vs "keep issues block but drop theme-referenced ones" are nuanced. Consider simplifying to always show per-task issues with a (see Theme N) suffix, and let the theme entry carry the systemic severity. This is a polish concern — the current design is workable.
  • Low: Line 11 references the workflow path (.github/workflows/lint-tasks.yml) — this is a forward reference that's correct but creates a coupling. If the workflow moves, this line goes stale. Minor.

2. .github/workflows/lint-tasks.yml — OK with one medium note

Severity: Medium

The workflow is well-structured: correct paths trigger, draft-skip logic, concurrency group, fork exclusion, and reuse of the existing ANTHROPIC_API_KEY secret. Prompt is clear with explicit steps and output template.

  • Good: Consistent with claude-pr-review.yml — same action version, same model flag, same secret, same timeout-minutes: 15.
  • Good: concurrency.cancel-in-progress: true prevents stale lint comments from racing.
  • Good: The fork exclusion (head.repo.full_name == github.repository) prevents leaking the API key.
  • Medium (line 45): claude_args: "--model claude-opus-4-6" hardcodes the model. The existing claude-pr-review.yml does the same, so this is consistent with current practice. However, if the team wants to update the model across all workflows, each file must be edited independently. Consider extracting this to a repo-level variable or .env — but since the existing workflow does the same thing, this is a consistency-preserving choice, not a regression.
  • The prompt (lines 46-194) is ~150 lines. This is large but well-organized. The explicit output template (lines 127-175) ensures consistent comment formatting. No issues with the YAML block scalar — markdown fences inside prompt: | are preserved correctly.

3. tests/tasks/uipath-maestro-flow/_DELETE_ME_lint_test/bad_self_report.yaml — High

Severity: High

This is a deliberately broken test fixture, and the file's own description (lines 2-7) says: "Delete this file and the enclosing _DELETE_ME_lint_test/ folder before merging the lint workflow PR."

  • High: This file must be removed before merge. If merged, it becomes a permanently broken task in the test suite. The _DELETE_ME_ naming convention is not recognized by any CI exclusion — it will show up in tests/tasks/ listings, confuse future task audits, and potentially trigger the nightly e2e runner if the experiment config includes glob patterns.
  • The file correctly exercises multiple rubric axes (self-report, over-specification, weak coverage, gameable criteria, validate-only e2e) — it's a good test fixture for development, but it has served its purpose.
  • The tag lifecycle:generate (line 8) is not a standard test-type tag (smoke, integration, e2e), but it appears alongside e2e which is valid. The lifecycle: prefix appears to be a category tag, not a test-type violation.

What's Missing

  • Test fixture cleanup: _DELETE_ME_lint_test/bad_self_report.yaml must be deleted before merge (the file itself says so). Fix this →
  • .claude/commands/lint-task.md is deleted in the working tree: git status shows D .claude/commands/lint-task.md. The file is in the branch commits (verified via git show HEAD), but the working tree deletion is anomalous. Verify this is a CI checkout artifact and not an uncommitted deletion that will be pushed. If the author ran git rm locally without committing, the file will be missing from the branch on next push.
  • No CODEOWNERS entry for .claude/commands/: The existing /.github/ path is covered by @uipreliga, but .claude/commands/ has no CODEOWNERS entry. The existing commands (generate-tasks.md, skill-compare.md, test-coverage.md) also lack coverage — this isn't a regression, but worth adding as a follow-up.

Area Ratings

Area Status Notes
Frontmatter N/A No skill SKILL.md changed
E2E Tests Issue Test fixture must be removed before merge
Skill Body N/A No skill content changed
References & Assets OK No broken links in changed files
Repo Hygiene Issue Test fixture committed with "delete before merge" instruction; no CODEOWNERS for .claude/commands/

Issues for Manual Review

  • Rubric calibration: The rubric axes and severity mappings are based on one audit. The PR description acknowledges this ("calibrated for the audit findings, not battle-tested across every skill"). Manual spot-checking against a few real tasks on main would build confidence before iterating on the rubric.
  • Theme-aware downgrade consistency: The theme logic in lint-task.md (lines ~195-221) is complex. Different LLM invocations may apply the downgrade/suppress rules inconsistently. Consider running the slash command on 5-10 real tasks and comparing outputs to validate consistency.
  • Workflow permissions: The workflow requests pull-requests: write which is correct for posting comments. Verify this doesn't conflict with any branch protection rules or required-review policies.

Conclusion

The rubric design is strong — the six axes target real, documented quality gaps, and the carve-outs for false positives show careful thought. The workflow is consistent with existing CI patterns.

One blocking issue: the _DELETE_ME_lint_test/bad_self_report.yaml test fixture must be removed before merge. The file itself documents this requirement.

One medium concern: consider whether the hardcoded model in the workflow should be extracted, or accept the consistency-with-existing-workflows trade-off.


@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Claude finished @bai-uipath's task in 2m 7s —— View job


Coder-eval task lint (advisory)

  • Read the lint rubric (.claude/commands/lint-task.md)
  • Identify changed task YAMLs
  • Check PR body for evidence of passing run
  • Apply rubric to each changed task
  • Check for within-PR duplicates
  • Post final lint report

Coder-eval task lint (advisory)

1 task YAML changed; verdict counts: 2 Critical, 0 High, 0 Medium, 0 Low, 0 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

High — PR body does not claim the changed task has been run and passed. This is moot for this PR: the task is an intentional bad-example fixture (description lines 2–7 say "Delete this file and the enclosing _DELETE_ME_lint_test/ folder before merging"). Please delete it before merge, or — if kept for CI self-testing — move it outside tests/tasks/.

Per-task lint

tests/tasks/uipath-maestro-flow/_DELETE_ME_lint_test/bad_self_report.yaml — verdict: Critical

Note: The task's own description (lines 2–7) states it is deliberately bad and temporary. The issues below confirm the rubric catches what was intended.

Issues:

  • [Critical] Self-report anti-pattern (lines 21, 31–56): initial_prompt instructs agent to write report.json (line 21: "Save a summary of what you did to report.json"); all three success_criteria grade that same self-written file (file_exists line 33, json_check lines 38–46 and 48–56). Agent grades its own homework.
  • [Critical] Could pass for the wrong reason (lines 31–56): A dummy agent can skip every CLI command, write {"solution_created": true, "status": "success"} to report.json, and pass all three criteria. The skill is never exercised.
  • [High] Prompt over-specification (lines 14–29): initial_prompt prescribes 6 numbered steps with exact CLI commands (uip solution new DeleteMe, uip maestro flow init DeleteMe/DeleteMe --output json, etc.). This is a recipe, not a goal — any agent follows it without invoking the skill.
  • [High] Meaningful coverage (lines 31–56): All criteria check the self-report file. No command_executed, run_command, or file_contains validates actual CLI usage or real artifacts.
  • [High] Validate-only flow tests (line 8, lines 31–56): Tags include e2e but no criterion checks flow debug. flow validate only checks JSON shape; without flow debug an e2e task cannot verify runtime correctness.
  • [Medium] Near-duplicate (line 8, cf. smoke/init_validate.yaml): Operations tested (solution new → flow init → flow validate) overlap substantially with smoke/init_validate.yaml, which tests the same lifecycle with proper command_executed criteria.

Suggested fixes:

  • Delete this file and its _DELETE_ME_lint_test/ folder before merging (as the description already instructs).
  • If kept for CI self-testing of the lint bot, move it to a path outside tests/tasks/ (e.g. tests/fixtures/lint/bad_self_report.yaml) so it is excluded from coder-eval runs and the lint workflow's tests/tasks/**/*.yaml glob.

Within-PR duplicates

No duplicate clusters detected (only 1 task YAML in this PR).

Conclusion

⚠️ 1 task has issues, max severity Critical. This is expected — the file is an intentional bad-example fixture and its description says to delete it before merge. Advisory only — not blocking merge.

Action item: Please delete tests/tasks/uipath-maestro-flow/_DELETE_ME_lint_test/ before merging, or move the fixture outside tests/tasks/ if it needs to persist for CI self-testing.

@UiPath UiPath deleted a comment from github-actions Bot May 5, 2026
Copy link
Copy Markdown
Collaborator

@uipreliga uipreliga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@bai-uipath bai-uipath merged commit 8533b4d into main May 5, 2026
2 checks passed
@bai-uipath bai-uipath deleted the bai/lint-new-tasks branch May 5, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants