Claude-based linter for reviewing quality of new tasks by bai-uipath · Pull Request #554 · UiPath/skills

bai-uipath · 2026-05-04T22:27:54Z

Summary

Adds a /lint-task slash command and a non-blocking PR-time GitHub Action that scores newly added or modified coder-eval task YAMLs against a quality rubric. Purely advisory — never fails CI.

Why

A recent quality audit of the coder-eval tasks in this repo found that roughly half score below 7/10. The dominant patterns:

Self-report anti-pattern — prompt asks the agent to write report.json (or similar), criteria grade that file. The agent grades its own homework; deterministic checks are bypassed; hallucination-prone.
Prompt over-specification — initial_prompt prescribes the procedure ("Walk the discovery hierarchy", "Use --output json"), so the skill itself is never actually exercised.
Trivially-passing criteria — only file_exists / loose command_executed patterns; a dummy implementation passes.
Near-duplicate tests — same shape, only surface-level naming differences; pure infra cost with no marginal coverage.
Validate-only flow tests — flow validate only checks JSON shape; without flow debug the test cannot tell whether the flow actually produces the right output.
No record of ever passing — tests merged that have never passed, surfacing only on the next nightly run.

Reviewers don't catch these consistently in PR review, so they keep accumulating. This change adds a lightweight automated reviewer that catches them at PR time and on demand.

What's added

.claude/commands/lint-task.md — slash command. /lint-task <path|glob|dir> reads the task YAML(s) plus up to 5 nearest siblings (for duplicate detection) and emits a per-task report with severity-tagged issues (Critical / High / Medium / Low / OK) and concrete suggested fixes. Source of truth for the rubric.
.github/workflows/lint-tasks.yml — advisory PR bot. Triggers on PRs touching tests/tasks/**/*.yaml, runs only when the PR is out of draft state, applies the same rubric to changed files only, also flags within-PR duplicates and checks the PR body for a verbal claim that the new/changed tasks have been run and passed at least once. Posts a single comment, never fails the check.

Both files share the rubric — the workflow reads .claude/commands/lint-task.md at runtime, so iterating on the rubric only requires editing one file.

Design choices

Advisory, not blocking. The rubric is calibrated for the audit findings, not battle-tested across every skill. Hard-gating at PR time would create false-positive friction. Severities communicate priority; reviewers and authors decide how to act.
Honor system on passing-run evidence. The bot checks the PR body for a verbal claim ("ran X locally and it passed") but does not verify against blob storage or the dashboard. Trust assumption is cheap to relax later if it breaks.
Within-skill duplicate detection, not embeddings. Compares each changed task to up to 5 nearby siblings (same folder, then same skill if folder is sparse). Catches ~95% of real-world duplicates without infrastructure.
Reuses existing infra. Same anthropics/claude-code-action@v1 and ANTHROPIC_API_KEY secret as claude-pr-review.yml. No new secrets, no new workflow dependencies.

Out of scope (follow-ups)

Backfilling lint over all existing tasks in main (intentionally excluded from this PR — that's a separate cleanup pass; expect to tag low-scoring tests [draft] and exclude them from daily metrics).
Cross-repo dup detection (across all skills, not just within-skill).
Hard CI gating after the rubric stabilizes.
Verifying passing-run claims against blob-storage run JSON.

github-actions · 2026-05-04T23:01:20Z

Claude finished @bai-uipath's task in 3m 6s —— View job

Summary

This PR adds a /lint-task slash command and an advisory CI workflow that scores new/modified coder-eval task YAMLs against a quality rubric. The rubric covers six axes: self-report anti-pattern, prompt over-specification, meaningful coverage, gameability, near-duplicates, and validate-only flow tests. A test fixture (bad_self_report.yaml) is included to exercise the workflow.

Change-by-Change Review

1. `.claude/commands/lint-task.md` — OK with minor notes

Severity: Low

Well-structured rubric with clear axes, severity levels, and actionable output format. Follows the existing command file convention (generate-tasks.md, test-coverage.md). Key observations:

Good: The carve-outs (ground-truth anchors in axis B, scaffold reuse in axis E, description-rationale carve-out in axis F) are well-designed and will prevent common false positives.
Good: Theme-aware downgrade logic is clever — avoids inflating severity when many tasks share the same systemic issue.
Low: The theme-aware suppression logic (lines ~200-221 in the diff) is sophisticated but may produce inconsistent results across different LLM invocations. The rules for "suppress issues block when fully theme-captured" vs "keep issues block but drop theme-referenced ones" are nuanced. Consider simplifying to always show per-task issues with a (see Theme N) suffix, and let the theme entry carry the systemic severity. This is a polish concern — the current design is workable.
Low: Line 11 references the workflow path (.github/workflows/lint-tasks.yml) — this is a forward reference that's correct but creates a coupling. If the workflow moves, this line goes stale. Minor.

2. `.github/workflows/lint-tasks.yml` — OK with one medium note

Severity: Medium

The workflow is well-structured: correct paths trigger, draft-skip logic, concurrency group, fork exclusion, and reuse of the existing ANTHROPIC_API_KEY secret. Prompt is clear with explicit steps and output template.

Good: Consistent with claude-pr-review.yml — same action version, same model flag, same secret, same timeout-minutes: 15.
Good: concurrency.cancel-in-progress: true prevents stale lint comments from racing.
Good: The fork exclusion (head.repo.full_name == github.repository) prevents leaking the API key.
Medium (line 45): claude_args: "--model claude-opus-4-6" hardcodes the model. The existing claude-pr-review.yml does the same, so this is consistent with current practice. However, if the team wants to update the model across all workflows, each file must be edited independently. Consider extracting this to a repo-level variable or .env — but since the existing workflow does the same thing, this is a consistency-preserving choice, not a regression.
The prompt (lines 46-194) is ~150 lines. This is large but well-organized. The explicit output template (lines 127-175) ensures consistent comment formatting. No issues with the YAML block scalar — markdown fences inside prompt: | are preserved correctly.

3. `tests/tasks/uipath-maestro-flow/_DELETE_ME_lint_test/bad_self_report.yaml` — High

Severity: High

This is a deliberately broken test fixture, and the file's own description (lines 2-7) says: "Delete this file and the enclosing _DELETE_ME_lint_test/ folder before merging the lint workflow PR."

High: This file must be removed before merge. If merged, it becomes a permanently broken task in the test suite. The _DELETE_ME_ naming convention is not recognized by any CI exclusion — it will show up in tests/tasks/ listings, confuse future task audits, and potentially trigger the nightly e2e runner if the experiment config includes glob patterns.
The file correctly exercises multiple rubric axes (self-report, over-specification, weak coverage, gameable criteria, validate-only e2e) — it's a good test fixture for development, but it has served its purpose.
The tag lifecycle:generate (line 8) is not a standard test-type tag (smoke, integration, e2e), but it appears alongside e2e which is valid. The lifecycle: prefix appears to be a category tag, not a test-type violation.

What's Missing

Test fixture cleanup: _DELETE_ME_lint_test/bad_self_report.yaml must be deleted before merge (the file itself says so). Fix this →
.claude/commands/lint-task.md is deleted in the working tree: git status shows D .claude/commands/lint-task.md. The file is in the branch commits (verified via git show HEAD), but the working tree deletion is anomalous. Verify this is a CI checkout artifact and not an uncommitted deletion that will be pushed. If the author ran git rm locally without committing, the file will be missing from the branch on next push.
No CODEOWNERS entry for .claude/commands/: The existing /.github/ path is covered by @uipreliga, but .claude/commands/ has no CODEOWNERS entry. The existing commands (generate-tasks.md, skill-compare.md, test-coverage.md) also lack coverage — this isn't a regression, but worth adding as a follow-up.

Area Ratings

Area	Status	Notes
Frontmatter	N/A	No skill SKILL.md changed
E2E Tests	Issue	Test fixture must be removed before merge
Skill Body	N/A	No skill content changed
References & Assets	OK	No broken links in changed files
Repo Hygiene	Issue	Test fixture committed with "delete before merge" instruction; no CODEOWNERS for `.claude/commands/`

Issues for Manual Review

Rubric calibration: The rubric axes and severity mappings are based on one audit. The PR description acknowledges this ("calibrated for the audit findings, not battle-tested across every skill"). Manual spot-checking against a few real tasks on main would build confidence before iterating on the rubric.
Theme-aware downgrade consistency: The theme logic in lint-task.md (lines ~195-221) is complex. Different LLM invocations may apply the downgrade/suppress rules inconsistently. Consider running the slash command on 5-10 real tasks and comparing outputs to validate consistency.
Workflow permissions: The workflow requests pull-requests: write which is correct for posting comments. Verify this doesn't conflict with any branch protection rules or required-review policies.

Conclusion

The rubric design is strong — the six axes target real, documented quality gaps, and the carve-outs for false positives show careful thought. The workflow is consistent with existing CI patterns.

One blocking issue: the _DELETE_ME_lint_test/bad_self_report.yaml test fixture must be removed before merge. The file itself documents this requirement.

One medium concern: consider whether the hardcoded model in the workflow should be extracted, or accept the consistency-with-existing-workflows trade-off.

github-actions · 2026-05-05T00:14:51Z

Claude finished @bai-uipath's task in 2m 7s —— View job

Coder-eval task lint (advisory)

Read the lint rubric (.claude/commands/lint-task.md)
Identify changed task YAMLs
Check PR body for evidence of passing run
Apply rubric to each changed task
Check for within-PR duplicates
Post final lint report

Coder-eval task lint (advisory)

1 task YAML changed; verdict counts: 2 Critical, 0 High, 0 Medium, 0 Low, 0 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

❌ High — PR body does not claim the changed task has been run and passed. This is moot for this PR: the task is an intentional bad-example fixture (description lines 2–7 say "Delete this file and the enclosing _DELETE_ME_lint_test/ folder before merging"). Please delete it before merge, or — if kept for CI self-testing — move it outside tests/tasks/.

Per-task lint

`tests/tasks/uipath-maestro-flow/_DELETE_ME_lint_test/bad_self_report.yaml` — verdict: Critical

Note: The task's own description (lines 2–7) states it is deliberately bad and temporary. The issues below confirm the rubric catches what was intended.

Issues:

[Critical] Self-report anti-pattern (lines 21, 31–56): initial_prompt instructs agent to write report.json (line 21: "Save a summary of what you did to report.json"); all three success_criteria grade that same self-written file (file_exists line 33, json_check lines 38–46 and 48–56). Agent grades its own homework.
[Critical] Could pass for the wrong reason (lines 31–56): A dummy agent can skip every CLI command, write {"solution_created": true, "status": "success"} to report.json, and pass all three criteria. The skill is never exercised.
[High] Prompt over-specification (lines 14–29): initial_prompt prescribes 6 numbered steps with exact CLI commands (uip solution new DeleteMe, uip maestro flow init DeleteMe/DeleteMe --output json, etc.). This is a recipe, not a goal — any agent follows it without invoking the skill.
[High] Meaningful coverage (lines 31–56): All criteria check the self-report file. No command_executed, run_command, or file_contains validates actual CLI usage or real artifacts.
[High] Validate-only flow tests (line 8, lines 31–56): Tags include e2e but no criterion checks flow debug. flow validate only checks JSON shape; without flow debug an e2e task cannot verify runtime correctness.
[Medium] Near-duplicate (line 8, cf. smoke/init_validate.yaml): Operations tested (solution new → flow init → flow validate) overlap substantially with smoke/init_validate.yaml, which tests the same lifecycle with proper command_executed criteria.

Suggested fixes:

Delete this file and its _DELETE_ME_lint_test/ folder before merging (as the description already instructs).
If kept for CI self-testing of the lint bot, move it to a path outside tests/tasks/ (e.g. tests/fixtures/lint/bad_self_report.yaml) so it is excluded from coder-eval runs and the lint workflow's tests/tasks/**/*.yaml glob.

Within-PR duplicates

No duplicate clusters detected (only 1 task YAML in this PR).

Conclusion

⚠️ 1 task has issues, max severity Critical. This is expected — the file is an intentional bad-example fixture and its description says to delete it before merge. Advisory only — not blocking merge.

Action item: Please delete tests/tasks/uipath-maestro-flow/_DELETE_ME_lint_test/ before merging, or move the fixture outside tests/tasks/ if it needs to persist for CI self-testing.

uipreliga

🚢

bai-uipath added 4 commits May 4, 2026 15:26

lint for task quality

ba54c94

use severity levels

f865d44

finetune rubric

faf8385

obviously bad one for testing

d274a20

bai-uipath marked this pull request as ready for review May 4, 2026 23:00

bai-uipath requested a review from uipreliga as a code owner May 4, 2026 23:00

bai-uipath requested review from rockymadden and tmatup May 4, 2026 23:09

bai-uipath mentioned this pull request May 5, 2026

Identity User/robot management skills #522

Open

human readable headings

b720f67

UiPath deleted a comment from github-actions Bot May 5, 2026

tmatup approved these changes May 5, 2026

View reviewed changes

uipreliga approved these changes May 5, 2026

View reviewed changes

bai-uipath added 2 commits May 5, 2026 09:59

remove lint-test fixture

54fdcbf

Merge branch 'main' into bai/lint-new-tasks

d058654

bai-uipath merged commit 8533b4d into main May 5, 2026
2 checks passed

bai-uipath deleted the bai/lint-new-tasks branch May 5, 2026 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude-based linter for reviewing quality of new tasks#554

Claude-based linter for reviewing quality of new tasks#554
bai-uipath merged 7 commits intomainfrom
bai/lint-new-tasks

bai-uipath commented May 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

uipreliga left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bai-uipath commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What's added

Design choices

Out of scope (follow-ups)

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change-by-Change Review

1. .claude/commands/lint-task.md — OK with minor notes

2. .github/workflows/lint-tasks.yml — OK with one medium note

3. tests/tasks/uipath-maestro-flow/_DELETE_ME_lint_test/bad_self_report.yaml — High

What's Missing

Area Ratings

Issues for Manual Review

Conclusion

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coder-eval task lint (advisory)

Coder-eval task lint (advisory)

Evidence of passing run

Per-task lint

tests/tasks/uipath-maestro-flow/_DELETE_ME_lint_test/bad_self_report.yaml — verdict: Critical

Within-PR duplicates

Conclusion

Uh oh!

uipreliga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bai-uipath commented May 4, 2026 •

edited

Loading

github-actions Bot commented May 4, 2026 •

edited

Loading

1. `.claude/commands/lint-task.md` — OK with minor notes

2. `.github/workflows/lint-tasks.yml` — OK with one medium note

3. `tests/tasks/uipath-maestro-flow/_DELETE_ME_lint_test/bad_self_report.yaml` — High

github-actions Bot commented May 5, 2026 •

edited

Loading

`tests/tasks/uipath-maestro-flow/_DELETE_ME_lint_test/bad_self_report.yaml` — verdict: Critical