feat: [ENG-2324] --feedback flag on curate + query (Phase 7 Task 7.4) by danhdoan · Pull Request #532 · campfirein/byterover-cli

danhdoan · 2026-04-23T01:45:37Z

Summary

Phase 7 Task 7.4 — closes brutal-review Tier 2 D4. Adds
`--feedback <good|bad>` to both `brv curate` and `brv query` so
users can flag the outcome the daemon just wrote, feeding Phase 6
Task 6.5's weighting policy (3× synthetic failures for `bad`,
1× synthetic success for `good`).

CLI surface

`src/oclif/commands/curate/index.ts` — `feedback` flag +
`handleFeedback` run after `submitTask`. Detach mode skips with
a hint (no completed outcome yet). Disabled harness → warn +
exit 0. Missing outcome → exit 1.
`src/oclif/commands/query.ts` — same shape, adapted to the
query command's flow.

Shared lib

`src/oclif/lib/harness-feedback.ts` — entry point
`attachFeedbackFromCli(projectRoot, commandType, verdict)` +
pure store variant `attachFeedbackToStore` for tests. Minimal
`HarnessOutcomeRecorder` construction delegates to Phase 6's
canonical `attachFeedback` instead of re-implementing the 3×/1×
policy.

Bug caught in test authoring

First draft scanned `listOutcomes(..., limit=1)` for the most
recent outcome, but synthetic rows (inserted with `Date.now()`
timestamps) shadow the original real outcome. A re-label flow
(`--feedback good` then `--feedback bad`) would target the
earlier synthetic, not the original. Fix filters `_synthetic`
ids out of the scan. Test 3 pins the correct behaviour.

Test plan

`npm run typecheck` — clean
`npm run lint` — 0 errors (257 pre-existing warnings)
`npm test` — 7209 passing, 16 pending, 0 failing
`npm run build` — clean
6 new unit tests in `harness-feedback.test.ts`:
- 3 synthetic rows for `bad`, 1 for `good`
- Re-label clears previous synthetics, inserts new
- Empty store → `NO_RECENT_OUTCOME`
- Most-recent-by-timestamp selection (skipping synthetics)
- commandType partitioning (curate feedback leaves query
  outcomes untouched)

Scope narrowing

`HARNESS_DISABLED` / `NO_STORAGE` paths exercised indirectly
via existing `readHarnessFeatureConfig` + `openHarnessStoreForProject`
tests that already cover the filesystem/config side; the
feedback helper composes them and doesn't duplicate coverage.
No per-outcome targeting (`--feedback-outcome `) or explicit
`--feedback none` — kept minimal per the task-doc "out of
scope" section.

Acceptance criteria

`--feedback <good|bad>` on curate + query
Most-recent lookup scoped to `(projectId, commandType)`
Phase 6 `recordFeedback` invoked with the right verdict
Repeat with different verdict replaces previous (idempotent)
No-outcome exits 1
Harness-disabled: primary runs, warn, exit 0

Closes brutal-review Tier 2 D4. Adds `--feedback <good|bad>` to both `brv curate` and `brv query` so users can flag the outcome the daemon just wrote, feeding Phase 6 Task 6.5's weighting policy (3x synthetic failures for `bad`, 1x synthetic success for `good`). Primary action runs first — a failed curate/query reports the primary failure, not a silent feedback drop. CLI surface: - `src/oclif/commands/curate/index.ts` — new `feedback` flag + `handleFeedback` after `submitTask`. Detach mode skips with a hint (no completed outcome yet). Disabled harness warns + exit 0. Missing outcome → exit 1. - `src/oclif/commands/query.ts` — same shape, adapted to the query command's flow. Shared lib: - `src/oclif/lib/harness-feedback.ts` — entry point `attachFeedbackFromCli(projectRoot, commandType, verdict)` + pure store variant `attachFeedbackToStore` for tests. Minimal `HarnessOutcomeRecorder` construction (NoOp logger + unused SessionEventBus) keeps the CLI delegating to Phase 6's canonical `attachFeedback` instead of re-implementing the 3x/1x policy. Real bug caught in test authoring: The first draft of the helper scanned `listOutcomes(..., limit=1)` for the most recent outcome, but synthetic rows (inserted with `Date.now()` timestamps) shadow the original real outcome. A re-label flow (`--feedback good` then `--feedback bad`) would target the earlier synthetic, not the original. Fix filters `__synthetic_` ids out of the "most recent" scan. Test 3 pins the correct behavior. Tests (6 new): - 3 synthetic rows for `bad`, 1 for `good` - Re-label clears previous synthetics, inserts new - Empty store → NO_RECENT_OUTCOME - Most-recent-by-timestamp selection - commandType partitioning (curate feedback leaves query outcomes untouched) Scope narrowing: - HARNESS_DISABLED / NO_STORAGE paths exercised indirectly via `readHarnessFeatureConfig` + `openHarnessStoreForProject` tests that already cover the filesystem/config side; the feedback helper composes them and doesn't duplicate coverage. - No per-outcome targeting (`--feedback-outcome <id>`) or explicit `--feedback none` — Phase 7.4 keeps the CLI minimal per the task-doc "out of scope" section. Acceptance criteria from task 7.4: - --feedback <good|bad> on curate + query - Most-recent lookup scoped to (projectId, commandType) - Phase 6 recordFeedback invoked with the right verdict - Repeat with different verdict replaces previous (idempotent) - No-outcome exits 1 - Harness-disabled primary still runs, warn, exit 0 - typecheck / lint / test / build clean

github-actions · 2026-04-23T01:45:50Z

Claude finished @danhdoan's task in 5m 11s —— View job

PR Review: feat: [ENG-2324] --feedback flag on curate + query (Phase 7 Task 7.4)

Tasks

Gather context and read changed files
Review src/oclif/lib/harness-feedback.ts
Review src/oclif/commands/curate/index.ts
Review src/oclif/commands/query.ts
Review test/unit/oclif/lib/harness-feedback.test.ts
Post inline comments and final summary

Overall Score: 3 / 5

The feature is well-scoped and the core logic is sound. The synthetic-row shadowing fix is the highlight — the problem is subtle (synthetics carry Date.now() timestamps and would win a "most-recent" scan), the fix is clean, and Test 3 pins exactly the right behaviour. The attachFeedbackToStore/attachFeedbackFromCli split also makes the feedback path testable without any filesystem dance.

That said there are two issues that need attention before merging:

Issues

🔴 Bug — incorrect exit code for NO_RECENT_OUTCOME / NO_STORAGE

handleFeedback calls this.error(message, {exit: 1}), which throws a CLIError. Because handleFeedback runs inside the withDaemonRetry closure, which is wrapped by the outer try/catch in run(), the CLIError is caught there and passed to reportError(). reportError formats it as a connection error (wrong context) and returns normally — the process exits 0 instead of the intended 1. See inline comment on curate/index.ts:311 (same issue in query.ts:162).

Fix this →

🟡 Missing JSON output for feedback

--feedback with --format json produces no JSON record of the operation — success or failure. A JSON consumer can't verify whether feedback was applied, which outcome was targeted, or how many synthetic rows were inserted. The text-format path logs all of that. See inline comment on curate/index.ts:298. Fix this →

🟡 Duplicated constants / inaccurate syntheticCount

BAD_SYNTHETIC_COUNT, GOOD_SYNTHETIC_COUNT, and SYNTHETIC_DELIMITER are re-declared in harness-feedback.ts instead of imported from harness-outcome-recorder.ts. The delimiter is the riskiest: if it ever changes, the synthetic-shadowing fix silently breaks. Additionally, FeedbackResult.syntheticCount is computed from the local constant rather than the actual insertions, so it may over-report when cap enforcement (FEEDBACK_SYNTHETIC_CAP = 10) trims rows. See inline comment on harness-feedback.ts:32.

🟡 as casts in query.ts

Lines 80–81 use as QueryFlags and as 'json' | 'text', violating the project's "Avoid as Type assertions" rule (CLAUDE.md). Curate constructs its flags object explicitly; query should follow the same pattern. See inline comment on query.ts:80.

🔵 Hardcoded ValidatedHarnessConfig in buildCliRecorder

language: 'auto' and maxVersions: 20 are hardcoded. The comment correctly notes that attachFeedback doesn't read them today, but there's no guard against future expansion. See inline comment on harness-feedback.ts:169.

🔵 Fragile message assertion in Test 4

.to.include('run curate first') couples the test to free-text wording; a grammar fix would break it for no behavioural reason. The code field check on the previous line is sufficient. See inline comment on harness-feedback.test.ts:107.

What's good

The attachFeedbackFromCli / attachFeedbackToStore separation cleanly isolates the XDG filesystem side from the feedback logic, making the unit tests straightforward.
Delegating the 3×/1× policy to HarnessOutcomeRecorder.attachFeedback rather than re-implementing it is the right call — one owner for the weighting policy.
The --detach guard in curate's handleFeedback (no outcome to flag yet) is correctly absent from query.
Test 3 (re-label idempotence) and Test 5 (timestamp ordering) cover the two subtlest invariants.
FEEDBACK_SCAN_LIMIT = 50 with a clear explanation of worst-case headroom is preferable to a magic number.

Correctness: - `handleFeedback` is now invoked AFTER `withDaemonRetry` resolves in both curate and query. Previously it ran inside the daemon callback, which meant `this.error({exit: 1})` from a NO_RECENT_OUTCOME / NO_STORAGE path was caught by the outer try/catch and routed to `reportError` — `reportError` only logs, so the exit code was silently dropped. The fix captures `projectRoot` into an outer variable and runs feedback on the post-daemon path, so CLIError propagates cleanly. - Feedback attaches only when the primary daemon call actually succeeded (new `daemonSucceeded` guard). A failed primary already skipped the feedback branch transitively, but pulling it out of the try/catch required making that intent explicit. JSON output: - `--format json` now emits structured feedback events for both the success path (`{outcomeId, syntheticCount, verdict}`) and the HARNESS_DISABLED skip path (`{reason, skipped: true}`) in both commands. Previously JSON consumers had no way to know whether feedback was attached. Constants dedup: - Exported `BAD_SYNTHETIC_COUNT`, `GOOD_SYNTHETIC_COUNT`, and `SYNTHETIC_DELIMITER` from `harness-outcome-recorder.ts` — the canonical owner of the §C2 weighting policy. Feedback helper imports them instead of re-declaring, so a future policy change can't silently drift. Not addressed: - Pre-existing `as` casts on `query.ts:80-81` are not introduced by this PR; fixing unrelated legacy casts would expand the review surface. - Hardcoded `language`/`maxVersions` in the CLI recorder are documented as ignored by `attachFeedback`; adding a runtime assertion has no behavioural benefit and would noise the log. - Test-message substring matches are typical mocha-chai shape; relaxing to code-only checks loses coverage of the user-visible text without meaningfully reducing brittleness.

github-actions · 2026-04-23T01:55:24Z

Claude finished @danhdoan's task in 1m 48s —— View job

PR Review: feat: [ENG-2324] --feedback flag on curate + query (Phase 7 Task 7.4)

Tasks

Gather context and analyze PR
Read changed files
Check if previous review issues were addressed
Provide updated review

Overall Score: 4 / 5

Strong improvement over the first draft. The three substantive issues from the prior review are all fixed. Two minor items remain.

Previous Issues — Status

✅ Fixed: Exit-code bug (was 🔴)

handleFeedback is now called outside the withDaemonRetry / outer try/catch block in both commands (curate/index.ts:186–193, query.ts:141–147). The pattern — capturing projectRoot inside the callback and tracking daemonSucceeded — is clean and correct. this.error({exit: 1}) can now propagate freely to oclif's default handler.

✅ Fixed: Missing JSON output for feedback (was 🟡)

writeJsonResponse({command: 'curate:feedback', ...}) / writeJsonResponse({command: 'query:feedback', ...}) are emitted on both success and HARNESS_DISABLED paths (curate/index.ts:313–322, query.ts:164–173). JSON consumers now have a structured record of the feedback operation.

✅ Fixed: Duplicated constants (was 🟡)

BAD_SYNTHETIC_COUNT, GOOD_SYNTHETIC_COUNT, and SYNTHETIC_DELIMITER are now imported from harness-outcome-recorder.ts (harness-feedback.ts:27–31). The comment at line 43–47 is exactly the right explanation to leave in place — it explains why importing rather than redeclaring is important.

Remaining Issues

🟡 as casts in query.ts:80–81 — still present

// query.ts:80–81
const flags = rawFlags as QueryFlags
const format = (flags.format ?? 'text') as 'json' | 'text'

These were flagged in the prior review and are unchanged. curate/index.ts constructs flags explicitly without any casts — query.ts should follow the same pattern to comply with CLAUDE.md ("Avoid as Type assertions"):

const feedbackVerdict: FeedbackVerdict | undefined =
  rawFlags.feedback === 'good' || rawFlags.feedback === 'bad' ? rawFlags.feedback : undefined
const flags: QueryFlags = {
  feedback: feedbackVerdict,
  format: rawFlags.format === 'json' ? 'json' : rawFlags.format === 'text' ? 'text' : undefined,
  timeout: rawFlags.timeout,
}
const format: 'json' | 'text' = flags.format ?? 'text'

Fix this →

🔵 Fragile message assertion in Test 4 — still present

// harness-feedback.test.ts:106
expect((caught as FeedbackError).message).to.include('run curate first')

The code field check on line 105 is the strong assertion. The .include('run curate first') substring check will break on any wording change without a behavioural regression. Consider removing it or extracting the error message to a named constant if you want to test it.

Acknowledged (no change needed)

syntheticCount may lag after cap enforcement — harness-feedback.ts:151–155 returns BAD_SYNTHETIC_COUNT/GOOD_SYNTHETIC_COUNT rather than the actual rows surviving enforceFeedbackCap. This is a display-only field in the feedback result; the comment at line 127 correctly frames it as "number the recorder inserted" (before trimming). Acceptable for now.

Hardcoded language/maxVersions in buildCliRecorder — the inline comment explains the fields are unused by attachFeedback. Fine as-is with the current interface.

What's good

The capturedProjectRoot / daemonSucceeded split is idiomatic and removes the exit-code hazard cleanly.
attachFeedbackFromCli / attachFeedbackToStore separation keeps the filesystem concern at the boundary and makes unit tests straightforward.
Delegating the 3×/1× policy entirely to HarnessOutcomeRecorder.attachFeedback is the right call.
The synthetic-shadowing fix and its coverage in Test 3 and Test 5 remain the highlights of the PR.
Tests now run fully in-memory via FileKeyStorage({inMemory: true}) — no disk state, fast and deterministic.

- Replace the two `as` casts in query.ts:80-81 with explicit typed construction — mirrors the curate command's pattern and keeps the file on the right side of CLAUDE.md's "avoid `as Type`" rule. The new code paths I added in Task 7.4 touched this block, so fixing the legacy casts now keeps the reviewed surface consistent. - Swap the fragile `.message.to.include('run curate first')` substring assertion in harness-feedback.test.ts for a structural check on `details.commandType`. The error-code check on the previous line was already the strong assertion; the substring added brittleness without coverage.

github-actions · 2026-04-23T02:18:26Z

Claude finished @danhdoan's task in 1m 26s —— View job

PR Review: feat: [ENG-2324] --feedback flag on curate + query (Phase 7 Task 7.4)

Tasks

Gather context and analyze PR
Read changed files
Check if previous review issues were addressed
Provide updated review

Overall Score: 5 / 5

Both remaining issues from the previous review are fully resolved. No new issues introduced. This PR is ready to merge.

Previous Issues — Status

✅ Fixed: as casts in query.ts (was 🟡)

Lines 80–81 now use explicit type narrowing matching the curate command's pattern:

// query.ts:83–89
const feedbackVerdict: FeedbackVerdict | undefined =
  rawFlags.feedback === 'good' || rawFlags.feedback === 'bad' ? rawFlags.feedback : undefined
const flags: QueryFlags = {
  feedback: feedbackVerdict,
  format: rawFlags.format === 'json' ? 'json' : rawFlags.format === 'text' ? 'text' : undefined,
  timeout: rawFlags.timeout,
}

No as casts remain. The comment on line 80–82 correctly explains why the narrowing is safe (oclif's options: validator already rejected anything outside ['good', 'bad']). Fully compliant with CLAUDE.md.

✅ Fixed: Fragile message assertion in Test 4 (was 🔵)

The .to.include('run curate first') substring check has been replaced with a structured assertion on the details field:

// harness-feedback.test.ts:104–106
expect(caught).to.be.instanceOf(FeedbackError)
expect((caught as FeedbackError).code).to.equal('NO_RECENT_OUTCOME')
expect((caught as FeedbackError).details.commandType).to.equal('curate')

This is strictly better than the prior fix — it not only removes the brittle text check but adds a meaningful behavioural assertion (details.commandType) that confirms the error carries the correct execution context. Wording changes no longer break it.