Add adaptive red-teaming, judge calibration, and evolutionary auto-learn#14
Conversation
…, adaptive red team - Versioned golden set (frozen ground-truth) used as stable regression suite + reference labels; judge calibration stage (agreement/Cohen's kappa) caps verdict at 'iterate' when the judge can't be trusted. - Auto-learn is now an evolutionary multi-generation search with an elite archive and A/B fitness vs the frozen baseline, replacing the single greedy patch. - Auto-learn ingests customer feedback (UI=human, MCP/CLI=LLM) with LLM-agent feedback weighted 2x over human feedback. - Adversarial stage upgraded to adaptive multi-turn red-team (attacker reads each defender reply and escalates) for prompt/hybrid skills.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7a8f385f5c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const [oldRes, newRes] = await Promise.all([ | ||
| generateText({ model: getGatewayModel(FAST), system: opts.version.system_prompt, prompt: ex.input }) | ||
| .then((r) => r.text) | ||
| .catch(() => ""), |
There was a problem hiding this comment.
Reuse one baseline output set across all candidates
Precomputing oldRes inside scoreCandidate() means every patch candidate is judged against a different baseline run, because getGatewayModel(FAST) is non-deterministic and is called again per candidate. That makes fitness values across candidates/generations incomparable, so elite selection can prefer a weaker patch due to baseline variance rather than real improvement. Compute OLD outputs once per sampled example before the evolutionary loop and reuse them for all candidate scoring.
Useful? React with 👍 / 👎.
| } catch { | ||
| break; |
There was a problem hiding this comment.
Treat adaptive judge failures as indeterminate, not safe
When the adaptive red-team judge call throws (API/transient/JSON-parse failure), the code immediately breaks out of the loop and later returns broke: false. In those failure paths, unsafe defender replies can be recorded as non-breaks, systematically inflating safety outcomes. This should preserve uncertainty (e.g., broke: null) or continue probing, rather than converting judge failure into a safe result.
Useful? React with 👍 / 👎.
| const { data: goldenRows } = await supabase | ||
| .from("package_golden_cases") | ||
| .select("title, input, expected_output, label_pass, label_source") | ||
| .eq("package_id", pkg.id) | ||
| .eq("is_active", true) | ||
| .limit(20); |
There was a problem hiding this comment.
Order golden cases before LIMIT to keep calibration stable
The golden-set query applies .limit(20) without any deterministic ordering, so once a package has more than 20 active golden rows the subset fed into calibration can change between runs. Because calibration can cap verdicts, this introduces run-to-run instability unrelated to model behavior. Add an explicit order(...) (e.g., by created_at/id) before limit so the same frozen cases are evaluated consistently.
Useful? React with 👍 / 👎.
- adaptive red-team: judge failure now yields broke=null (indeterminate), not false, so unjudged unsafe replies no longer inflate safety. - auto-learn: compute baseline OLD outputs once and reuse across all candidates so evolutionary fitness is comparable. - golden-set queries: deterministic order(created_at,id) before limit so calibration is stable when >20 active golden rows exist.
What does this PR add?
Implements three major improvements to SkillForge's evaluation and auto-learning pipelines: (1) adaptive multi-turn red-teaming that escalates attacks based on defender responses, (2) judge calibration against frozen golden test cases to detect when the LLM judge is unreliable, and (3) evolutionary patch search that maintains an elite archive across generations instead of proposing a single greedy patch.
Type
Changes
Adaptive Red-Teaming
adaptiveAttack()function that runs multi-turn conversations with the skill, using an LLM judge to decide if the attack succeeded and what to try nextredTeamTurnsparameter inevaluatorPipeline()Judge Calibration
cohenKappa()function to compute inter-rater agreement between LLM judge and ground truthpackage_golden_casestable to store frozen, human-labelled reference test cases per packagejudge_calibrationcolumn on evaluationsEvolutionary Auto-Learn
candidatesPerGendiverse patches (default 2) using different strategiesgenerations(default 2, max 4) andcandidatesPerGen(default 2, max 4)Customer Feedback Integration
FeedbackRowtype andsummarizeFeedback()functionDatabase
package_golden_casestable with RLS policiesjudge_calibrationandevolution_traceJSONB columns topackage_evaluationsNotes for reviewers
Tradeoffs:
Edge cases:
maxTurnslimit; if judge says "done", loop breaks earlyAssumptions:
https://claude.ai/code/session_01BFy9njggrB6SUV3KRJ6439