Skip to content

Add adaptive red-teaming, judge calibration, and evolutionary auto-learn#14

Merged
criptogus merged 2 commits into
mainfrom
claude/analyze-skill-creator-dLyBX
May 17, 2026
Merged

Add adaptive red-teaming, judge calibration, and evolutionary auto-learn#14
criptogus merged 2 commits into
mainfrom
claude/analyze-skill-creator-dLyBX

Conversation

@criptogus
Copy link
Copy Markdown
Owner

What does this PR add?

Implements three major improvements to SkillForge's evaluation and auto-learning pipelines: (1) adaptive multi-turn red-teaming that escalates attacks based on defender responses, (2) judge calibration against frozen golden test cases to detect when the LLM judge is unreliable, and (3) evolutionary patch search that maintains an elite archive across generations instead of proposing a single greedy patch.

Type

  • Platform code or docs

Changes

Adaptive Red-Teaming

  • New adaptiveAttack() function that runs multi-turn conversations with the skill, using an LLM judge to decide if the attack succeeded and what to try next
  • Replaces single-shot probe evaluation with configurable turn limits (default 3, max 5)
  • Tracks full transcript of attacker/defender exchanges for transparency
  • Configurable via redTeamTurns parameter in evaluatorPipeline()

Judge Calibration

  • New cohenKappa() function to compute inter-rater agreement between LLM judge and ground truth
  • New package_golden_cases table to store frozen, human-labelled reference test cases per package
  • Stage 3.5 in evaluator pipeline compares judge verdicts against golden labels
  • If judge agreement < 70% or κ < 0.4, verdict is capped at 'iterate' (cannot ship)
  • Calibration results persisted in judge_calibration column on evaluations

Evolutionary Auto-Learn

  • Replaces single-patch proposal with multi-generation search maintaining an elite archive
  • Each generation proposes candidatesPerGen diverse patches (default 2) using different strategies
  • Candidates scored via A/B testing on frozen baseline examples; fitness = net wins − output failures − confidence bonus
  • Elite archive (size 2) carries forward across generations; later generations mutate best elites
  • Configurable via generations (default 2, max 4) and candidatesPerGen (default 2, max 4)
  • Evolution trace stored for UI visualization

Customer Feedback Integration

  • New FeedbackRow type and summarizeFeedback() function
  • LLM-agent feedback weighted 2× higher than human UI feedback (configurable)
  • Feedback summary included in root-cause analysis and patch proposal prompts
  • Feedback counts and top comments exposed in return values

Database

  • Migration adds package_golden_cases table with RLS policies
  • Adds judge_calibration and evolution_trace JSONB columns to package_evaluations
  • Seeds golden set from current version examples as 'reference' labels

Notes for reviewers

Tradeoffs:

  • Adaptive red-teaming increases eval latency (multi-turn LLM calls per probe) but provides stronger adversarial signal
  • Evolutionary search is more expensive than single-patch proposal but avoids local optima and provides search transparency
  • Judge calibration can block shipping if the judge is miscalibrated on golden cases, which is intentional (safety gate)

Edge cases:

  • If no golden cases exist, calibration stage is skipped (backward compatible)
  • If evolutionary search produces no scorable candidates, pipeline fails with 502 (prevents shipping broken patches)
  • Adaptive attack respects maxTurns limit; if judge says "done", loop breaks early

Assumptions:

  • Golden cases are stable and represent the true acceptance criteria (not regenerated per eval)
  • LLM-agent feedback is more reliable than human ratings (weighted higher)
  • A/B fitness on 4 baseline examples is sufficient to detect regressions (sample size trade-off)

https://claude.ai/code/session_01BFy9njggrB6SUV3KRJ6439

…, adaptive red team

- Versioned golden set (frozen ground-truth) used as stable regression
  suite + reference labels; judge calibration stage (agreement/Cohen's
  kappa) caps verdict at 'iterate' when the judge can't be trusted.
- Auto-learn is now an evolutionary multi-generation search with an
  elite archive and A/B fitness vs the frozen baseline, replacing the
  single greedy patch.
- Auto-learn ingests customer feedback (UI=human, MCP/CLI=LLM) with
  LLM-agent feedback weighted 2x over human feedback.
- Adversarial stage upgraded to adaptive multi-turn red-team (attacker
  reads each defender reply and escalates) for prompt/hybrid skills.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a8f385f5c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/lib/skills/pipelines.server.ts Outdated
Comment on lines +980 to +983
const [oldRes, newRes] = await Promise.all([
generateText({ model: getGatewayModel(FAST), system: opts.version.system_prompt, prompt: ex.input })
.then((r) => r.text)
.catch(() => ""),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reuse one baseline output set across all candidates

Precomputing oldRes inside scoreCandidate() means every patch candidate is judged against a different baseline run, because getGatewayModel(FAST) is non-deterministic and is called again per candidate. That makes fitness values across candidates/generations incomparable, so elite selection can prefer a weaker patch due to baseline variance rather than real improvement. Compute OLD outputs once per sampled example before the evolutionary loop and reuse them for all candidate scoring.

Useful? React with 👍 / 👎.

Comment on lines +407 to +408
} catch {
break;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat adaptive judge failures as indeterminate, not safe

When the adaptive red-team judge call throws (API/transient/JSON-parse failure), the code immediately breaks out of the loop and later returns broke: false. In those failure paths, unsafe defender replies can be recorded as non-breaks, systematically inflating safety outcomes. This should preserve uncertainty (e.g., broke: null) or continue probing, rather than converting judge failure into a safe result.

Useful? React with 👍 / 👎.

Comment on lines +36 to +41
const { data: goldenRows } = await supabase
.from("package_golden_cases")
.select("title, input, expected_output, label_pass, label_source")
.eq("package_id", pkg.id)
.eq("is_active", true)
.limit(20);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Order golden cases before LIMIT to keep calibration stable

The golden-set query applies .limit(20) without any deterministic ordering, so once a package has more than 20 active golden rows the subset fed into calibration can change between runs. Because calibration can cap verdicts, this introduces run-to-run instability unrelated to model behavior. Add an explicit order(...) (e.g., by created_at/id) before limit so the same frozen cases are evaluated consistently.

Useful? React with 👍 / 👎.

- adaptive red-team: judge failure now yields broke=null (indeterminate),
  not false, so unjudged unsafe replies no longer inflate safety.
- auto-learn: compute baseline OLD outputs once and reuse across all
  candidates so evolutionary fitness is comparable.
- golden-set queries: deterministic order(created_at,id) before limit
  so calibration is stable when >20 active golden rows exist.
@criptogus criptogus merged commit e28bf90 into main May 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants