Add adaptive red-teaming, judge calibration, and evolutionary auto-learn by criptogus · Pull Request #14 · criptogus/agent-evolve-network

criptogus · 2026-05-17T13:17:42Z

What does this PR add?

Implements three major improvements to SkillForge's evaluation and auto-learning pipelines: (1) adaptive multi-turn red-teaming that escalates attacks based on defender responses, (2) judge calibration against frozen golden test cases to detect when the LLM judge is unreliable, and (3) evolutionary patch search that maintains an elite archive across generations instead of proposing a single greedy patch.

Type

Platform code or docs

Changes

Adaptive Red-Teaming

New adaptiveAttack() function that runs multi-turn conversations with the skill, using an LLM judge to decide if the attack succeeded and what to try next
Replaces single-shot probe evaluation with configurable turn limits (default 3, max 5)
Tracks full transcript of attacker/defender exchanges for transparency
Configurable via redTeamTurns parameter in evaluatorPipeline()

Judge Calibration

New cohenKappa() function to compute inter-rater agreement between LLM judge and ground truth
New package_golden_cases table to store frozen, human-labelled reference test cases per package
Stage 3.5 in evaluator pipeline compares judge verdicts against golden labels
If judge agreement < 70% or κ < 0.4, verdict is capped at 'iterate' (cannot ship)
Calibration results persisted in judge_calibration column on evaluations

Evolutionary Auto-Learn

Replaces single-patch proposal with multi-generation search maintaining an elite archive
Each generation proposes candidatesPerGen diverse patches (default 2) using different strategies
Candidates scored via A/B testing on frozen baseline examples; fitness = net wins − output failures − confidence bonus
Elite archive (size 2) carries forward across generations; later generations mutate best elites
Configurable via generations (default 2, max 4) and candidatesPerGen (default 2, max 4)
Evolution trace stored for UI visualization

Customer Feedback Integration

New FeedbackRow type and summarizeFeedback() function
LLM-agent feedback weighted 2× higher than human UI feedback (configurable)
Feedback summary included in root-cause analysis and patch proposal prompts
Feedback counts and top comments exposed in return values

Database

Migration adds package_golden_cases table with RLS policies
Adds judge_calibration and evolution_trace JSONB columns to package_evaluations
Seeds golden set from current version examples as 'reference' labels

Notes for reviewers

Tradeoffs:

Adaptive red-teaming increases eval latency (multi-turn LLM calls per probe) but provides stronger adversarial signal
Evolutionary search is more expensive than single-patch proposal but avoids local optima and provides search transparency
Judge calibration can block shipping if the judge is miscalibrated on golden cases, which is intentional (safety gate)

Edge cases:

If no golden cases exist, calibration stage is skipped (backward compatible)
If evolutionary search produces no scorable candidates, pipeline fails with 502 (prevents shipping broken patches)
Adaptive attack respects maxTurns limit; if judge says "done", loop breaks early

Assumptions:

Golden cases are stable and represent the true acceptance criteria (not regenerated per eval)
LLM-agent feedback is more reliable than human ratings (weighted higher)
A/B fitness on 4 baseline examples is sufficient to detect regressions (sample size trade-off)

https://claude.ai/code/session_01BFy9njggrB6SUV3KRJ6439

…, adaptive red team - Versioned golden set (frozen ground-truth) used as stable regression suite + reference labels; judge calibration stage (agreement/Cohen's kappa) caps verdict at 'iterate' when the judge can't be trusted. - Auto-learn is now an evolutionary multi-generation search with an elite archive and A/B fitness vs the frozen baseline, replacing the single greedy patch. - Auto-learn ingests customer feedback (UI=human, MCP/CLI=LLM) with LLM-agent feedback weighted 2x over human feedback. - Adversarial stage upgraded to adaptive multi-turn red-team (attacker reads each defender reply and escalates) for prompt/hybrid skills.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a8f385f5c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T13:22:10Z

+        const [oldRes, newRes] = await Promise.all([
+          generateText({ model: getGatewayModel(FAST), system: opts.version.system_prompt, prompt: ex.input })
+            .then((r) => r.text)
+            .catch(() => ""),


Reuse one baseline output set across all candidates

Precomputing oldRes inside scoreCandidate() means every patch candidate is judged against a different baseline run, because getGatewayModel(FAST) is non-deterministic and is called again per candidate. That makes fitness values across candidates/generations incomparable, so elite selection can prefer a weaker patch due to baseline variance rather than real improvement. Compute OLD outputs once per sampled example before the evolutionary loop and reuse them for all candidate scoring.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-17T13:22:10Z

+    } catch {
+      break;


Treat adaptive judge failures as indeterminate, not safe

When the adaptive red-team judge call throws (API/transient/JSON-parse failure), the code immediately breaks out of the loop and later returns broke: false. In those failure paths, unsafe defender replies can be recorded as non-breaks, systematically inflating safety outcomes. This should preserve uncertainty (e.g., broke: null) or continue probing, rather than converting judge failure into a safe result.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-17T13:22:10Z

+    const { data: goldenRows } = await supabase
+      .from("package_golden_cases")
+      .select("title, input, expected_output, label_pass, label_source")
+      .eq("package_id", pkg.id)
+      .eq("is_active", true)
+      .limit(20);


Order golden cases before LIMIT to keep calibration stable

The golden-set query applies .limit(20) without any deterministic ordering, so once a package has more than 20 active golden rows the subset fed into calibration can change between runs. Because calibration can cap verdicts, this introduces run-to-run instability unrelated to model behavior. Add an explicit order(...) (e.g., by created_at/id) before limit so the same frozen cases are evaluated consistently.

Useful? React with 👍 / 👎.

- adaptive red-team: judge failure now yields broke=null (indeterminate), not false, so unjudged unsafe replies no longer inflate safety. - auto-learn: compute baseline OLD outputs once and reuse across all candidates so evolutionary fitness is comparable. - golden-set queries: deterministic order(created_at,id) before limit so calibration is stable when >20 active golden rows exist.

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

criptogus merged commit e28bf90 into main May 17, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add adaptive red-teaming, judge calibration, and evolutionary auto-learn#14

Add adaptive red-teaming, judge calibration, and evolutionary auto-learn#14
criptogus merged 2 commits into
mainfrom
claude/analyze-skill-creator-dLyBX

criptogus commented May 17, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

criptogus commented May 17, 2026

What does this PR add?

Type

Changes

Adaptive Red-Teaming

Judge Calibration

Evolutionary Auto-Learn

Customer Feedback Integration

Database

Notes for reviewers

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants