Feed it a safety policy. Get back thousands of high-quality DPO preference pairs. No human annotators. No manual labeling. Just agents arguing with each other.
Try it now — no API key needed:
pip install -e ".[all]" && python -m ecoalign_forge --demoSee demo output (click to expand)
============================================================
EcoAlign-Forge DEMO MODE
No API key needed — using pre-recorded agent responses
============================================================
16:14:28 INFO orchestrator: Starting pipeline run — 5 samples in batches of 10
16:14:28 INFO [ChaosCreator] (DEMO) Generating 5 adversarial cases...
16:14:29 INFO [Moderator] (DEMO) Reviewing 5 cases as naive junior reviewer...
16:14:30 INFO [SupremeJudge] (DEMO) Judging 5 cases with guidelines...
16:14:30 INFO Batch 0 done: 5/5 cases, 3 DPO pairs
16:14:30 INFO IAA: kappa=0.444, alpha=0.494
============================================================
Pipeline Complete!
============================================================
Total cases: 5
Evaluations: 5
DPO pairs: 3
Avg quality: 0.40
Interception rate: 40.0%
Output: data/datasets/dpo_pairs_28eb5d03.jsonl
Sample DPO pair:
Chosen: {"has_stealth_marketing":true, "reasoning_trace":"命中 A-001 + A-002..."}
Rejected: {"has_stealth_marketing":false, "reasoning_trace":"看起来正常..."}
Gap: 0.40
Lineage: policy=default-v1, judge=openai/gpt-5.4
Training a content moderation model requires thousands of preference pairs — examples of "good judgment vs. bad judgment" on the same content. Today, this means:
- Hiring annotators at $0.5–5 per label
- Waiting weeks for a batch of 1,000 pairs
- Getting inconsistent labels across annotators
- Having no idea why an annotator chose "block" over "pass"
What if you could spin up a factory that produces labeled preference data 24/7, with full traceability, for < $0.01 per pair?
EcoAlign-Forge runs a courtroom drama inside your terminal:
🔴 Red Team (ChaosCreator) "I crafted this sneaky ad disguised as a review."
│
▼
🟡 Junior Reviewer (Moderator) "Hmm, looks fine to me... T2_Normal."
│
▼
🟢 Supreme Judge "Nope. Rule A-002: homophone evasion for WeChat ID.
│ This is stealth marketing. T1_Shadowban."
│
▼
⚖️ Constitutional Reviewer "Let me double-check against the handbook...
│ Yes, the Judge got it right."
│
▼
📦 DPO Pair chosen = Judge's ruling (with rule citations)
rejected = Moderator's naive guess
preference_gap = 0.7
The disagreement between Judge and Moderator becomes your training signal. The Judge's rule-cited reasoning becomes chosen. The Moderator's gut feeling becomes rejected. Rinse and repeat — thousands of times.
| You are... | You want to... | EcoAlign-Forge helps by... |
|---|---|---|
| ML Engineer | Train a content moderation model via DPO/RLHF | Generating training data that plugs directly into TRL / LLaMA-Factory |
| Trust & Safety Lead | Scale content review without scaling headcount | Producing labeled edge cases your human reviewers would miss |
| AI Researcher | Study red-teaming and adversarial robustness | Providing a structured framework for generating + evaluating attacks |
| Data Scientist | Build a data quality flywheel | Offering IAA metrics, quality scores, and adaptive sampling out of the box |
pip install -e ".[all]"
cp .env.example .env # Add your LLM API key
python -m ecoalign_forge # Watch the factory rundata/
├── datasets/
│ └── dpo_pairs_a1b2c3d4_20260410_120000.jsonl # Your DPO training data
├── metrics.json # Quality metrics
├── runs.jsonl # Pipeline run history
├── flywheel_state.json # Iteration tracking
└── report.html # Visual quality report
from ecoalign_forge.export import export_trl
# Option A: Classic TRL format
export_trl(pairs, "train.jsonl")
# Option B: TRL >= 0.8 conversational format
export_trl(pairs, "train.jsonl", conversational=True)
# Option C: LLaMA-Factory ShareGPT format
from ecoalign_forge.export import export_sharegpt
export_sharegpt(pairs, "train_sharegpt.json")make dashboard # Opens Streamlit at localhost:8501┌─────────────────────────────────────────────────────────────────────────┐
│ AgentOrchestrator.run() │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─ AdaptiveSampler ──────────────────────────────────────────────┐ │
│ │ "ai_slop is undersampled → boost T0/T1 ratio this batch" │ │
│ └────────────────────────────────────────┬───────────────────────┘ │
│ ▼ │
│ Stage 1 ┌──────────────┐ ChaosCase[] │
│ │ ChaosCreator │ "Here are 10 sneaky posts │
│ │ (T=0.9) │ targeting your policy gaps" │
│ └──────┬───────┘ │
│ ▼ │
│ Stage 2 ┌──────────────┐ JudgeEvaluation[] │
│ │ Moderator │ "I'm a naive reviewer, │
│ │ (T=0.5) │ most of these look fine" │
│ │ 4 personas │ │
│ └──────┬───────┘ │
│ ▼ │
│ Stage 3 ┌──────────────┐ JudgeEvaluation[] + DPO_Pair[] │
│ │ SupremeJudge │ "Rule A-002 triggered. │
│ │ (T=0.2) │ T1_Shadowban. Here's why." │
│ └──────┬───────┘ │
│ ▼ │
│ Stage 4 ┌──────────────┐ Corrected evaluations │
│ │Constitutional│ "Double-checked against the handbook. │
│ │ Reviewer │ 2 out of 10 judgments corrected." │
│ └──────┬───────┘ │
│ ▼ │
│ Post DataLineage injection → QualityScorer → IAA → FlyWheel │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The Moderator deliberately doesn't read the rule book. Each of its 4 personas makes different types of mistakes:
| Persona | Behavior | What it generates |
|---|---|---|
naive |
Goes with gut feeling | Balanced false positives/negatives |
strict_paranoid |
Blocks everything suspicious | Over-moderation training signal |
lax_overlooker |
Lets most things through | Under-moderation training signal |
keyword_matcher |
Only catches obvious keywords | Evasion-blind training signal |
The Judge, armed with the full guidelines handbook, catches these mistakes. The gap between them is your DPO signal.
| Signal Type | When | Strength | Example |
|---|---|---|---|
| Direct Disagreement | Judge and Moderator pick different tiers | Strong (gap = severity difference) | Judge: T0_Block, Moderator: T2_Normal |
| Reasoning Quality | Same tier, but Judge cites 2+ rules, Moderator cites 0 | Soft (gap = 0.3) | Both say T1, but Judge explains why |
"We're launching a new social platform and need a moderation model, but we have zero labeled data."
from ecoalign_forge.engine.orchestrator import AgentOrchestrator
from ecoalign_forge.schemas.policy import PolicyInput, PolicyDimension
policy = PolicyInput(
policy_id="my-platform-v1",
name="My Social Platform",
dimensions=[
PolicyDimension(name="stealth_marketing", description="Hidden ads and traffic diversion"),
PolicyDimension(name="ai_slop", description="Low-effort AI-generated content"),
],
)
orch = AgentOrchestrator()
result = await orch.run(policy=policy, num_samples=1000)
# → 1000 cases processed, ~400 DPO pairs generated
# → Exported to data/datasets/*.jsonl"We trained v1 of our model. How do we make v2 better with targeted data?"
from ecoalign_forge.engine.flywheel import FlyWheelOrchestrator
fw = FlyWheelOrchestrator(convergence_threshold=0.02)
# Round 1: baseline model as Moderator
result_r1 = await orch.run(policy, num_samples=500)
# → avg_quality=0.55, kappa=0.42
# Train your model with Round 1 data...
# Then swap the trained model in as the new Moderator
# Round 2: trained model catches more nuance
result_r2 = await orch.run(policy, num_samples=500)
# → avg_quality=0.72, kappa=0.61
fw.state.quality_improvement # +30.9% — the flywheel is spinning"We updated our guidelines. Which rules aren't being triggered by any test cases?"
print(orch.metrics.uncovered_rules)
# → ['A-005', 'B-006'] ← These rules have zero test coverage
coverage = orch.sampler.analyze_coverage(orch._all_cases)
print(coverage.undersampled_combinations)
# → [('ai_slop', 'extreme')] ← No extreme-difficulty AI slop cases yetfrom ecoalign_forge.reports import generate_html_report
generate_html_report(
dataset_name="Q2 2026 Moderation Training Set",
total_pairs=len(result.dpo_pairs),
avg_quality=result.avg_quality_score,
interception_rate=result.interception_rate,
quality_distribution=[s.overall for s in quality_reports],
output_path="q2_report.html",
)
# → Self-contained HTML with KPI cards, charts, and coverage analysisEcoAlign-Forge doesn't just generate data — it tells you how good the data is:
| Metric | What it measures | Where to find it |
|---|---|---|
| Cohen's Kappa | Agreement between Judge and each Moderator persona | compute_batch_iaa() |
| Krippendorff's Alpha | Multi-rater agreement (handles missing values) | compute_batch_iaa() |
| 5-Dimension Quality Score | Reasoning depth, info density, preference clarity, decision consistency, completeness | QualityScorer.score() |
| Constitutional Correction Rate | How often the self-review catches errors | constitutional.stats.correction_rate |
| Rule Coverage | Which policy rules have been triggered | metrics.rule_coverage |
| Data Lineage | Full provenance: which model, persona, policy version, guidelines hash | DPO_Pair.lineage |
The attack classification system is not invented from scratch — it aligns with established frameworks:
| Framework | What we borrowed | Where it lives |
|---|---|---|
| HarmBench | 4 functional categories, 7 semantic domains | taxonomy/harm_categories.py |
| OWASP LLM Top 10 | Vulnerability-to-category mapping | HarmCategory.owasp_mapping |
| Evol-Instruct | Depth evolution (add constraints) + Breadth evolution (topic mutation) | taxonomy/evol_strategies.py |
| PyRIT | Orchestrator → Converter → Scorer pipeline pattern | engine/orchestrator.py |
| Constitutional AI | Self-critique → correction loop | agents/constitutional.py |
ecoalign-forge/
├── src/ecoalign_forge/
│ ├── agents/ # The cast of characters
│ │ ├── chaos_creator.py # The attacker (red team)
│ │ ├── moderator.py # The naive reviewer (4 personas)
│ │ ├── supreme_judge.py # The expert judge (cites rules)
│ │ └── constitutional.py # The quality auditor (self-review)
│ ├── engine/ # The machinery
│ │ ├── orchestrator.py # Runs the full pipeline
│ │ ├── flywheel.py # Manages multi-round iteration
│ │ └── adaptive_sampler.py # Adjusts sampling strategy
│ ├── schemas/ # The contracts (Pydantic v2)
│ ├── llm/ # LLM client (LiteLLM, 100+ providers)
│ ├── storage/ # JSONL storage + metrics + IAA
│ ├── export/ # TRL / ShareGPT / HF Dataset Card
│ ├── quality/ # 5-dimension quality scorer
│ ├── taxonomy/ # HarmBench + OWASP attack taxonomy
│ └── reports/ # Self-contained HTML reports
├── dashboard/ # Streamlit real-time monitoring
├── tests/ # 199 tests (pytest + asyncio)
├── guidelines.md # The "constitution" (judgment handbook)
└── examples/ # Quick-start scripts + policy templates
| Layer | Technology | Why |
|---|---|---|
| LLM | LiteLLM | One interface for 100+ providers (OpenAI, Anthropic, local models) |
| Data Validation | Pydantic v2 | Rule ID hard-validation prevents LLM hallucinated citations |
| Async | asyncio + Tenacity | Concurrent LLM calls with exponential backoff retry |
| Monitoring | Streamlit + Plotly | Real-time dashboard with 5-second auto-refresh |
| Testing | pytest + asyncio | 199 tests covering all public APIs |
| CI | GitHub Actions | Lint (ruff) + test on Python 3.11 & 3.12 |
make install # Install dev dependencies
make test # Run 199 tests with coverage
make lint # Lint with ruff
make format # Format with black + isort
make dashboard # Launch Streamlit dashboardBuilt on ideas from: TRL | UltraFeedback | PyRIT | HarmBench | Constitutional AI | Arena Learning | Garak | Evol-Instruct