Turning failed agent trajectories into high-quality training data
Quickstart • How It Works • Architecture • Usage • Related projects • Citation
In LLM Agent training, failed tool-use trajectories are routinely discarded. This is wasteful — a trajectory that fails Goal A may perfectly succeed for Goal B.
AgentHER borrows the core insight from Hindsight Experience Replay (HER) in reinforcement learning: instead of discarding failures, we relabel the goal to match what was actually achieved, creating valid training data from every trajectory.
| Original (Failed) | Hindsight (Success) | |
|---|---|---|
| Prompt | "Find copper wire under $5/kg" | "Find copper wire suppliers and compare pricing" |
| Trajectory | Searched 7 suppliers, best found at $5.30/kg | (same trajectory) |
| Label | ❌ Failure | ✅ Success |
The agent's work was thorough and correct — it just didn't meet an arbitrary price constraint. AgentHER recovers this data.
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ ┌────────────────┐
│ 1. Failure │────▸│ 2. Outcome │────▸│ 3. Prompt │────▸│ 4. Data │
│ Detector │ │ Extractor │ │ Relabeler │ │ Augmenter │
│ │ │ │ │ │ │ │
│ Is this really │ │ What did the │ │ Reverse- │ │ Package as │
│ a failure? │ │ agent achieve? │ │ engineer a new │ │ SFT / DPO / │
│ Recoverable? │ │ │ │ matching prompt│ │ ShareGPT │
└─────────────────┘ └──────────────────┘ └─────────────────┘ └────────────────┘
Stage 1 — Failure Detector: Validates whether the trajectory truly fails, classifies the failure type (constraint violation, wrong result, tool error, etc.), and assesses recoverability. Supports rule-based (free) or LLM-judge modes.
Stage 2 — Outcome Extractor: Analyzes observations to build a factual summary of what the agent actually accomplished, ignoring the original goal entirely.
Stage 3 — Prompt Relabeler: Uses an LLM to craft a natural, human-like prompt that the trajectory perfectly satisfies. Includes confidence scoring and retry logic.
Stage 4 — Data Augmenter: Packages the new (prompt, trajectory) pair into standard training formats: SFT, DPO (with chosen/rejected pairs), or ShareGPT multi-turn.
agenther/
├── models.py # Pydantic data models (AgentStep, FailedTrajectory, etc.)
├── constants.py # Shared thresholds (min observation length, truncation, etc.)
├── llm_client.py # OpenAI-compatible LLM client with structured output
├── prompts.py # Jinja2 prompt templates + steps_for_prompt()
├── failure_detector.py # Stage 1: rule-based + LLM failure classification
├── outcome_extractor.py # Stage 2: extract actual achievements
├── prompt_relabeler.py # Stage 3: reverse-engineer hindsight prompts
├── data_augmenter.py # Stage 4: SFT/DPO/ShareGPT formatting
├── pipeline.py # End-to-end pipeline orchestrator
└── cli.py # Command-line interface
# Recommended: use a virtual environment
python -m venv .venv && source .venv/bin/activate # Linux/macOS
# or: .venv\Scripts\activate # Windows
pip install -e .
# Optional, for running tests: pip install -e ".[dev]"python examples/run_example.py --rule-basedexport OPENAI_API_KEY="your-key"
# Process failed trajectories → SFT data
agenther run examples/example_trajectories.json -f sft -o outputs/sft_data.jsonl
# Generate DPO pairs
agenther run examples/example_trajectories.json -f dpo -o outputs/dpo_data.jsonl
# Validate input format
agenther validate examples/example_trajectories.json# With vLLM / Ollama / any OpenAI-compatible endpoint
agenther run data.json --model "llama3" --base-url "http://localhost:8000/v1"from agenther import AgentHERPipeline, PipelineConfig
from agenther.models import FailedTrajectory, AgentStep, OutputFormat
# Define a failed trajectory
trajectory = FailedTrajectory(
original_prompt="Find flights to Tokyo under $500",
steps=[
AgentStep(
thought="Searching for flights",
action_name="flight_search",
action_input={"destination": "Tokyo", "max_price": 500},
observation="Found: ANA $680, JAL $720, United $590",
),
],
final_answer="No flights under $500 found.",
failure_reason="All flights exceed $500 budget",
)
# Run the pipeline
config = PipelineConfig(model="gpt-4o", output_format=OutputFormat.SFT)
pipeline = AgentHERPipeline(config)
result = pipeline.process(trajectory)
if result.success:
print(f"Hindsight prompt: {result.relabeled.hindsight_prompt}")
# e.g., "Search for flights to Tokyo and compare prices across airlines"Provide failed trajectories as JSON or JSONL. steps must contain at least one step.
{
"trajectory_id": "optional_id",
"original_prompt": "The user's original request",
"steps": [
{
"thought": "Agent's reasoning",
"action_name": "tool_name",
"action_input": {"key": "value"},
"observation": "Tool output"
}
],
"final_answer": "Agent's final response",
"failure_reason": "Why this is considered a failure"
}CLI options override defaults; there is no config file loading. For reference, configs/default.yaml documents the same options (use it as a template; pass values via CLI or PipelineConfig in code):
llm:
model: "gpt-4o"
temperature: 0.3
pipeline:
use_llm_detector: false # Rule-based is faster and free
use_llm_extractor: true # LLM gives better outcome extraction
output_format: "sft" # sft | dpo | sharegpt
min_confidence: 0.5 # Quality threshold for relabelingpip install -e ".[dev]"
pytest -v- Batch processing is sequential — no parallelism; large batches may be slow.
- No config file — options are passed via CLI or
PipelineConfigin code. - Rule-based stages are heuristics — for best quality, use LLM for detector/extractor when cost allows.
Issues and pull requests are welcome on GitHub.
- AdaRubrics — Adaptive dynamic rubric evaluator for agent trajectories: generates task-specific dimensions and scores runs for filtering/RLHF. Use it to score or filter relabeled data from AgentHER.
- AgentSynth — Synthetic agent data pipeline (forward + back-translation, execution-based reject sampling). AgentHER can relabel failed or low-quality synthetic runs into valid SFT/DPO data.
- trajectory_tokenization — ReAct with trajectory tokenization: compresses long (Thought, Action, Observation) history so long-horizon runs fit in context. Addresses context length; AgentHER addresses reuse of failed trajectories.
@software{agenther2025,
title = {AgentHER: Hindsight Experience Replay for LLM Agents},
author = {Ding, Liang},
year = {2025},
url = {https://github.com/alphadl/AgentHER},
}Apache 2.0
