Skip to content

Commit b3d794a

Browse files
committed
chore(release): bump version to 0.5.0 and update docs
- Bump package version from 0.4.0 to 0.5.0 in pyproject.toml - Rewrite and improve project README with: installation, quickstart, testing, CLI usage, developer notes, and documentation map - Clean up and standardize documentation under docs/: quickstart, CLI, writing-scenarios, architecture - Update poetry.lock (regeneration may have occurred during prior dependency work) This release prepares the project for v0.5.0 including documentation improvements and minor code fixes verified by the test suite. No public API breaking changes expected in this release. Change highlights: - RegressionDetector: ensure native boolean return for regression detection to satisfy identity checks in tests - ScenarioBuilder (no-code): adjust shorthand adapter handling to raise NotImplementedError and consolidate metric instantiation placeholder - Documentation: large-scale rewrite for clarity, navigation, and accurate instructions for contributors and users Testing: all unit tests pass locally (summary: 144 passed, 10 skipped; warnings exist from third-party dependencies).
1 parent b74bb48 commit b3d794a

7 files changed

Lines changed: 484 additions & 501 deletions

File tree

README.md

Lines changed: 60 additions & 290 deletions
Original file line numberDiff line numberDiff line change
@@ -1,335 +1,105 @@
1+
def build_dataset() -> DatasetSource:
2+
dataset = build_dataset()
3+
def create_suite():
14
# AgentUnit
25

3-
**AgentUnit** is a comprehensive framework for testing, monitoring, and validating multi-agent AI systems across different platforms. It provides a unified interface to test agent interactions, measure performance, and ensure reliability of conversational AI workflows.
4-
5-
## 🚀 Features
6-
7-
### Multi-Platform Support
8-
- **AutoGen AG2**: Microsoft's conversational AI framework
9-
- **OpenAI Swarm**: Multi-agent coordination and orchestration
10-
- **LangSmith**: Advanced language model monitoring and evaluation
11-
- **AgentOps**: Production monitoring and observability
12-
- **Wandb**: Experiment tracking and performance analytics
13-
14-
### Core Capabilities
15-
- **Multi-Agent Testing**: Comprehensive testing of agent interactions and workflows
16-
- **Production Monitoring**: Real-time monitoring of agent performance and behavior
17-
- **Performance Analytics**: Detailed metrics and insights into agent system performance
18-
- **Scenario Management**: Create, run, and manage test scenarios across platforms
19-
- **Reporting & Export**: Generate detailed reports in multiple formats (JSON, XML, HTML)
6+
AgentUnit is a framework for evaluating, monitoring, and benchmarking multi-agent systems. It standardises how teams define scenarios, run experiments, and report outcomes across adapters, model providers, and deployment targets.
207

21-
### Architecture Highlights
22-
- **Modular Design**: Platform-agnostic architecture with pluggable adapters
23-
- **Async Processing**: Fully asynchronous execution for high-performance testing
24-
- **Extensible Framework**: Easy to add new platforms and monitoring capabilities
25-
- **Production Ready**: Built for enterprise-scale deployment and monitoring
8+
## Overview
269

27-
> Looking for a crash course? Jump to the [five-minute quickstart](#five-minute-quickstart).
10+
- **Scenario-centric design** – describe datasets, adapters, and policies once, then reuse them in local runs, CI jobs, and production monitors.
11+
- **Extensible adapters** – plug into LangGraph, CrewAI, PromptFlow, OpenAI Swarm, Anthropic Bedrock, Phidata, and custom agents through a consistent interface.
12+
- **Comprehensive metrics** – combine exact-match assertions, RAGAS quality scores, and operational metrics with optional OpenTelemetry traces.
13+
- **Production-first tooling** – export JSON, Markdown, and JUnit reports, gate releases with regression detection, and surface telemetry in existing observability stacks.
2814

29-
## Table of contents
15+
## Installation
3016

31-
1. [Installation](#installation)
32-
2. [Five-minute quickstart](#five-minute-quickstart)
33-
3. [Workflow overview](#workflow-overview)
34-
4. [Authoring evaluation suites](#authoring-evaluation-suites)
35-
5. [Command-line interface](#command-line-interface)
36-
6. [Metrics & reporting](#metrics--reporting)
37-
7. [Observability with OpenTelemetry](#observability-with-opentelemetry)
38-
8. [Local development](#local-development)
39-
9. [Additional resources](#additional-resources)
40-
41-
## 📦 Installation
42-
43-
### Prerequisites
44-
- Python 3.9 or higher
45-
- Virtual environment (recommended)
46-
47-
### Quick Installation
17+
AgentUnit targets Python 3.9+. The recommended workflow uses Poetry for dependency management.
4818

4919
```bash
50-
# Clone the repository
51-
git clone https://github.com/yourusername/agentunit.git
20+
git clone https://github.com/aviralgarg05/agentunit.git
5221
cd agentunit
53-
54-
# Create and activate virtual environment
55-
python -m venv venv
56-
source venv/bin/activate # On Windows: venv\Scripts\activate
57-
58-
# Install AgentUnit
59-
pip install -e .
60-
61-
# Verify installation
62-
agentunit --help
63-
```
64-
65-
### Platform-Specific Dependencies
66-
67-
Install additional dependencies for specific platforms:
68-
69-
```bash
70-
# For AutoGen AG2 support
71-
pip install pyautogen[ag2]
72-
73-
# For OpenAI Swarm support
74-
pip install openai-swarm
75-
76-
# For LangSmith integration
77-
pip install langsmith
78-
79-
# For AgentOps monitoring
80-
pip install agentops
81-
82-
# For Wandb tracking
83-
pip install wandb
84-
```
85-
86-
## 🏃‍♂️ Quick Start
87-
88-
### 1. Basic Multi-Agent Test
89-
90-
```python
91-
from agentunit.core import Scenario, DatasetSource, DatasetCase
92-
from agentunit.adapters.autogen_ag2 import AG2Adapter
93-
94-
# Create a test case
95-
test_case = DatasetCase(
96-
id="greeting_test",
97-
query="Hello, can you help me plan a meeting?",
98-
expected_output="I'd be happy to help you plan a meeting.",
99-
metadata={"category": "greeting", "complexity": "simple"}
100-
)
101-
102-
# Create dataset
103-
dataset = DatasetSource("meeting_scenarios", lambda: [test_case])
104-
105-
# Configure adapter
106-
adapter = AG2Adapter({
107-
"model": "gpt-4",
108-
"max_turns": 10,
109-
"timeout": 60
110-
})
111-
112-
# Create and run scenario
113-
scenario = Scenario(
114-
name="meeting_planning_test",
115-
adapter=adapter,
116-
dataset=dataset
117-
)
118-
119-
# Execute the test
120-
from agentunit.core import Runner
121-
runner = Runner()
122-
results = await runner.run_scenario(scenario)
123-
124-
print(f"Success rate: {results.success_rate}")
22+
poetry install
23+
poetry shell
12524
```
12625

127-
### 2. CLI Usage
26+
To use pip instead:
12827

12928
```bash
130-
# Run a multi-agent test scenario
131-
agentunit multiagent run --scenario meeting_test.json --adapter autogen_ag2
132-
133-
# Monitor production deployment
134-
agentunit monitoring start --platform langsmith --project my_agents
135-
136-
# Generate analysis report
137-
agentunit analyze --results results.json --output report.html
138-
139-
# Configure AgentUnit settings
140-
agentunit config set adapter.default autogen_ag2
141-
agentunit config set monitoring.enabled true
142-
```
143-
144-
### 3. Production Monitoring
145-
146-
```python
147-
from agentunit.monitoring import ProductionMonitor
148-
from agentunit.adapters.langsmith_adapter import LangSmithAdapter
149-
150-
# Setup production monitoring
151-
monitor = ProductionMonitor()
152-
adapter = LangSmithAdapter({
153-
"project_name": "production_agents",
154-
"api_key": "your_langsmith_key"
155-
})
156-
157-
# Start monitoring
158-
await monitor.start_monitoring(adapter)
159-
160-
# Monitor specific agent interactions
161-
session_id = await monitor.create_session("customer_support")
162-
# Your agent interactions here...
163-
await monitor.end_session(session_id)
164-
165-
# Generate monitoring report
166-
report = await monitor.generate_report()
29+
python -m venv .venv
30+
source .venv/bin/activate
31+
pip install -e .
16732
```
16833

169-
Or add it to a Poetry project:
34+
Optional integrations are published as extras; install only what you need:
17035

17136
```bash
172-
poetry add agentunit
37+
poetry install --with promptflow,crewai,langgraph
38+
# or with pip
39+
pip install agentunit[promptflow,crewai,langgraph]
17340
```
17441

175-
The CLI entry point `agentunit` becomes available immediately after installation.
42+
Refer to the [framework integrations catalog](docs/framework-integrations.md) for per-adapter requirements.
17643

177-
## Five-minute quickstart
44+
## Getting started
17845

179-
1. **Install the package** (see above) inside a virtual environment.
180-
2. **Bootstrap a template project** using the bundled example suite:
46+
1. Follow the [Quickstart](docs/quickstart.md) to run the bundled template suite and swap in your own adapter.
47+
2. Review [Writing Scenarios](docs/writing-scenarios.md) for dataset and adapter templates plus helper constructors for popular frameworks.
48+
3. Consult the [CLI reference](docs/cli.md) to orchestrate suites from the command line and export results for CI, dashboards, or audits.
18149

182-
```bash
183-
agentunit agentunit.examples.template_project.suite \
184-
--json reports/template.json \
185-
--markdown reports/template.md \
186-
--junit reports/template.xml
187-
```
188-
189-
This runs the canned template agent against two sample questions and emits three report formats under `reports/`.
190-
191-
3. **Inspect the reports** for aggregated pass/fail signals, metric scores, and tool usage breakdowns.
192-
193-
Ready to wire in your own agent? Proceed to the next section.
194-
195-
## Workflow overview
196-
197-
AgentUnit breaks evaluations into four concepts:
198-
199-
| Concept | Purpose |
200-
| --- | --- |
201-
| **Dataset** | Supplies deterministic prompts, ground-truth expectations, and contextual metadata for each case. |
202-
| **Adapter** | Knows how to call your agent (LangGraph, CrewAI, custom code) and translate results into `AdapterOutcome` objects. |
203-
| **Scenario** | Couples an adapter with a dataset, retries, and runtime limits. Scenarios are collected into suites. |
204-
| **Metrics** | Score each run with heuristic checks (exact match, tool success) or RAGAS-powered evaluations when dependencies are available. |
205-
206-
You author suites in plain Python, then execute them with the CLI or your own runner.
207-
208-
## Authoring evaluation suites
209-
210-
Use the scaffolding below to create your own suite module. Place the file anywhere in your codebase (for example `evals/my_suite.py`) and point the CLI to it.
211-
212-
```python
213-
from agentunit.adapters.base import AdapterOutcome, BaseAdapter
214-
from agentunit.core.scenario import Scenario
215-
from agentunit.core.trace import TraceLog
216-
from agentunit.datasets.base import DatasetCase, DatasetSource
217-
218-
219-
def build_dataset() -> DatasetSource:
220-
"""Return a DatasetSource yielding deterministic DatasetCase objects."""
221-
222-
def _loader():
223-
yield DatasetCase(
224-
id="faq-001",
225-
query="What is the capital of France?",
226-
expected_output="Paris is the capital of France.",
227-
context=["Paris is the capital of France."],
228-
tools=["knowledge_base"],
229-
)
230-
231-
return DatasetSource(name="faq-demo", loader=_loader)
232-
233-
234-
class MyAdapter(BaseAdapter):
235-
"""Calls your agent stack and returns AdapterOutcome objects."""
236-
237-
name = "faq-adapter"
238-
239-
def __init__(self, agent):
240-
self._agent = agent
241-
242-
def prepare(self) -> None:
243-
... # warm up connections, load prompts, etc.
244-
245-
def execute(self, case: DatasetCase, trace: TraceLog) -> AdapterOutcome:
246-
trace.record("agent_prompt", input={"query": case.query, "context": case.context})
247-
answer = self._agent.answer(case.query, context=case.context)
248-
trace.record("agent_response", content=answer)
249-
success = case.expected_output is None or answer.strip() == case.expected_output.strip()
250-
return AdapterOutcome(success=success, output=answer)
251-
252-
def cleanup(self) -> None:
253-
...
254-
255-
256-
dataset = build_dataset()
257-
258-
259-
def create_suite():
260-
agent = ... # instantiate your production agent here
261-
adapter = MyAdapter(agent)
262-
scenario = Scenario(name="faq-demo", adapter=adapter, dataset=dataset)
263-
return [scenario]
264-
265-
266-
suite = list(create_suite())
267-
```
268-
269-
### Helpful utilities
270-
271-
- `Scenario.load_langgraph(path, dataset=...)`: build scenarios directly from LangGraph graphs.
272-
- `Scenario.from_openai_agents(flow, dataset=...)`: plug into the OpenAI Agents SDK.
273-
- `Scenario.from_crewai(crew, dataset=...)`: evaluate any CrewAI `Crew` object.
274-
- `Scenario.with_dataset(new_dataset)`: reuse adapters across multiple datasets.
275-
276-
See the [template suite](docs/templates/suite_template.py) for a fully commented example.
277-
278-
## Command-line interface
279-
280-
Run suites with the `agentunit` CLI:
50+
AgentUnit exposes an `agentunit` CLI entry point once installed. Typical usage:
28151

28252
```bash
283-
agentunit path.to.your.suite \
53+
agentunit path.to.suite \
28454
--metrics faithfulness answer_correctness \
285-
--otel-exporter otlp \
28655
--json reports/results.json \
28756
--markdown reports/results.md \
28857
--junit reports/results.xml
28958
```
29059

291-
### Key flags
60+
Programmatic runners are available through `agentunit.core.Runner` for notebook- or script-driven workflows.
29261

293-
- `suite`: Either a module path (`package.module`) or a Python file ending in `.py` that exports `suite` or `create_suite`.
294-
- `--metrics`: Optional subset of metric names. Omit to run everything registered in `agentunit.metrics`.
295-
- `--seed`: Force deterministic shuffling for stochastic datasets.
296-
- `--otel-exporter`: Choose `console` (default pretty-print) or `otlp` (send spans to OTLP endpoint).
297-
- `--json`, `--markdown`, `--junit`: Export evaluation artifacts in your preferred formats.
62+
## Documentation map
29863

299-
Full CLI documentation lives under [`docs/cli.md`](docs/cli.md).
64+
| Topic | Reference |
65+
| --- | --- |
66+
| Quick evaluation walkthrough | [docs/quickstart.md](docs/quickstart.md) |
67+
| Scenario and adapter authoring | [docs/writing-scenarios.md](docs/writing-scenarios.md) |
68+
| CLI options and examples | [docs/cli.md](docs/cli.md) |
69+
| Architecture overview | [docs/architecture.md](docs/architecture.md) |
70+
| Framework-specific guides | [docs/platform-guides.md](docs/platform-guides.md) |
71+
| No-code builder guide | [docs/nocode-quickstart.md](docs/nocode-quickstart.md) |
72+
| Templates | [docs/templates/](docs/templates/) |
73+
| Performance testing | [docs/performance-testing.md](docs/performance-testing.md) |
30074

301-
## Metrics & reporting
75+
Use the table above as the canonical navigation surface; every document cross-links back to related topics for clarity.
30276

303-
- **Answer correctness**: Exact-match with RAGAS-backed fuzzy scoring when available.
304-
- **Faithfulness**, **retrieval quality**, **hallucination rate**: Guard against fabricated responses by aligning answers with provided context.
305-
- **Tool success**: Verifies end-to-end workflows by aggregating tool call outcomes from traces.
77+
## Development workflow
30678

307-
Exports contain per-scenario, per-case, and aggregate metric summaries. JUnit reports integrate with CI platforms, while JSON/Markdown support custom dashboards.
79+
1. Install dependencies (Poetry or pip).
80+
2. Run the unit and integration suite:
30881

309-
## Observability with OpenTelemetry
82+
```bash
83+
poetry run python3 -m pytest tests -v
84+
```
31085

311-
AgentUnit emits structured spans through OpenTelemetry. Set `--otel-exporter otlp` to forward traces to an OTLP-compatible collector (such as OpenTelemetry Collector, Honeycomb, or Datadog). The default `console` exporter prints spans to stdout for quick inspection.
86+
3. Execute targeted suites during active development, then run the full matrix before opening a pull request.
31287

313-
You can also hook into `TraceLog` within adapters to record domain-specific events (tool prompts, retrieval hits, etc.).
88+
Latest verification (2025-10-07): 144 passed, 10 skipped, 32 warnings. Warnings originate from third-party dependencies (`langchain` pydantic shim deprecations and `datetime.utcnow` usage). Track upstream fixes or pin patched releases as needed.
31489

315-
## Local development
90+
## Contributing
31691

317-
Clone the repository if you plan to contribute:
92+
- Fork the repository and target the `main` branch for pull requests.
93+
- Include tests for new features or behavioural changes.
94+
- Update documentation when public APIs change; use the navigation table above to keep references synchronized.
95+
- Adhere to the existing code style and run `pytest` before submitting changes.
31896

319-
```bash
320-
git clone https://github.com/aviralgarg05/agentunit.git
321-
cd agentunit
322-
poetry install
323-
poetry run pytest
324-
```
97+
Security disclosures and discussions are managed through GitHub issues; sensitive topics should follow responsible disclosure guidelines outlined in `SECURITY.md` (if unavailable, open a private issue via GitHub).
32598

326-
The template suite lives under `src/agentunit/examples/template_project/` to help you smoke-test the framework.
99+
## License
327100

328-
## Additional resources
101+
AgentUnit is released under the MIT License. See [LICENSE](LICENSE) for the full text.
329102

330-
- [Quickstart guide](docs/quickstart.md)
331-
- [How to write scenarios](docs/writing-scenarios.md)
332-
- [CLI reference](docs/cli.md)
333-
- [Template suite skeleton](docs/templates/suite_template.py)
103+
---
334104

335-
If you build something cool with AgentUnit, let us know by opening an issue or discussion on GitHub!
105+
Need an overview for stakeholders? Start with [docs/architecture.md](docs/architecture.md). Ready to extend the platform? Explore the templates under [docs/templates/](docs/templates/).

0 commit comments

Comments
 (0)