|
| 1 | +def build_dataset() -> DatasetSource: |
| 2 | +dataset = build_dataset() |
| 3 | +def create_suite(): |
1 | 4 | # AgentUnit |
2 | 5 |
|
3 | | -**AgentUnit** is a comprehensive framework for testing, monitoring, and validating multi-agent AI systems across different platforms. It provides a unified interface to test agent interactions, measure performance, and ensure reliability of conversational AI workflows. |
4 | | - |
5 | | -## 🚀 Features |
6 | | - |
7 | | -### Multi-Platform Support |
8 | | -- **AutoGen AG2**: Microsoft's conversational AI framework |
9 | | -- **OpenAI Swarm**: Multi-agent coordination and orchestration |
10 | | -- **LangSmith**: Advanced language model monitoring and evaluation |
11 | | -- **AgentOps**: Production monitoring and observability |
12 | | -- **Wandb**: Experiment tracking and performance analytics |
13 | | - |
14 | | -### Core Capabilities |
15 | | -- **Multi-Agent Testing**: Comprehensive testing of agent interactions and workflows |
16 | | -- **Production Monitoring**: Real-time monitoring of agent performance and behavior |
17 | | -- **Performance Analytics**: Detailed metrics and insights into agent system performance |
18 | | -- **Scenario Management**: Create, run, and manage test scenarios across platforms |
19 | | -- **Reporting & Export**: Generate detailed reports in multiple formats (JSON, XML, HTML) |
| 6 | +AgentUnit is a framework for evaluating, monitoring, and benchmarking multi-agent systems. It standardises how teams define scenarios, run experiments, and report outcomes across adapters, model providers, and deployment targets. |
20 | 7 |
|
21 | | -### Architecture Highlights |
22 | | -- **Modular Design**: Platform-agnostic architecture with pluggable adapters |
23 | | -- **Async Processing**: Fully asynchronous execution for high-performance testing |
24 | | -- **Extensible Framework**: Easy to add new platforms and monitoring capabilities |
25 | | -- **Production Ready**: Built for enterprise-scale deployment and monitoring |
| 8 | +## Overview |
26 | 9 |
|
27 | | -> Looking for a crash course? Jump to the [five-minute quickstart](#five-minute-quickstart). |
| 10 | +- **Scenario-centric design** – describe datasets, adapters, and policies once, then reuse them in local runs, CI jobs, and production monitors. |
| 11 | +- **Extensible adapters** – plug into LangGraph, CrewAI, PromptFlow, OpenAI Swarm, Anthropic Bedrock, Phidata, and custom agents through a consistent interface. |
| 12 | +- **Comprehensive metrics** – combine exact-match assertions, RAGAS quality scores, and operational metrics with optional OpenTelemetry traces. |
| 13 | +- **Production-first tooling** – export JSON, Markdown, and JUnit reports, gate releases with regression detection, and surface telemetry in existing observability stacks. |
28 | 14 |
|
29 | | -## Table of contents |
| 15 | +## Installation |
30 | 16 |
|
31 | | -1. [Installation](#installation) |
32 | | -2. [Five-minute quickstart](#five-minute-quickstart) |
33 | | -3. [Workflow overview](#workflow-overview) |
34 | | -4. [Authoring evaluation suites](#authoring-evaluation-suites) |
35 | | -5. [Command-line interface](#command-line-interface) |
36 | | -6. [Metrics & reporting](#metrics--reporting) |
37 | | -7. [Observability with OpenTelemetry](#observability-with-opentelemetry) |
38 | | -8. [Local development](#local-development) |
39 | | -9. [Additional resources](#additional-resources) |
40 | | - |
41 | | -## 📦 Installation |
42 | | - |
43 | | -### Prerequisites |
44 | | -- Python 3.9 or higher |
45 | | -- Virtual environment (recommended) |
46 | | - |
47 | | -### Quick Installation |
| 17 | +AgentUnit targets Python 3.9+. The recommended workflow uses Poetry for dependency management. |
48 | 18 |
|
49 | 19 | ```bash |
50 | | -# Clone the repository |
51 | | -git clone https://github.com/yourusername/agentunit.git |
| 20 | +git clone https://github.com/aviralgarg05/agentunit.git |
52 | 21 | cd agentunit |
53 | | - |
54 | | -# Create and activate virtual environment |
55 | | -python -m venv venv |
56 | | -source venv/bin/activate # On Windows: venv\Scripts\activate |
57 | | - |
58 | | -# Install AgentUnit |
59 | | -pip install -e . |
60 | | - |
61 | | -# Verify installation |
62 | | -agentunit --help |
63 | | -``` |
64 | | - |
65 | | -### Platform-Specific Dependencies |
66 | | - |
67 | | -Install additional dependencies for specific platforms: |
68 | | - |
69 | | -```bash |
70 | | -# For AutoGen AG2 support |
71 | | -pip install pyautogen[ag2] |
72 | | - |
73 | | -# For OpenAI Swarm support |
74 | | -pip install openai-swarm |
75 | | - |
76 | | -# For LangSmith integration |
77 | | -pip install langsmith |
78 | | - |
79 | | -# For AgentOps monitoring |
80 | | -pip install agentops |
81 | | - |
82 | | -# For Wandb tracking |
83 | | -pip install wandb |
84 | | -``` |
85 | | - |
86 | | -## 🏃♂️ Quick Start |
87 | | - |
88 | | -### 1. Basic Multi-Agent Test |
89 | | - |
90 | | -```python |
91 | | -from agentunit.core import Scenario, DatasetSource, DatasetCase |
92 | | -from agentunit.adapters.autogen_ag2 import AG2Adapter |
93 | | - |
94 | | -# Create a test case |
95 | | -test_case = DatasetCase( |
96 | | - id="greeting_test", |
97 | | - query="Hello, can you help me plan a meeting?", |
98 | | - expected_output="I'd be happy to help you plan a meeting.", |
99 | | - metadata={"category": "greeting", "complexity": "simple"} |
100 | | -) |
101 | | - |
102 | | -# Create dataset |
103 | | -dataset = DatasetSource("meeting_scenarios", lambda: [test_case]) |
104 | | - |
105 | | -# Configure adapter |
106 | | -adapter = AG2Adapter({ |
107 | | - "model": "gpt-4", |
108 | | - "max_turns": 10, |
109 | | - "timeout": 60 |
110 | | -}) |
111 | | - |
112 | | -# Create and run scenario |
113 | | -scenario = Scenario( |
114 | | - name="meeting_planning_test", |
115 | | - adapter=adapter, |
116 | | - dataset=dataset |
117 | | -) |
118 | | - |
119 | | -# Execute the test |
120 | | -from agentunit.core import Runner |
121 | | -runner = Runner() |
122 | | -results = await runner.run_scenario(scenario) |
123 | | - |
124 | | -print(f"Success rate: {results.success_rate}") |
| 22 | +poetry install |
| 23 | +poetry shell |
125 | 24 | ``` |
126 | 25 |
|
127 | | -### 2. CLI Usage |
| 26 | +To use pip instead: |
128 | 27 |
|
129 | 28 | ```bash |
130 | | -# Run a multi-agent test scenario |
131 | | -agentunit multiagent run --scenario meeting_test.json --adapter autogen_ag2 |
132 | | - |
133 | | -# Monitor production deployment |
134 | | -agentunit monitoring start --platform langsmith --project my_agents |
135 | | - |
136 | | -# Generate analysis report |
137 | | -agentunit analyze --results results.json --output report.html |
138 | | - |
139 | | -# Configure AgentUnit settings |
140 | | -agentunit config set adapter.default autogen_ag2 |
141 | | -agentunit config set monitoring.enabled true |
142 | | -``` |
143 | | - |
144 | | -### 3. Production Monitoring |
145 | | - |
146 | | -```python |
147 | | -from agentunit.monitoring import ProductionMonitor |
148 | | -from agentunit.adapters.langsmith_adapter import LangSmithAdapter |
149 | | - |
150 | | -# Setup production monitoring |
151 | | -monitor = ProductionMonitor() |
152 | | -adapter = LangSmithAdapter({ |
153 | | - "project_name": "production_agents", |
154 | | - "api_key": "your_langsmith_key" |
155 | | -}) |
156 | | - |
157 | | -# Start monitoring |
158 | | -await monitor.start_monitoring(adapter) |
159 | | - |
160 | | -# Monitor specific agent interactions |
161 | | -session_id = await monitor.create_session("customer_support") |
162 | | -# Your agent interactions here... |
163 | | -await monitor.end_session(session_id) |
164 | | - |
165 | | -# Generate monitoring report |
166 | | -report = await monitor.generate_report() |
| 29 | +python -m venv .venv |
| 30 | +source .venv/bin/activate |
| 31 | +pip install -e . |
167 | 32 | ``` |
168 | 33 |
|
169 | | -Or add it to a Poetry project: |
| 34 | +Optional integrations are published as extras; install only what you need: |
170 | 35 |
|
171 | 36 | ```bash |
172 | | -poetry add agentunit |
| 37 | +poetry install --with promptflow,crewai,langgraph |
| 38 | +# or with pip |
| 39 | +pip install agentunit[promptflow,crewai,langgraph] |
173 | 40 | ``` |
174 | 41 |
|
175 | | -The CLI entry point `agentunit` becomes available immediately after installation. |
| 42 | +Refer to the [framework integrations catalog](docs/framework-integrations.md) for per-adapter requirements. |
176 | 43 |
|
177 | | -## Five-minute quickstart |
| 44 | +## Getting started |
178 | 45 |
|
179 | | -1. **Install the package** (see above) inside a virtual environment. |
180 | | -2. **Bootstrap a template project** using the bundled example suite: |
| 46 | +1. Follow the [Quickstart](docs/quickstart.md) to run the bundled template suite and swap in your own adapter. |
| 47 | +2. Review [Writing Scenarios](docs/writing-scenarios.md) for dataset and adapter templates plus helper constructors for popular frameworks. |
| 48 | +3. Consult the [CLI reference](docs/cli.md) to orchestrate suites from the command line and export results for CI, dashboards, or audits. |
181 | 49 |
|
182 | | - ```bash |
183 | | - agentunit agentunit.examples.template_project.suite \ |
184 | | - --json reports/template.json \ |
185 | | - --markdown reports/template.md \ |
186 | | - --junit reports/template.xml |
187 | | - ``` |
188 | | - |
189 | | - This runs the canned template agent against two sample questions and emits three report formats under `reports/`. |
190 | | - |
191 | | -3. **Inspect the reports** for aggregated pass/fail signals, metric scores, and tool usage breakdowns. |
192 | | - |
193 | | -Ready to wire in your own agent? Proceed to the next section. |
194 | | - |
195 | | -## Workflow overview |
196 | | - |
197 | | -AgentUnit breaks evaluations into four concepts: |
198 | | - |
199 | | -| Concept | Purpose | |
200 | | -| --- | --- | |
201 | | -| **Dataset** | Supplies deterministic prompts, ground-truth expectations, and contextual metadata for each case. | |
202 | | -| **Adapter** | Knows how to call your agent (LangGraph, CrewAI, custom code) and translate results into `AdapterOutcome` objects. | |
203 | | -| **Scenario** | Couples an adapter with a dataset, retries, and runtime limits. Scenarios are collected into suites. | |
204 | | -| **Metrics** | Score each run with heuristic checks (exact match, tool success) or RAGAS-powered evaluations when dependencies are available. | |
205 | | - |
206 | | -You author suites in plain Python, then execute them with the CLI or your own runner. |
207 | | - |
208 | | -## Authoring evaluation suites |
209 | | - |
210 | | -Use the scaffolding below to create your own suite module. Place the file anywhere in your codebase (for example `evals/my_suite.py`) and point the CLI to it. |
211 | | - |
212 | | -```python |
213 | | -from agentunit.adapters.base import AdapterOutcome, BaseAdapter |
214 | | -from agentunit.core.scenario import Scenario |
215 | | -from agentunit.core.trace import TraceLog |
216 | | -from agentunit.datasets.base import DatasetCase, DatasetSource |
217 | | - |
218 | | - |
219 | | -def build_dataset() -> DatasetSource: |
220 | | - """Return a DatasetSource yielding deterministic DatasetCase objects.""" |
221 | | - |
222 | | - def _loader(): |
223 | | - yield DatasetCase( |
224 | | - id="faq-001", |
225 | | - query="What is the capital of France?", |
226 | | - expected_output="Paris is the capital of France.", |
227 | | - context=["Paris is the capital of France."], |
228 | | - tools=["knowledge_base"], |
229 | | - ) |
230 | | - |
231 | | - return DatasetSource(name="faq-demo", loader=_loader) |
232 | | - |
233 | | - |
234 | | -class MyAdapter(BaseAdapter): |
235 | | - """Calls your agent stack and returns AdapterOutcome objects.""" |
236 | | - |
237 | | - name = "faq-adapter" |
238 | | - |
239 | | - def __init__(self, agent): |
240 | | - self._agent = agent |
241 | | - |
242 | | - def prepare(self) -> None: |
243 | | - ... # warm up connections, load prompts, etc. |
244 | | - |
245 | | - def execute(self, case: DatasetCase, trace: TraceLog) -> AdapterOutcome: |
246 | | - trace.record("agent_prompt", input={"query": case.query, "context": case.context}) |
247 | | - answer = self._agent.answer(case.query, context=case.context) |
248 | | - trace.record("agent_response", content=answer) |
249 | | - success = case.expected_output is None or answer.strip() == case.expected_output.strip() |
250 | | - return AdapterOutcome(success=success, output=answer) |
251 | | - |
252 | | - def cleanup(self) -> None: |
253 | | - ... |
254 | | - |
255 | | - |
256 | | -dataset = build_dataset() |
257 | | - |
258 | | - |
259 | | -def create_suite(): |
260 | | - agent = ... # instantiate your production agent here |
261 | | - adapter = MyAdapter(agent) |
262 | | - scenario = Scenario(name="faq-demo", adapter=adapter, dataset=dataset) |
263 | | - return [scenario] |
264 | | - |
265 | | - |
266 | | -suite = list(create_suite()) |
267 | | -``` |
268 | | - |
269 | | -### Helpful utilities |
270 | | - |
271 | | -- `Scenario.load_langgraph(path, dataset=...)`: build scenarios directly from LangGraph graphs. |
272 | | -- `Scenario.from_openai_agents(flow, dataset=...)`: plug into the OpenAI Agents SDK. |
273 | | -- `Scenario.from_crewai(crew, dataset=...)`: evaluate any CrewAI `Crew` object. |
274 | | -- `Scenario.with_dataset(new_dataset)`: reuse adapters across multiple datasets. |
275 | | - |
276 | | -See the [template suite](docs/templates/suite_template.py) for a fully commented example. |
277 | | - |
278 | | -## Command-line interface |
279 | | - |
280 | | -Run suites with the `agentunit` CLI: |
| 50 | +AgentUnit exposes an `agentunit` CLI entry point once installed. Typical usage: |
281 | 51 |
|
282 | 52 | ```bash |
283 | | -agentunit path.to.your.suite \ |
| 53 | +agentunit path.to.suite \ |
284 | 54 | --metrics faithfulness answer_correctness \ |
285 | | - --otel-exporter otlp \ |
286 | 55 | --json reports/results.json \ |
287 | 56 | --markdown reports/results.md \ |
288 | 57 | --junit reports/results.xml |
289 | 58 | ``` |
290 | 59 |
|
291 | | -### Key flags |
| 60 | +Programmatic runners are available through `agentunit.core.Runner` for notebook- or script-driven workflows. |
292 | 61 |
|
293 | | -- `suite`: Either a module path (`package.module`) or a Python file ending in `.py` that exports `suite` or `create_suite`. |
294 | | -- `--metrics`: Optional subset of metric names. Omit to run everything registered in `agentunit.metrics`. |
295 | | -- `--seed`: Force deterministic shuffling for stochastic datasets. |
296 | | -- `--otel-exporter`: Choose `console` (default pretty-print) or `otlp` (send spans to OTLP endpoint). |
297 | | -- `--json`, `--markdown`, `--junit`: Export evaluation artifacts in your preferred formats. |
| 62 | +## Documentation map |
298 | 63 |
|
299 | | -Full CLI documentation lives under [`docs/cli.md`](docs/cli.md). |
| 64 | +| Topic | Reference | |
| 65 | +| --- | --- | |
| 66 | +| Quick evaluation walkthrough | [docs/quickstart.md](docs/quickstart.md) | |
| 67 | +| Scenario and adapter authoring | [docs/writing-scenarios.md](docs/writing-scenarios.md) | |
| 68 | +| CLI options and examples | [docs/cli.md](docs/cli.md) | |
| 69 | +| Architecture overview | [docs/architecture.md](docs/architecture.md) | |
| 70 | +| Framework-specific guides | [docs/platform-guides.md](docs/platform-guides.md) | |
| 71 | +| No-code builder guide | [docs/nocode-quickstart.md](docs/nocode-quickstart.md) | |
| 72 | +| Templates | [docs/templates/](docs/templates/) | |
| 73 | +| Performance testing | [docs/performance-testing.md](docs/performance-testing.md) | |
300 | 74 |
|
301 | | -## Metrics & reporting |
| 75 | +Use the table above as the canonical navigation surface; every document cross-links back to related topics for clarity. |
302 | 76 |
|
303 | | -- **Answer correctness**: Exact-match with RAGAS-backed fuzzy scoring when available. |
304 | | -- **Faithfulness**, **retrieval quality**, **hallucination rate**: Guard against fabricated responses by aligning answers with provided context. |
305 | | -- **Tool success**: Verifies end-to-end workflows by aggregating tool call outcomes from traces. |
| 77 | +## Development workflow |
306 | 78 |
|
307 | | -Exports contain per-scenario, per-case, and aggregate metric summaries. JUnit reports integrate with CI platforms, while JSON/Markdown support custom dashboards. |
| 79 | +1. Install dependencies (Poetry or pip). |
| 80 | +2. Run the unit and integration suite: |
308 | 81 |
|
309 | | -## Observability with OpenTelemetry |
| 82 | +```bash |
| 83 | +poetry run python3 -m pytest tests -v |
| 84 | +``` |
310 | 85 |
|
311 | | -AgentUnit emits structured spans through OpenTelemetry. Set `--otel-exporter otlp` to forward traces to an OTLP-compatible collector (such as OpenTelemetry Collector, Honeycomb, or Datadog). The default `console` exporter prints spans to stdout for quick inspection. |
| 86 | +3. Execute targeted suites during active development, then run the full matrix before opening a pull request. |
312 | 87 |
|
313 | | -You can also hook into `TraceLog` within adapters to record domain-specific events (tool prompts, retrieval hits, etc.). |
| 88 | +Latest verification (2025-10-07): 144 passed, 10 skipped, 32 warnings. Warnings originate from third-party dependencies (`langchain` pydantic shim deprecations and `datetime.utcnow` usage). Track upstream fixes or pin patched releases as needed. |
314 | 89 |
|
315 | | -## Local development |
| 90 | +## Contributing |
316 | 91 |
|
317 | | -Clone the repository if you plan to contribute: |
| 92 | +- Fork the repository and target the `main` branch for pull requests. |
| 93 | +- Include tests for new features or behavioural changes. |
| 94 | +- Update documentation when public APIs change; use the navigation table above to keep references synchronized. |
| 95 | +- Adhere to the existing code style and run `pytest` before submitting changes. |
318 | 96 |
|
319 | | -```bash |
320 | | -git clone https://github.com/aviralgarg05/agentunit.git |
321 | | -cd agentunit |
322 | | -poetry install |
323 | | -poetry run pytest |
324 | | -``` |
| 97 | +Security disclosures and discussions are managed through GitHub issues; sensitive topics should follow responsible disclosure guidelines outlined in `SECURITY.md` (if unavailable, open a private issue via GitHub). |
325 | 98 |
|
326 | | -The template suite lives under `src/agentunit/examples/template_project/` to help you smoke-test the framework. |
| 99 | +## License |
327 | 100 |
|
328 | | -## Additional resources |
| 101 | +AgentUnit is released under the MIT License. See [LICENSE](LICENSE) for the full text. |
329 | 102 |
|
330 | | -- [Quickstart guide](docs/quickstart.md) |
331 | | -- [How to write scenarios](docs/writing-scenarios.md) |
332 | | -- [CLI reference](docs/cli.md) |
333 | | -- [Template suite skeleton](docs/templates/suite_template.py) |
| 103 | +--- |
334 | 104 |
|
335 | | -If you build something cool with AgentUnit, let us know by opening an issue or discussion on GitHub! |
| 105 | +Need an overview for stakeholders? Start with [docs/architecture.md](docs/architecture.md). Ready to extend the platform? Explore the templates under [docs/templates/](docs/templates/). |
0 commit comments