Skip to content

cognis-digital/evalbench

EVALBENCH

EVALBENCH

Offline LLM / agent eval harness with regression gates

PyPI CI License: COCL 1.0 Suite

AI Agents & LLMOps — build, route, evaluate, and secure agents.

pip install cognis-evalbench
evalbench scan .            # → prioritized findings in seconds

Contents

Why evalbench?

CI for agents

evalbench is single-purpose, scriptable, and self-hostable: point it at a target, get prioritized results in the format your workflow already speaks (table · JSON · SARIF), gate CI on it, and let agents drive it over MCP.

Features

  • ✅ Load Suite
  • ✅ Run Suite
  • ✅ Compare Baseline
  • ✅ Runs on Linux/macOS/Windows · Docker · devcontainer
  • ✅ Ports in Python, JavaScript, Go, and Rust (ports/)

Quick start

pip install cognis-evalbench
evalbench --version
evalbench scan .                       # scan current project
evalbench scan . --format json         # machine-readable
evalbench scan . --fail-on high        # CI gate (non-zero exit)

Example

$ evalbench scan .
  [HIGH    ] EVA-001  example finding             (./src/app.py)
  [MEDIUM  ] EVA-002  another signal              (./config.yaml)

  2 findings · risk score 5 · 38ms

Architecture

flowchart LR
  A[Input: file / dir / API] --> B[Collectors]
  B --> C[Rules / Analyzers]
  C --> D[Scorer]
  D --> E{Reporters}
  E --> F[Table]
  E --> G[JSON / SARIF]
  E --> H[MCP tool -. drives .-> AI agents]
Loading

Use it from any AI stack

evalbench is interoperable with every popular way of using AI:

  • MCP serverevalbench mcp (Claude Desktop, Cursor, Cognis.Studio, uncensored-fleet)
  • OpenAI-compatible / JSON — pipe evalbench scan . --format json into any agent or LLM
  • LangChain · CrewAI · AutoGen · LlamaIndex — wrap the CLI/JSON as a tool in one line
  • CI / scripts — exit codes + SARIF for non-AI pipelines

How it compares

Cognis evalbench promptfoo
Self-hostable, no account varies
Single command, zero config ⚠️
JSON + SARIF for CI varies
MCP-native (AI agents)
Polyglot ports (JS/Go/Rust)
Open license ✅ COCL varies

Built in the spirit of promptfoo / deepeval, re-framed the Cognis way. Missing a credit? Open a PR.

Integrations

Pipes into your stack: SARIF for code-scanning, JSON for anything, an MCP server (evalbench mcp) for AI agents, and a webhook forwarder for SIEM/Slack/Jira. See docs/INTEGRATIONS.md.

Install — every way, every platform

pip install "git+https://github.com/cognis-digital/evalbench.git"    # pip (works today)
pipx install "git+https://github.com/cognis-digital/evalbench.git"   # isolated CLI
uv tool install "git+https://github.com/cognis-digital/evalbench.git" # uv
pip install cognis-evalbench                                          # PyPI (when published)
docker run --rm ghcr.io/cognis-digital/evalbench:latest --help        # Docker
brew install cognis-digital/tap/evalbench                             # Homebrew tap
curl -fsSL https://raw.githubusercontent.com/cognis-digital/evalbench/main/install.sh | sh
Linux macOS Windows Docker Cloud
scripts/setup-linux.sh scripts/setup-macos.sh scripts/setup-windows.ps1 docker run ghcr.io/cognis-digital/evalbench DEPLOY.md (AWS/Azure/GCP/k8s)

Related Cognis tools

  • agentsmith — Config-first scaffolding and orchestration for multi-agent workflows
  • skillhub — Local skill registry and installer for AI agents
  • toolguard — Runtime allowlist and policy for agent tool-calls
  • ragkit — Batteries-included local RAG pipeline — ingest, index, serve
  • memorybank — Portable long-term memory store for agents, exposed over MCP
  • promptpack — Versioned prompt / template registry with A/B and rollbacks

Explore the suite → 🗂️ all 170+ tools · ⭐ awesome-cognis · 🔗 cognis-sources · 🤖 uncensored-fleet · 🧠 hermes

Contributing

PRs, new rules, and demo scenarios are welcome under the collaboration-pull model — see CONTRIBUTING.md and SECURITY.md.

⭐ If evalbench saved you time, star it — it genuinely helps others find it.

License

Source-available under the Cognis Open Collaboration License (COCL) v1.0 — free for personal, internal-evaluation, research, and educational use; commercial / production use requires a license (licensing@cognis.digital). See LICENSE.


Cognis Digital · one of 170+ tools in the Cognis Neural Suite · Making Tomorrow Better Today

Releases

No releases published

Packages

 
 
 

Contributors