Releases: auth0/auth0-evals
1.1.0-beta.0
Improved Eval Score v1.1
v1.1 expands evaluation coverage from 12 to 13 frameworks with the addition of Flask and adds support for evaluating newer models, including Claude Opus 4.8, Claude Haiku 4.5, Gemini 3.5 Flash, and GPT-5.4 Mini.
Using Flywheel, we identified opportunities to improve our Skills. With the latest Skills and MCP enhancements, the agent-assisted evals score increased from 93.0 in v1 to 96.9 in v1.1 (+3.9 points), demonstrating continued gains in Auth0 SDK integration task completion across frameworks and models and further enhancing the agentic experience for our customers.
Live score: https://auth0.com/agent-experience
1.0.0
Built an eval framework to answer that- a system that runs real AI agents through real Auth0 integration tasks, then scores the output across 7 dimensions including correctness, security, and hallucination. We measured the agent experience across 5 models, 12 frameworks, and 60 configurations and made the results public on https://auth0.com/agent-experience