Skip to content

Conversation

@dimavrem22
Copy link
Contributor

@dimavrem22 dimavrem22 commented Jan 12, 2026

Resolved:

Benchmark Flow:

  • python web_hacker/scripts/run_benchmarks.py -v
  • has a list of s3 benchmarks fixtures
  • downloads one benchmark test at a time and unzips it (contains cdp captures, routine discovery task, deterministic tests, and llm tests)
  • runs discovery agent, then deterministic tests, then llm tests to produce an output like this
Screenshot 2026-01-13 at 10 00 25 AM

@dimavrem22 dimavrem22 merged commit b4e2f0d into main Jan 13, 2026
2 checks passed
@dimavrem22 dimavrem22 deleted the agent-benchmarks branch January 13, 2026 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants