Skip to content

VibeBench/VibeSearchBench.github.io

Repository files navigation

VibeSearchBench Logo

VibeSearchBench

Tasks Best F1 Paper Leaderboard Code Dataset

Hardest — vague multi-turn proactive search in the wild.
Verifiable — schema-free knowledge graph evaluation.
Long-horizon — persona-driven progressive disclosure.


Leaderboard

Browse the full leaderboard and individual task trajectories at vibebench.github.io/VibeSearchBench.github.io.

Evaluation:

  • Primary metric: Triplet F1. Predicted knowledge graphs are matched against ground truth via LLM-as-judge node alignment and triplet semantic equivalence.
  • Frameworks: ReAct and OpenClaw, evaluated on VibeSearch-Pro and VibeSearch-Daily.
  • Best reported score: 30.3 triplet F1 (Claude Opus 4.6, OpenClaw).

Explore: Leaderboard · Task trajectories · Paper

Tasks

200 tasks across 2 subsets and 20 domains. Each task pairs a vague initial query with a ground-truth knowledge graph and a persona simulator.

Split Count Description
pro 100 Professional research — literature reviews, market analysis, technical due diligence
daily 100 Daily-life search — shopping, travel, lifestyle with evolving preferences

Real users rarely specify full intent upfront. VibeSearch captures bidirectional convergence: agents interleave partial results with follow-up questions while users progressively disclose needs. VibeSearchBench evaluates schema-free knowledge graphs via graph matching (Precision / Recall / F1).

Dataset

Available on Hugging Face: VibeSearchBench/VibeSearchBench


Live site

https://vibebench.github.io/VibeSearchBench.github.io/

Static project website for VibeSearchBench. This repo is under the VibeBench org as a project site.

Enable GitHub Pages (required once)

The Publish site to gh-pages workflow builds the site and pushes the gh-pages branch. Then:

  1. Open Settings → Pages: https://github.com/VibeBench/VibeSearchBench.github.io/settings/pages
  2. Build and deployment → SourceDeploy from a branch
  3. Branch gh-pages, folder / (root)Save
  4. Wait 2–5 min → https://vibebench.github.io/VibeSearchBench.github.io/

If Actions cannot push, enable Settings → Actions → General → Workflow permissions → Read and write.

Update from the main benchmark repo

cd /path/to/VibeSearchBench
bash scripts/publish_github_io.sh

Or build only:

SITE_DIR=../VibeSearchBench.github.io bash scripts/build_website.sh
cd ../VibeSearchBench.github.io && git add -A && git commit -m "Update site" && git push

Trajectory layout

  • Pro source (jsonl): data/trajs/pro/*.jsonl → viewer JSON via scripts/convert_pro_trajs.py
  • Daily source (jsonl): data/trajs/claude-opus-4.6_custom_serper_simulated/trajs_reextract/
  • Viewer (json): data/trajs/pro/ (001.json …), data/trajs/daily/ (task_*.json)
python3 scripts/convert_pro_trajs.py
python3 scripts/build_final_extractions.py
python3 scripts/build_tasks_index.py
python3 scripts/fetch_ground_truth.py

Then commit and push this repository.


VibeSearchBench · Rednote-Hilab & Unipat AI

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors