Skip to content

brighterr/SurveyReview

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SurveyReview icon SurveyReview

A Reviewer-Aligned Benchmark for Survey Evaluators

Project Page Hugging Face Dataset

SurveyReview overview

SurveyReview is a reviewer-aligned benchmark for evaluating survey papers. It converts real peer-review reports into multidimensional evaluation labels and rationales, allowing models to be tested against how human reviewers judge survey quality.

The benchmark focuses on four survey-review dimensions: Readability, Criticalness, Comprehensiveness, and Structure. It provides standardized train/test splits, article metadata, prompt files, and an API-based evaluation pipeline.

Quick Start

Create an environment and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Configure the API client:

cp .env.example .env

Then edit .env:

API_KEY=your-api-key-here
BASE_URL=https://api.openai.com/v1
MODEL_NAME=gpt-5.2
JUDGE_MODEL=gpt-5.2
EVALUATE_REASONS=True

Run the default test-set evaluation:

python src/api_base_evaluate.py

Run on the train split:

EVAL_SPLIT=train python src/api_base_evaluate.py

Outputs are written to result/<timestamp>/:

File Description
results.csv MSE, MAE, accuracy, and sample counts for each dimension.
predictions_<dimension>.jsonl Per-sample prediction records.
run_config.json Runtime configuration and split statistics.
rqs_<dimension>.json Rationale quality results when EVALUATE_REASONS=True.

Leaderboard

Lower MSE/MAE is better. Higher SSR/RQS is better.

Rank Model SSR Read. Crit. Comp. Stru. Average RQS
MSEMAE MSEMAE MSEMAE MSEMAE MSEMAE
1 SurveyReviewer 0.74 1.430.72 1.520.82 1.260.56 1.290.65 1.380.69 0.36
2 GPT-5.2 0.68 2.131.07 1.970.97 2.041.08 2.981.47 2.281.15 0.42
3 Claude-Opus-4.5 0.68 2.911.29 1.880.88 2.661.23 3.651.58 2.771.25 0.48
4 Qwen3-32B 0.61 3.051.45 3.241.51 3.221.54 3.351.53 3.211.51 0.36
5 GLM-4.7 0.60 3.431.50 2.581.21 3.661.57 4.831.95 3.621.56 0.37
6 gemini-3-pro 0.58 3.841.52 2.251.00 3.911.49 5.762.11 3.941.53 0.29
7 DeepSeek-v3.2 0.58 4.781.88 2.491.15 4.591.82 4.021.76 3.971.65 0.37

Notes

  • articles/ is split into multiple JSON shards to stay within GitHub file-size limits.
  • v1.0-paper is intended for paper reproduction.
  • If you only want to verify the pipeline, set EVALUATE_REASONS=False to skip the judge-model stage.

About

SurveyReview: A Reviewer-Aligned Benchmark for Evaluating Survey Papers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages