SurveyReview

A Reviewer-Aligned Benchmark for Survey Evaluators

SurveyReview is a reviewer-aligned benchmark for evaluating survey papers. It converts real peer-review reports into multidimensional evaluation labels and rationales, allowing models to be tested against how human reviewers judge survey quality.

The benchmark focuses on four survey-review dimensions: Readability, Criticalness, Comprehensiveness, and Structure. It provides standardized train/test splits, article metadata, prompt files, and an API-based evaluation pipeline.

Quick Start

Create an environment and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Configure the API client:

cp .env.example .env

Then edit .env:

API_KEY=your-api-key-here
BASE_URL=https://api.openai.com/v1
MODEL_NAME=gpt-5.2
JUDGE_MODEL=gpt-5.2
EVALUATE_REASONS=True

Run the default test-set evaluation:

python src/api_base_evaluate.py

Run on the train split:

EVAL_SPLIT=train python src/api_base_evaluate.py

Outputs are written to result/<timestamp>/:

File	Description
`results.csv`	MSE, MAE, accuracy, and sample counts for each dimension.
`predictions_<dimension>.jsonl`	Per-sample prediction records.
`run_config.json`	Runtime configuration and split statistics.
`rqs_<dimension>.json`	Rationale quality results when `EVALUATE_REASONS=True`.

Leaderboard

Lower MSE/MAE is better. Higher SSR/RQS is better.

Rank	Model	SSR	Read.		Crit.		Comp.		Stru.		Average		RQS
Rank	Model	SSR	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	RQS
1	SurveyReviewer	0.74	1.43	0.72	1.52	0.82	1.26	0.56	1.29	0.65	1.38	0.69	0.36
2	GPT-5.2	0.68	2.13	1.07	1.97	0.97	2.04	1.08	2.98	1.47	2.28	1.15	0.42
3	Claude-Opus-4.5	0.68	2.91	1.29	1.88	0.88	2.66	1.23	3.65	1.58	2.77	1.25	0.48
4	Qwen3-32B	0.61	3.05	1.45	3.24	1.51	3.22	1.54	3.35	1.53	3.21	1.51	0.36
5	GLM-4.7	0.60	3.43	1.50	2.58	1.21	3.66	1.57	4.83	1.95	3.62	1.56	0.37
6	gemini-3-pro	0.58	3.84	1.52	2.25	1.00	3.91	1.49	5.76	2.11	3.94	1.53	0.29
7	DeepSeek-v3.2	0.58	4.78	1.88	2.49	1.15	4.59	1.82	4.02	1.76	3.97	1.65	0.37

Notes

articles/ is split into multiple JSON shards to stay within GitHub file-size limits.
v1.0-paper is intended for paper reproduction.
If you only want to verify the pipeline, set EVALUATE_REASONS=False to skip the judge-model stage.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
data		data
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SurveyReview

Quick Start

Leaderboard

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SurveyReview

Quick Start

Leaderboard

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages