A Reviewer-Aligned Benchmark for Survey Evaluators
SurveyReview is a reviewer-aligned benchmark for evaluating survey papers. It converts real peer-review reports into multidimensional evaluation labels and rationales, allowing models to be tested against how human reviewers judge survey quality.
The benchmark focuses on four survey-review dimensions: Readability, Criticalness, Comprehensiveness, and Structure. It provides standardized train/test splits, article metadata, prompt files, and an API-based evaluation pipeline.
Create an environment and install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtConfigure the API client:
cp .env.example .envThen edit .env:
API_KEY=your-api-key-here
BASE_URL=https://api.openai.com/v1
MODEL_NAME=gpt-5.2
JUDGE_MODEL=gpt-5.2
EVALUATE_REASONS=True
Run the default test-set evaluation:
python src/api_base_evaluate.pyRun on the train split:
EVAL_SPLIT=train python src/api_base_evaluate.pyOutputs are written to result/<timestamp>/:
| File | Description |
|---|---|
results.csv |
MSE, MAE, accuracy, and sample counts for each dimension. |
predictions_<dimension>.jsonl |
Per-sample prediction records. |
run_config.json |
Runtime configuration and split statistics. |
rqs_<dimension>.json |
Rationale quality results when EVALUATE_REASONS=True. |
Lower MSE/MAE is better. Higher SSR/RQS is better.
| Rank | Model | SSR | Read. | Crit. | Comp. | Stru. | Average | RQS | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | ||||
| 1 | SurveyReviewer | 0.74 | 1.43 | 0.72 | 1.52 | 0.82 | 1.26 | 0.56 | 1.29 | 0.65 | 1.38 | 0.69 | 0.36 |
| 2 | GPT-5.2 | 0.68 | 2.13 | 1.07 | 1.97 | 0.97 | 2.04 | 1.08 | 2.98 | 1.47 | 2.28 | 1.15 | 0.42 |
| 3 | Claude-Opus-4.5 | 0.68 | 2.91 | 1.29 | 1.88 | 0.88 | 2.66 | 1.23 | 3.65 | 1.58 | 2.77 | 1.25 | 0.48 |
| 4 | Qwen3-32B | 0.61 | 3.05 | 1.45 | 3.24 | 1.51 | 3.22 | 1.54 | 3.35 | 1.53 | 3.21 | 1.51 | 0.36 |
| 5 | GLM-4.7 | 0.60 | 3.43 | 1.50 | 2.58 | 1.21 | 3.66 | 1.57 | 4.83 | 1.95 | 3.62 | 1.56 | 0.37 |
| 6 | gemini-3-pro | 0.58 | 3.84 | 1.52 | 2.25 | 1.00 | 3.91 | 1.49 | 5.76 | 2.11 | 3.94 | 1.53 | 0.29 |
| 7 | DeepSeek-v3.2 | 0.58 | 4.78 | 1.88 | 2.49 | 1.15 | 4.59 | 1.82 | 4.02 | 1.76 | 3.97 | 1.65 | 0.37 |
articles/is split into multiple JSON shards to stay within GitHub file-size limits.v1.0-paperis intended for paper reproduction.- If you only want to verify the pipeline, set
EVALUATE_REASONS=Falseto skip the judge-model stage.
