Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? PreScience decomposes the research process into four interdependent generative tasks and evaluates models on a curated dataset of 98,000 AI-related arXiv papers (Oct 2023 -- Oct 2025) with disambiguated author identities, temporally aligned scholarly metadata, and a structured graph of 502,000 total papers. We decompose the prediction of a single scientific advance into the following four interdependent prediction problems:
| Task | Description | Metrics |
|---|---|---|
| Collaborator Prediction | Predict the remaining authors on a future paper given a seed author | nDCG, R-Precision |
| Prior Work Selection | Predict the key references of a future paper given its authors | nDCG, R-Precision |
| Contribution Generation | Generate a paper's title and abstract given its authors and key references | LACERScore, ROUGE-L, BERTScore |
| Impact Prediction | Predict a paper's 12-month cumulative citation count | MAE, Pearson, Spearman |
The dataset is hosted on HuggingFace and contains 98k target papers across train (Oct 2023--2024) and test (Oct 2024--2025) splits, along with 400k+ companion papers (references and author publication histories). See the dataset card for full schema and statistics.
git clone https://github.com/allenai/prescience.git
cd prescience
pip install -r requirements.txtPrerequisites:
- Python 3.10+
- GPU with 24 GB+ VRAM for GritLM-7B embeddings and 8B model fine-tuning (BF16, no quantization); CUDA 12 required (
faiss-gpu-cu12) - Embedding-based task baselines and
multiturn/simulate.pyrequire precomputed embeddings (see Computing Embeddings) export OPENAI_API_KEY=<your_key>— required for GPT baselines, LACERScore evaluation, and gold paraphrase generationexport ANTHROPIC_API_KEY=<your_key>— required for Claude baselines- SPECTER2 has dependency conflicts with the main environment; use
requirements_specter2.txtin a separate environment
prescience/
├── task_coauthor_prediction/ # Collaborator prediction baselines, evaluation and analyses
├── task_priorwork_prediction/ # Prior work selection baselines, evaluation and analyses
├── task_followup_prediction/
│ ├── generate/ # Contribution generation baselines (GPT, Claude, OLMo, LLaMA)
│ └── evaluate/ # LACERScore, BERTScore, ROUGE-L evaluation
│ └── analysis/
├── task_impact_prediction/ # Impact prediction baselines, evaluation and analyses
├── multiturn/
│ ├── simulate.py # End-to-end multi-turn simulation
│ └── analysis/ # Simulation analysis scripts
├── dataset/
│ ├── corpus/ # 7-stage corpus creation pipeline
│ ├── embeddings/ # Embedding computation
│ └── s2and_prep/ # S2AND author disambiguation features
├── utils.py # Shared utilities (I/O, embeddings, FAISS, S2 API)
└── requirements.txt
All baseline scripts save predictions to data/<task_dir>/<split>/predictions/ and generation scripts save to data/task_followup_prediction/<split>/generations/, which are the inputs to the evaluation scripts.
# Frequency
python3 -m task_coauthor_prediction.baseline_frequency --split test
# Rank Fusion (requires precomputed embeddings)
python3 -m task_coauthor_prediction.baseline_rank_fusion --split test --embeddings_dir data/corpus/test --embedding_type grit
# Embedding Fusion
python3 -m task_coauthor_prediction.baseline_embedding_fusion --split test --embeddings_dir data/corpus/test --embedding_type grit
# Hierarchical Clustering
python3 -m task_coauthor_prediction.baseline_hierarchical --split test --embeddings_dir data/corpus/test --embedding_type grit
# Embedding Fusion Projected (train projection, then run inference)
python3 -m task_coauthor_prediction.train_projection --split train --embeddings_dir data/corpus/train --embedding_type grit --output_dir data/task_coauthor_prediction/train/projection_models
python3 -m task_coauthor_prediction.baseline_mean_pooling_projected --split test --embeddings_dir data/corpus/test --embedding_type grit --checkpoint data/task_coauthor_prediction/train/projection_models/projection.grit.pt
# Evaluate
python3 -m task_coauthor_prediction.evaluate --predictions_path data/task_coauthor_prediction/test/predictions/predictions.frequency.one_shot.first.jsonTable 2 results are reported across --embedding_type values: gtr, specter2, grit. The --seed_author_type flag defaults to first (matching Table 2); other options (last, random, highest_h_index) are explored in Appendix C.1.
The task_coauthor_prediction/analysis/ directory contains the scripts used to generate the paper's analysis plots (e.g., nDCG vs. experience, R-Precision by familiarity). These expect the full set of baseline predictions with grit embeddings and may not run with partial inputs.
# Frequency
python3 -m task_priorwork_prediction.baseline_frequency --split test
# Embedding Fusion (mean pool of authored papers; "Emb. Fusion" in Table 2)
python3 -m task_priorwork_prediction.baseline_mean_pooling --split test --embeddings_dir data/corpus/test --embedding_type grit
# Embedding Fusion Refs (mean pool of cited references; "Emb. Fusion Refs" in Table 2)
python3 -m task_priorwork_prediction.baseline_embedding_fusion --split test --embeddings_dir data/corpus/test --embedding_type grit
# Rank Fusion
python3 -m task_priorwork_prediction.baseline_rank_fusion --split test --embeddings_dir data/corpus/test --embedding_type grit
# Hierarchical Clustering
python3 -m task_priorwork_prediction.baseline_hierarchical --split test --embeddings_dir data/corpus/test --embedding_type grit
# Embedding Fusion Projected (train projection, then run inference)
python3 -m task_priorwork_prediction.train_projection --split train --embeddings_dir data/corpus/train --embedding_type grit --output_dir data/task_priorwork_prediction/train/projection_models
python3 -m task_priorwork_prediction.baseline_mean_pooling_projected --split test --embeddings_dir data/corpus/test --embedding_type grit --checkpoint data/task_priorwork_prediction/train/projection_models/projection.grit.pt
# Evaluate
python3 -m task_priorwork_prediction.evaluate --predictions_path data/task_priorwork_prediction/test/predictions/predictions.frequency.jsonTable 2 results are reported across --embedding_type values: gtr, specter2, grit.
The task_priorwork_prediction/analysis/ directory contains the scripts used to generate the paper's analysis plots (e.g., nDCG vs. experience, R-Precision by familiarity). These expect the full set of baseline predictions with grit embeddings and may not run with partial inputs.
# GPT models (requires OPENAI_API_KEY)
python3 -m task_followup_prediction.generate.baseline_gpt_parallel --split test --model gpt-4o-2024-11-20
# Claude models (requires ANTHROPIC_API_KEY)
python3 -m task_followup_prediction.generate.baseline_claude_parallel --split test --model claude-sonnet-4-5-20250929
# Local models — vanilla
python3 -m task_followup_prediction.generate.baseline_local --split test --model llama3.1-8b
# Local models — fine-tuned (LoRA)
python3 -m task_followup_prediction.generate.baseline_local --split test --model llama3.1-8b --adapter_path data/task_followup_prediction/train/lora_models/llama3.1-8b/final/adapter
# Same-topic baseline
python3 -m task_followup_prediction.generate.baseline_same_topic --split test
# Key reference baseline
python3 -m task_followup_prediction.generate.baseline_key_reference --split test
# Gold paraphrase (requires OPENAI_API_KEY)
python3 -m task_followup_prediction.generate.baseline_paraphrased_gold --split test --model gpt-5-2025-08-07Model name mapping (paper name → --model value):
| Paper Name | --model |
Script |
|---|---|---|
| GPT 4o | gpt-4o-2024-11-20 |
baseline_gpt_parallel |
| GPT 4.1 | gpt-4.1-2025-04-14 |
baseline_gpt_parallel |
| GPT o3 | o3-2025-04-16 |
baseline_gpt_parallel |
| GPT 5 | gpt-5-2025-08-07 |
baseline_gpt_parallel |
| GPT 5.1 | gpt-5.1-chat-latest |
baseline_gpt_parallel |
| GPT 5.2 | gpt-5.2-2025-12-11 |
baseline_gpt_parallel |
| Claude Sonnet 4.5 | claude-sonnet-4-5-20250929 |
baseline_claude_parallel |
| Claude Opus 4.5 | claude-opus-4-5-20251101 |
baseline_claude_parallel |
| LLaMA 3.1 8B | llama3.1-8b |
baseline_local |
| OLMo 3 7B | olmo3-7b |
baseline_local |
Table 3 reports only fine-tuned (FT) results for LLaMA and OLMo.
LoRA fine-tuning:
# Hyperparameter search + final training
python3 -m task_followup_prediction.generate.finetune_lora --input_dir data/corpus/train --output_dir data/task_followup_prediction/train/lora_models --model llama3.1-8b --n_trials 20 --val_ratio 0.15 --num_train_epochs 3
# Skip HP search with known-good params
python3 -m task_followup_prediction.generate.finetune_lora --input_dir data/corpus/train --output_dir data/task_followup_prediction/train/lora_models --model llama3.1-8b --skip_hp_search --learning_rate 2e-4 --lora_r 16Evaluate:
# LACERScore (default judge: gpt-5-2025-08-07; also supports Anthropic via --judge anthropic)
python3 -m task_followup_prediction.evaluate.lacer_score --input_path <generations_file>
# BERTScore
python3 -m task_followup_prediction.evaluate.bert_score --input_path <generations_file>
# ROUGE-L
python3 -m task_followup_prediction.evaluate.rouge_score --input_path <generations_file>The task_followup_prediction/analysis/ directory contains the scripts used to generate the paper's analysis plots (e.g., LACER by category, LACER vs. citations). These require LACER-scored generation files as input.
Feature-to-flag mapping (for understanding Table 4 feature set combinations):
| Paper Feature Set | CLI Flags |
|---|---|
| Target Text | --use_followup_work_paper |
| Context Text | --use_author_papers --use_prior_work_papers |
| Bibliometrics | --use_author_numbers --use_prior_work_numbers |
# Ridge — Target Text (GRIT)
python3 -m task_impact_prediction.baseline_embedding_ridge --split test --train_embeddings_dir data/corpus/train --test_embeddings_dir data/corpus/test --embedding_type grit --use_followup_work_paper
# XGBoost — Target Text only (GRIT)
python3 -m task_impact_prediction.baseline_xgboost_regressor --split test --train_embeddings_dir data/corpus/train --test_embeddings_dir data/corpus/test --embedding_type grit --use_followup_work_paper
# XGBoost — Target + Context Text
python3 -m task_impact_prediction.baseline_xgboost_regressor --split test --train_embeddings_dir data/corpus/train --test_embeddings_dir data/corpus/test --embedding_type grit --use_followup_work_paper --use_author_papers --use_prior_work_papers
# XGBoost — Target + Context + Bibliometrics (full model)
python3 -m task_impact_prediction.baseline_xgboost_regressor --split test --train_embeddings_dir data/corpus/train --test_embeddings_dir data/corpus/test --embedding_type grit --use_followup_work_paper --use_author_papers --use_prior_work_papers --use_author_numbers --use_prior_work_numbers
# XGBoost — Bibliometrics only
python3 -m task_impact_prediction.baseline_xgboost_regressor --split test --train_embeddings_dir data/corpus/train --test_embeddings_dir data/corpus/test --embedding_type grit --use_author_numbers --use_prior_work_numbers
# Evaluate
python3 -m task_impact_prediction.evaluate --predictions_path data/task_impact_prediction/test/predictions/predictions.ridge_grit.jsonReplace grit with gtr or specter2 to reproduce other embedding rows in Table 4.
The task_impact_prediction/analysis/ directory contains the scripts used to generate the paper's analysis plots (e.g., prediction scatter, SHAP analysis). These require evaluation output files as input.
Embedding-based baselines require precomputed embeddings. Compute them for each split you need:
python3 -m dataset.embeddings.compute_paper_embeddings --split test --embedding_type grit --output_dir data/corpus/test
python3 -m dataset.embeddings.compute_paper_embeddings --split train --embedding_type grit --output_dir data/corpus/train| Type | Model | Dimension |
|---|---|---|
gtr |
sentence-transformers/gtr-t5-large | 768 |
specter2 |
allenai/specter2_base | 768 |
grit |
GritLM/GritLM-7B | 4096 |
SPECTER2 embeddings require a separate environment; install with pip install -r requirements_specter2.txt.
The simulation composes collaborator prediction, prior work selection, and contribution generation into a pipeline that generates a synthetic corpus day-by-day over a specified time horizon.
python3 -m multiturn.simulate --calibration_split train --embeddings_dir data/corpus/train \
--author_embedding_type grit --paper_embedding_type grit \
--coauthor_baseline embedding_fusion --priorwork_baseline embedding_fusion \
--generation_backend openai --generation_model gpt-5-2025-08-07 \
--output_dir data/multiturn/simulated --depth 365 --seed 42Key arguments:
--coauthor_baseline:frequency,embedding_fusion,rank_fusion--priorwork_baseline:frequency,embedding_fusion,rank_fusion--generation_backend:openai,anthropic--generation_model: any supported model name (e.g.,gpt-5-2025-08-07)--seed: random seed for reproducibility
The paper uses embedding_fusion for both collaborator and prior work prediction with GPT-5 for generation, runs the simulation 6 times with different --seed values, and reports 95% confidence intervals.
Analysis. The analysis pipeline has two stages: compute metrics, then plot. For example, to analyze author diversity:
# Compute diversity for natural and synthetic corpora
python3 -m multiturn.analysis.compute_author_diversity --compute_on natural --role target --output_path data/multiturn/analysis/author_diversity_natural.json
python3 -m multiturn.analysis.compute_author_diversity --compute_on synthetic --synthetic_dir data/multiturn/simulated --role synthetic --output_path data/multiturn/analysis/author_diversity_synthetic.json
# Plot
python3 -m multiturn.analysis.plot_author_diversity --natural_path data/multiturn/analysis/author_diversity_natural.json --synthetic_paths data/multiturn/analysis/author_diversity_synthetic.jsonOther available analysis scripts follow the same compute→plot pattern:
| Analysis | Compute Script | Plot Script |
|---|---|---|
| Author diversity | compute_author_diversity |
plot_author_diversity |
| Key reference diversity | compute_key_references_diversity |
plot_key_references_diversity |
| LACER diversity | compute_diversity_neighbors → compute_lacer_diversity |
plot_lacer_diversity |
| LACER novelty | compute_novelty_neighbors → compute_lacer_novelty |
plot_lacer_novelty |
| Topic distribution | classify_synthetic_primary_categories |
plot_topic_distribution |
| Paper Reference | Tasks / Section | How to Reproduce |
|---|---|---|
| Table 2 | Collaborator Prediction + Prior Work Selection | Run all embedding baselines (Frequency, Rank Fusion, Emb. Fusion, Hier. Clustering, Projection) × 3 embedding types (gtr, specter2, grit) |
| Table 3 | Contribution Generation | Run all LLM baselines (see model mapping table) and evaluate with LACERScore, BERTScore, and ROUGE-L |
| Table 4 | Impact Prediction | Run XGBoost with feature combinations (Target Text, Context Text, Bibliometrics) × 3 embedding types |
| Figure 4 | Multi-turn Simulation | Run multiturn.simulate with --coauthor_baseline embedding_fusion --priorwork_baseline embedding_fusion --generation_model gpt-5-2025-08-07, 6 seeds |
Most users should use the HuggingFace dataset directly. The dataset we released was built using internal versions of the Semantic Scholar API for efficiency; the scripts below use the equivalent public APIs, but results may vary slightly. Due to public API rate limits, these scripts may take a very long time to complete. If you are serious about building a similar dataset, please contact us.
To rebuild from scratch, you need:
- S2 API key: Get one at https://www.semanticscholar.org/product/api#api-key and set
export S2_API_KEY=<your_key> - arXiv snapshot: Download from Kaggle to
data/arxiv_snapshot/arxiv-metadata-oai-snapshot.json - S2AND: For author disambiguation (Stage 4), follow S2AND setup
Full pipeline commands
Run each stage for both train and test splits. Commands below show the train split; for test, replace 2023-10-01/2024-10-01 with 2024-10-01/2025-10-01 and train with test.
# Stage 1: Download target papers from arXiv snapshot + S2 API
python3 -m dataset.corpus.download_target_papers --start_date 2023-10-01 --end_date 2024-10-01 --output_dir data/corpus/train
# Stage 2: Add key (highly influential) references
python3 -m dataset.corpus.add_key_references --input_dir data/corpus/train --output_dir data/corpus/train
# Stage 3: Add author rosters and publication histories
python3 -m dataset.corpus.add_authors --input_dir data/corpus/train --output_dir data/corpus/train
# Stage 4a: Prepare S2AND input features
python3 -m dataset.s2and_prep.download_s2and_features --input_dirs data/corpus/train data/corpus/test --output_dir data/corpus/s2and_prescience
# Stage 4b: Run S2AND clustering (see https://github.com/allenai/S2AND)
# Stage 4c: Merge disambiguated author identities into corpus
python3 -m dataset.corpus.merge_authors_in_corpus --input_dir data/corpus/train --output_dir data/corpus/train --s2and_data_dir data/corpus/s2and_prescience
# Stage 5: Add citation metadata (citation counts, h-index, trajectories)
python3 -m dataset.corpus.add_citation_metadata --input_dir data/corpus/train --output_dir data/corpus/train
# Stage 6: Replace titles/abstracts with official arXiv versions
python3 -m dataset.corpus.replace_title_abstracts_using_snapshot --input_dir data/corpus/train --output_dir data/corpus/train
# Stage 7: Compute embeddings
python3 -m dataset.embeddings.compute_paper_embeddings --split train --embedding_type gtr --output_dir data/corpus/train
python3 -m dataset.embeddings.compute_paper_embeddings --split train --embedding_type specter2 --output_dir data/corpus/train
python3 -m dataset.embeddings.compute_paper_embeddings --split train --embedding_type grit --output_dir data/corpus/train@misc{ajith2026presciencebenchmarkforecastingscientific,
title={PreScience: A Benchmark for Forecasting Scientific Contributions},
author={Anirudh Ajith and Amanpreet Singh and Jay DeYoung and Nadav Kunievsky and Austin C. Kozlowski and Oyvind Tafjord and James Evans and Daniel S. Weld and Tom Hope and Doug Downey},
year={2026},
eprint={2602.20459},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.20459},
}Code is licensed under Apache 2.0 — see LICENSE for details.
The dataset is released under ODC-BY 1.0.
We welcome bug fixes and improvements — please submit a pull request. For questions or suggestions, open an issue on GitHub. For other inquiries, contact anirudha@allenai.org.