|
| 1 | +# SLDBench — Scaling Law Discovery Benchmark |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +**SLDBench** is a benchmark for discovering scaling laws, originally introduced in the paper [*Can Language Models Discover Their Own Scaling Laws?*](https://arxiv.org/abs/2507.21184) by Lin et al. It aggregates over **5,000 LLM training experiments** from recent scaling-law literature into a unified dataset, hosted on the Hugging Face Hub at [`pkuHaowei/sldbench`](https://huggingface.co/datasets/pkuHaowei/sldbench). |
| 6 | + |
| 7 | +Also check this [blog](https://algorithmicsuperintelligence.ai/blog/openevolve-sldagent/) for quickly understanding OpenEvolve x SLDBench. |
| 8 | + |
| 9 | +## Overview |
| 10 | + |
| 11 | +SLDBench focuses on **discovery** rather than simple curve fitting. The agent must identify: |
| 12 | + |
| 13 | +- A **symbolic law** $f_\theta(x)$ (the functional form). |
| 14 | +- A **parameter fitting routine** that generalizes across multiple training scenarios. |
| 15 | + |
| 16 | +**Key Features:** |
| 17 | + |
| 18 | +- **Data Source:** All task data is pulled dynamically from the Hugging Face dataset. |
| 19 | +- **Extrapolation Evaluation:** Models are trained on smaller-scale runs and strictly evaluated on held-out, larger-scale configurations to test predictive capability. |
| 20 | +- **Evolutionary Loop:** OpenEvolve iteratively mutates and evaluates candidate implementations of `scaling_law_func(...)` (the symbolic law) and `fit_scaling_law(...)` (the optimizer). |
| 21 | + |
| 22 | +------ |
| 23 | + |
| 24 | +## SLDBench Tasks |
| 25 | + |
| 26 | +There are currently 7 core scaling-law discovery tasks, each derived from real-world LLM experiments. Configuration files for these tasks are located in `examples/sldbench/configs/`. |
| 27 | + |
| 28 | +| **Task Name (Config)** | **Scenario** | **Inputs (X)** | **Target (y)** | |
| 29 | +| -------------------------------- | -------------------------------------------- | ---------------------------------------------------- | ------------------------------ | |
| 30 | +| **parallel_scaling_law** | Parallel / Best-of-N inference scaling. | Model size $N$, Parallelism $P$ | Loss $L(N, P)$ | |
| 31 | +| **vocab_scaling_law** | Vocabulary size vs. model/data scaling. | Non-vocab size $N$, Vocab size $V$, Dataset size $D$ | Unigram-normalized loss $L$ | |
| 32 | +| **sft_scaling_law** | Supervised Fine-Tuning (SFT). | SFT dataset size $D$ | Fine-tuning loss $L(D)$ | |
| 33 | +| **domain_mixture_scaling_law** | Multi-domain pre-training mixtures. | Domain mixture proportions $r$ | Per-domain losses $\{L_i(r)\}$ | |
| 34 | +| **moe_scaling_law** | Mixture-of-Experts (MoE) scaling. | Network size $N$, Experts $E$ | Pre-training loss $L(N, E)$ | |
| 35 | +| **data_constrained_scaling_law** | Data-constrained pre-training regimes. | Model size $N$, Dataset size $D$, Unique tokens $U$ | Loss $L(N, D, U)$ | |
| 36 | +| **lr_bsz_scaling_law** | Joint Learning Rate / Batch Size (Step Law). | LR $l$, Batch size $b$, Dataset $D$, Model $N$ | Loss $L(l, b, D, N)$ & Optima | |
| 37 | + |
| 38 | +> **Note:** A task named `easy_question_scaling_law` is also included for U-shape scaling studies, though it is not part of the current paper reference. |
| 39 | +
|
| 40 | +------ |
| 41 | + |
| 42 | +## File Structure |
| 43 | + |
| 44 | +- `configs/` — YAML configuration files defining data splits, features, targets, and evaluation settings for each task. |
| 45 | +- `data_loader.py` — Unified data loader for the `pkuHaowei/sldbench` Hugging Face dataset. |
| 46 | +- `evaluator.py` — The evaluation framework; handles data splitting (train/extrapolate) and metric computation. |
| 47 | +- `init_program.py` — The seed implementation (a power-law–style baseline) to jumpstart the evolutionary search. |
| 48 | + |
| 49 | +------ |
| 50 | + |
| 51 | +## Usage |
| 52 | + |
| 53 | +### Configuration Prerequisites |
| 54 | + |
| 55 | +Before running any tasks, ensure your API key environment variable is set for your API provider (e.g., `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`). |
| 56 | + |
| 57 | +### Running Individual Tasks |
| 58 | + |
| 59 | +To run the evolutionary process for a specific task: |
| 60 | + |
| 61 | +```bash |
| 62 | +python openevolve-run.py \ |
| 63 | + examples/sldbench/init_program.py \ |
| 64 | + examples/sldbench/evaluator.py \ |
| 65 | + --config examples/sldbench/configs/sft_scaling_law.yaml \ |
| 66 | + --api-base "https://api.openai.com/v1" \ |
| 67 | + --iterations 50 |
| 68 | +``` |
| 69 | + |
| 70 | +To switch tasks, simply point the `--config` argument to a different YAML file found in `examples/sldbench/configs/`. |
| 71 | + |
| 72 | +### Automated Benchmark Script |
| 73 | + |
| 74 | +A complete benchmark script `run.sh` is provided for running multiple tasks across different models with parallelism: |
| 75 | + |
| 76 | +```bash |
| 77 | +cd examples/sldbench |
| 78 | +chmod +x run.sh |
| 79 | +./run.sh 4 # Run with parallelism degree of 4 per model |
| 80 | +``` |
| 81 | + |
| 82 | +**Note**: Configure the `API_BASE` variable in the script (defaults to `https://api.openai.com/v1`) and ensure your API key environment variable is set. |
| 83 | + |
| 84 | +The script handles evolution and evaluation automatically, storing results in `./results/`. |
| 85 | + |
| 86 | +### Important: Test Set Evaluation |
| 87 | + |
| 88 | +**Note:** The `openevolve-run` command only evaluates programs on the **training set** during evolution. To compute final metrics on the **test set**, you must explicitly run: |
| 89 | + |
| 90 | +```bash |
| 91 | +python evaluator.py "path/to/generated_program.py" |
| 92 | +``` |
| 93 | + |
| 94 | +The `evaluator.py` script, when run in `__main__` mode, computes metrics on the held-out extrapolation test set, which is the proper way to evaluate the discovered scaling laws' predictive capability. |
| 95 | + |
| 96 | +------ |
| 97 | + |
| 98 | +## Data Format & Evaluation |
| 99 | + |
| 100 | +Each task is formulated as a scaling-law discovery problem containing: |
| 101 | + |
| 102 | +1. **Features ($X$):** Input variables (e.g., $N, D, \text{LR}, \text{Batch Size}$). |
| 103 | +2. **Targets ($y$):** Performance metrics (typically training or validation loss). |
| 104 | +3. **Groups:** Control indices representing distinct experimental settings (e.g., different model architectures) that share the law *form* but require distinct fitted *parameters*. |
| 105 | + |
| 106 | +### The Evaluation Process |
| 107 | + |
| 108 | +1. **Splitting:** The evaluator partitions data into **training** and **extrapolation test** sets. The largest models or datasets are explicitly held out to mirror real-world forecasting needs. |
| 109 | +2. **Fitting:** The `fit_scaling_law` function optimizes parameters on the training portion for each group. |
| 110 | +3. **Scoring:** The fitted law is applied to the test set to compute the following metrics: |
| 111 | + |
| 112 | +- **NMSE:** Normalized Mean Squared Error |
| 113 | +- **NMAE:** Normalized Mean Absolute Error |
| 114 | +- **$R^2$:** Coefficient of Determination |
| 115 | +- **Combined Score:** A single scalar summary (currently equivalent to $R^2$). |
| 116 | + |
| 117 | +*Higher combined scores indicate superior extrapolation quality.* |
| 118 | + |
| 119 | +------ |
| 120 | + |
| 121 | +## Evolution Markers |
| 122 | + |
| 123 | +OpenEvolve modifies code explicitly wrapped in evolution blocks. The agent evolves the symbolic form and the optimizer simultaneously: |
| 124 | + |
| 125 | +Python |
| 126 | + |
| 127 | +``` |
| 128 | +# EVOLVE-BLOCK-START |
| 129 | +def scaling_law_func(data_points, params): |
| 130 | + # Returns predicted values given inputs and parameters |
| 131 | + pass |
| 132 | +
|
| 133 | +def fit_scaling_law(data_points, loss_values): |
| 134 | + # Optimizes parameters to fit the scaling law |
| 135 | + pass |
| 136 | +# EVOLVE-BLOCK-END |
| 137 | +``` |
| 138 | + |
| 139 | +The system mutates these blocks, evaluates them via `evaluator.py`, and maintains a database of the highest-performing implementations. |
| 140 | + |
| 141 | +## Requirements |
| 142 | + |
| 143 | +Bash |
| 144 | + |
| 145 | +``` |
| 146 | +pip install datasets numpy scipy |
| 147 | +# Ensure the latest version of openevolve is installed |
| 148 | +``` |
| 149 | + |
| 150 | +## Citation |
| 151 | + |
| 152 | +If you utilize SLDBench, this example, or derived results in your work, please cite the original paper: |
| 153 | + |
| 154 | +``` |
| 155 | +@article{lin2025sldbench, |
| 156 | + title = {Can Language Models Discover Scaling Laws?}, |
| 157 | + author = {Lin, Haowei and Ye, Haotian and Feng, Wenzheng and Huang, Quzhe and |
| 158 | + Li, Yujun and Lim, Hubert and Li, Zhengrui and Wang, Xiangyu and |
| 159 | + Ma, Jianzhu and Liang, Yitao and Zou, James}, |
| 160 | + journal = {arXiv preprint arXiv:2507.21184}, |
| 161 | + year = {2025} |
| 162 | +} |
0 commit comments