This repository contains scripts for scaling experiments across different model scales and compression rates (i.e., patch size).
We release all the results presented in the paper "Compute Optimal Tokenization" (blog post) and the code for fitting scaling laws and producing visualizations.
- Experiments: Raw results used in scaling laws and visualizations.
- Scaling Laws: Power laws fit for optimal bytes-to-parameter ratio and optimal compression rate.
- Visualizations: Process results to generate visualizations.
We trained LLMs at multiple compute budgets, data sizes, parameter counts, and compression rates.
Raw results are stored as CSVs in result_csv/. Filenames encode the tokenizer (blt_entropy, blt_static, isotropic), an optional language tag (e.g. rus_Cyrl), and the compute budget exponent (e.g. 1e20).
Training/evaluation code is not included here. BLT runs can be reproduced with the BLT repository.
Both fitting scripts discover corresponding CSVs in result_csv/ matching --prefix. They write a human-readable *.txt report and a machine-readable *.json of fitted coefficients next to themselves in power_laws/.
File: power_laws/fit_power_law_data_parameters.py
Fits the two power laws
B*(C, T) = B_0 · C^alpha · T^beta
N*(C, T) = N_0 · C^alpha_N · T^beta_N
where C is compute budget, T is compression rate, B* is optimal data (bytes) and N* is optimal parameters. Per-(C, T) optima are taken from the polynomial paraboloid fit to each isoFLOP CSV (see plot_utils.load_optimal_results). Fitting is done in log space with BFGS using 2197-point grid multi-start, with standard errors and 95% CIs from the Hessian.
Usage:
python power_laws/fit_power_law_data_parameters.py [csv_files ...] \
--prefix blt_entropy \
--consider-global-parameters \
[--approximated-compute] [--dataset c4]Working example (auto-discovers all blt_entropy_*e*_results.csv):
python power_laws/fit_power_law_data_parameters.py --consider-global-parametersArguments:
csv_files— explicit list of CSVs. If omitted, files matching--prefixare auto-discovered.--prefix— CSV prefix used for auto-discovery (defaultblt_entropy; useisotropicfor BPE).--consider-global-parameters— count parameters of the global (i.e., excluding encoder, decoder, embeddings) transformer only. Used for all paper results.--approximated-compute— replaces the file-budgetCwithC* = 6·B·N/T(requires--consider-global-parameters). Useful for sanity-checking the budget assumption.--dataset— evaluation column to use for selecting optima (defaultc4).
Outputs (in power_laws/):
<prefix>_power_law_fit_report[ _approximated_global_compute].txt— fit summary, CIs, R².<prefix>_power_law_params[ _approximated_global_compute].json—{B_optimal_data: {B_0, alpha, beta}, N_optimal_params: {N_0, alpha, beta}}.
File: power_laws/fit_power_law_compression_rate.py
Jointly fits a scaling law for loss given compression and compute budget. Three residual variants are selectable:
compute-dependent-t(default):L(C,T) = G·C^gamma + F·log(C^delta · T / T_0)^2 + Econstant-t:L(C,T) = G·C^gamma + F·log(T / T_0)^2 + Eintercept:L(C,T) = G·C^gamma + E
The optional --two-stage variant first fits G·C^gamma with rate-specific intercepts F_T, then fits the residual model to the residuals.
Usage:
python power_laws/fit_power_law_compression_rate.py [csv_files ...] \
--prefix blt_entropy \
[--two-stage] \
[--residual-model {compute-dependent-t,constant-t,intercept}] \
[--dataset c4]Working example:
python power_laws/fit_power_law_compression_rate.py --two-stageArguments:
csv_files,--prefix,--dataset— as in §2.1.--two-stage— two-stage fit producing per-rateF_Tintercepts.--residual-model— selects the optimal-loss formula above.
Use plots/plot_compression_heatmap.py to create a heatmap (or contour) showing validation loss for configurations of compression (x-axis) and _ (bytes-per-parameter, data, parameters, or all of them) on y-axis. Fits a paraboloid and saves PDF.
Usage:
python plots/plot_compression_heatmap.py <results_csv> --consider-global-parameters --contour [--fit-params --fit-data --fit-all]Working Example:
python plots/plot_compression_heatmap.py blt_entropy_1e20_results.csv --consider-global-parameters --contourArguments:
<result_csv>the CSV file with results saved [E.G. blt_entropy_1e20_results.csv]--x-scaleand--y-scaleset log or linear scales for each axis. Also important for fitting paraboloid. By default set to log. [DEFAULT: log]--contouruse contour instead of heatmap to plot paraboloid (i.e., isolosses) [USED]--consider-global-parametersif used, parameters of the global model are considered [USED]--fit-paramsif used, plot parameters on y-axis [NOT USED]--fit-dataif used, plot data on y-axis [NOT USED]--fit-allif used, 1x3 figure with all 3 options: parameters, data, and bytes-per-parameter--datasetevaluation column (defaultc4)
Use plots/plot_compression_loss_curves.py to create scaling curves showing loss vs. _ (bytes-per-parameter, data, parameters, or all of them) for different compression rates.
Usage:
python plots/plot_compression_loss_curves.py <results_csv> --consider-global-parameters [--joint-fit] [--fit-params --fit-data --fit-all]Working Example:
python plots/plot_compression_loss_curves.py blt_entropy_1e20_results.csv --consider-global-parametersArguments:
<result_csv>the CSV file with results saved [E.G. blt_entropy_1e20_results.csv]--consider-global-parametersif used, parameters of the global model are considered [USED]--joint-fitif used, one polynomial is fitted for all data (as inplots/plot_compression_heatmap.py). Otherwise fit a polynomial for each compression rate separately. [NOT USED]--fit-paramsif used, plot parameters on x-axis [NOT USED]--fit-dataif used, plot data on x-axis [NOT USED]--fit-allif used, 1x3 figure with all 3 options: parameters, data, and bytes-per-parameter--datasetevaluation column (defaultc4)
Use plots/plot_compression_3d.py to create an interactive Plotly 3D scatter (HTML) of loss vs. compression rate and bytes-per-parameter for a single CSV. Per-rate optima from the paraboloid fit are overlaid as red-bordered diamonds.
Usage:
python plots/plot_compression_3d.py <results_csv> [--consider-global-parameters --dataset <dataset> --output-dir <dir>]Working Example:
python plots/plot_compression_3d.py blt_entropy_1e20_results.csv --consider-global-parametersArguments:
<results_csv>the CSV file with results saved [E.G. blt_entropy_1e20_results.csv]--consider-global-parametersif used, parameters of the global model are considered [USED]--datasetevaluation column (defaultc4)--output-dirdirectory to write the HTML (default.)
Use plots/plot_optimal_configs.py to plot a 3D visualization of optimal _ (data, parameters, bytes-per-parameter, or loss) vs. compression rate and compute budget.
Usage:
python plots/plot_optimal_configs.py <results_csv_0> ... <results_csv_n> --consider-global-parameters [--dataset <dataset> --prefix <prefix> --fit <fit> --interactive]Working Example:
python plots/plot_optimal_configs.py --consider-global-parametersArguments:
<results_csv_0>...<results_csv_n>the list of CSVs with results for corresponding compute budgets. [SKIPPED WHEN PREFIX IS USED]--consider-global-parametersif used, parameters of the global model are considered [USED]--datasetthe dataset to use for the plot. [DEFAULT: "c4"]--prefixthe tokenization method used in the plot. [DEFAULT: "blt_entropy", CHANGE TO: "isotropic" for BPE]--fitwhat config to plot. Choices:data,params,bn,loss,all.allproduces a 1x4 panel (or a separate file per mode in interactive mode). [DEFAULT: "all"]--interactiveplots interactive figures (HTML and GIF) instead of PDFs.--output-dirdirectory to write outputs (default.)
Use plots/plot_loss_profile.py to plot Loss vs. Compression and add interpolation curves from the pre-computed power law.
Usage:
python plots/plot_loss_profile.py --json-file <json_file> --prefix <prefix>Working Example:
python plots/plot_loss_profile.py --json-file blt_entropy_compression_rate_params.jsonArguments:
--csv-files <results_csv_0> ... <results_csv_n>the list of CSVs with results for corresponding compute budgets. [SKIPPED WHEN PREFIX IS USED]--json-filethe json file with pre-computed power law coefficients. [USED]--prefixthe tokenization method used in the plot. [DEFAULT: "blt_entropy", CHANGE TO: "isotropic" for BPE]--datasetevaluation column (defaultc4)--output-dirdirectory to write the PNG (default.)
Use plots/plot_multilingual_optimals.py to plot per-language optimal compression rate, bytes-per-parameter, BPB, and 1/BPB against byte parity, color-coded by script.
Usage:
python plots/plot_multilingual_optimals.py [--csv <csv_file> --output-dir <dir>]Working Example:
python plots/plot_multilingual_optimals.py --csv multilingual_results.csvArguments:
--csvthe multilingual results CSV inresult_csv/[DEFAULT: "multilingual_results.csv"]--output-dirdirectory to write the PDFs (default: this script's directory)
Use plots/plot_endtask.py to plot task performance vs. inference FLOPs for models on one training IsoFLOP but with different compression rates and model sizes.
Usage:
python plots/plot_endtask.py <csv_file> --dataset <dataset>Working Example:
python plots/plot_endtask.py endtask_results_2e21.csv --dataset "Hellaswag"Arguments:
<csv_file>the csv with endtask results [USED: endtask_results_2e21.csv]--datasetthe dataset for which to plot the results [E.G. Hellaswag]--output-dirdirectory to write the PDF (default.)
@article{limisiewicz2026cotok,
title={Compute Optimal Tokenization},
author={Limisiewicz, Tomasz and Pagnoni, Artidoro and Iyer, Srini and Lewis, Mike and Mehta, Sachin and Liu, Alisa and Li, Margaret and Ghosh, Gargi and Zettlemoyer, Luke},
year={2026},
eprint={2605.01188},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.01188},
}This project is licensed under the
Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
license. The full legal text is in the LICENSE file.






