Compute Optimal Tokenization

This repository contains scripts for scaling experiments across different model scales and compression rates (i.e., patch size).

We release all the results presented in the paper "Compute Optimal Tokenization" (blog post) and the code for fitting scaling laws and producing visualizations.

Repository Overview

Experiments: Raw results used in scaling laws and visualizations.
Scaling Laws: Power laws fit for optimal bytes-to-parameter ratio and optimal compression rate.
Visualizations: Process results to generate visualizations.

1. Experiments (`result_csv/`)

We trained LLMs at multiple compute budgets, data sizes, parameter counts, and compression rates. Raw results are stored as CSVs in result_csv/. Filenames encode the tokenizer (blt_entropy, blt_static, isotropic), an optional language tag (e.g. rus_Cyrl), and the compute budget exponent (e.g. 1e20).

Training/evaluation code is not included here. BLT runs can be reproduced with the BLT repository.

2. Power Laws (`power_laws/`)

Both fitting scripts discover corresponding CSVs in result_csv/ matching --prefix. They write a human-readable *.txt report and a machine-readable *.json of fitted coefficients next to themselves in power_laws/.

2.1 Power Law I: Optimal Data and Parameters

File: power_laws/fit_power_law_data_parameters.py

Fits the two power laws

B*(C, T) = B_0 · C^alpha   · T^beta
N*(C, T) = N_0 · C^alpha_N · T^beta_N

where C is compute budget, T is compression rate, B* is optimal data (bytes) and N* is optimal parameters. Per-(C, T) optima are taken from the polynomial paraboloid fit to each isoFLOP CSV (see plot_utils.load_optimal_results). Fitting is done in log space with BFGS using 2197-point grid multi-start, with standard errors and 95% CIs from the Hessian.

Usage:

python power_laws/fit_power_law_data_parameters.py [csv_files ...] \
    --prefix blt_entropy \
    --consider-global-parameters \
    [--approximated-compute] [--dataset c4]

Working example (auto-discovers all blt_entropy_*e*_results.csv):

python power_laws/fit_power_law_data_parameters.py --consider-global-parameters

Arguments:

csv_files — explicit list of CSVs. If omitted, files matching --prefix are auto-discovered.
--prefix — CSV prefix used for auto-discovery (default blt_entropy; use isotropic for BPE).
--consider-global-parameters — count parameters of the global (i.e., excluding encoder, decoder, embeddings) transformer only. Used for all paper results.
--approximated-compute — replaces the file-budget C with C* = 6·B·N/T (requires --consider-global-parameters). Useful for sanity-checking the budget assumption.
--dataset — evaluation column to use for selecting optima (default c4).

Outputs (in power_laws/):

<prefix>_power_law_fit_report[ _approximated_global_compute].txt — fit summary, CIs, R².
<prefix>_power_law_params[ _approximated_global_compute].json — {B_optimal_data: {B_0, alpha, beta}, N_optimal_params: {N_0, alpha, beta}}.

2.2 Power Law II: Optimal Loss vs. Compression

File: power_laws/fit_power_law_compression_rate.py

Jointly fits a scaling law for loss given compression and compute budget. Three residual variants are selectable:

compute-dependent-t (default): L(C,T) = G·C^gamma + F·log(C^delta · T / T_0)^2 + E
constant-t: L(C,T) = G·C^gamma + F·log(T / T_0)^2 + E
intercept: L(C,T) = G·C^gamma + E

The optional --two-stage variant first fits G·C^gamma with rate-specific intercepts F_T, then fits the residual model to the residuals.

Usage:

python power_laws/fit_power_law_compression_rate.py [csv_files ...] \
    --prefix blt_entropy \
    [--two-stage] \
    [--residual-model {compute-dependent-t,constant-t,intercept}] \
    [--dataset c4]

Working example:

python power_laws/fit_power_law_compression_rate.py --two-stage

Arguments:

csv_files, --prefix, --dataset — as in §2.1.
--two-stage — two-stage fit producing per-rate F_T intercepts.
--residual-model — selects the optimal-loss formula above.

3. Visualizations (`plots/`)

3.1 Plot Heatmap (Loss vs. _ and Compression)

Use plots/plot_compression_heatmap.py to create a heatmap (or contour) showing validation loss for configurations of compression (x-axis) and _ (bytes-per-parameter, data, parameters, or all of them) on y-axis. Fits a paraboloid and saves PDF.

Usage:

python plots/plot_compression_heatmap.py <results_csv> --consider-global-parameters --contour [--fit-params --fit-data --fit-all]

Working Example:

python plots/plot_compression_heatmap.py blt_entropy_1e20_results.csv --consider-global-parameters --contour

Arguments:

<result_csv> the CSV file with results saved [E.G. blt_entropy_1e20_results.csv]
--x-scale and --y-scale set log or linear scales for each axis. Also important for fitting paraboloid. By default set to log. [DEFAULT: log]
--contour use contour instead of heatmap to plot paraboloid (i.e., isolosses) [USED]
--consider-global-parameters if used, parameters of the global model are considered [USED]
--fit-params if used, plot parameters on y-axis [NOT USED]
--fit-data if used, plot data on y-axis [NOT USED]
--fit-all if used, 1x3 figure with all 3 options: parameters, data, and bytes-per-parameter
--dataset evaluation column (default c4)

3.2 Plot Loss Curves (Loss vs. _)

Use plots/plot_compression_loss_curves.py to create scaling curves showing loss vs. _ (bytes-per-parameter, data, parameters, or all of them) for different compression rates.

Usage:

python plots/plot_compression_loss_curves.py <results_csv> --consider-global-parameters [--joint-fit] [--fit-params --fit-data --fit-all]

Working Example:

python plots/plot_compression_loss_curves.py blt_entropy_1e20_results.csv --consider-global-parameters

Arguments:

<result_csv> the CSV file with results saved [E.G. blt_entropy_1e20_results.csv]
--consider-global-parameters if used, parameters of the global model are considered [USED]
--joint-fit if used, one polynomial is fitted for all data (as in plots/plot_compression_heatmap.py). Otherwise fit a polynomial for each compression rate separately. [NOT USED]
--fit-params if used, plot parameters on x-axis [NOT USED]
--fit-data if used, plot data on x-axis [NOT USED]
--fit-all if used, 1x3 figure with all 3 options: parameters, data, and bytes-per-parameter
--dataset evaluation column (default c4)

3.3 Plot 3D Loss Surface for a Single IsoFLOP

Use plots/plot_compression_3d.py to create an interactive Plotly 3D scatter (HTML) of loss vs. compression rate and bytes-per-parameter for a single CSV. Per-rate optima from the paraboloid fit are overlaid as red-bordered diamonds.

Usage:

python plots/plot_compression_3d.py <results_csv> [--consider-global-parameters --dataset <dataset> --output-dir <dir>]

Working Example:

python plots/plot_compression_3d.py blt_entropy_1e20_results.csv --consider-global-parameters

Arguments:

<results_csv> the CSV file with results saved [E.G. blt_entropy_1e20_results.csv]
--consider-global-parameters if used, parameters of the global model are considered [USED]
--dataset evaluation column (default c4)
--output-dir directory to write the HTML (default .)

3.4 Plot 3D Optimal _ vs. Compression and Compute

Use plots/plot_optimal_configs.py to plot a 3D visualization of optimal _ (data, parameters, bytes-per-parameter, or loss) vs. compression rate and compute budget.

Usage:

python plots/plot_optimal_configs.py <results_csv_0> ... <results_csv_n> --consider-global-parameters [--dataset <dataset> --prefix <prefix> --fit <fit> --interactive]

Working Example:

python plots/plot_optimal_configs.py --consider-global-parameters

Arguments:

<results_csv_0> ... <results_csv_n> the list of CSVs with results for corresponding compute budgets. [SKIPPED WHEN PREFIX IS USED]
--consider-global-parameters if used, parameters of the global model are considered [USED]
--dataset the dataset to use for the plot. [DEFAULT: "c4"]
--prefix the tokenization method used in the plot. [DEFAULT: "blt_entropy", CHANGE TO: "isotropic" for BPE]
--fit what config to plot. Choices: data, params, bn, loss, all. all produces a 1x4 panel (or a separate file per mode in interactive mode). [DEFAULT: "all"]
--interactive plots interactive figures (HTML and GIF) instead of PDFs.
--output-dir directory to write outputs (default .)

3.5 Plot Scaling Laws (Loss vs. Compression)

Use plots/plot_loss_profile.py to plot Loss vs. Compression and add interpolation curves from the pre-computed power law.

Usage:

python plots/plot_loss_profile.py --json-file <json_file> --prefix <prefix>

Working Example:

python plots/plot_loss_profile.py --json-file blt_entropy_compression_rate_params.json

Arguments:

--csv-files <results_csv_0> ... <results_csv_n> the list of CSVs with results for corresponding compute budgets. [SKIPPED WHEN PREFIX IS USED]
--json-file the json file with pre-computed power law coefficients. [USED]
--prefix the tokenization method used in the plot. [DEFAULT: "blt_entropy", CHANGE TO: "isotropic" for BPE]
--dataset evaluation column (default c4)
--output-dir directory to write the PNG (default .)

3.6 Plot Multilingual Optimals vs. Byte Parity

Use plots/plot_multilingual_optimals.py to plot per-language optimal compression rate, bytes-per-parameter, BPB, and 1/BPB against byte parity, color-coded by script.

Usage:

python plots/plot_multilingual_optimals.py [--csv <csv_file> --output-dir <dir>]

Working Example:

python plots/plot_multilingual_optimals.py --csv multilingual_results.csv

Arguments:

--csv the multilingual results CSV in result_csv/ [DEFAULT: "multilingual_results.csv"]
--output-dir directory to write the PDFs (default: this script's directory)

3.7 Plot Task Performance vs. Inference FLOPs

Use plots/plot_endtask.py to plot task performance vs. inference FLOPs for models on one training IsoFLOP but with different compression rates and model sizes.

Usage:

python plots/plot_endtask.py <csv_file> --dataset <dataset>

Working Example:

python plots/plot_endtask.py endtask_results_2e21.csv --dataset "Hellaswag"

Arguments:

<csv_file> the csv with endtask results [USED: endtask_results_2e21.csv]
--dataset the dataset for which to plot the results [E.G. Hellaswag]
--output-dir directory to write the PDF (default .)

Citation

@article{limisiewicz2026cotok,
  title={Compute Optimal Tokenization},
  author={Limisiewicz, Tomasz and Pagnoni, Artidoro and Iyer, Srini and Lewis, Mike and Mehta, Sachin and Liu, Alisa and Li, Margaret and Ghosh, Gargi and Zettlemoyer, Luke},
  year={2026},
  eprint={2605.01188},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2605.01188},
}

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. The full legal text is in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
plots		plots
power_laws		power_laws
result_csv		result_csv
result_plot		result_plot
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Compute Optimal Tokenization

Repository Overview

1. Experiments (`result_csv/`)

2. Power Laws (`power_laws/`)

2.1 Power Law I: Optimal Data and Parameters

2.2 Power Law II: Optimal Loss vs. Compression

3. Visualizations (`plots/`)

3.1 Plot Heatmap (Loss vs. _ and Compression)

3.2 Plot Loss Curves (Loss vs. _)

3.3 Plot 3D Loss Surface for a Single IsoFLOP

3.4 Plot 3D Optimal _ vs. Compression and Compute

3.5 Plot Scaling Laws (Loss vs. Compression)

3.6 Plot Multilingual Optimals vs. Byte Parity

3.7 Plot Task Performance vs. Inference FLOPs

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Compute Optimal Tokenization

Repository Overview

1. Experiments (result_csv/)

2. Power Laws (power_laws/)

2.1 Power Law I: Optimal Data and Parameters

2.2 Power Law II: Optimal Loss vs. Compression

3. Visualizations (plots/)

3.1 Plot Heatmap (Loss vs. _ and Compression)

3.2 Plot Loss Curves (Loss vs. _)

3.3 Plot 3D Loss Surface for a Single IsoFLOP

3.4 Plot 3D Optimal _ vs. Compression and Compute

3.5 Plot Scaling Laws (Loss vs. Compression)

3.6 Plot Multilingual Optimals vs. Byte Parity

3.7 Plot Task Performance vs. Inference FLOPs

Citation

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. Experiments (`result_csv/`)

2. Power Laws (`power_laws/`)

3. Visualizations (`plots/`)

Packages