Skip to content

facebookresearch/compute-optimal-tokenization

Compute Optimal Tokenization

arXiv Blog post License: CC BY-NC 4.0

This repository contains scripts for scaling experiments across different model scales and compression rates (i.e., patch size).

We release all the results presented in the paper "Compute Optimal Tokenization" (blog post) and the code for fitting scaling laws and producing visualizations.

Repository Overview

  1. Experiments: Raw results used in scaling laws and visualizations.
  2. Scaling Laws: Power laws fit for optimal bytes-to-parameter ratio and optimal compression rate.
  3. Visualizations: Process results to generate visualizations.

1. Experiments (result_csv/)

We trained LLMs at multiple compute budgets, data sizes, parameter counts, and compression rates. Raw results are stored as CSVs in result_csv/. Filenames encode the tokenizer (blt_entropy, blt_static, isotropic), an optional language tag (e.g. rus_Cyrl), and the compute budget exponent (e.g. 1e20).

Training/evaluation code is not included here. BLT runs can be reproduced with the BLT repository.


2. Power Laws (power_laws/)

Both fitting scripts discover corresponding CSVs in result_csv/ matching --prefix. They write a human-readable *.txt report and a machine-readable *.json of fitted coefficients next to themselves in power_laws/.

2.1 Power Law I: Optimal Data and Parameters

File: power_laws/fit_power_law_data_parameters.py

Fits the two power laws

B*(C, T) = B_0 · C^alpha   · T^beta
N*(C, T) = N_0 · C^alpha_N · T^beta_N

where C is compute budget, T is compression rate, B* is optimal data (bytes) and N* is optimal parameters. Per-(C, T) optima are taken from the polynomial paraboloid fit to each isoFLOP CSV (see plot_utils.load_optimal_results). Fitting is done in log space with BFGS using 2197-point grid multi-start, with standard errors and 95% CIs from the Hessian.

Usage:

python power_laws/fit_power_law_data_parameters.py [csv_files ...] \
    --prefix blt_entropy \
    --consider-global-parameters \
    [--approximated-compute] [--dataset c4]

Working example (auto-discovers all blt_entropy_*e*_results.csv):

python power_laws/fit_power_law_data_parameters.py --consider-global-parameters

Arguments:

  • csv_files — explicit list of CSVs. If omitted, files matching --prefix are auto-discovered.
  • --prefix — CSV prefix used for auto-discovery (default blt_entropy; use isotropic for BPE).
  • --consider-global-parameters — count parameters of the global (i.e., excluding encoder, decoder, embeddings) transformer only. Used for all paper results.
  • --approximated-compute — replaces the file-budget C with C* = 6·B·N/T (requires --consider-global-parameters). Useful for sanity-checking the budget assumption.
  • --dataset — evaluation column to use for selecting optima (default c4).

Outputs (in power_laws/):

  • <prefix>_power_law_fit_report[ _approximated_global_compute].txt — fit summary, CIs, R².
  • <prefix>_power_law_params[ _approximated_global_compute].json{B_optimal_data: {B_0, alpha, beta}, N_optimal_params: {N_0, alpha, beta}}.

2.2 Power Law II: Optimal Loss vs. Compression

File: power_laws/fit_power_law_compression_rate.py

Jointly fits a scaling law for loss given compression and compute budget. Three residual variants are selectable:

  • compute-dependent-t (default): L(C,T) = G·C^gamma + F·log(C^delta · T / T_0)^2 + E
  • constant-t: L(C,T) = G·C^gamma + F·log(T / T_0)^2 + E
  • intercept: L(C,T) = G·C^gamma + E

The optional --two-stage variant first fits G·C^gamma with rate-specific intercepts F_T, then fits the residual model to the residuals.

Usage:

python power_laws/fit_power_law_compression_rate.py [csv_files ...] \
    --prefix blt_entropy \
    [--two-stage] \
    [--residual-model {compute-dependent-t,constant-t,intercept}] \
    [--dataset c4]

Working example:

python power_laws/fit_power_law_compression_rate.py --two-stage

Arguments:

  • csv_files, --prefix, --dataset — as in §2.1.
  • --two-stage — two-stage fit producing per-rate F_T intercepts.
  • --residual-model — selects the optimal-loss formula above.

3. Visualizations (plots/)

3.1 Plot Heatmap (Loss vs. _ and Compression)

Use plots/plot_compression_heatmap.py to create a heatmap (or contour) showing validation loss for configurations of compression (x-axis) and _ (bytes-per-parameter, data, parameters, or all of them) on y-axis. Fits a paraboloid and saves PDF.

Compression contour example

Usage:

python plots/plot_compression_heatmap.py <results_csv> --consider-global-parameters --contour [--fit-params --fit-data --fit-all]

Working Example:

python plots/plot_compression_heatmap.py blt_entropy_1e20_results.csv --consider-global-parameters --contour

Arguments:

  • <result_csv> the CSV file with results saved [E.G. blt_entropy_1e20_results.csv]
  • --x-scale and --y-scale set log or linear scales for each axis. Also important for fitting paraboloid. By default set to log. [DEFAULT: log]
  • --contour use contour instead of heatmap to plot paraboloid (i.e., isolosses) [USED]
  • --consider-global-parameters if used, parameters of the global model are considered [USED]
  • --fit-params if used, plot parameters on y-axis [NOT USED]
  • --fit-data if used, plot data on y-axis [NOT USED]
  • --fit-all if used, 1x3 figure with all 3 options: parameters, data, and bytes-per-parameter
  • --dataset evaluation column (default c4)

3.2 Plot Loss Curves (Loss vs. _)

Use plots/plot_compression_loss_curves.py to create scaling curves showing loss vs. _ (bytes-per-parameter, data, parameters, or all of them) for different compression rates.

Loss curves example

Usage:

python plots/plot_compression_loss_curves.py <results_csv> --consider-global-parameters [--joint-fit] [--fit-params --fit-data --fit-all]

Working Example:

python plots/plot_compression_loss_curves.py blt_entropy_1e20_results.csv --consider-global-parameters

Arguments:

  • <result_csv> the CSV file with results saved [E.G. blt_entropy_1e20_results.csv]
  • --consider-global-parameters if used, parameters of the global model are considered [USED]
  • --joint-fit if used, one polynomial is fitted for all data (as in plots/plot_compression_heatmap.py). Otherwise fit a polynomial for each compression rate separately. [NOT USED]
  • --fit-params if used, plot parameters on x-axis [NOT USED]
  • --fit-data if used, plot data on x-axis [NOT USED]
  • --fit-all if used, 1x3 figure with all 3 options: parameters, data, and bytes-per-parameter
  • --dataset evaluation column (default c4)

3.3 Plot 3D Loss Surface for a Single IsoFLOP

Use plots/plot_compression_3d.py to create an interactive Plotly 3D scatter (HTML) of loss vs. compression rate and bytes-per-parameter for a single CSV. Per-rate optima from the paraboloid fit are overlaid as red-bordered diamonds.

3D loss surface example

Usage:

python plots/plot_compression_3d.py <results_csv> [--consider-global-parameters --dataset <dataset> --output-dir <dir>]

Working Example:

python plots/plot_compression_3d.py blt_entropy_1e20_results.csv --consider-global-parameters

Arguments:

  • <results_csv> the CSV file with results saved [E.G. blt_entropy_1e20_results.csv]
  • --consider-global-parameters if used, parameters of the global model are considered [USED]
  • --dataset evaluation column (default c4)
  • --output-dir directory to write the HTML (default .)

3.4 Plot 3D Optimal _ vs. Compression and Compute

Use plots/plot_optimal_configs.py to plot a 3D visualization of optimal _ (data, parameters, bytes-per-parameter, or loss) vs. compression rate and compute budget.

Optimal configs example

Usage:

python plots/plot_optimal_configs.py <results_csv_0> ... <results_csv_n> --consider-global-parameters [--dataset <dataset> --prefix <prefix> --fit <fit> --interactive]

Working Example:

python plots/plot_optimal_configs.py --consider-global-parameters

Arguments:

  • <results_csv_0> ... <results_csv_n> the list of CSVs with results for corresponding compute budgets. [SKIPPED WHEN PREFIX IS USED]
  • --consider-global-parameters if used, parameters of the global model are considered [USED]
  • --dataset the dataset to use for the plot. [DEFAULT: "c4"]
  • --prefix the tokenization method used in the plot. [DEFAULT: "blt_entropy", CHANGE TO: "isotropic" for BPE]
  • --fit what config to plot. Choices: data, params, bn, loss, all. all produces a 1x4 panel (or a separate file per mode in interactive mode). [DEFAULT: "all"]
  • --interactive plots interactive figures (HTML and GIF) instead of PDFs.
  • --output-dir directory to write outputs (default .)

3.5 Plot Scaling Laws (Loss vs. Compression)

Use plots/plot_loss_profile.py to plot Loss vs. Compression and add interpolation curves from the pre-computed power law.

Loss profile example

Usage:

python plots/plot_loss_profile.py --json-file <json_file> --prefix <prefix>

Working Example:

python plots/plot_loss_profile.py --json-file blt_entropy_compression_rate_params.json

Arguments:

  • --csv-files <results_csv_0> ... <results_csv_n> the list of CSVs with results for corresponding compute budgets. [SKIPPED WHEN PREFIX IS USED]
  • --json-file the json file with pre-computed power law coefficients. [USED]
  • --prefix the tokenization method used in the plot. [DEFAULT: "blt_entropy", CHANGE TO: "isotropic" for BPE]
  • --dataset evaluation column (default c4)
  • --output-dir directory to write the PNG (default .)

3.6 Plot Multilingual Optimals vs. Byte Parity

Use plots/plot_multilingual_optimals.py to plot per-language optimal compression rate, bytes-per-parameter, BPB, and 1/BPB against byte parity, color-coded by script.

Multilingual optimals example

Usage:

python plots/plot_multilingual_optimals.py [--csv <csv_file> --output-dir <dir>]

Working Example:

python plots/plot_multilingual_optimals.py --csv multilingual_results.csv

Arguments:

  • --csv the multilingual results CSV in result_csv/ [DEFAULT: "multilingual_results.csv"]
  • --output-dir directory to write the PDFs (default: this script's directory)

3.7 Plot Task Performance vs. Inference FLOPs

Use plots/plot_endtask.py to plot task performance vs. inference FLOPs for models on one training IsoFLOP but with different compression rates and model sizes.

End-task example

Usage:

python plots/plot_endtask.py <csv_file> --dataset <dataset>

Working Example:

python plots/plot_endtask.py endtask_results_2e21.csv --dataset "Hellaswag"

Arguments:

  • <csv_file> the csv with endtask results [USED: endtask_results_2e21.csv]
  • --dataset the dataset for which to plot the results [E.G. Hellaswag]
  • --output-dir directory to write the PDF (default .)

Citation

@article{limisiewicz2026cotok,
  title={Compute Optimal Tokenization},
  author={Limisiewicz, Tomasz and Pagnoni, Artidoro and Iyer, Srini and Lewis, Mike and Mehta, Sachin and Liu, Alisa and Li, Margaret and Ghosh, Gargi and Zettlemoyer, Luke},
  year={2026},
  eprint={2605.01188},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2605.01188},
}

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. The full legal text is in the LICENSE file.

About

The repository contains raw data results and code for scaling laws fitting and visualization used in "Compute Optimal Tokenization" paper.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages