tu is a small command-line tool for counting tokens in files and directories.
It is intentionally shaped like du, but the unit is not disk usage. The unit is the number of tokens produced by a tokenizer.
By default, tu:
- uses OpenAI
o200k_base - respects
.gitignore,.ignore, and git exclude rules - excludes
.gitdirectories for detected git repositories by default - skips binary files with a warning
When working with LLM prompts, repositories, or corpora, byte size is often less useful than token count. tu answers questions like:
- "How many tokens are in this prompt file?"
- "How large is this repository in
o200k_baseterms?" - "Which subtree is contributing the most tokens?"
- "What would this look like under a HuggingFace tokenizer?"
Build from source:
cargo build --releaseThe binary will be available at:
./target/release/tuOr install it into Cargo's binary directory:
cargo install --path .Count tokens in the current directory:
tuCount tokens in a single file:
tu prompt.txtRead from stdin:
cat prompt.txt | tuShow every file and directory aggregate:
tu --all .Use human-readable units:
tu --human .Emit JSON for scripts:
tu --json .- If no path is provided and stdin is a TTY,
tuscans.. - If no path is provided and stdin is piped,
tureads stdin. - You can use
-explicitly to read stdin alongside file paths. - Symbolic links are not followed unless
--follow-linksis enabled.
By default, tu respects:
.gitignore.ignore.git/info/exclude- global git ignore rules
.gitdirectories for detected repositories are skipped by default
Disable this behavior with:
tu --no-ignore .Add extra exclusions with repeatable glob patterns:
tu --exclude '*.min.js' --exclude 'dist/**' .Binary handling is controlled with --binary:
skip(default): skip binary or non-UTF-8 input and print a warninglossy: decode with UTF-8 lossy conversion and still count tokenserror: fail the command when binary or non-UTF-8 input is encountered
Examples:
tu --binary skip .
tu --binary lossy archive.dat
tu --binary error .The default backend is openai with o200k_base:
tu .You can switch builtin tokenizers:
tu --encoding cl100k_base .
tu --encoding p50k_base .
tu --encoding r50k_base .
tu --encoding qwen3 .
tu --encoding deepseek_v3_2 .
tu --encoding glm5 .Compare multiple tokenizers in one run:
tu --compare o200k_base --compare cl100k_base .
tu --compare o200k_base --compare qwen3 .
tu --compare qwen3 --compare file:./tokenizer.json .Available builtin tokenizer ids:
o200k_basecl100k_basep50k_baser50k_baseqwen3deepseek_v3_2glm5
Builtin HuggingFace tokenizers are selected through --encoding:
tu --encoding qwen3 .
tu --encoding deepseek_v3_2 .
tu --encoding glm5 .Or point tu at a local tokenizer.json without setting --tokenizer:
tu --tokenizer-file ./tokenizer.json .Builtin families are embedded into the binary for offline use. A local tokenizer file remains useful when you want counts for a model-specific tokenizer outside the bundled set.
The default text format is:
<tokens>\t<path>
Examples:
tu README.md
tu --all src
tu --total packages/a packages/bUse --json when you need machine-readable output:
tu --json .When --compare is used, text output becomes a TSV table with path plus one column per
tokenizer, and JSON output switches to a tokenizers + results structure.
The JSON payload includes:
tokenizer: the tokenizer configuration used for the runentries: emitted entriestotal: summed totals across rootshad_errors: whether any execution error occurred
Usage: tu [OPTIONS] [PATH]...
Arguments:
[PATH]... Files or directories to scan. Use `-` to read stdin
Options:
-a, --all Output every file and directory aggregate
-s, --summarize Output only the summary for each root input
-d, --max-depth <N> Limit displayed depth. Deeper descendants are still counted in aggregates
--tokenizer <TOKENIZER> Select the tokenizer backend [possible values: openai, hf]
--encoding <ENCODING> Select a builtin tokenizer family [possible values: o200k_base, cl100k_base, p50k_base, r50k_base, qwen3, deepseek_v3_2, glm5]
--tokenizer-file <PATH> Path to a HuggingFace tokenizer.json
--binary <BINARY> Binary file handling policy [default: skip] [possible values: skip, lossy, error]
--no-ignore Disable .gitignore, .ignore, and git exclude rules
--exclude <GLOB> Exclude matching paths. Repeatable
-L, --follow-links Follow symbolic links
-H, --human Print human-readable token units
--json Emit JSON instead of text output
--total Print a total row when multiple roots are provided
-h, --help Print help
-V, --version Print version
Show only a top-level summary for the current repository:
tu .Inspect a subtree in detail:
tu --all --max-depth 2 srcCompare two directories and print a combined total:
tu --total app serverCount a prompt under a different tokenizer:
tu --encoding cl100k_base prompt.md
tu --encoding qwen3 prompt.mdCount with a HuggingFace tokenizer:
tu --encoding qwen3 docs
tu --tokenizer-file ./tokenizer.json docsCompare OpenAI and HuggingFace tokenizers:
tu --compare o200k_base --compare qwen3 docs
tu --compare qwen3 --compare file:./tokenizer.json docsUse in shell pipelines:
git show HEAD~1:README.md | tuProcess JSON output with jq:
tu --json . | jq '.total.tokens'0: success1: the command completed but at least one scan/read/count error occurred2: invalid configuration or startup failure
Warnings, such as skipped binary files, are written to stderr.
- Token counts depend on the selected tokenizer. Different backends or encodings will produce different numbers.
--tokenizeris now optional for builtin tokenizers and mostly acts as an explicit compatibility constraint.--max-depthonly affects displayed entries. Deeper files still contribute to ancestor aggregates.--summarizeexists for clarity, but summary-only output is already the default.
Use just to set up the local environment and run checks:
just readyUseful recipes:
just venv # create .venv if it does not exist
just lock # refresh uv.lock
just sync # install Python dev dependencies into .venv
just test # run the full Rust test suitejust ready is the recommended entrypoint for contributors. It ensures the project virtualenv exists, synchronizes the Python test dependency used by integration tests, and then runs cargo test.
Preview a release without changing the repository:
just release-plan <version>Create the local release commit, tag, and changelog:
just release <version>Push the branch and tag to GitHub:
just publish-release <version>These commands run just ready first, update the version, regenerate CHANGELOG.md, create a release: vX.Y.Z commit, and create an annotated vX.Y.Z tag. After the tag is pushed, GitHub Actions creates the GitHub Release and uploads the platform archives.