Skip to content

fengkx/tu

Repository files navigation

tu

tu is a small command-line tool for counting tokens in files and directories.

It is intentionally shaped like du, but the unit is not disk usage. The unit is the number of tokens produced by a tokenizer.

By default, tu:

  • uses OpenAI o200k_base
  • respects .gitignore, .ignore, and git exclude rules
  • excludes .git directories for detected git repositories by default
  • skips binary files with a warning

Why

When working with LLM prompts, repositories, or corpora, byte size is often less useful than token count. tu answers questions like:

  • "How many tokens are in this prompt file?"
  • "How large is this repository in o200k_base terms?"
  • "Which subtree is contributing the most tokens?"
  • "What would this look like under a HuggingFace tokenizer?"

Installation

Build from source:

cargo build --release

The binary will be available at:

./target/release/tu

Or install it into Cargo's binary directory:

cargo install --path .

Quick Start

Count tokens in the current directory:

tu

Count tokens in a single file:

tu prompt.txt

Read from stdin:

cat prompt.txt | tu

Show every file and directory aggregate:

tu --all .

Use human-readable units:

tu --human .

Emit JSON for scripts:

tu --json .

Behavior

Traversal

  • If no path is provided and stdin is a TTY, tu scans ..
  • If no path is provided and stdin is piped, tu reads stdin.
  • You can use - explicitly to read stdin alongside file paths.
  • Symbolic links are not followed unless --follow-links is enabled.

Ignore rules

By default, tu respects:

  • .gitignore
  • .ignore
  • .git/info/exclude
  • global git ignore rules
  • .git directories for detected repositories are skipped by default

Disable this behavior with:

tu --no-ignore .

Add extra exclusions with repeatable glob patterns:

tu --exclude '*.min.js' --exclude 'dist/**' .

Binary files

Binary handling is controlled with --binary:

  • skip (default): skip binary or non-UTF-8 input and print a warning
  • lossy: decode with UTF-8 lossy conversion and still count tokens
  • error: fail the command when binary or non-UTF-8 input is encountered

Examples:

tu --binary skip .
tu --binary lossy archive.dat
tu --binary error .

Tokenizers

Default OpenAI backend

The default backend is openai with o200k_base:

tu .

You can switch builtin tokenizers:

tu --encoding cl100k_base .
tu --encoding p50k_base .
tu --encoding r50k_base .
tu --encoding qwen3 .
tu --encoding deepseek_v3_2 .
tu --encoding glm5 .

Compare multiple tokenizers in one run:

tu --compare o200k_base --compare cl100k_base .
tu --compare o200k_base --compare qwen3 .
tu --compare qwen3 --compare file:./tokenizer.json .

Available builtin tokenizer ids:

  • o200k_base
  • cl100k_base
  • p50k_base
  • r50k_base
  • qwen3
  • deepseek_v3_2
  • glm5

HuggingFace backend

Builtin HuggingFace tokenizers are selected through --encoding:

tu --encoding qwen3 .
tu --encoding deepseek_v3_2 .
tu --encoding glm5 .

Or point tu at a local tokenizer.json without setting --tokenizer:

tu --tokenizer-file ./tokenizer.json .

Builtin families are embedded into the binary for offline use. A local tokenizer file remains useful when you want counts for a model-specific tokenizer outside the bundled set.

Output

Default text output

The default text format is:

<tokens>\t<path>

Examples:

tu README.md
tu --all src
tu --total packages/a packages/b

JSON output

Use --json when you need machine-readable output:

tu --json .

When --compare is used, text output becomes a TSV table with path plus one column per tokenizer, and JSON output switches to a tokenizers + results structure.

The JSON payload includes:

  • tokenizer: the tokenizer configuration used for the run
  • entries: emitted entries
  • total: summed totals across roots
  • had_errors: whether any execution error occurred

Options

Usage: tu [OPTIONS] [PATH]...

Arguments:
  [PATH]...  Files or directories to scan. Use `-` to read stdin

Options:
  -a, --all                    Output every file and directory aggregate
  -s, --summarize              Output only the summary for each root input
  -d, --max-depth <N>          Limit displayed depth. Deeper descendants are still counted in aggregates
      --tokenizer <TOKENIZER>  Select the tokenizer backend [possible values: openai, hf]
      --encoding <ENCODING>    Select a builtin tokenizer family [possible values: o200k_base, cl100k_base, p50k_base, r50k_base, qwen3, deepseek_v3_2, glm5]
      --tokenizer-file <PATH>  Path to a HuggingFace tokenizer.json
      --binary <BINARY>        Binary file handling policy [default: skip] [possible values: skip, lossy, error]
      --no-ignore              Disable .gitignore, .ignore, and git exclude rules
      --exclude <GLOB>         Exclude matching paths. Repeatable
  -L, --follow-links           Follow symbolic links
  -H, --human                  Print human-readable token units
      --json                   Emit JSON instead of text output
      --total                  Print a total row when multiple roots are provided
  -h, --help                   Print help
  -V, --version                Print version

Examples

Show only a top-level summary for the current repository:

tu .

Inspect a subtree in detail:

tu --all --max-depth 2 src

Compare two directories and print a combined total:

tu --total app server

Count a prompt under a different tokenizer:

tu --encoding cl100k_base prompt.md
tu --encoding qwen3 prompt.md

Count with a HuggingFace tokenizer:

tu --encoding qwen3 docs
tu --tokenizer-file ./tokenizer.json docs

Compare OpenAI and HuggingFace tokenizers:

tu --compare o200k_base --compare qwen3 docs
tu --compare qwen3 --compare file:./tokenizer.json docs

Use in shell pipelines:

git show HEAD~1:README.md | tu

Process JSON output with jq:

tu --json . | jq '.total.tokens'

Exit Codes

  • 0: success
  • 1: the command completed but at least one scan/read/count error occurred
  • 2: invalid configuration or startup failure

Warnings, such as skipped binary files, are written to stderr.

Notes

  • Token counts depend on the selected tokenizer. Different backends or encodings will produce different numbers.
  • --tokenizer is now optional for builtin tokenizers and mostly acts as an explicit compatibility constraint.
  • --max-depth only affects displayed entries. Deeper files still contribute to ancestor aggregates.
  • --summarize exists for clarity, but summary-only output is already the default.

Contributing

Use just to set up the local environment and run checks:

just ready

Useful recipes:

just venv   # create .venv if it does not exist
just lock   # refresh uv.lock
just sync   # install Python dev dependencies into .venv
just test   # run the full Rust test suite

just ready is the recommended entrypoint for contributors. It ensures the project virtualenv exists, synchronizes the Python test dependency used by integration tests, and then runs cargo test.

Release

Preview a release without changing the repository:

just release-plan <version>

Create the local release commit, tag, and changelog:

just release <version>

Push the branch and tag to GitHub:

just publish-release <version>

These commands run just ready first, update the version, regenerate CHANGELOG.md, create a release: vX.Y.Z commit, and create an annotated vX.Y.Z tag. After the tag is pushed, GitHub Actions creates the GitHub Release and uploads the platform archives.

About

A du-like CLI for counting tokens

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors