Skip to content

cmd-llm/cllm-tokens

Repository files navigation

cllm-tokens

A lightweight CLI utility for counting tokens in text and files.

Purpose

cllm-tokens is a command-line tool that helps you quickly count the number of tokens in text. Whether you're analyzing prompts for language models, understanding token consumption, or debugging text processing, cllm-tokens provides a simple, fast way to count tokens from stdin or files.

Vision and Goals

Vision: Build a simple, intuitive CLI that makes token counting instant and transparent.

Goals:

  • Provide fast, accurate token counting from stdin or files
  • Support multiple input methods (piping, file paths)
  • Display clear, actionable token count results
  • Offer a lightweight alternative to heavier token analysis tools
  • Document decision-making processes through ADRs to guide AI-assisted development

Getting Started

This project uses Vibe ADR (Architecture Decision Records) to document key decisions and guide development. See docs/decisions/ for decision records.

Prerequisites

  • Python 3.12 or later
  • uv package manager

Installation

Install uv (if not already installed):

curl -LsSf https://astral.sh/uv/install.sh | sh

Then install cllm-tokens:

uv sync
uv pip install -e .

Or in one command:

uv sync && uv pip install -e .

Usage

Basic Usage

Count tokens from stdin:

echo "Hello, world!" | cllm-tokens
# Output: 5

Count tokens from a file:

cllm-tokens /path/to/file.txt
# Output: 1,234

Count tokens from multiple files:

cllm-tokens file1.txt file2.txt file3.txt

Options

  • --verbose or -v: Show detailed output including file names
  • --quiet or -q: Show only the token count (no formatting)
  • --total or -t: Show total count when processing multiple files
  • --encoding: Specify tokenization encoding (default: cl100k_base)

Supported encodings:

  • cl100k_base – GPT-3.5, GPT-4 (default)
  • p50k_base – Code models, older GPT-3 variants
  • r50k_base – Legacy models
  • o200k_base – GPT-4 Turbo and newer models

Examples

Analyze a prompt:

echo "Write a poem about programming" | cllm-tokens
# Output: 6

Check token usage of a document:

cllm-tokens myessay.txt
# Output: 1,234

Count tokens using a different encoding:

cllm-tokens --encoding p50k_base document.txt

Get detailed output for a file:

cllm-tokens --verbose document.txt
# Output: document.txt: 1,234 tokens

Count tokens from multiple files with a total:

cllm-tokens --total file1.txt file2.txt file3.txt
# Output:
# 100
# 200
# 300
#
# Total: 600 tokens

Count tokens from piped content with quiet mode:

cat document.txt | cllm-tokens --quiet
# Output: 1234

Development

Project Structure

  • src/cllm_tokens/ – Main package
    • counter.py – Token counting logic using tiktoken
    • cli.py – Command-line interface using Click
    • __init__.py – Package exports
  • tests/ – Test suite
    • test_counter.py – Token counter tests (20+ tests)
    • test_cli.py – CLI interface tests (27+ tests)

Running Tests

Install the package in development mode and run tests:

uv sync
uv pip install -e .
python -m pytest tests/ -v

Code Quality

Format code with black and lint with ruff:

uv run --with black black src/ tests/
uv run --with ruff ruff check src/ tests/ --fix

Architecture

The project is built on two main ADRs:

  • ADR 0003: CLI interface using Click framework

    • Simple decorator-based API
    • Supports stdin, files, and multiple inputs
    • Built-in help, verbose, and quiet modes
    • Proper error handling and exit codes
  • ADR 0004: Token counting using tiktoken

    • OpenAI's official tokenizer for GPT models
    • Support for multiple encodings (cl100k_base, p50k_base, etc.)
    • Fast Rust-based implementation
    • Accurate token counts matching API usage

Both modules are thoroughly tested with 54+ passing tests covering:

  • Simple and complex text inputs
  • File I/O and error handling
  • Multiple encodings
  • Unicode and special characters
  • CLI options and flags

Decision Documentation

All significant architectural and implementation decisions are recorded in the docs/decisions/ directory as Architecture Decision Records (ADRs). See 0001-adopt-vibe-adr.md for details on how we structure decisions.

Contributing

When contributing new features or making architectural decisions:

  1. Read the relevant ADRs in docs/decisions/
  2. Create a new ADR for significant decisions using templates/VIBE_ADR_TEMPLATE.md
  3. Reference ADR IDs in commit messages (e.g., "ref: 0002-token-counting")
  4. Keep decisions linked to implementation commits for traceability

Guiding Principles

  • Lightweight: Keep abstractions simple and dependencies minimal
  • Transparent: Make token counting logic clear and auditable
  • AI-Collaborative: Use ADRs to guide AI agents in understanding project intent
  • Intentional: Document the "why" behind decisions, not just the "what"

About

Cli command to count the number of tokens

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages