A CLI tool to compute token lengths of various file types (txt, md, pdf, html) for different LLM models.
- Calculate token counts for various file types (Text, Markdown, PDF, HTML)
- Support for multiple LLM models (configurable via config.json)
- Display token usage as percentage of context window
- Powered by HuggingFace tokenizers library
Clone the repository and build the project:
git clone https://github.com/deltartificial/tokenizer.git
cd tokenizer
cargo build --release
# Count tokens in a file using the default config.json
./target/release/tokenizer count path/to/your/file.txt
# Count tokens using a custom config file
./target/release/tokenizer count path/to/your/file.txt -c custom-config.json
# Count tokens using a specific tokenizer model
./target/release/tokenizer count path/to/your/file.html -t roberta-base
The tool uses a config.json
file to define models and their context lengths. The default file includes configurations for various models:
{
"models": [
{
"name": "gpt-3.5-turbo",
"context_length": 16385,
"encoding": "tiktoken"
},
{
"name": "gpt-4",
"context_length": 8192,
"encoding": "tiktoken"
},
{
"name": "bert-base",
"context_length": 512,
"encoding": "bert"
},
...
]
}
You can customize this file to add or modify models as needed.
This tool uses HuggingFace's tokenizers library, which provides high-performance implementations of various tokenization algorithms. The default tokenizer used is BERT, but the architecture is designed to be easily extended to support different tokenizers.
.txt
- Plain text files.md
- Markdown files.pdf
- PDF documents (basic implementation).html
/.htm
- HTML files (tags are stripped for token counting)
The project follows a clean architecture approach:
domain
: Core business logic and entitiesapplication
: Use cases that orchestrate the domain logicinfrastructure
: External services implementation (file reading, tokenization)presentation
: User interface (CLI)
MIT