# Krishna Tokenizer — Counting and Inspecting Tokens

This notebook demonstrates how to tokenize chaotic text (including emojis, punctuation, mixed case, repeats), compute frontend digits (0–9), and generate backend numbers (huge and scaled), similar in spirit to the tiktoken token counting guide.

Reference: [How_to_count_tokens_with_tiktoken.ipynb](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb)

## What you’ll learn
- Preserve-all tokenization (space/word/char/grammar/subword/byte)
- Frontend digits (0–9) per token
- Backend numbers (full M and scaled small-IDs)
- Digit-only streams (flattened 0–9)
- Determinism and validation (checksums, lengths)
- DEV vs USER vs JSON output styles



## Quick Start

Run the CLI for interactive exploration:

```bash
python krishna_tokenizer.py
```

- Choose input: type text or provide a file path
- Choose output mode: DEV (full), USER (summary), JSON (compact)
- Optionally save per-stream JSONL files



## Example Input

```text
mother Fucker :), @nIpuKuLLO Naaaaaaa mOOOOOOOdAAAAAAAA.
```

This input includes spaces, punctuation, symbols, mixed case, and repeats. The system keeps everything intact for display; numbers are computed on a math view (lowercased, repeat-aware), while preserving all original characters.



## Tokenization Levels

- Space tokens: split on spaces
- Word tokens: alphanumerics grouped; punctuation separated
- Grammar tokens: words and individual punctuation/emojis
- Character tokens: one token per character
- Subword tokens: fixed n-grams (3 chars)
- Byte tokens: decimal digits of codepoints (UTF-8-free fallback)



## Frontend and Backend

For each token:
- frontend: a digit 1–9 (Krishna-digit)
- backend_huge: full identity number M (deterministic, large)
- backend_scaled: small readable ID (0..99999)
- backend_digits: flattened 0–9 digit stream from backend_scaled



## DEV vs USER vs JSON Modes

- DEV: full debug (all token lists, digits, backends, IDs)
- USER: concise summary (word tokens, frontend digits, backend_digits)
- JSON: compact machine-readable summary

This mirrors the style of the tiktoken doc while preserving Krishna-specific math.



## References

- tiktoken counting tutorial: [How_to_count_tokens_with_tiktoken.ipynb](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb)
- This Krishna tokenizer preserves all characters; math view only normalizes for stability.

