# Tokenizer Demo
This notebook demonstrates tokenization using two popular libraries:
- **Hugging Face Transformers** – for BERT‑style tokenizers
- **tiktoken** – OpenAI's fast tokeniser used by GPT models

## Installation
You need to install the following Python packages before running the examples:
```bash
pip install transformers tiktoken
```
The command above works on most platforms (Linux, macOS, Windows).

### Detailed tiktoken installation notes
* **Python version**: tiktoken requires Python >= 3.8.
* **Binary wheels**: For Linux/macOS, pip will download pre‑compiled wheels (manylinux). No compiler is needed.
* **Building from source**: If a wheel is not available for your platform, pip will attempt to build from source. This requires a C compiler (e.g., `gcc` on Linux, `clang` on macOS) and `rustc` because tiktoken includes Rust extensions. Install Rust via `curl https://sh.rustup.rs -sSf | sh` if needed.
* **Conda users**: You can also install via conda‑forge: `conda install -c conda-forge tiktoken`.
* **Optional dependencies**: No additional system libraries are required for the basic encoder. If you plan to use the `tiktoken` tokenizer with OpenAI's `gpt‑3.5‑turbo` or `gpt‑4` models, ensure you have internet access for the model‑specific encodings (they are bundled).
* **Verification**: After installation, you can run `python -c "import tiktoken, sys; print(tiktoken.__version__)"` to confirm it works.

You can also install the packages directly from this notebook using the magic command below (run the cell).

In [2]:
%pip install transformers tiktoken


Collecting transformers
  Using cached transformers-4.57.6-py3-none-any.whl.metadata (43 kB)
Collecting tiktoken
  Downloading tiktoken-0.12.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (6.7 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Using cached huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Using cached tokenizers-0.22.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Using cached safetensors-0.7.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting hf-xet<2.0.0,>=1.1.3 (from huggingface-hub<1.0,>=0.34.0->transformers)
  Using cached hf_xet-1.2.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Using cached transformers-4.57.6-py3-none-any.whl (12.0 MB)
Downloading tiktoken-0.12.0-cp312-cp312-manylinux_2_28_x86_64.whl (1.2 MB)
[2K   [38;2;114;156;31

## Example 1: Hugging Face Transformers tokenizer
We load the `bert-base-uncased` tokenizer and tokenize a short sentence.

In [3]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "Hello, world! This is a tokenizer demo."
print('Transformers tokens:', tokenizer.tokenize(text))


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Transformers tokens: ['hello', ',', 'world', '!', 'this', 'is', 'a', 'token', '##izer', 'demo', '.']


## Example 2: tiktoken
tiktoken provides fast tokenisation for OpenAI models.
We use the `cl100k_base` encoding (used by gpt‑3.5‑turbo and gpt‑4).

In [4]:
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
print('tiktoken tokens:', enc.encode(text))
# Show token count
print('Number of tokens:', len(enc.encode(text)))


tiktoken tokens: [9906, 11, 1917, 0, 1115, 374, 264, 47058, 17074, 13]
Number of tokens: 10
