# Tokenizer Demo
This notebook demonstrates tokenization using two popular libraries:
- **Hugging Face Transformers** – for BERT‑style tokenizers
- **tiktoken** – OpenAI's fast tokeniser used by GPT models

## Installation
You need to install the following Python packages before running the examples:
```bash
pip install transformers tiktoken
```
The command above works on most platforms (Linux, macOS, Windows).

### Detailed tiktoken installation notes
* **Python version**: tiktoken requires Python >= 3.8.
* **Binary wheels**: For Linux/macOS, pip will download pre‑compiled wheels (manylinux). No compiler is needed.
* **Building from source**: If a wheel is not available for your platform, pip will attempt to build from source. This requires a C compiler (e.g., `gcc` on Linux, `clang` on macOS) and `rustc` because tiktoken includes Rust extensions. Install Rust via `curl https://sh.rustup.rs -sSf | sh` if needed.
* **Conda users**: You can also install via conda‑forge: `conda install -c conda-forge tiktoken`.
* **Optional dependencies**: No additional system libraries are required for the basic encoder. If you plan to use the `tiktoken` tokenizer with OpenAI's `gpt‑3.5‑turbo` or `gpt‑4` models, ensure you have internet access for the model‑specific encodings (they are bundled).
* **Verification**: After installation, you can run `python -c "import tiktoken, sys; print(tiktoken.__version__)"` to confirm it works.

You can also install the packages directly from this notebook using the magic command below (run the cell).

In [1]:
%pip install transformers tiktoken

Note: you may need to restart the kernel to use updated packages.


## Example 1: Hugging Face Transformers tokenizer
We load the `bert-base-uncased` tokenizer and tokenize a short sentence.

In [3]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "Hello, world! This is a tokenizer demo."
print('Transformers tokens:', tokenizer.tokenize(text))
# Encode to ids and decode back
ids = tokenizer.encode(text, add_special_tokens=False)
print('Encoded ids:', ids)
decoded = tokenizer.decode(ids)
print('Decoded text:', decoded)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Transformers tokens: ['hello', ',', 'world', '!', 'this', 'is', 'a', 'token', '##izer', 'demo', '.']
Encoded ids: [7592, 1010, 2088, 999, 2023, 2003, 1037, 19204, 17629, 9703, 1012]
Decoded text: hello, world! this is a tokenizer demo.


## Example 2: tiktoken
tiktoken provides fast tokenisation for OpenAI models.
We use the `cl100k_base` encoding (used by gpt‑3.5‑turbo and gpt‑4).

In [4]:
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
print('tiktoken tokens:', enc.encode(text))
print('Number of tokens:', len(enc.encode(text)))
# Decode back to string
decoded_tiktoken = enc.decode(enc.encode(text))
print('Decoded with tiktoken:', decoded_tiktoken)

tiktoken tokens: [9906, 11, 1917, 0, 1115, 374, 264, 47058, 17074, 13]
Number of tokens: 10
Decoded with tiktoken: Hello, world! This is a tokenizer demo.
