
# 🔤 Hugging Face Tokenization Lab

Welcome to the Hugging Face Tokenization lab! This lab will help you explore the core concepts of tokenization in NLP using the `transformers` and `tokenizers` libraries from Hugging Face.

We will go from basic theory to complex customizations, including using pre-trained models and training your own tokenizer from scratch.


In [None]:
!pip install transformers tokenizers datasets --quiet


## 📘 1. What is Tokenization?

Tokenization is the process of converting a text input (e.g., a sentence or paragraph) into a sequence of tokens that a model can understand.

These tokens are usually integers or subwords representing the original text.

Tokenization is the **first step** in most NLP pipelines.



## 🔍 2. Why Tokenization is Important

Transformers cannot handle raw text directly. They require tokenized input.

A good tokenizer ensures:

- Efficient encoding of text
- Handling of rare words
- Preservation of meaning
- Compatibility with model architecture



## ✂️ 3. Types of Tokenization

1. **Character-level tokenization**
2. **Word-level tokenization**
3. **Subword tokenization** (most common for Transformers)

Subword tokenizers break rare words into common components, e.g.:
- `"unhappiness"` → `["un", "happi", "ness"]`

Popular subword algorithms:
- **BPE** (Byte-Pair Encoding)
- **WordPiece**
- **Unigram**


## 🤖 4. Using a Pretrained Tokenizer

In [None]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Tokenization is the first step in NLP!"
tokens = tokenizer.tokenize(text)
tokens


## 📥 5. Tokenizer Outputs

In [None]:

# Encode returns token IDs
encoded = tokenizer.encode(text)
print(encoded)

# Decode
print(tokenizer.decode(encoded))


## 🔐 6. Special Tokens

In [None]:
tokenizer.special_tokens_map

## 🔓 7. Decoding Tokens

In [None]:

# Decode one sequence
tokenizer.decode(encoded)

# Decode multiple sequences
tokenizer.batch_decode([encoded])


## ⚙️ 8. Tokenizer Configuration

In [None]:

encoded = tokenizer(text, add_special_tokens=True, padding="max_length", truncation=True, max_length=12)
encoded


## 🧠 9. Attention Masks

In [None]:

print(encoded['input_ids'])
print(encoded['attention_mask'])


## 🔁 10. Batch Encoding

In [None]:

batch = tokenizer(["Hello world!", "Tokenization is awesome."], padding=True, truncation=True, return_tensors="pt")
batch


## 🧪 11. Training a Custom Tokenizer

In [None]:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=1000, special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])

# Dummy training
from datasets import load_dataset
dataset = load_dataset("ag_news", split="train[:1%]")  # small subset
texts = [x["text"] for x in dataset]

tokenizer.train_from_iterator(texts, trainer)
tokenizer.save("custom-tokenizer.json")


## 💾 12. Using a Custom Tokenizer with Transformers

In [None]:

from transformers import PreTrainedTokenizerFast

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="custom-tokenizer.json", unk_token="[UNK]", pad_token="[PAD]", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]")

fast_tokenizer("Let's test our custom tokenizer!")


## 🐢 13. Fast vs Slow Tokenizers

In [None]:

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer.is_fast



## 🔍 14. Subword Algorithms

Hugging Face supports 3 main algorithms:

- **BPE** (e.g., GPT-2)
- **WordPiece** (e.g., BERT)
- **Unigram** (e.g., XLNet)

Each has different strategies for subword splitting and vocabulary learning.


## 🧩 15. Pre-tokenizers and Normalizers

In [None]:

from tokenizers.normalizers import Lowercase, NFD, StripAccents, Sequence

tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])


## 🧹 16. Post-processing

In [None]:

from tokenizers.processors import TemplateProcessing

tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[("[CLS]", 1), ("[SEP]", 2)],
)



## ✅ Conclusion

In this lab, you learned:

- What tokenization is and why it matters
- How to use Hugging Face's pretrained tokenizers
- How to decode, pad, truncate, and batch encode
- How to train a custom tokenizer from scratch
- How to apply normalizers, pre-tokenizers, and post-processors

Now you're ready to build NLP pipelines with precise control over tokenization.
