# Tokenization Deep Dive
Explore how tokenization works under the hood using Hugging Face Transformers.

## Install Transformers

In [None]:
!pip install transformers --quiet

## Load Tokenizer

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Playing with transformers is amazing!"
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

## Rare Word Tokenization Example

In [None]:
rare_word = "transformerscape"
rare_tokens = tokenizer.tokenize(rare_word)
print("Rare word tokens:", rare_tokens)

## Convert Tokens to IDs

In [None]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)

## Add Special Tokens ([CLS], [SEP])

In [None]:
tokens_with_special = tokenizer.build_inputs_with_special_tokens(token_ids)
print("Token IDs with special tokens:", tokens_with_special)

## Full Encoding with Attention Mask

In [None]:
inputs = tokenizer(text, return_tensors="pt")
print("Input IDs:", inputs['input_ids'])
print("Attention Mask:", inputs['attention_mask'])

## Compare Tokenizers: BERT vs RoBERTa

In [None]:
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
roberta_tokenizer = AutoTokenizer.from_pretrained("roberta-base")

sentence = "Let's explore Hugging Face!"

print("BERT:", bert_tokenizer.tokenize(sentence))
print("RoBERTa:", roberta_tokenizer.tokenize(sentence))

## Handle Emojis, Accents, Mixed Languages

In [None]:
emoji_text = "I love AI 🤖!"
accent_text = "Café au lait"
multi_lang = "مرحبا Hello こんにちは"

print("Emoji:", tokenizer.tokenize(emoji_text))
print("Accents:", tokenizer.tokenize(accent_text))
print("Mixed:", tokenizer.tokenize(multi_lang))

## Padding & Truncation Example

In [None]:
sentences = ["This is short.", "This is a much longer sentence that may need truncation."]
encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
print("Padded input IDs:", encoded['input_ids'])
print("Attention mask:", encoded['attention_mask'])

## Summary
- Tokenizers break text into subwords.
- They map tokens to IDs used by models.
- Special tokens, attention masks, and padding help with model inputs.
- Tokenizers differ (e.g., BERT vs RoBERTa).
- Transformers process sequences, not raw words.