# üî§ Tokenization in Hugging Face

---

## üìö What You'll Learn

Tokenization is the **foundational step** in Natural Language Processing (NLP) that converts human-readable text into a format that machine learning models can understand. In this notebook, you'll learn:

1. **What is Tokenization?** - Breaking text into smaller, meaningful units
2. **Types of Tokenization** - Word-level, Subword-level, and Character-level
3. **Hugging Face Tokenizers** - Using AutoTokenizer for various models
4. **Understanding Token Components** - input_ids, attention_mask, token_type_ids
5. **Practical Dataset Tokenization** - Preparing data for model training

---

## üéØ Why Tokenization Matters

| Purpose | Description |
|---------|-------------|
| **Text Preprocessing** | Simplifies handling of punctuation, casing, and special characters |
| **Numerical Representation** | Converts text to token IDs for model consumption |
| **Memory Efficiency** | Subword tokenization reduces vocabulary size while maintaining coverage |
| **Foundation for NLP Tasks** | Essential for NER, translation, summarization, and more |

## üõ†Ô∏è Setup and Installation

In [1]:
# Install required libraries (uncomment if needed)
# !pip install transformers datasets torch -q

In [2]:
# Import essential libraries
from transformers import AutoTokenizer
from datasets import load_dataset
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


---

## üìñ Part 1: Types of Tokenization Methods

There are three main approaches to breaking text into tokens, each with its own strengths and use cases.

### 1Ô∏è‚É£ Word-Level Tokenization

The simplest approach - splits text by whitespace and punctuation into complete words.

**Pros:**
- Simple and intuitive
- Each token represents a complete word

**Cons:**
- Struggles with out-of-vocabulary (OOV) words
- Requires large vocabulary for diverse languages
- Doesn't capture word morphology (prefixes, suffixes)

**Models using this:** Word2Vec, GloVe

In [3]:
# ‚ú® Example: Simple Word-Level Tokenization

def simple_word_tokenize(text):
    """Basic word-level tokenization using split."""
    return text.split()

# Let's try it with a tech-related sentence
sample_text = "Artificial Intelligence is transforming healthcare and education"
word_tokens = simple_word_tokenize(sample_text)

print(f"üìù Original Text: '{sample_text}'")
print(f"üî¢ Number of Tokens: {len(word_tokens)}")
print(f"üìã Tokens: {word_tokens}")

üìù Original Text: 'Artificial Intelligence is transforming healthcare and education'
üî¢ Number of Tokens: 7
üìã Tokens: ['Artificial', 'Intelligence', 'is', 'transforming', 'healthcare', 'and', 'education']


In [4]:
# üö´ Problem with Word-Level: Out-of-Vocabulary Words

# Imagine a fixed vocabulary
vocabulary = {"artificial", "intelligence", "is", "good", "for", "society"}

text = "Artificial Intelligence is revolutionizing biotechnology"
tokens = simple_word_tokenize(text.lower())

print("üîç Checking tokens against vocabulary:")
for token in tokens:
    status = "‚úÖ Known" if token in vocabulary else "‚ùå OOV (Unknown)"
    print(f"   '{token}': {status}")

print("\nüí° Words like 'revolutionizing' and 'biotechnology' would be unknown!")

üîç Checking tokens against vocabulary:
   'artificial': ‚úÖ Known
   'intelligence': ‚úÖ Known
   'is': ‚úÖ Known
   'revolutionizing': ‚ùå OOV (Unknown)
   'biotechnology': ‚ùå OOV (Unknown)

üí° Words like 'revolutionizing' and 'biotechnology' would be unknown!


### 2Ô∏è‚É£ Subword-Level Tokenization

The **gold standard** for modern NLP! Breaks words into smaller meaningful units.

**Pros:**
- ‚úÖ Handles OOV words by breaking them into known subwords
- ‚úÖ Captures morphology (prefixes, suffixes, roots)
- ‚úÖ Compact vocabulary with wide coverage
- ‚úÖ Works well across languages

**Algorithms:**
- **BPE (Byte-Pair Encoding)**: Used by GPT-2, GPT-3, RoBERTa
- **WordPiece**: Used by BERT, DistilBERT
- **SentencePiece**: Used by T5, ALBERT, XLNet

**Models using this:** BERT, GPT, T5, RoBERTa, and most modern transformers

In [5]:
# ‚ú® Example: Subword Tokenization with BERT

# Load BERT's WordPiece tokenizer
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Let's see how it handles complex words
complex_words = [
    "cryptocurrency",
    "unbelievable",
    "internationalization",
    "neuroscientist"
]

print("üî¨ Subword Tokenization Examples:")
print("=" * 50)

for word in complex_words:
    tokens = bert_tokenizer.tokenize(word)
    print(f"\nüìù Word: '{word}'")
    print(f"   Subwords: {tokens}")
    print(f"   Count: {len(tokens)} tokens")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

üî¨ Subword Tokenization Examples:

üìù Word: 'cryptocurrency'
   Subwords: ['crypt', '##oc', '##ur', '##ren', '##cy']
   Count: 5 tokens

üìù Word: 'unbelievable'
   Subwords: ['unbelievable']
   Count: 1 tokens

üìù Word: 'internationalization'
   Subwords: ['international', '##ization']
   Count: 2 tokens

üìù Word: 'neuroscientist'
   Subwords: ['ne', '##uro', '##sc', '##ient', '##ist']
   Count: 5 tokens


In [6]:
# üí° Understanding the '##' Symbol in WordPiece

text = "The microprocessor revolutionized computing"
tokens = bert_tokenizer.tokenize(text)

print(f"üìù Original: '{text}'")
print(f"\nüîç Tokens: {tokens}")

print("\nüìö Explanation:")
print("   ‚Ä¢ '##' prefix indicates the token is a CONTINUATION of the previous word")
print("   ‚Ä¢ Tokens without '##' are either standalone words or word beginnings")
print("   ‚Ä¢ This helps the model understand word boundaries")

üìù Original: 'The microprocessor revolutionized computing'

üîç Tokens: ['the', 'micro', '##pro', '##ces', '##sor', 'revolution', '##ized', 'computing']

üìö Explanation:
   ‚Ä¢ '##' prefix indicates the token is a CONTINUATION of the previous word
   ‚Ä¢ Tokens without '##' are either standalone words or word beginnings
   ‚Ä¢ This helps the model understand word boundaries


### 3Ô∏è‚É£ Character-Level Tokenization

Breaks text into individual characters. Especially useful for:
- Languages without clear word boundaries (Chinese, Japanese, Korean)
- Handling typos and misspellings
- Very fine-grained text analysis

**Pros:**
- No OOV problem (limited character set)
- Perfect for logographic languages

**Cons:**
- Very long sequences
- Harder for models to learn word meanings

In [7]:
# ‚ú® Example: Character-Level Tokenization

def char_tokenize(text):
    """Character-level tokenization."""
    return list(text)

# English example
english_text = "Hello AI"
english_chars = char_tokenize(english_text)

print("üî§ Character-Level Tokenization:")
print(f"\nüìù English: '{english_text}'")
print(f"   Characters: {english_chars}")
print(f"   Count: {len(english_chars)} characters")

# Chinese example (if your system supports it)
chinese_text = "‰∫∫Â∑•Êô∫ËÉΩ"  # "Artificial Intelligence" in Chinese
chinese_chars = char_tokenize(chinese_text)

print(f"\nüìù Chinese: '{chinese_text}'")
print(f"   Characters: {chinese_chars}")
print(f"   Count: {len(chinese_chars)} characters")
print("\nüí° Each Chinese character is a meaningful unit!")

üî§ Character-Level Tokenization:

üìù English: 'Hello AI'
   Characters: ['H', 'e', 'l', 'l', 'o', ' ', 'A', 'I']
   Count: 8 characters

üìù Chinese: '‰∫∫Â∑•Êô∫ËÉΩ'
   Characters: ['‰∫∫', 'Â∑•', 'Êô∫', 'ËÉΩ']
   Count: 4 characters

üí° Each Chinese character is a meaningful unit!


### üìä Comparison Summary

| Method | Vocabulary Size | OOV Handling | Sequence Length | Best For |
|--------|----------------|--------------|-----------------|----------|
| Word-Level | Very Large | Poor | Short | Simple tasks, older models |
| **Subword** | Medium | **Excellent** | Medium | **Modern Transformers** |
| Character | Small | None (No OOV) | Very Long | CJK languages, typo handling |

---

## üìñ Part 2: Hugging Face AutoTokenizer

The `AutoTokenizer` class automatically loads the correct tokenizer for any pre-trained model. This is the **recommended way** to work with tokenizers in Hugging Face.

In [11]:
# üîÑ Loading Different Tokenizers

# BERT tokenizer (WordPiece)
bert_tok = AutoTokenizer.from_pretrained('bert-base-uncased')

# GPT-2 tokenizer (BPE)
gpt2_tok = AutoTokenizer.from_pretrained('gpt2')

print("‚úÖ Loaded tokenizers:")
print(f"   ‚Ä¢ BERT: {type(bert_tok).__name__}")
print(f"   ‚Ä¢ GPT-2: {type(gpt2_tok).__name__}")

‚úÖ Loaded tokenizers:
   ‚Ä¢ BERT: BertTokenizerFast
   ‚Ä¢ GPT-2: GPT2TokenizerFast


In [13]:
# üîç Comparing How Different Tokenizers Handle the Same Text

sample = "Machine learning enables computers to learn from data"

print(f"üìù Input: '{sample}'\n")
print("üî¨ Tokenization Comparison:")
print("=" * 60)

# BERT
bert_tokens = bert_tok.tokenize(sample)
print(f"\nü§ñ BERT (WordPiece):")
print(f"   {bert_tokens}")
print(f"   Token count: {len(bert_tokens)}")

# GPT-2
gpt2_tokens = gpt2_tok.tokenize(sample)
print(f"\nü§ñ GPT-2 (BPE):")
print(f"   {gpt2_tokens}")
print(f"   Token count: {len(gpt2_tokens)}")

üìù Input: 'Machine learning enables computers to learn from data'

üî¨ Tokenization Comparison:

ü§ñ BERT (WordPiece):
   ['machine', 'learning', 'enables', 'computers', 'to', 'learn', 'from', 'data']
   Token count: 8

ü§ñ GPT-2 (BPE):
   ['Machine', 'ƒ†learning', 'ƒ†enables', 'ƒ†computers', 'ƒ†to', 'ƒ†learn', 'ƒ†from', 'ƒ†data']
   Token count: 8


---

## üìñ Part 3: Understanding Tokenizer Outputs

When you tokenize text with Hugging Face, you get three important components:

1. **`input_ids`** - Numerical IDs representing each token
2. **`attention_mask`** - Binary mask showing real tokens (1) vs padding (0)
3. **`token_type_ids`** - Segment IDs for sentence pairs (for BERT-like models)

In [14]:
# üìä Full Tokenization Example

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

text = "Deep learning powers modern AI applications"

# Full tokenization with all outputs
encoded = tokenizer(
    text,
    padding='max_length',    # Pad to max_length
    max_length=15,           # Maximum sequence length
    truncation=True,         # Truncate if too long
    return_tensors=None      # Return Python lists
)

print(f"üìù Original Text: '{text}'")
print("\n" + "=" * 60)
print("\nüì¶ Tokenizer Output:")
print(f"\n1Ô∏è‚É£ input_ids (Token IDs):")
print(f"   {encoded['input_ids']}")
print(f"\n2Ô∏è‚É£ attention_mask:")
print(f"   {encoded['attention_mask']}")
print(f"\n3Ô∏è‚É£ token_type_ids:")
print(f"   {encoded['token_type_ids']}")

üìù Original Text: 'Deep learning powers modern AI applications'


üì¶ Tokenizer Output:

1Ô∏è‚É£ input_ids (Token IDs):
   [101, 2784, 4083, 4204, 2715, 9932, 5097, 102, 0, 0, 0, 0, 0, 0, 0]

2Ô∏è‚É£ attention_mask:
   [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

3Ô∏è‚É£ token_type_ids:
   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [15]:
# üîç Decoding: Converting Token IDs Back to Text

# Get the tokens from IDs
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'])

print("üîÑ Token ID to Token Mapping:\n")
print(f"{'Token ID':<12} {'Token':<15} {'Attention':<12} {'Description'}")
print("-" * 60)

for token_id, token, attn in zip(encoded['input_ids'], tokens, encoded['attention_mask']):
    if token == '[CLS]':
        desc = "‚Üê Start token"
    elif token == '[SEP]':
        desc = "‚Üê Separator/End token"
    elif token == '[PAD]':
        desc = "‚Üê Padding token"
    elif token.startswith('##'):
        desc = "‚Üê Subword continuation"
    else:
        desc = ""
    
    print(f"{token_id:<12} {token:<15} {attn:<12} {desc}")

üîÑ Token ID to Token Mapping:

Token ID     Token           Attention    Description
------------------------------------------------------------
101          [CLS]           1            ‚Üê Start token
2784         deep            1            
4083         learning        1            
4204         powers          1            
2715         modern          1            
9932         ai              1            
5097         applications    1            
102          [SEP]           1            ‚Üê Separator/End token
0            [PAD]           0            ‚Üê Padding token
0            [PAD]           0            ‚Üê Padding token
0            [PAD]           0            ‚Üê Padding token
0            [PAD]           0            ‚Üê Padding token
0            [PAD]           0            ‚Üê Padding token
0            [PAD]           0            ‚Üê Padding token
0            [PAD]           0            ‚Üê Padding token


In [16]:
# üí° Understanding Special Tokens

print("üè∑Ô∏è Special Tokens in BERT Tokenizer:\n")

special_tokens = {
    '[CLS]': 'Classification token - marks the beginning of input',
    '[SEP]': 'Separator token - marks end of sentence or separates sentence pairs',
    '[PAD]': 'Padding token - fills sequences to uniform length',
    '[UNK]': 'Unknown token - represents out-of-vocabulary tokens',
    '[MASK]': 'Mask token - used for masked language modeling'
}

for token, description in special_tokens.items():
    token_id = tokenizer.convert_tokens_to_ids(token)
    print(f"{token} (ID: {token_id})")
    print(f"   ‚îî‚îÄ‚îÄ {description}\n")

üè∑Ô∏è Special Tokens in BERT Tokenizer:

[CLS] (ID: 101)
   ‚îî‚îÄ‚îÄ Classification token - marks the beginning of input

[SEP] (ID: 102)
   ‚îî‚îÄ‚îÄ Separator token - marks end of sentence or separates sentence pairs

[PAD] (ID: 0)
   ‚îî‚îÄ‚îÄ Padding token - fills sequences to uniform length

[UNK] (ID: 100)
   ‚îî‚îÄ‚îÄ Unknown token - represents out-of-vocabulary tokens

[MASK] (ID: 103)
   ‚îî‚îÄ‚îÄ Mask token - used for masked language modeling



### üß† Understanding Attention Mask

The attention mask tells the model:
- **1** = "Pay attention to this token" (real content)
- **0** = "Ignore this token" (padding)

This is crucial for:
- Batch processing with variable-length sequences
- Preventing the model from learning from padding tokens

In [17]:
# üìä Visualizing Attention Mask

short_text = "AI is amazing"
long_text = "Artificial intelligence transforms how we interact with technology daily"

# Tokenize both with same max_length
short_encoded = tokenizer(short_text, padding='max_length', max_length=20, truncation=True)
long_encoded = tokenizer(long_text, padding='max_length', max_length=20, truncation=True)

print("üìä Attention Mask Comparison:\n")

print(f"Short text: '{short_text}'")
print(f"Tokens:     {tokenizer.convert_ids_to_tokens(short_encoded['input_ids'])}")
print(f"Attention:  {short_encoded['attention_mask']}")
print(f"Real tokens: {sum(short_encoded['attention_mask'])} | Padding: {len(short_encoded['attention_mask']) - sum(short_encoded['attention_mask'])}")

print(f"\nLong text: '{long_text}'")
print(f"Tokens:     {tokenizer.convert_ids_to_tokens(long_encoded['input_ids'])}")
print(f"Attention:  {long_encoded['attention_mask']}")
print(f"Real tokens: {sum(long_encoded['attention_mask'])} | Padding: {len(long_encoded['attention_mask']) - sum(long_encoded['attention_mask'])}")

üìä Attention Mask Comparison:

Short text: 'AI is amazing'
Tokens:     ['[CLS]', 'ai', 'is', 'amazing', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Attention:  [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Real tokens: 5 | Padding: 15

Long text: 'Artificial intelligence transforms how we interact with technology daily'
Tokens:     ['[CLS]', 'artificial', 'intelligence', 'transforms', 'how', 'we', 'interact', 'with', 'technology', 'daily', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Attention:  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Real tokens: 11 | Padding: 9


### üîó Understanding Token Type IDs

Token type IDs are used when the input consists of **multiple segments** (e.g., question-answering tasks):
- **0** = First segment (e.g., question)
- **1** = Second segment (e.g., context/answer)

For single-sentence inputs, all token type IDs are 0.

In [18]:
# üîó Token Type IDs with Sentence Pairs

question = "What is machine learning?"
answer = "Machine learning is a subset of AI that enables computers to learn from data."

# Tokenize as a sentence pair
pair_encoded = tokenizer(
    question, 
    answer, 
    padding='max_length', 
    max_length=35,
    truncation=True
)

tokens = tokenizer.convert_ids_to_tokens(pair_encoded['input_ids'])

print("üîó Sentence Pair Tokenization:\n")
print(f"Question: '{question}'")
print(f"Answer: '{answer}'\n")

print(f"{'Token':<20} {'Type ID':<10} {'Segment'}")
print("-" * 45)

for token, type_id in zip(tokens, pair_encoded['token_type_ids']):
    if token == '[PAD]':
        continue
    segment = "Question" if type_id == 0 else "Answer"
    if token in ['[CLS]', '[SEP]']:
        segment = f"Special ({segment})"
    print(f"{token:<20} {type_id:<10} {segment}")

üîó Sentence Pair Tokenization:

Question: 'What is machine learning?'
Answer: 'Machine learning is a subset of AI that enables computers to learn from data.'

Token                Type ID    Segment
---------------------------------------------
[CLS]                0          Special (Question)
what                 0          Question
is                   0          Question
machine              0          Question
learning             0          Question
?                    0          Question
[SEP]                0          Special (Question)
machine              1          Answer
learning             1          Answer
is                   1          Answer
a                    1          Answer
subset               1          Answer
of                   1          Answer
ai                   1          Answer
that                 1          Answer
enables              1          Answer
computers            1          Answer
to                   1          Answer
learn            

---

## üìñ Part 4: Tokenizing Datasets for Training

Let's apply everything we've learned to tokenize a real dataset, preparing it for model training!

In [19]:
# üì• Load a Sample Dataset

# Using the 'emotion' dataset - classifying text into emotions
dataset = load_dataset('dair-ai/emotion', split='train[:1000]')  # First 1000 samples

print("üìä Dataset Info:")
print(f"   ‚Ä¢ Number of samples: {len(dataset)}")
print(f"   ‚Ä¢ Features: {dataset.features}")
print(f"\nüìù Sample entries:")
for i in range(3):
    print(f"   {i+1}. '{dataset[i]['text'][:60]}...' ‚Üí Label: {dataset[i]['label']}")

README.md: 0.00B [00:00, ?B/s]

split/train-00000-of-00001.parquet:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

split/validation-00000-of-00001.parquet:   0%|          | 0.00/127k [00:00<?, ?B/s]

split/test-00000-of-00001.parquet:   0%|          | 0.00/129k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

üìä Dataset Info:
   ‚Ä¢ Number of samples: 1000
   ‚Ä¢ Features: {'text': Value('string'), 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'])}

üìù Sample entries:
   1. 'i didnt feel humiliated...' ‚Üí Label: 0
   2. 'i can go from feeling so hopeless to so damned hopeful just ...' ‚Üí Label: 0
   3. 'im grabbing a minute to post i feel greedy wrong...' ‚Üí Label: 3


In [20]:
# üîß Define Tokenization Function

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

def tokenize_function(examples):
    """Tokenize a batch of examples."""
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=128
    )

print("‚úÖ Tokenization function defined!")
print("\nüìã Parameters:")
print("   ‚Ä¢ truncation=True: Cut sequences longer than max_length")
print("   ‚Ä¢ padding='max_length': Pad shorter sequences")
print("   ‚Ä¢ max_length=128: Maximum sequence length")

‚úÖ Tokenization function defined!

üìã Parameters:
   ‚Ä¢ truncation=True: Cut sequences longer than max_length
   ‚Ä¢ padding='max_length': Pad shorter sequences
   ‚Ä¢ max_length=128: Maximum sequence length


In [21]:
# ‚ö° Apply Tokenization to Dataset

# Using the map() function for efficient batch processing
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,           # Process in batches for efficiency
    remove_columns=['text'] # Remove original text column
)

print("‚úÖ Dataset tokenized successfully!")
print(f"\nüìä Tokenized Dataset Features:")
print(f"   {tokenized_dataset.features}")
print(f"\nüìã New columns added:")
print("   ‚Ä¢ input_ids: Token IDs")
print("   ‚Ä¢ attention_mask: Real token indicators")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

‚úÖ Dataset tokenized successfully!

üìä Tokenized Dataset Features:
   {'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']), 'input_ids': List(Value('int32')), 'attention_mask': List(Value('int8'))}

üìã New columns added:
   ‚Ä¢ input_ids: Token IDs
   ‚Ä¢ attention_mask: Real token indicators


In [22]:
# üîç Examine a Tokenized Sample

sample_idx = 0
sample = tokenized_dataset[sample_idx]

print(f"üìã Tokenized Sample #{sample_idx}:\n")
print(f"Label: {sample['label']}")
print(f"\nInput IDs (first 30): {sample['input_ids'][:30]}")
print(f"Attention Mask (first 30): {sample['attention_mask'][:30]}")

# Decode back to text
decoded_text = tokenizer.decode(sample['input_ids'], skip_special_tokens=True)
print(f"\nüîÑ Decoded text: '{decoded_text}'")

üìã Tokenized Sample #0:

Label: 0

Input IDs (first 30): [101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Attention Mask (first 30): [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

üîÑ Decoded text: 'i didnt feel humiliated'


In [23]:
# üìä Analyze Token Distribution

# Count actual tokens (excluding padding)
token_counts = [sum(sample['attention_mask']) for sample in tokenized_dataset]

avg_tokens = sum(token_counts) / len(token_counts)
max_tokens = max(token_counts)
min_tokens = min(token_counts)

print("üìä Token Distribution Statistics:\n")
print(f"   ‚Ä¢ Average tokens per sample: {avg_tokens:.1f}")
print(f"   ‚Ä¢ Maximum tokens: {max_tokens}")
print(f"   ‚Ä¢ Minimum tokens: {min_tokens}")
print(f"   ‚Ä¢ Max length setting: 128")
print(f"\nüí° Insight: {'Most samples use full length!' if avg_tokens > 100 else 'Samples are shorter than max_length - consider reducing max_length for efficiency.'}") 

üìä Token Distribution Statistics:

   ‚Ä¢ Average tokens per sample: 22.9
   ‚Ä¢ Maximum tokens: 70
   ‚Ä¢ Minimum tokens: 5
   ‚Ä¢ Max length setting: 128

üí° Insight: Samples are shorter than max_length - consider reducing max_length for efficiency.


---

## üìñ Part 5: Advanced Tokenization Techniques

In [24]:
# üéØ Dynamic Padding (More Efficient)

# Instead of padding to max_length, pad to longest in batch
texts = [
    "Short text",
    "This is a medium length sentence",
    "This is a much longer sentence that contains more words and information"
]

# With dynamic padding
dynamic_encoded = tokenizer(
    texts,
    padding=True,  # Pad to longest in batch
    truncation=True,
    return_tensors='pt'  # Return PyTorch tensors
)

print("üéØ Dynamic Padding Results:\n")
print(f"Shape of input_ids: {dynamic_encoded['input_ids'].shape}")
print(f"(batch_size=3, sequence_length={dynamic_encoded['input_ids'].shape[1]})")
print("\nüí° Sequence length is determined by the longest text in the batch!")

üéØ Dynamic Padding Results:

Shape of input_ids: torch.Size([3, 14])
(batch_size=3, sequence_length=14)

üí° Sequence length is determined by the longest text in the batch!


In [27]:
# üîÄ Handling Multiple Text Pairs

# Useful for NLI, QA, and sentence similarity tasks
premises = [
    "The weather is sunny today",
    "Python is a programming language"
]

hypotheses = [
    "It's a beautiful day outside",
    "Python is used for web development"
]

# Use BERT tokenizer for this example (it returns token_type_ids)
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

pair_encoded = bert_tokenizer(
    premises,
    hypotheses,
    padding=True,
    truncation=True,
    return_tensors='pt'
)

print("üîÄ Batch Sentence Pair Encoding:\n")
print(f"Input IDs shape: {pair_encoded['input_ids'].shape}")
print(f"   ‚Üí (batch_size={pair_encoded['input_ids'].shape[0]}, sequence_length={pair_encoded['input_ids'].shape[1]})")

# Show first pair details
print(f"\nüìù First Pair:")
print(f"   Premise: '{premises[0]}'")
print(f"   Hypothesis: '{hypotheses[0]}'")
print(f"\n   Tokens: {bert_tokenizer.convert_ids_to_tokens(pair_encoded['input_ids'][0].tolist())}")

# Show token_type_ids to distinguish premise from hypothesis
print(f"\n   Token Type IDs: {pair_encoded['token_type_ids'][0].tolist()}")
print("   ‚Üí 0 = Premise tokens, 1 = Hypothesis tokens")

# Show second pair for comparison
print(f"\nüìù Second Pair:")
print(f"   Premise: '{premises[1]}'")
print(f"   Hypothesis: '{hypotheses[1]}'")
print(f"\n   Tokens: {bert_tokenizer.convert_ids_to_tokens(pair_encoded['input_ids'][1].tolist())}")

üîÄ Batch Sentence Pair Encoding:

Input IDs shape: torch.Size([2, 15])
   ‚Üí (batch_size=2, sequence_length=15)

üìù First Pair:
   Premise: 'The weather is sunny today'
   Hypothesis: 'It's a beautiful day outside'

   Tokens: ['[CLS]', 'the', 'weather', 'is', 'sunny', 'today', '[SEP]', 'it', "'", 's', 'a', 'beautiful', 'day', 'outside', '[SEP]']

   Token Type IDs: [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
   ‚Üí 0 = Premise tokens, 1 = Hypothesis tokens

üìù Second Pair:
   Premise: 'Python is a programming language'
   Hypothesis: 'Python is used for web development'

   Tokens: ['[CLS]', 'python', 'is', 'a', 'programming', 'language', '[SEP]', 'python', 'is', 'used', 'for', 'web', 'development', '[SEP]', '[PAD]']


---

## üéì Key Takeaways

### 1. Tokenization Types
- **Word-level**: Simple but limited vocabulary and OOV handling
- **Subword-level** (BPE, WordPiece): Best of both worlds - used by modern transformers
- **Character-level**: Good for CJK languages and typo handling

### 2. Tokenizer Outputs
- **`input_ids`**: Numerical token representations
- **`attention_mask`**: 1 for real tokens, 0 for padding
- **`token_type_ids`**: Segment identifiers for sentence pairs

### 3. Special Tokens
- **[CLS]**: Start of sequence (classification)
- **[SEP]**: Separator/End of sequence
- **[PAD]**: Padding for batch processing
- **[MASK]**: For masked language modeling

### 4. Best Practices
- Use `AutoTokenizer` for model-compatible tokenization
- Apply `truncation=True` and set appropriate `max_length`
- Use `batched=True` with `dataset.map()` for efficiency
- Consider dynamic padding for variable-length sequences

---

## üìö Further Reading

- [Hugging Face Tokenizers Documentation](https://huggingface.co/docs/tokenizers/)
- [Understanding WordPiece](https://huggingface.co/learn/nlp-course/chapter6/6)
- [BPE Algorithm Explained](https://huggingface.co/learn/nlp-course/chapter6/5)

---