
## Tokenization in Transformers

**Name:** Hemanth Kandimalla  
**Date:** 3-1-2024

---




The Hugging Face library offers a wide range of transformer models for natural language processing tasks. If you have mastered the basics and now want to dive into more advanced topics, here are some suggestions:

- Fine-Tuning Models: Learn how to fine-tune pre-trained models on your specific task. This will allow you to leverage the power of transformers for your specific needs.

- Multilingual Models: Hugging Face offers several multilingual models. You can learn how to train a single model on multiple languages.

- Custom Models: You can learn how to create and train your own transformer models from scratch.

- Pipeline Creation: Hugging Face provides a pipeline feature to easily process data and make predictions. You can learn how to create custom pipelines.

- Training on Large Datasets: Handling and training models on large datasets is a common challenge in NLP. You can learn different strategies to handle this, such as gradient accumulation.

- Distributed Training: You can learn how to use multiple GPUs to train your models faster.

- Optimization Techniques: You can dive into various optimization techniques, like learning rate scheduling, weight decay, etc.

- Advanced Tokenization Techniques: There are different tokenization techniques available, such as Byte Pair Encoding (BPE), SentencePiece, etc. Each has its own advantages and use-cases.

- Interpretability of Transformer Models: Understanding why a model made a particular prediction is an active field of research. You can dive into this topic to understand your models better.

- Use of Callbacks: Callbacks can be used to customize the training process, like saving the model at different stages, changing the learning rate, etc.

You can find many tutorials and resources on these topics in the Hugging Face model hub and their documentation. Additionally, there are many courses and books available online that delve into these topics.

# Tokenization



- Text preprocessing - Common preprocessing steps like lowercase, accent removal, punctuation handling that affect tokenization.

- Tokenization algorithms - Learn popular algorithms like wordpiece, byte-pair encoding (BPE) used by models. Their pros and cons.

- Subword tokenization - How models tokenize text to subwords for large vocabularies. Concepts like unified/non-unified tokenization.

- Language-specific tokenizers - Differences in tokenizing patterns across languages like character-based for Chinese/Japanese.

- Custom tokenizers - Building custom tokenizers for domain-specific texts using regex, lists etc.

- Added token types - Special tokens used by models like CLS, SEP, PAD and their roles.

- Contextual tokenizers - How contextual word embeddings capture subword/character level variations.

- Benefits of subword tokens - How they handle out-of-vocabulary words, compound words effectively.

- Embeddings for tokens - How token embeddings are learned during pretraining and their semantic relationships.

- Tokenization for different tasks - Differences in tokenization for tasks like NER, QA, Summarization etc. based on needs.

- Tokenization in pipelines - Integrating tokenization as first step in NLP pipelines for tasks.

- Tokenization hyperparameters - Tuning parameters like token_min_freq, token_max_len for specific problems.

- Analyzing tokenized texts - Visualization, analysis of effect of tokenization parameters qualitatively and quantitatively.



In [None]:
!pip install -q -U transformers

In [None]:
from transformers import BertTokenizerFast
import torch
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, texts, tokenizer, max_len):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, index):
        text = self.texts[index]
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            return_attention_mask=True,
            return_tensors='pt',
        )
        return {
            'text': text,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten()
        }

# Initialize the tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Assuming texts is your large dataset
texts = ["Here's a sample text"] * 180000 # replace with your own dataset

# Create dataset and dataloader
dataset = TextDataset(texts, tokenizer, max_len=512)
dataloader = DataLoader(dataset, batch_size=16)

# Then you can use this dataloader in your training loop with a model.

# Load Tokenizer

In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize a Single Sentence

In [None]:
input_ids = tokenizer.encode("Hello, I am a single sentence!", add_special_tokens=True)
print(input_ids)

# Tokenize a Batch of Sentences

In [None]:
batch_sentences = ["Hello I am a batch sentence!", "This is another one."]
input_ids = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(input_ids)

# Tokenize Large Text into Chunks

In [None]:
def chunk_text(text, chunk_size):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

text = "your large text here..."*10000
chunks = chunk_text(text, 1800)
encoded_chunks = [tokenizer.encode(chunk, add_special_tokens=True) for chunk in chunks]
print(encoded_chunks )

# Tokenize Large Dataset

In [None]:
from transformers import BertTokenizer

# Initialize tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Function to chunk text
def chunk_text(text, chunk_size):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

# Function to tokenize large dataset
def tokenize_large_dataset(dataset, tokenizer, chunk_size=1800):
    tokenized_dataset = []
    for text in dataset:
        chunks = chunk_text(text, chunk_size)
        tokenized_chunks = [tokenizer.encode(chunk, add_special_tokens=True) for chunk in chunks]
        tokenized_dataset.append(tokenized_chunks)
    return tokenized_dataset

# Create a dataset (a list of texts)
dataset = ["your large text here..."*10000]

# Tokenize the dataset
tokenized_dataset = tokenize_large_dataset(dataset, tokenizer)

In [None]:
print(tokenized_dataset)

# Tokenize with Attention Masks

In [None]:
encoding = tokenizer("Hello, I am a single sentence!", add_special_tokens=True, truncation=True, max_length=50, padding='max_length', return_tensors="pt")
input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]
print(input_ids, attention_mask)

# Handle Overflowing Tokens from Long Sequences

In [None]:
text="""

The Hugging Face library offers a wide range of transformer models for natural language processing tasks. If you have mastered the basics and now want to dive into more advanced topics, here are some suggestions:

- Fine-Tuning Models: Learn how to fine-tune pre-trained models on your specific task. This will allow you to leverage the power of transformers for your specific needs.

- Multilingual Models: Hugging Face offers several multilingual models. You can learn how to train a single model on multiple languages.

- Custom Models: You can learn how to create and train your own transformer models from scratch.

- Pipeline Creation: Hugging Face provides a pipeline feature to easily process data and make predictions. You can learn how to create custom pipelines.

- Training on Large Datasets: Handling and training models on large datasets is a common challenge in NLP. You can learn different strategies to handle this, such as gradient accumulation.

- Distributed Training: You can learn how to use multiple GPUs to train your models faster.

- Optimization Techniques: You can dive into various optimization techniques, like learning rate scheduling, weight decay, etc.

- Advanced Tokenization Techniques: There are different tokenization techniques available, such as Byte Pair Encoding (BPE), SentencePiece, etc. Each has its own advantages and use-cases.

- Interpretability of Transformer Models: Understanding why a model made a particular prediction is an active field of research. You can dive into this topic to understand your models better.

- Use of Callbacks: Callbacks can be used to customize the training process, like saving the model at different stages, changing the learning rate, etc.

You can find many tutorials and resources on these topics in the Hugging Face model hub and their documentation. Additionally, there are many courses and books available online that delve into these topics.

"""
encoding = tokenizer(text, max_length=512, stride=256, return_overflowing_tokens=True, truncation=True)
print(encoding)

# Encode Plus Method for Pair of Sentences

In [None]:
pairs = [("Hello, I am sentence 1.", "Hello, I am sentence 2."), ("Sentence 1.", "Sentence 2.")]
encoding = tokenizer.batch_encode_plus(pairs, padding=True, truncation=True, return_tensors="pt")
print(encoding)

# Batch Encoding for Pair of Sentences

In [None]:
pairs = [("Hello, I am sentence 1.", "Hello, I am sentence 2."), ("Sentence 1.", "Sentence 2.")]
encoding = tokenizer.batch_encode_plus(pairs, padding=True, truncation=True, return_tensors="pt")
print(encoding)

Byte-level Byte Pair Encoding (Byte-level BPE) is a data compression technique commonly used in natural language processing tasks, including tokenization in models like GPT `(Generative Pre-trained Transformer)`. It involves iteratively merging the most frequent byte pairs in a given dataset, building a vocabulary of byte-level subwords. Here's a high-level explanation of the Byte-level BPE process:

1. **Initialization:**
   - Start with a vocabulary containing `all unique bytes` present in the dataset.

2. **Byte Pair Merging:**
   - Iteratively `merge the most frequent pair of bytes` in the current vocabulary.
   - Calculate the `frequency of each byte pair` in the dataset.
   - Merge the most frequent byte pair to create a `new subword, updating the vocabulary`.

3. **Iterative Merging:**
   - `Repeat the merging process` for a predefined number of iterations or until a `specified vocabulary size is reached`.

4. **Final Vocabulary:**
   - `The final vocabulary consists of byte-level subwords generated through the merging process`.

5. **Tokenization:**
   - `Use the obtained vocabulary` for tokenizing the input text into a sequence of byte-level subwords.

Mathematically, let's denote the dataset as D, the initial vocabulary as V₀, and the merged vocabulary after i iterations as Vᵢ. The frequency of a byte pair (a, b) in the dataset is denoted as freq(a, b).

The algorithm can be represented as follows:

- Initialize: V₀ = {all unique bytes in D}
- For i = 1 to N (where N is the number of iterations or desired vocabulary size):
  - Calculate frequencies of all byte pairs in Vᵢ₋₁ in the dataset D.
  - Find the most frequent byte pair (a, b).
  - Merge (a, b) to create a new subword.
  - Update the vocabulary: Vᵢ = Vᵢ₋₁ ∪ {ab}, where ab is the merged subword.

The tokenization process using the final vocabulary Vₙ involves replacing sequences of bytes in the input text with their corresponding subwords from Vₙ.

Byte-level BPE is effective in handling rare words and reducing vocabulary size while maintaining flexibility in handling a wide range of input languages and characters. The process is particularly useful in the context of models like GPT, where a compact yet expressive vocabulary is crucial for efficient language representation.

# mathematical explanations `Byte-level BPE`


In Byte-level Byte Pair Encoding (Byte-level BPE), the process involves iteratively merging the most frequent byte pairs in a given dataset to create a vocabulary of byte-level subwords.

Let's define the variables and steps mathematically:

- Let $ D $ represent the dataset consisting of bytes.
- $ V_i $ represents the vocabulary after $ i $ iterations.
- $ \text{freq}(a, b) $ denotes the frequency of the byte pair $ (a, b) $ in the dataset $ D $.

The algorithm iterates as follows:

1. **Initialization**:
   - $ V_0 $ is initialized with all unique bytes in $ D $.

2. **Byte Pair Merging**:
   - For $ i = 1 $ to $ N $ iterations (or until a specified vocabulary size is reached):
     - Calculate the frequency of each byte pair in $ V_{i-1} $ in the dataset $ D $.
     - Let  \(a^*, b^*\)  be the most frequent byte pair.
     - Merge (a^*, b^*) to create a new subword $ ab $.
     - Update the vocabulary: $ V_i = V_{i-1} \cup \{ab\} $.

This process continues until the desired number of iterations $ N $ is reached or until a specific vocabulary size criterion is met.

Mathematically, the algorithm can be represented as a set of equations:

- **Initialization**:
  $$
  V_0 = \{ \text{all unique bytes in } D \}
  $$

- **Byte Pair Merging**:
  $$
  \text{For } i = 1 \text{ to } N:
  $$
  $$
  \begin{align*}
  &\text{Calculate } \text{freq}(a, b) \text{ for all } (a, b) \text{ in } V_{i-1} \text{ in dataset } D \\
  &(a^*, b^*) = \arg\max_{(a, b)} \text{freq}(a, b) \\
  &ab = \text{merge } (a^*, b^*) \\
  &V_i = V_{i-1} \cup \{ab\}
  \end{align*}
  $$

This iterative process effectively builds a vocabulary $ V_N $ consisting of byte-level subwords, merging the most frequent byte pairs from the previous vocabulary in each iteration.

# WordPiece

WordPiece is a tokenization technique used in language models like BERT `(Bidirectional Encoder Representations from Transformers)`. It operates by breaking down words into smaller subword units called tokens. This method allows the model to handle out-of-vocabulary words and increases the model's ability to generalize.

Mathematically, the WordPiece tokenization process involves several steps:

1. **Initialization of Vocabulary:** Start with a predefined vocabulary containing individual characters or subword units.

2. **Frequency Analysis:** Analyze the frequency of subword units or characters in the training corpus.

3. **Merging frequent pairs:** Iteratively merge the most frequent pairs of subword units or characters based on their frequency in the corpus.

4. **Stop Criteria:** Repeat the merging process until reaching a predefined vocabulary size or convergence criterion.

The algorithm can be represented mathematically as follows:

- Let $ C $ represent the corpus.
- $ V_0 $ represents the initial vocabulary.
- $ \text{freq}(x, y) $ denotes the frequency of the pair $(x, y) $ in the corpus.

The tokenization algorithm can be described using equations:

1. **Initialization:**
   $$
   V_0 = \{\text{individual characters or subword units}\}
   $$

2. **Frequency Analysis and Merging:**
   $$
   \text{For } i = 1 \text{ to } N \text{ iterations or until convergence}:
   $$
   $$
   \begin{align*}
   &\text{Calculate } \text{freq}(x, y) \text{ for all pairs } (x, y) \text{ in } V_{i-1} \text{ in corpus } C \\
   &(x^*, y^*) = \arg\max_{(x, y)} \text{freq}(x, y) \\
   &xy = \text{merge } (x^*, y^*) \\
   &V_i = V_{i-1} \cup \{xy\}
   \end{align*}
   $$

This iterative process continues until reaching the desired vocabulary size or convergence criterion. The resulting vocabulary consists of subword units or merged characters that form the WordPiece tokens used for tokenizing words in the language model.

WordPiece tokenization is effective in handling rare words, morphological variations, and languages with complex word structures, contributing to the robustness and flexibility of language models like BERT.