# TOKENIZATION

## WHAT?

Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, characters, subwords, or symbols, depending on the specific tokenization method used. Tokenization is a fundamental step in natural language processing (NLP) and is crucial for various language-related tasks.

## WHEN?

Tokenization is typically performed in the following sequence during the preprocessing stage of an NLP pipeline:

1. **After Text Cleaning**: 
   - Once the raw text data is cleaned (removing noise like punctuation, special characters, lowercasing, etc.), tokenization comes next.
   
2. **Before Vector Creation**: 
   - Tokenization happens right before converting text into numerical representations. After tokenization, techniques like Bag-of-Words, TF-IDF, or embeddings (Word2Vec, GloVe, transformers) are applied to convert tokens into vectors that models can process.

So, tokenization is performed **after text cleaning** and **before vectorization** in the NLP pipeline.

## WHY?

Tokenization is essential for several reasons:

**Input Preparation:** Most NLP models require discrete input units. Tokenization converts raw text into these units.

**Granularity Control:** It allows control over the level of granularity in text analysis (e.g., word-level vs. character-level).

**Vocabulary Management:** It helps in creating and managing vocabularies for language models.

**Feature Extraction:** Tokens serve as features for various NLP tasks.

**Dimensionality Reduction:** By breaking text into tokens, we can represent words or subwords as numerical vectors, reducing the dimensionality of the input.

## HOW?

The process of tokenization can vary depending on the chosen method, but generally involves these steps:

1. **Text Normalization**: This may include converting text to lowercase, removing punctuation, or handling special characters.
2. **Boundary Detection**: Identifying where one token ends and another begins. This could be based on whitespace, punctuation, or more complex rules.
3. **Token Extraction**: Separating the identified tokens from the original text.
4. **Token Processing**: This may involve further processing like stemming, lemmatization, or subword tokenization.

Example:

Let's take a simple sentence: "The quick brown fox jumps over the lazy dog."

1. Word-level tokenization: 
   ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

2. Character-level tokenization:
   ["T", "h", "e", " ", "q", "u", "i", "c", "k", " ", "b", "r", "o", "w", "n", " ", "f", "o", "x", " ", "j", "u", "m", "p", "s", " ", "o", "v", "e", "r", " ", "t", "h", "e", " ", "l", "a", "z", "y", " ", "d", "o", "g"]

3. Subword tokenization (using BPE as an example):
   ["The", "quick", "brown", "fox", "jump", "##s", "over", "the", "lazy", "dog"]

## SIGNIFICANCE

Tokenization plays a crucial role in NLP for several reasons:

1. **Language Understanding**: By breaking text into meaningful units, tokenization helps machines better understand and process human language.

2. **Vocabulary Size Management**: It allows for control over the vocabulary size, which is crucial for model efficiency and handling out-of-vocabulary words.

3. **Cross-lingual Applications**: Some tokenization methods (like BPE) can work across multiple languages, facilitating multilingual NLP models.

4. **Handling of Rare Words**: Subword tokenization methods can effectively handle rare words by breaking them into more common subword units.

5. **Improved Model Performance**: Proper tokenization can lead to better model performance by providing more meaningful input representations.

6. **Consistency**: It ensures consistency in how text is processed, which is essential for reproducible results in NLP tasks.

7. **Efficiency**: By converting text into numerical representations, tokenization enables efficient processing of large amounts of text data.

8. **Feature Engineering**: Tokens serve as the basis for many feature engineering techniques in NLP, such as bag-of-words, TF-IDF, and n-grams.

# TOKENIZERS TYPES

## Character Tokenizer



#### Working Explanation:
A character tokenizer breaks down text into individual characters. Each character, including spaces and punctuation marks, becomes a separate token.

#### Example:
Input: "Hello, World!"
Output: ["H", "e", "l", "l", "o", ",", " ", "W", "o", "r", "l", "d", "!"]

#### Advantages:
1. Simple and straightforward implementation
2. No out-of-vocabulary issues
3. Useful for tasks that require character-level analysis

#### Disadvantages:
1. Loses word-level semantics
2. Results in very long sequences, which can be computationally expensive
3. May not capture higher-level language structures effectively

In [1]:
text = "Hello, world!"
character_tokens = list(text)
print(character_tokens)

['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!']


## Word Level Tokenizer

#### Working Explanation:
A word-level tokenizer splits text into individual words. It typically uses spaces and punctuation as delimiters to identify word boundaries.

#### Example:
Input: "The quick brown fox jumps over the lazy dog."
Output: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

#### Advantages:
1. Preserves word-level semantics
2. Intuitive and easy to interpret
3. Works well for many NLP tasks

#### Disadvantages:
1. Large vocabulary size, especially for morphologically rich languages
2. Cannot handle out-of-vocabulary words
3. May struggle with compound words or unconventional spellings

In [2]:
from nltk.tokenize import word_tokenize

text = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(text)

print(tokens)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']


## Whitespace Tokenizer

#### Working Explanation:
A whitespace tokenizer simply splits text on whitespace characters (spaces, tabs, newlines). It's one of the simplest forms of tokenization.

#### Example:
Input: "The quick brown fox\njumps over the lazy dog."
Output: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog."]

#### Advantages:
1. Very simple and fast
2. Works well for languages with clear word boundaries

#### Disadvantages:
1. Doesn't handle punctuation well
2. May not work properly for languages without clear word boundaries (e.g., Chinese)
3. Can't handle contractions or hyphenated words effectively

In [3]:
import nltk
from nltk.tokenize import WhitespaceTokenizer

text = "Hello, world! This is an example text."

tokenizer = WhitespaceTokenizer()

tokens = tokenizer.tokenize(text)

print(tokens)


['Hello,', 'world!', 'This', 'is', 'an', 'example', 'text.']


## SubWord Tokenization

SubWord tokenization methods aim to break words into smaller meaningful units, helping to balance vocabulary size and semantic representation.

### **Subword Tokenization Methods**



Subword tokenization is a crucial technique in Natural Language Processing (NLP) that breaks words into smaller units. This approach helps handle out-of-vocabulary words, reduces vocabulary size, and captures morphological information. Here, we'll discuss four important subword tokenization methods: Byte-Pair Encoding (BPE), WordPiece, Unigram, and SentencePiece.

#### **1. Byte-Pair Encoding (BPE)**

##### **Working Explanation**

BPE starts with a vocabulary of individual characters and iteratively merges the most frequent adjacent pairs of characters or subwords. The process continues until a desired vocabulary size is reached.

1. Initialize the vocabulary with individual characters.
2. Count the frequency of character pairs in the corpus.
3. Merge the most frequent pair and add it to the vocabulary.
4. Repeat steps 2-3 until the desired vocabulary size is reached.

Example:
```
Initial words: ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
Base vocabulary: ["b", "g", "h", "n", "p", "s", "u"]
After merges: ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]
```

##### **Advantages**
1. Effective balance between vocabulary size and token expressiveness
2. Handles rare words and OOV words well
3. Can capture subword semantics
4. Works well for multilingual models

##### **Disadvantages**
1. Can produce unintuitive splits for some words
2. Requires training on a corpus
3. May not always capture morphological structures effectively

##### **Variants**

**Byte-level BPE**

Used by GPT-2, this variant uses bytes as the base vocabulary, ensuring a fixed base vocabulary size of 256 while being able to tokenize any text without an unknown token.


In [4]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import BertProcessing

class BPETokenizer:
    """
    Byte Pair Encoding (BPE) tokenizer using HuggingFace Tokenizers library.
    """
    def __init__(self, vocab_size=1000):
        self.tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
        self.trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=vocab_size)
        self.tokenizer.pre_tokenizer = Whitespace()

    def train(self, files):
        """
        Train the BPE tokenizer on a given corpus.
        """
        self.tokenizer.train(files, self.trainer)

    def tokenize(self, text):
        """
        Tokenize the input text using the trained BPE tokenizer.
        """
        return self.tokenizer.encode(text).tokens

# Example usage
bpe_tokenizer = BPETokenizer(vocab_size=2000)
corpus = ["res/corpus.txt"]  # Provide your corpus file here
bpe_tokenizer.train(corpus)
tokens = bpe_tokenizer.tokenize("This is a sample sentence.")
print(tokens)

['This', 'is', 'a', 's', 'a', 'm', 'ple', 'sent', 'ence', '.']


#### **2. WordPiece**


##### **Working Explanation**

WordPiece is similar to BPE but uses a different criterion for merging tokens. Instead of choosing the most frequent pair, it selects the pair that maximizes the likelihood of the training data when added to the vocabulary.

1. Initialize the vocabulary with individual characters.
2. For each possible merge, calculate the increase in likelihood of the training data.
3. Choose the merge that results in the highest increase in likelihood.
4. Repeat steps 2-3 until the desired vocabulary size is reached.

##### **Advantages**
1. Often produces more meaningful subword units
2. Balances frequency and usefulness of tokens
3. Effective for languages with rich morphology

##### **Disadvantages**
1. Can be computationally more expensive than BPE
2. Still requires a pre-tokenization step for most implementations

In [5]:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace

class WordPieceTokenizerHF:
    """
    WordPiece tokenizer using HuggingFace Tokenizers library.
    """
    def __init__(self, vocab_size=1000):
        self.tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
        self.trainer = WordPieceTrainer(vocab_size=vocab_size, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
        self.tokenizer.pre_tokenizer = Whitespace()

    def train(self, files):
        """
        Train the WordPiece tokenizer on a given corpus.
        """
        self.tokenizer.train(files, self.trainer)

    def tokenize(self, text):
        """
        Tokenize the input text using the trained WordPiece tokenizer.
        """
        return self.tokenizer.encode(text).tokens

# Example usage
wordpiece_tokenizer = WordPieceTokenizerHF(vocab_size=2000)
corpus = ["res/corpus.txt"]  # Provide your corpus file here
wordpiece_tokenizer.train(corpus)
tokens = wordpiece_tokenizer.tokenize("This is a sample sentence.")
print(tokens)

['This', 'is', 'a', 's', '##a', '##mp', '##le', 'sent', '##ence', '.']


#### **3. Unigram**



##### **Working Explanation**

Unigram starts with a large vocabulary and iteratively removes tokens to reach the desired vocabulary size.

1. Initialize with a large vocabulary (e.g., all pre-tokenized words and common substrings).
2. Define a loss function over the training data given the current vocabulary.
3. For each symbol, calculate the loss increase if it were removed.
4. Remove a percentage of symbols with the lowest loss increase.
5. Repeat steps 2-4 until the desired vocabulary size is reached.

##### **Advantages**
1. Allows for multiple tokenization possibilities, which can improve robustness
2. Can find an optimal vocabulary for a given size
3. Works well with SentencePiece for language-agnostic tokenization

##### **Disadvantages**
1. More complex implementation compared to BPE or WordPiece
2. May require more computational resources during training

In [6]:
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
from tokenizers.pre_tokenizers import Whitespace

class UnigramTokenizerHF:
    """
    Unigram tokenizer using HuggingFace Tokenizers library.
    """
    def __init__(self, vocab_size=1000):
        self.tokenizer = Tokenizer(Unigram())
        self.trainer = UnigramTrainer(vocab_size=vocab_size, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
        self.tokenizer.pre_tokenizer = Whitespace()

    def train(self, files):
        """
        Train the Unigram tokenizer on a given corpus.
        """
        self.tokenizer.train(files, self.trainer)

    def tokenize(self, text):
        """
        Tokenize the input text using the trained Unigram tokenizer.
        """
        return self.tokenizer.encode(text).tokens

# Example usage
unigram_tokenizer = UnigramTokenizerHF(vocab_size=2000)
corpus = ["res/corpus.txt"]  # Provide your corpus file here
unigram_tokenizer.train(corpus)
tokens = unigram_tokenizer.tokenize("This is a sample sentence.")
print(tokens)


['T', 'hi', 's', 'i', 's', 'a', 's', 'a', 'm', 'p', 'le', 'sent', 'ence', '.']


#### **4. SentencePiece**


##### **Working Explanation**

SentencePiece is not a tokenization algorithm itself, but rather a framework that can use BPE or Unigram algorithms. Its key feature is treating the input as a raw stream, including spaces as part of the token set.

1. Treat the input text as a raw stream of characters, including spaces.
2. Apply either BPE or Unigram algorithm to this stream.
3. Learn a vocabulary that includes space-separated tokens.

Example:
```
Input: "Hello world"
Tokenized: ["▁Hello", "▁world"]
```
(Note: "▁" represents the space character)

##### **Advantages**
1. Language-agnostic: works well for languages without clear word boundaries
2. Reversible tokenization: easy to reconstruct the original text
3. Consistent tokenization across languages in multilingual models

##### **Disadvantages**
1. May produce tokens that don't align with linguistic units in some languages
2. Can be less intuitive for debugging or analysis compared to word-based tokenizers

### **Comparison and Use Cases**

1. **BPE**: Good general-purpose algorithm, widely used (e.g., GPT models)
2. **WordPiece**: Effective for morphologically rich languages, used in BERT and related models
3. **Unigram**: Offers probabilistic tokenization, good for handling ambiguity
4. **SentencePiece**: Excellent for multilingual models and languages without clear word boundaries

When choosing a subword tokenization method, consider factors such as the language(s) you're working with, the size of your corpus, computational resources, and the specific requirements of your NLP task.