- **Author:** **Hemanth K**
- **✉** **speechcodehemanth2@gmail.com**
- Ai Researcher


# Word Structure and Subword Models in Natural Language Processing

In this exposition, we delve into the concept of word structure and subword models, which are foundational to modern Natural Language Processing (NLP) systems, particularly in the context of Large Language Models (LLMs) and Generative AI (Gen-AI). These models are crucial for handling linguistic diversity, out-of-vocabulary (OOV) words, and efficient representation learning. We will cover definitions, mathematical formulations, and detailed explanations of the topic, ensuring a comprehensive understanding for AI researchers and scientists.

---

## 1. Definition of Word Structure and Subword Models

### 1.1 Word Structure
Word structure refers to the morphological composition of words, encompassing their roots, prefixes, suffixes, and other subcomponents. In linguistics, words are often decomposed into smaller meaningful units called *morphemes*.
- For example, the word "unhappiness" can be broken down into "un-" (prefix), "happy" (root), and "-ness" (suffix). Understanding word structure is critical for tasks like part-of-speech tagging, machine translation, and semantic understanding.

### 1.2 Subword Models
Subword models are techniques in NLP that represent words as sequences of smaller, linguistically or statistically derived units, rather than treating words as atomic entities. `These models aim to balance the trade-off between vocabulary size and generalization`, especially for morphologically rich languages (e.g., German, Finnish) or rare words. Subword models are widely used in modern LLMs, such as BERT, GPT, and T5, to tokenize and encode text efficiently.

---

## 2. Importance of Subword Models in NLP

**Subword models address several challenges in traditional word-based or character-based representations**:
- **Out-of-Vocabulary (OOV) Words**: Traditional word-based models struggle with unseen words, requiring large vocabularies. Subword models mitigate this by breaking words into smaller, reusable units.
- **Morphological Richness**: Languages with complex morphology (e.g., agglutinative languages) benefit from subword units that capture shared morphemes.
- **Efficiency**: Subword models reduce vocabulary size while maintaining expressiveness, optimizing memory and computation during training and inference.
- **Generalization**: Subword units enable models to generalize to unseen words by learning patterns in subword composition.

---

## 3. Mathematical Foundations of Subword Models

Subword models rely on `probabilistic, statistical, or optimization-based techniques` to decompose words into subword units. Below, we outline the mathematical principles underlying key subword algorithms, such as Byte-Pair Encoding (BPE) and WordPiece.

### 3.1 Tokenization as an Optimization Problem
The goal of subword tokenization is to represent a corpus $C$ as a sequence of tokens $T = \{t_1, t_2, \dots, t_n\}$, where each token $t_i$ is either a word, subword, or character. **The objective is to minimize the vocabulary size $V$ while maximizing the coverage of the corpus, subject to constraints on model complexity**.





Mathematically, the problem can be framed as an optimization problem:
$$
\text{minimize } |V| \text{ subject to } P(C|T, V) \geq \epsilon
$$
where:
- $ |V| $ is the size of the vocabulary.
- $ P(C|T, V) $ is the probability of reconstructing the corpus $C$ given the token sequence $T$ and vocabulary $V$.
- $ \epsilon $ is a threshold for acceptable coverage.

### 3.2 Probability of a Corpus
The probability of a corpus $C$ given a tokenization $T$ can be modeled using a language model. For a sequence of tokens $T = \{t_1, t_2, \dots, t_n\}$, the joint probability is:
$$
P(T) = \prod_{i=1}^n P(t_i | t_1, t_2, \dots, t_{i-1})
$$
Subword models aim to maximize this probability by selecting a vocabulary $V$ that captures frequent and meaningful patterns in the data.

---

## 4. Subword Model Algorithms

Subword models employ various algorithms to generate subword units. Below, we discuss the most widely used approaches: Byte-Pair Encoding (BPE), WordPiece, and Unigram Language Model.

### 4.1 Byte-Pair Encoding (BPE)

#### 4.1.1 Definition
BPE is a data compression technique adapted for NLP to iteratively merge frequent pairs of characters or subword units into a single token. It starts with a character-level representation of the corpus and builds a vocabulary of subword units.

#### 4.1.2 Algorithm
1. **Initialization**: Start with the corpus $C$ represented as a sequence of characters, with each word separated by a special end-of-word symbol (e.g., `_`).
2. **Frequency Counting**: Compute the frequency of all adjacent pairs of tokens in the corpus.
3. **Merging**: Identify the most frequent pair $ (a, b) $ and merge it into a single token $ ab $.
4. **Iteration**: Repeat steps 2–3 for a fixed number of iterations or until a desired vocabulary size is reached.
5. **Tokenization**: Use the final vocabulary to tokenize new text by greedily matching the longest possible subword units.

#### 4.1.3 Mathematical Formulation
Let $ F(a, b) $ denote the frequency of the pair $ (a, b) $ in the corpus. At each iteration, BPE selects the pair that maximizes:
$$
(a^*, b^*) = \arg \max_{a, b} F(a, b)
$$
The merged token $ ab $ is added to the vocabulary, and all occurrences of $ (a, b) $ in the corpus are replaced with $ ab $.

#### 4.1.4 Example
Consider the corpus: `low`, `lowest`, `new`, `newer`.  
- Initial representation: `l o w _`, `l o w e s t _`, `n e w _`, `n e w e r _`.
- Step 1: Merge `e w` (most frequent pair) → Vocabulary: `{e, w, ew}`.
- Step 2: Merge `l o` → Vocabulary: `{e, w, ew, l, o, lo}`.
- Continue until the desired vocabulary size is reached.

### 4.2 WordPiece

#### 4.2.1 Definition
WordPiece is a subword tokenization algorithm that selects subword units based on their likelihood of improving the language model's performance. It is used in models like BERT.

#### 4.2.2 Algorithm
1. **Initialization**: Start with a character-level vocabulary.
2. **Scoring**: For each potential merge of two tokens $ (a, b) $ into $ ab $, compute the increase in the likelihood of the corpus under a language model.
3. **Merging**: Select the merge that maximizes the likelihood and add the new token $ ab $ to the vocabulary.
4. **Iteration**: Repeat steps 2–3 until the desired vocabulary size is reached.
5. **Tokenization**: Use the final vocabulary to tokenize text by greedily matching the longest possible subword units.

#### 4.2.3 Mathematical Formulation
WordPiece maximizes the likelihood of the corpus $C$ under a unigram language model. For a token $t$, its probability is:
$$
P(t) = \frac{\text{freq}(t)}{\sum_{t' \in V} \text{freq}(t')}
$$
The likelihood of the corpus is the product of the probabilities of its tokens. At each iteration, WordPiece selects the merge $ (a, b) \rightarrow ab $ that maximizes the increase in likelihood:
$$
\Delta L = L(C|V \cup \{ab\}) - L(C|V)
$$
where $ L(C|V) $ is the log-likelihood of the corpus given the current vocabulary $V$.

#### 4.2.4 Example
For the same corpus as above, WordPiece might prioritize merges like `lo` or `est` if they improve the language model's likelihood more than pairs like `ew`.

### 4.3 Unigram Language Model

#### 4.3.1 Definition
The Unigram Language Model is a probabilistic approach to subword tokenization, used in models like SentencePiece. It assumes that tokens are generated independently and selects a vocabulary that maximizes the likelihood of the corpus.

#### 4.3.2 Algorithm
1. **Initialization**: Start with a large seed vocabulary (e.g., all possible character sequences up to a certain length).
2. **Scoring**: Compute the unigram probability of each token in the vocabulary.
3. **Pruning**: Remove the least probable tokens from the vocabulary, keeping only the top $k$ tokens that maximize the corpus likelihood.
4. **Iteration**: Repeat steps 2–3 until the desired vocabulary size is reached.
5. **Tokenization**: Use the Viterbi algorithm to find the most probable tokenization of a word given the final vocabulary.

#### 4.3.3 Mathematical Formulation
The probability of a word $w$ being tokenized as a sequence of subword units $ T = \{t_1, t_2, \dots, t_m\} $ is:
$$
P(w|T) = \prod_{i=1}^m P(t_i)
$$
The goal is to find the vocabulary $V$ and tokenization $T$ that maximize the likelihood of the corpus $C$:
$$
V^* = \arg \max_V \prod_{w \in C} \max_T P(w|T, V)
$$

#### 4.3.4 Example
For the word `lowest`, the Unigram model might consider tokenizations like `low est`, `lo west`, or `l o w e s t`, and select the one with the highest probability based on the learned token probabilities.

---

## 5. Subword Models in Practice

### 5.1 Integration with Neural Networks
Subword models are typically used as a preprocessing step in NLP pipelines. The tokenized subword units are mapped to embeddings, which are then fed into neural networks (e.g., Transformers). The embedding of a token $t$ is represented as a vector $ \mathbf{e}_t \in \mathbb{R}^d $, where $d$ is the embedding dimension.

### 5.2 Handling OOV Words
For a new word $w$, subword models tokenize it into a sequence of subword units $ T = \{t_1, t_2, \dots, t_m\} $. The word's representation is often the sum or average of the embeddings of its subword units:
$$
\mathbf{e}_w = \frac{1}{m} \sum_{i=1}^m \mathbf{e}_{t_i}
$$

### 5.3 Evaluation Metrics
Subword models are evaluated based on:
- **Vocabulary Size**: Smaller vocabularies are preferred for efficiency.
- **Coverage**: The percentage of the corpus that can be tokenized without OOV tokens.
- **Perplexity**: The perplexity of a language model trained on the tokenized corpus.
- **Downstream Task Performance**: Accuracy, F1-score, or BLEU score on tasks like machine translation or text classification.

---

## 6. Challenges and Limitations

- **Language Dependence**: Subword models may struggle with languages that have irregular morphology or lack clear morpheme boundaries.
- **Over-Segmentation**: Rare words may be tokenized into too many subword units, leading to loss of meaning.
- **Under-Segmentation**: Frequent words may not be decomposed, missing morphological insights.
- **Computational Cost**: Training subword models on large corpora requires significant computational resources.

---

## 7. Advanced Topics in Subword Models

### 7.1 Multilingual Subword Models
For multilingual models, subword vocabularies are shared across languages. The challenge is to balance the representation of high-resource and low-resource languages. Techniques like temperature sampling are used to adjust the frequency of tokens during vocabulary construction:
$$
P(t) \propto \text{freq}(t)^{1/\tau}
$$
where $ \tau $ is the temperature parameter.

### 7.2 Subword Regularization
To improve robustness, subword regularization introduces randomness into the tokenization process during training. For a word $w$, multiple tokenizations $T_1, T_2, \dots, T_k$ are sampled, and the model is trained on these variations to improve generalization.

### 7.3 Subword Models in Speech and Audio Processing
In speech processing, subword models are used to tokenize phonetic transcriptions or align audio features with text. For example, in automatic speech recognition (ASR), subword units help handle pronunciation variations and OOV words in spoken language.

---

## 8. Conclusion

Subword models are a cornerstone of modern NLP, enabling efficient, scalable, and generalizable representations of text. By decomposing words into smaller units, these models address challenges like OOV words, morphological richness, and vocabulary size. Algorithms like BPE, WordPiece, and Unigram Language Model provide robust frameworks for subword tokenization, each with unique strengths and trade-offs. Understanding the mathematical foundations and practical applications of these models is essential for researchers and AI scientists working on cutting-edge NLP systems.

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

In [None]:
# -*- coding: utf-8 -*-
"""
Word Structure and Subword Models in NLP

This module explains and implements various tokenization strategies
with a focus on subword models used in modern NLP systems.
"""

import re
import collections
from typing import Dict, List, Tuple, Set, Optional


# ============================================================================ #
#                             TOKENIZATION CONCEPTS                            #
# ============================================================================ #
"""
WORD STRUCTURE AND TOKENIZATION

In Natural Language Processing (NLP), tokenization is the process of breaking
text into smaller units called tokens. Traditionally, this meant splitting text
into words, but modern approaches use more sophisticated methods:

1. Word-level tokenization: Splits text into words
   - Simple but has issues with out-of-vocabulary (OOV) words
   - Example: "I love programming" → ["I", "love", "programming"]

2. Character-level tokenization: Splits text into individual characters
   - No OOV issues but loses word-level semantics
   - Example: "Hello" → ["H", "e", "l", "l", "o"]

3. Subword tokenization: A middle ground approach
   - Breaks words into meaningful subword units
   - Handles rare words and morphologically rich languages better
   - Example: "unhappiness" → ["un", "happiness"] or ["un", "happy", "ness"]

Subword models are crucial for modern NLP architectures like BERT, GPT, and
T5, as they balance vocabulary size and semantic representation.
"""


# ============================================================================ #
#                               WORD TOKENIZATION                              #
# ============================================================================ #

def word_tokenizer(text: str) -> List[str]:
    """
    A basic word-level tokenizer that splits text on whitespace and removes punctuation.

    Args:
        text: Input text string

    Returns:
        List of word tokens

    Examples:
        >>> word_tokenizer("Hello, world! How are you?")
        ['Hello', 'world', 'How', 'are', 'you']
    """
    # Remove punctuation and split on whitespace
    words = re.sub(r'[^\w\s]', '', text).split()
    return words


def word_tokenizer_with_punctuation(text: str) -> List[str]:
    """
    A word tokenizer that preserves punctuation as separate tokens.

    Args:
        text: Input text string

    Returns:
        List of word and punctuation tokens

    Examples:
        >>> word_tokenizer_with_punctuation("Hello, world! How are you?")
        ['Hello', ',', 'world', '!', 'How', 'are', 'you', '?']
    """
    # This regex splits on word boundaries, keeping punctuation as tokens
    tokens = re.findall(r'\b\w+\b|[^\w\s]', text)
    return tokens


# ============================================================================ #
#                            CHARACTER TOKENIZATION                            #
# ============================================================================ #

def char_tokenizer(text: str) -> List[str]:
    """
    A character-level tokenizer that splits text into individual characters.

    Args:
        text: Input text string

    Returns:
        List of character tokens

    Examples:
        >>> char_tokenizer("Hello")
        ['H', 'e', 'l', 'l', 'o']
    """
    return list(text)


def char_tokenizer_with_whitespace(text: str) -> List[str]:
    """
    A character-level tokenizer that keeps whitespace as separate tokens.

    Args:
        text: Input text string

    Returns:
        List of character tokens including whitespace

    Examples:
        >>> char_tokenizer_with_whitespace("Hi there")
        ['H', 'i', ' ', 't', 'h', 'e', 'r', 'e']
    """
    return list(text)


# ============================================================================ #
#                              SUBWORD TOKENIZATION                            #
# ============================================================================ #

"""
SUBWORD MODELS

Subword models break words into smaller meaningful units, offering a balance
between word and character tokenization. The main approaches are:

1. Byte Pair Encoding (BPE):
   - Iteratively merges the most frequent pair of bytes or characters
   - Used in GPT models and RoBERTa

2. WordPiece:
   - Similar to BPE but uses a different selection criteria based on likelihood
   - Used in BERT and DistilBERT

3. Unigram Language Model:
   - Probabilistic approach that starts with a large vocabulary and prunes it
   - Used in XLNet and AlBERT

4. SentencePiece:
   - Language-agnostic tokenizer that treats the input as a raw stream of Unicode characters
   - Used in multilingual models like mBERT and XLM-RoBERTa

These methods have revolutionized NLP by reducing vocabulary size while effectively
handling unseen words, morphologically rich languages, and multilingual scenarios.
"""


# ============================================================================ #
#                           BYTE PAIR ENCODING (BPE)                           #
# ============================================================================ #

class BytePairEncoder:
    """
    Implementation of the Byte Pair Encoding (BPE) algorithm for subword tokenization.

    BPE is a data compression algorithm that iteratively replaces the most frequent
    pair of consecutive bytes (or characters) with a single, unused byte (or a new symbol).
    In NLP, it's used to learn subword units from a corpus.

    Attributes:
        vocab_size: The target vocabulary size
        merges: List of character/subword merges in order of frequency
        vocab: Dictionary mapping tokens to IDs
    """

    def __init__(self, vocab_size: int = 10000):
        """
        Initialize the BPE tokenizer.

        Args:
            vocab_size: Target size of the vocabulary
        """
        self.vocab_size = vocab_size
        self.merges = []
        self.vocab = {}

    def _get_stats(self, vocab: Dict[str, int]) -> Dict[Tuple[str, str], int]:
        """
        Count frequency of adjacent symbol pairs in the vocabulary.

        Args:
            vocab: Dictionary mapping sequences to their frequency

        Returns:
            Dictionary of adjacent symbol pairs and their frequencies
        """
        pairs = collections.defaultdict(int)
        for word, freq in vocab.items():
            symbols = word.split()
            for i in range(len(symbols) - 1):
                pairs[(symbols[i], symbols[i + 1])] += freq
        return pairs

    def _merge_vocab(self, vocab: Dict[str, int], pair: Tuple[str, str]) -> Dict[str, int]:
        """
        Merge all occurrences of a symbol pair in the vocabulary.

        Args:
            vocab: Dictionary mapping sequences to their frequency
            pair: The pair of symbols to merge

        Returns:
            Updated vocabulary with the pair merged
        """
        new_vocab = {}
        bigram = ' '.join(pair)
        replacement = ''.join(pair)

        for word, freq in vocab.items():
            # Split the word into symbols (parts)
            parts = word.split()

            # Keep track of where we are in the word
            i = 0
            new_parts = []

            # Iterate through the parts to find and merge the pair
            while i < len(parts):
                # If we found the pair and we're not at the end
                if i < len(parts) - 1 and parts[i] == pair[0] and parts[i + 1] == pair[1]:
                    new_parts.append(replacement)
                    i += 2
                else:
                    new_parts.append(parts[i])
                    i += 1

            # Create the new word and add it to the vocabulary
            new_word = ' '.join(new_parts)
            new_vocab[new_word] = freq

        return new_vocab

    def fit(self, texts: List[str], num_merges: Optional[int] = None) -> None:
        """
        Learn BPE merges from a list of texts.

        Args:
            texts: List of text samples to learn from
            num_merges: Number of merge operations (if None, uses vocab_size)
        """
        # Initialize vocabulary with character-split words
        vocab = {}
        for text in texts:
            # Simple word tokenization
            words = re.findall(r'\b\w+\b', text.lower())
            for word in words:
                # Split each word into characters with spaces between
                char_word = ' '.join(list(word))
                if char_word in vocab:
                    vocab[char_word] += 1
                else:
                    vocab[char_word] = 1

        # Determine number of merges
        if num_merges is None:
            # Calculate initial vocab size (unique characters)
            unique_chars = set()
            for word in vocab:
                unique_chars.update(word.split())

            # Number of merges needed to reach target vocab size
            num_merges = min(self.vocab_size - len(unique_chars), 10000)

        # Perform merge operations
        for i in range(num_merges):
            pairs = self._get_stats(vocab)
            if not pairs:
                break

            # Find the most frequent pair
            best_pair = max(pairs, key=pairs.get)
            self.merges.append(best_pair)

            # Merge the pair in the vocabulary
            vocab = self._merge_vocab(vocab, best_pair)

            if i % 100 == 0:
                print(f"Merge {i}: {best_pair} -> {''.join(best_pair)}")

        # Build the final vocabulary
        self._build_vocab(vocab)

    def _build_vocab(self, vocab: Dict[str, int]) -> None:
        """
        Build the vocabulary from learned merges.

        Args:
            vocab: The current vocabulary after merges
        """
        # Get all unique tokens
        tokens = set()
        for word in vocab:
            tokens.update(word.split())

        # Assign IDs to tokens
        for i, token in enumerate(sorted(tokens)):
            self.vocab[token] = i

    def tokenize(self, text: str) -> List[str]:
        """
        Tokenize text using learned BPE merges.

        Args:
            text: Text to tokenize

        Returns:
            List of subword tokens
        """
        # Simple word tokenization
        words = re.findall(r'\b\w+\b', text.lower())
        result = []

        for word in words:
            # Start with characters
            chars = ' '.join(list(word))

            # Apply merges in the learned order
            for pair in self.merges:
                chars = chars.replace(' '.join(pair), ''.join(pair))

            # Add resulting tokens to output
            result.extend(chars.split())

        return result

    def encode(self, text: str) -> List[int]:
        """
        Encode text into token IDs.

        Args:
            text: Text to encode

        Returns:
            List of token IDs
        """
        tokens = self.tokenize(text)
        return [self.vocab.get(token, self.vocab.get('<unk>', len(self.vocab))) for token in tokens]

    def decode(self, ids: List[int]) -> str:
        """
        Decode token IDs back to text.

        Args:
            ids: List of token IDs

        Returns:
            Decoded text
        """
        # Create a reverse mapping from IDs to tokens
        id_to_token = {v: k for k, v in self.vocab.items()}

        # Convert IDs to tokens
        tokens = [id_to_token.get(id, '<unk>') for id in ids]

        # Join tokens (this simplistic join doesn't handle spacing properly)
        return ''.join(tokens)


# ============================================================================ #
#                           WORDPIECE IMPLEMENTATION                           #
# ============================================================================ #

class WordPiece:
    """
    Implementation of WordPiece tokenization algorithm.

    WordPiece is similar to BPE but uses a different selection criterion.
    It chooses the pair that maximizes the likelihood of the training data
    after the merge.

    The main difference from BPE is that WordPiece marks subword units
    differently. Most subwords are prefixed with "##" to indicate they are
    part of a word.

    Attributes:
        vocab_size: The target vocabulary size
        vocab: Dictionary mapping tokens to IDs
    """

    def __init__(self, vocab_size: int = 10000):
        """
        Initialize the WordPiece tokenizer.

        Args:
            vocab_size: Target size of the vocabulary
        """
        self.vocab_size = vocab_size
        self.vocab = {}
        # Special tokens
        self.special_tokens = {
            '<unk>': 0,  # Unknown token
            '<s>': 1,    # Start of sequence
            '</s>': 2,   # End of sequence
            '<pad>': 3,  # Padding token
        }

    def _get_word_counts(self, texts: List[str]) -> Dict[str, int]:
        """
        Count word frequencies in the corpus.

        Args:
            texts: List of text samples

        Returns:
            Dictionary of words and their frequencies
        """
        word_counts = collections.Counter()
        for text in texts:
            # Simple word tokenization
            words = re.findall(r'\b\w+\b', text.lower())
            word_counts.update(words)
        return word_counts

    def _split_word_to_chars(self, word: str) -> str:
        """
        Split a word into character level representation for WordPiece.

        Args:
            word: Input word

        Returns:
            WordPiece character representation with ## prefix
        """
        chars = list(word)
        # First character doesn't get ## prefix
        result = [chars[0]]
        # Rest get ## prefix
        result.extend([f"##{c}" for c in chars[1:]])
        return ' '.join(result)

    def _compute_pair_scores(self, vocab: Dict[str, int]) -> Dict[Tuple[str, str], float]:
        """
        Compute scores for each pair based on the WordPiece scoring function.

        Args:
            vocab: Dictionary mapping sequences to their frequency

        Returns:
            Dictionary of adjacent symbol pairs and their scores
        """
        # Count frequencies of adjacent pairs
        pair_counts = collections.defaultdict(int)
        total_pairs = 0

        for word, freq in vocab.items():
            symbols = word.split()
            for i in range(len(symbols) - 1):
                pair = (symbols[i], symbols[i + 1])
                pair_counts[pair] += freq
                total_pairs += freq

        # Calculate scores (simplified version of WordPiece scoring)
        scores = {}
        for pair, count in pair_counts.items():
            # WordPiece uses a more complex likelihood-based score
            # This is a simplified version that approximates the original algorithm
            a, b = pair
            if a.startswith('##') and b.startswith('##'):
                # Encourage merging subword units
                scores[pair] = count * 1.1 / total_pairs
            else:
                scores[pair] = count / total_pairs

        return scores

    def fit(self, texts: List[str], num_merges: Optional[int] = None) -> None:
        """
        Learn WordPiece vocabulary from a list of texts.

        Args:
            texts: List of text samples to learn from
            num_merges: Number of merge operations (if None, uses vocab_size)
        """
        # Get word counts
        word_counts = self._get_word_counts(texts)

        # Initialize vocabulary with character-split words
        vocab = {}
        for word, count in word_counts.items():
            char_word = self._split_word_to_chars(word)
            vocab[char_word] = count

        # Determine number of merges
        if num_merges is None:
            # Calculate initial vocab size (unique characters and special tokens)
            unique_tokens = set()
            for word in vocab:
                unique_tokens.update(word.split())

            # Number of merges needed to reach target vocab size
            num_merges = min(self.vocab_size - len(unique_tokens) - len(self.special_tokens), 10000)

        # Perform merge operations
        merges = []
        for i in range(num_merges):
            # Calculate scores for all pairs
            scores = self._compute_pair_scores(vocab)

            if not scores:
                break

            # Find the best pair to merge
            best_pair = max(scores, key=scores.get)
            merges.append(best_pair)

            # Merge the pair in all words
            vocab = self._merge_vocab(vocab, best_pair)

            if i % 100 == 0:
                print(f"Merge {i}: {best_pair} -> {''.join(best_pair)}")

        # Build final vocabulary
        self._build_vocab(vocab, merges)

    def _merge_vocab(self, vocab: Dict[str, int], pair: Tuple[str, str]) -> Dict[str, int]:
        """
        Merge all occurrences of a symbol pair in the vocabulary.

        Args:
            vocab: Dictionary mapping sequences to their frequency
            pair: The pair of symbols to merge

        Returns:
            Updated vocabulary with the pair merged
        """
        new_vocab = {}
        bigram = ' '.join(pair)

        # Handle special case for ## prefix
        if pair[1].startswith('##'):
            # When merging with a ## token, remove the ## from the merged result
            replacement = pair[0] + pair[1][2:]
        else:
            replacement = ''.join(pair)

        for word, freq in vocab.items():
            # Replace the pair throughout the word
            new_word = word.replace(bigram, replacement)
            new_vocab[new_word] = freq

        return new_vocab

    def _build_vocab(self, vocab: Dict[str, int], merges: List[Tuple[str, str]]) -> None:
        """
        Build the vocabulary from learned merges.

        Args:
            vocab: The current vocabulary after merges
            merges: List of merge operations
        """
        # Start with special tokens
        final_vocab = self.special_tokens.copy()
        next_id = len(final_vocab)

        # Get all unique tokens
        tokens = set()
        for word in vocab:
            tokens.update(word.split())

        # Add tokens to vocabulary
        for token in sorted(tokens):
            final_vocab[token] = next_id
            next_id += 1

        self.vocab = final_vocab

    def tokenize(self, text: str) -> List[str]:
        """
        Tokenize text using learned WordPiece vocabulary.

        Args:
            text: Text to tokenize

        Returns:
            List of subword tokens
        """
        # Simple word tokenization
        words = re.findall(r'\b\w+\b', text.lower())
        result = []

        for word in words:
            # Start with character-level representation
            current = self._split_word_to_chars(word)
            current_tokens = current.split()

            # Try to merge tokens greedily
            changed = True
            while changed:
                changed = False
                for i in range(len(current_tokens) - 1):
                    # Check if this pair can be merged
                    pair = (current_tokens[i], current_tokens[i + 1])
                    merged = current_tokens[i]

                    # Handle ## prefix when merging
                    if current_tokens[i + 1].startswith('##'):
                        merged += current_tokens[i + 1][2:]
                    else:
                        merged += current_tokens[i + 1]

                    # If the merged token is in our vocabulary, merge it
                    if merged in self.vocab:
                        current_tokens[i] = merged
                        current_tokens.pop(i + 1)
                        changed = True
                        break

            # Add resulting tokens to output
            result.extend(current_tokens)

        return result

    def encode(self, text: str) -> List[int]:
        """
        Encode text into token IDs.

        Args:
            text: Text to encode

        Returns:
            List of token IDs
        """
        tokens = self.tokenize(text)
        return [self.vocab.get(token, self.vocab['<unk>']) for token in tokens]

    def decode(self, ids: List[int]) -> str:
        """
        Decode token IDs back to text.

        Args:
            ids: List of token IDs

        Returns:
            Decoded text
        """
        # Create a reverse mapping from IDs to tokens
        id_to_token = {v: k for k, v in self.vocab.items()}

        # Convert IDs to tokens
        tokens = [id_to_token.get(id, '<unk>') for id in ids]

        # Join tokens, handling ## prefix
        text = ''
        for token in tokens:
            if token.startswith('##'):
                # Remove ## and append without space
                text += token[2:]
            elif token in self.special_tokens.keys():
                # Skip special tokens
                continue
            else:
                # Add space before regular tokens (except at the beginning)
                if text:
                    text += ' '
                text += token

        return text


# ============================================================================ #
#                          UNIGRAM LANGUAGE MODEL                              #
# ============================================================================ #

class UnigramLM:
    """
    Implementation of the Unigram Language Model for subword tokenization.

    The Unigram model is a probabilistic approach that starts with a large
    vocabulary and iteratively removes tokens to maximize the likelihood
    of the training data.

    Attributes:
        vocab_size: The target vocabulary size
        vocab: Dictionary mapping tokens to IDs
        token_probs: Probability for each token
    """

    def __init__(self, vocab_size: int = 10000):
        """
        Initialize the Unigram Language Model tokenizer.

        Args:
            vocab_size: Target size of the vocabulary
        """
        self.vocab_size = vocab_size
        self.vocab = {}
        self.token_probs = {}
        # Special tokens
        self.special_tokens = {
            '<unk>': 0,
            '<s>': 1,
            '</s>': 2,
            '<pad>': 3,
        }

    def _initialize_vocab(self, texts: List[str], initial_vocab_size: int = 50000) -> Dict[str, float]:
        """
        Initialize a large vocabulary for pruning.

        Args:
            texts: List of text samples
            initial_vocab_size: Initial size of the vocabulary before pruning

        Returns:
            Dictionary mapping tokens to their probabilities
        """
        # Start with character-level tokenization
        char_vocab = collections.Counter()
        for text in texts:
            char_vocab.update(list(text.lower()))

        # Add all characters to the vocabulary with their frequencies
        vocab = {c: count for c, count in char_vocab.items()}

        # Generate all substrings of length 2-8 from the corpus
        substrings = collections.Counter()
        for text in texts:
            text = text.lower()
            for length in range(2, 9):
                for i in range(len(text) - length + 1):
                    substrings[text[i:i+length]] += 1

        # Add most common substrings to initial vocabulary
        max_substrs = initial_vocab_size - len(vocab) - len(self.special_tokens)
        for substr, count in substrings.most_common(max_substrs):
            vocab[substr] = count

        # Convert frequencies to probabilities
        total = sum(vocab.values())
        vocab_probs = {token: count / total for token, count in vocab.items()}

        return vocab_probs

    def _viterbi_segment(self, word: str, token_probs: Dict[str, float]) -> List[str]:
        """
        Segment a word into the most likely sequence of subwords using Viterbi algorithm.

        Args:
            word: Word to segment
            token_probs: Dictionary of token probabilities

        Returns:
            List of subword tokens
        """
        # Dynamic programming approach
        # best_score[i] = best score for segmenting word[:i]
        best_score = [0] * (len(word) + 1)
        best_score[0] = 1.0

        # best_edge[i] = best previous position for segmenting word[:i]
        best_edge = [0] * (len(word) + 1)

        for i in range(1, len(word) + 1):
            best_score[i] = 0
            for j in range(i):
                # Try segmenting at each possible position
                substr = word[j:i]
                if substr in token_probs:
                    # Score is the product of probabilities (log domain would be better)
                    score = best_score[j] * token_probs[substr]
                    if score > best_score[i]:
                        best_score[i] = score
                        best_edge[i] = j

        # Backtrack to get the segmentation
        tokens = []
        current = len(word)
        while current > 0:
            prev = best_edge[current]
            tokens.append(word[prev:current])
            current = prev

        # Reverse to get tokens in the correct order
        return tokens[::-1]

    def fit(self, texts: List[str]) -> None:
        """
        Learn the Unigram vocabulary from a list of texts.

        Args:
            texts: List of text samples to learn from
        """
        # Initialize with a large vocabulary
        token_probs = self._initialize_vocab(texts)

        # Iteratively prune the vocabulary
        while len(token_probs) + len(self.special_tokens) > self.vocab_size:
            # Calculate loss contribution of each token
            token_scores = {}
            for token, prob in token_probs.items():
                # Skip single character tokens (to avoid degenerate solutions)
                if len(token) == 1:
                    continue

                # Calculate approximate loss change if we remove this token
                # This is a simplified version of the actual algorithm
                token_scores[token] = prob * len(token)

            # Remove the token that contributes least to the model
            if token_scores:
                worst_token = min(token_scores, key=token_scores.get)
                del token_probs[worst_token]
            else:
                # If there are no more tokens to remove, break
                break

            # Adjust probabilities to sum to 1
            total = sum(token_probs.values())
            token_probs = {token: prob / total for token, prob in token_probs.items()}

        self.token_probs = token_probs

        # Build final vocabulary
        self._build_vocab()

    def _build_vocab(self) -> None:
        """Build the vocabulary from token probabilities."""
        # Start with special tokens
        vocab = self.special_tokens.copy()
        next_id = len(vocab)

        # Add tokens sorted by probability
        for token, _ in sorted(self.token_probs.items(), key=lambda x: x[1], reverse=True):
            vocab[token] = next_id
            next_id += 1

        self.vocab = vocab

    def tokenize(self, text: str) -> List[str]:
        """
        Tokenize text using the Unigram model.

        Args:
            text: Text to tokenize

        Returns:
            List of subword tokens
        """
        result = []
        # Simple word tokenization first
        words = re.findall(r'\b\w+\b', text.lower())

        for word in words:
            # Use Viterbi algorithm to segment each word
            tokens = self._viterbi_segment(word, self.token_probs)
            result.extend(tokens)

        return result

    def encode(self, text: str) -> List[int]:
        """
        Encode text into token IDs.

        Args:
            text: Text to encode

        Returns:
            List of token IDs
        """
        tokens = self.tokenize(text)
        return [self.vocab.get(token, self.vocab['<unk>']) for token in tokens]

    def decode(self, ids: List[int]) -> str:
        """
        Decode token IDs back to text.

        Args:
            ids: List of token IDs

        Returns:
            Decoded text
        """
        # Create a reverse mapping from IDs to tokens
        id_to_token = {v: k for k, v in self.vocab.items()}

        # Convert IDs to tokens
        tokens = [id_to_token.get(id, '<unk>') for id in ids]

        # Simply join tokens (this is a simplification)
        return ''.join(tokens)


# ============================================================================ #
#                        EXAMPLES AND DEMONSTRATIONS                           #
# ============================================================================ #

def demonstrate_word_tokenization() -> None:
    """Demonstrate word-level tokenization."""
    text = "Hello world! This is an example of word-level tokenization. It doesn't handle contractions like don't very well."

    print("\nWORD TOKENIZATION EXAMPLE:")
    print("Original text:", text)
    print("Simple word tokenization:", word_tokenizer(text))
    print("Word tokenization with punctuation:", word_tokenizer_with_punctuation(text))


def demonstrate_char_tokenization() -> None:
    """Demonstrate character-level tokenization."""
    text = "Hello world!"

    print("\nCHARACTER TOKENIZATION EXAMPLE:")
    print("Original text:", text)
    print("Character tokens:", char_tokenizer(text))
    print("Character tokens with whitespace:", char_tokenizer_with_whitespace(text))


def demonstrate_bpe() -> None:
    """Demonstrate Byte Pair Encoding tokenization."""
    # Training corpus
    corpus = [
        "I love to program in Python",
        "Python programming is fun",
        "The Python programming language is versatile",
        "I enjoy learning new programming languages",
        "Programming helps solve complex problems",
        "Python has many programming libraries",
        "Learning to program is a valuable skill"
    ]

    # Test text
    test_text = "I love programming in Python language"

    print("\nBYTE PAIR ENCODING EXAMPLE:")
    print("Training corpus size:", len(corpus))

    # Train BPE
    bpe = BytePairEncoder(vocab_size=100)
    bpe.fit(corpus, num_merges=50)

    # Tokenize and encode test text
    tokens = bpe.tokenize(test_text)
    ids = bpe.encode(test_text)

    print("BPE tokens for '", test_text, "':", tokens)
    print("BPE token IDs:", ids)

    # Show how BPE handles unseen words
    unseen_text = "programmer programs programmable"
    unseen_tokens = bpe.tokenize(unseen_text)
    print("BPE tokens for unseen text '", unseen_text, "':", unseen_tokens)


def demonstrate_wordpiece() -> None:
    """Demonstrate WordPiece tokenization."""
    # Training corpus
    corpus = [
        "I love to program in Python",
        "Python programming is fun",
        "The Python programming language is versatile",
        "I enjoy learning new programming languages",
        "Programming helps solve complex problems",
        "Python has many programming libraries",
        "Learning to program is a valuable skill"
    ]

    # Test text
    test_text = "I love programming in Python language"

    print("\nWORDPIECE EXAMPLE:")

    # Train WordPiece
    wp = WordPiece(vocab_size=100)
    wp.fit(corpus, num_merges=50)

    # Tokenize and encode test text
    tokens = wp.tokenize(test_text)
    ids = wp.encode(test_text)

    print("WordPiece tokens for '", test_text, "':", tokens)
    print("WordPiece token IDs:", ids)

    # Show how WordPiece handles unseen words
    unseen_text = "programmer programs programmable"
    unseen_tokens = wp.tokenize(unseen_text)
    print("WordPiece tokens for unseen text '", unseen_text, "':", unseen_tokens)


def demonstrate_unigram() -> None:
    """Demonstrate Unigram Language Model tokenization."""
    # Training corpus
    corpus = [
        "I love to program in Python",
        "Python programming is fun",
        "The Python programming language is versatile",
        "I enjoy learning new programming languages",
        "Programming helps solve complex problems",
        "Python has many programming libraries",
        "Learning to program is a valuable skill"
    ]

    # Test text
    test_text = "I love programming in Python language"

    print("\nUNIGRAM LANGUAGE MODEL EXAMPLE:")

    # Train Unigram model
    unigram = UnigramLM(vocab_size=100)
    unigram.fit(corpus)

    # Tokenize and encode test text
    tokens = unigram.tokenize(test_text)
    ids = unigram.encode(test_text)

    print("Unigram tokens for '", test_text, "':", tokens)
    print("Unigram token IDs:", ids)

    # Show how Unigram handles unseen words
    unseen_text = "programmer programs programmable"
    unseen_tokens = unigram.tokenize(unseen_text)
    print("Unigram tokens for unseen text '", unseen_text, "':", unseen_tokens)


def compare_models() -> None:
    """Compare different tokenization approaches on the same text."""
    text = "The transformer architecture revolutionized natural language processing."

    print("\nCOMPARISON OF TOKENIZATION APPROACHES:")
    print("Original text:", text)

    # Word tokenization
    print("Word tokens:", word_tokenizer(text))

    # Character tokenization
    print("Character tokens:", char_tokenizer(text))

    # Train simple models on a small corpus
    corpus = [
        "The transformer architecture revolutionized natural language processing.",
        "Natural language processing models use transformers.",
        "Transformers process text effectively.",
        "Language models help with text generation.",
        "Processing natural language is complex."
    ]

    # BPE
    bpe = BytePairEncoder(vocab_size=50)
    bpe.fit(corpus, num_merges=20)
    print("BPE tokens:", bpe.tokenize(text))

    # WordPiece
    wp = WordPiece(vocab_size=50)
    wp.fit(corpus, num_merges=20)
    print("WordPiece tokens:", wp.tokenize(text))

    # Unigram
    unigram = UnigramLM(vocab_size=50)
    unigram.fit(corpus)
    print("Unigram tokens:", unigram.tokenize(text))


def real_world_examples() -> None:
    """Show how subword models are used in real-world scenarios."""
    print("\nREAL-WORLD APPLICATIONS OF SUBWORD MODELS:")

    print("\n1. Handling Out-of-Vocabulary Words:")
    print("   - Traditional word-level tokenization: 'unhappiness' -> ['<unk>'] (if not in vocab)")
    print("   - BPE tokenization: 'unhappiness' -> ['un', 'happiness'] (more meaningful)")

    print("\n2. Morphologically Rich Languages:")
    print("   - Finnish word 'epäjärjestelmällistyttämättömyydelläänsäkäänköhän'")
    print("   - Would be broken into meaningful subwords rather than a single OOV token")

    print("\n3. Code-Switching and Multilingual Text:")
    print("   - Text: 'I love machine learning. C'est très intéressant.'")
    print("   - Handles multiple languages with the same vocabulary")

    print("\n4. Emojis and Special Characters:")
    print("   - Text with emojis: 'I love 🐱 and 🐶'")
    print("   - Tokenized properly as individual units")

    print("\n5. Technical Terms and Domain-Specific Vocabulary:")
    print("   - Medical terms like 'electroencephalography'")
    print("   - Broken into meaningful subword units: 'electro', 'encephalo', 'graphy'")


def advanced_use_cases() -> None:
    """Demonstrate some advanced use cases for subword tokenization."""
    print("\nADVANCED USE CASES:")

    print("\n1. Cross-Lingual Transfer Learning:")
    print("   - Models trained on one language can be applied to similar languages")
    print("   - Example: English → German, Spanish → Portuguese")

    print("\n2. Handling Technical Vocabulary:")
    text = "The hyperparameter optimization of the convolutional neural network improved F1-score."
    print("   Original:", text)

    # Create small corpus with technical terms
    corpus = [
        "The hyperparameter optimization improved the model.",
        "Convolutional neural networks process images.",
        "F1-score is a metric for classification.",
        "Neural networks have hyperparameters.",
        "Optimization of models requires careful tuning."
    ]

    # Train BPE
    bpe = BytePairEncoder(vocab_size=50)
    bpe.fit(corpus, num_merges=20)
    print("   BPE tokens:", bpe.tokenize(text))

    print("\n3. Spelling Errors and Typos:")
    text_with_typo = "The transfomer architecture is powerful."  # Misspelled "transformer"
    print("   Original with typo:", text_with_typo)
    print("   BPE tokens:", bpe.tokenize(text_with_typo))

    print("\n4. Computational Efficiency:")
    print("   - Smaller vocabulary size (compared to word-level) leads to:")
    print("     * Reduced memory footprint")
    print("     * Faster training")
    print("     * More efficient matrix operations")


# ============================================================================ #
#                               MAIN FUNCTION                                  #
# ============================================================================ #

def main() -> None:
    """
    Main function demonstrating word structure and subword models in NLP.
    """
    print("WORD STRUCTURE AND SUBWORD MODELS IN NLP\n")

    # Demonstrate different tokenization methods
    demonstrate_word_tokenization()
    demonstrate_char_tokenization()
    demonstrate_bpe()
    demonstrate_wordpiece()
    demonstrate_unigram()

    # Compare models on the same text
    compare_models()

    # Show real-world examples
    real_world_examples()

    # Advanced use cases
    advanced_use_cases()


if __name__ == "__main__":
    main()

In [None]:
"""
Word Structure and Subword Models: A Comprehensive Guide

This file provides an in-depth exploration of word structure analysis and subword modeling techniques,
which are fundamental in Natural Language Processing (NLP), text processing, and machine learning.
We will cover word tokenization, stemming, lemmatization, and advanced subword modeling techniques
like Byte-Pair Encoding (BPE) and WordPiece. The code is written with PEP-8 standards, is highly
modular, and includes detailed explanations and examples.

Topics Covered:
1. Word Structure Basics
   - Tokenization
   - Stemming
   - Lemmatization
2. Subword Modeling
   - Byte-Pair Encoding (BPE)
   - WordPiece
3. Exception Handling and Edge Cases
4. Practical Examples and Use Cases

Each section includes detailed explanations, code implementations, and examples to ensure clarity.
"""

# Import necessary libraries
import re
from collections import Counter, defaultdict
from typing import List, Dict, Set, Tuple

# For lemmatization, we use NLTK (ensure to install: pip install nltk)
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

# Download required NLTK data (run once)
try:
    nltk.download('wordnet')
    nltk.download('averaged_perceptron_tagger')
except Exception as e:
    print(f"Error downloading NLTK data: {e}")


# ============================================================================
# SECTION 1: WORD STRUCTURE BASICS
# ============================================================================

"""
Word Structure Basics:
Words are the fundamental units of language. Understanding their structure involves breaking them
into meaningful components (tokens, stems, lemmas, etc.). This section covers:
1. Tokenization: Splitting text into words or tokens.
2. Stemming: Reducing words to their root form (e.g., 'running' -> 'run').
3. Lemmatization: Reducing words to their dictionary form (e.g., 'better' -> 'good').
"""

class WordStructureAnalyzer:
    """
    A class to handle basic word structure analysis including tokenization,
    stemming, and lemmatization.
    """

    def __init__(self) -> None:
        """Initialize stemmer and lemmatizer."""
        self.stemmer = PorterStemmer()
        self.lemmatizer = WordNetLemmatizer()

    def tokenize(self, text: str) -> List[str]:
        """
        Tokenize text into words using regex.

        Args:
            text (str): Input text to tokenize.

        Returns:
            List[str]: List of tokens.

        Example:
            >>> analyzer = WordStructureAnalyzer()
            >>> analyzer.tokenize("Hello, world! How are you?")
            ['Hello', 'world', 'How', 'are', 'you']
        """
        # Use regex to split on non-word characters, preserving apostrophes
        tokens = re.findall(r'\w+\'?\w*|\w+', text)
        return tokens

    def stem(self, word: str) -> str:
        """
        Apply stemming to a word.

        Args:
            word (str): Input word to stem.

        Returns:
            str: Stemmed word.

        Example:
            >>> analyzer = WordStructureAnalyzer()
            >>> analyzer.stem("running")
            'run'
        """
        return self.stemmer.stem(word)

    def lemmatize(self, word: str, pos: str = 'n') -> str:
        """
        Apply lemmatization to a word.

        Args:
            word (str): Input word to lemmatize.
            pos (str): Part of speech ('n' for noun, 'v' for verb, etc.).

        Returns:
            str: Lemmatized word.

        Example:
            >>> analyzer = WordStructureAnalyzer()
            >>> analyzer.lemmatize("better", pos='a')
            'good'
        """
        return self.lemmatizer.lemmatize(word, pos=pos)

    def get_wordnet_pos(self, word: str) -> str:
        """
        Helper function to map POS tag to WordNet POS tag for lemmatization.

        Args:
            word (str): Input word to tag.

        Returns:
            str: WordNet POS tag.
        """
        tag = nltk.pos_tag([word])[0][1][0].upper()
        tag_dict = {
            'J': wordnet.ADJ,
            'N': wordnet.NOUN,
            'V': wordnet.VERB,
            'R': wordnet.ADV
        }
        return tag_dict.get(tag, wordnet.NOUN)


# Example usage of WordStructureAnalyzer
def demonstrate_word_structure() -> None:
    """Demonstrate word structure analysis with examples."""
    analyzer = WordStructureAnalyzer()

    # Example text
    text = "Running faster is better than walking slowly!"

    # Tokenization
    print("\n=== Tokenization ===")
    tokens = analyzer.tokenize(text)
    print(f"Original text: {text}")
    print(f"Tokens: {tokens}")

    # Stemming
    print("\n=== Stemming ===")
    stemmed_words = [analyzer.stem(token) for token in tokens]
    print(f"Original tokens: {tokens}")
    print(f"Stemmed tokens: {stemmed_words}")

    # Lemmatization
    print("\n=== Lemmatization ===")
    lemmatized_words = [
        analyzer.lemmatize(token, analyzer.get_wordnet_pos(token))
        for token in tokens
    ]
    print(f"Original tokens: {tokens}")
    print(f"Lemmatized tokens: {lemmatized_words}")


# ============================================================================
# SECTION 2: SUBWORD MODELING
# ============================================================================

"""
Subword Modeling:
Subword modeling is a technique used in modern NLP models (e.g., BERT, GPT) to handle large
vocabularies and out-of-vocabulary (OOV) words. Instead of treating words as atomic units,
subword models break words into smaller, meaningful units (subwords). This section covers:
1. Byte-Pair Encoding (BPE): A data compression technique adapted for NLP.
2. WordPiece: A subword tokenization algorithm used in models like BERT.
"""

class SubwordModel:
    """
    A class to implement subword modeling techniques like BPE and WordPiece.
    """

    def __init__(self, vocab_size: int = 1000) -> None:
        """
        Initialize subword model.

        Args:
            vocab_size (int): Maximum size of the vocabulary.
        """
        self.vocab_size = vocab_size
        self.vocab: Set[str] = set()
        self.merges: List[Tuple[str, str]] = []

    def train_bpe(self, corpus: List[str]) -> None:
        """
        Train a Byte-Pair Encoding (BPE) model on a corpus.

        Args:
            corpus (List[str]): List of words in the corpus.

        Example:
            >>> model = SubwordModel(vocab_size=10)
            >>> corpus = ['low', 'lower', 'lowest', 'widest']
            >>> model.train_bpe(corpus)
        """
        # Step 1: Initialize vocabulary with character-level tokens
        word_freqs = Counter(corpus)
        word_splits: Dict[str, List[str]] = {}
        for word, freq in word_freqs.items():
            chars = list(word) + ['</w>']  # Add end-of-word symbol
            word_splits[word] = chars
            self.vocab.update(chars)

        # Step 2: Perform BPE merges until vocab size is reached
        while len(self.vocab) < self.vocab_size:
            # Count pairs of adjacent symbols
            pair_freqs = defaultdict(int)
            for word, splits in word_splits.items():
                freq = word_freqs[word]
                for i in range(len(splits) - 1):
                    pair = (splits[i], splits[i + 1])
                    pair_freqs[pair] += freq

            if not pair_freqs:
                break

            # Find most frequent pair
            most_frequent_pair = max(pair_freqs, key=pair_freqs.get)
            self.merges.append(most_frequent_pair)

            # Merge the most frequent pair in the vocabulary
            new_subword = ''.join(most_frequent_pair)
            self.vocab.add(new_subword)

            # Update word splits by merging the pair
            for word in word_splits:
                splits = word_splits[word]
                new_splits = []
                i = 0
                while i < len(splits):
                    if (
                        i < len(splits) - 1
                        and (splits[i], splits[i + 1]) == most_frequent_pair
                    ):
                        new_splits.append(new_subword)
                        i += 2
                    else:
                        new_splits.append(splits[i])
                        i += 1
                word_splits[word] = new_splits

    def tokenize_bpe(self, word: str) -> List[str]:
        """
        Tokenize a word using trained BPE merges.

        Args:
            word (str): Input word to tokenize.

        Returns:
            List[str]: List of subword tokens.

        Example:
            >>> model = SubwordModel(vocab_size=10)
            >>> model.tokenize_bpe("lowest")
            ['low', 'est']
        """
        if not self.merges:
            return list(word) + ['</w>']

        # Start with character-level split
        splits = list(word) + ['</w>']
        for pair in self.merges:
            new_subword = ''.join(pair)
            new_splits = []
            i = 0
            while i < len(splits):
                if (
                    i < len(splits) - 1
                    and (splits[i], splits[i + 1]) == pair
                ):
                    new_splits.append(new_subword)
                    i += 2
                else:
                    new_splits.append(splits[i])
                    i += 1
            splits = new_splits
        return splits

    def train_wordpiece(self, corpus: List[str]) -> None:
        """
        Train a WordPiece model on a corpus (simplified version).

        Args:
            corpus (List[str]): List of words in the corpus.
        """
        # Implementation of WordPiece is complex and often requires likelihood scoring.
        # Here, we simulate a simplified version by reusing BPE with a different merging strategy.
        # In practice, WordPiece uses a language model to score merges.
        self.train_bpe(corpus)  # Placeholder for actual WordPiece implementation

    def tokenize_wordpiece(self, word: str) -> List[str]:
        """
        Tokenize a word using trained WordPiece model.

        Args:
            word (str): Input word to tokenize.

        Returns:
            List[str]: List of subword tokens.
        """
        # Placeholder: In practice, WordPiece uses a different strategy
        return self.tokenize_bpe(word)


# Example usage of SubwordModel
def demonstrate_subword_modeling() -> None:
    """Demonstrate subword modeling with examples."""
    model = SubwordModel(vocab_size=15)

    # Example corpus
    corpus = ['low', 'lower', 'lowest', 'widest', 'new', 'newer', 'newest']

    # Train BPE
    print("\n=== Byte-Pair Encoding (BPE) ===")
    model.train_bpe(corpus)
    print(f"Learned vocabulary: {model.vocab}")
    print(f"Merges: {model.merges}")

    # Tokenize a new word
    word = "lowest"
    tokens = model.tokenize_bpe(word)
    print(f"Word: {word}")
    print(f"BPE Tokens: {tokens}")

    # Train WordPiece (simplified)
    print("\n=== WordPiece (Simplified) ===")
    model.train_wordpiece(corpus)
    tokens = model.tokenize_wordpiece(word)
    print(f"Word: {word}")
    print(f"WordPiece Tokens: {tokens}")


# ============================================================================
# SECTION 3: EXCEPTION HANDLING AND EDGE CASES
# ============================================================================

"""
Exception Handling and Edge Cases:
When dealing with word structure and subword models, several edge cases must be handled:
1. Empty or invalid input text.
2. Special characters, numbers, or punctuation.
3. Out-of-vocabulary (OOV) words in subword models.
4. Language-specific nuances (e.g., contractions, compound words).
"""

class RobustWordAnalyzer:
    """
    A robust version of WordStructureAnalyzer and SubwordModel with exception handling.
    """

    def __init__(self, vocab_size: int = 1000) -> None:
        """Initialize analyzer with error handling."""
        self.analyzer = WordStructureAnalyzer()
        self.subword_model = SubwordModel(vocab_size=vocab_size)

    def tokenize_with_exceptions(self, text: str) -> List[str]:
        """
        Tokenize text with exception handling.

        Args:
            text (str): Input text to tokenize.

        Returns:
            List[str]: List of tokens or empty list if invalid.

        Raises:
            ValueError: If input is not a string or is empty.
        """
        if not isinstance(text, str):
            raise ValueError("Input must be a string.")
        if not text.strip():
            raise ValueError("Input text cannot be empty.")
        try:
            return self.analyzer.tokenize(text)
        except Exception as e:
            print(f"Error during tokenization: {e}")
            return []

    def subword_tokenize_with_exceptions(self, word: str) -> List[str]:
        """
        Tokenize a word into subwords with exception handling.

        Args:
            word (str): Input word to tokenize.

        Returns:
            List[str]: List of subword tokens or empty list if invalid.

        Raises:
            ValueError: If input is not a string or is empty.
        """
        if not isinstance(word, str):
            raise ValueError("Input must be a string.")
        if not word.strip():
            raise ValueError("Input word cannot be empty.")
        try:
            return self.subword_model.tokenize_bpe(word)
        except Exception as e:
            print(f"Error during subword tokenization: {e}")
            return []


# Example usage of RobustWordAnalyzer
def demonstrate_exception_handling() -> None:
    """Demonstrate exception handling with examples."""
    robust_analyzer = RobustWordAnalyzer(vocab_size=15)

    print("\n=== Exception Handling ===")

    # Valid case
    try:
        tokens = robust_analyzer.tokenize_with_exceptions("Hello, world!")
        print(f"Valid tokenization: {tokens}")
    except ValueError as e:
        print(f"Error: {e}")

    # Empty input
    try:
        tokens = robust_analyzer.tokenize_with_exceptions("")
        print(f"Tokens: {tokens}")
    except ValueError as e:
        print(f"Error: {e}")

    # Invalid input type
    try:
        tokens = robust_analyzer.tokenize_with_exceptions(123)
        print(f"Tokens: {tokens}")
    except ValueError as e:
        print(f"Error: {e}")

    # Subword tokenization of OOV word
    try:
        tokens = robust_analyzer.subword_tokenize_with_exceptions("unseenword")
        print(f"Subword tokens for 'unseenword': {tokens}")
    except ValueError as e:
        print(f"Error: {e}")


# ============================================================================
# SECTION 4: PRACTICAL EXAMPLES AND USE CASES
# ============================================================================

"""
Practical Examples and Use Cases:
This section provides real-world examples of how word structure analysis and subword modeling
can be applied in NLP tasks, such as text preprocessing, search engines, and machine translation.
"""

def practical_examples() -> None:
    """Demonstrate practical use cases with examples."""
    analyzer = WordStructureAnalyzer()
    subword_model = SubwordModel(vocab_size=20)

    print("\n=== Practical Examples ===")

    # Use Case 1: Text Preprocessing for Search Engine
    print("\nUse Case 1: Text Preprocessing for Search Engine")
    query = "Running faster is better than walking slowly!"
    tokens = analyzer.tokenize(query)
    stemmed_tokens = [analyzer.stem(token) for token in tokens]
    print(f"Query: {query}")
    print(f"Stemmed tokens for indexing: {stemmed_tokens}")

    # Use Case 2: Subword Modeling for Machine Translation
    print("\nUse Case 2: Subword Modeling for Machine Translation")
    corpus = ['play', 'playing', 'played', 'plays', 'player']
    subword_model.train_bpe(corpus)
    word = "players"
    subword_tokens = subword_model.tokenize_bpe(word)
    print(f"Word: {word}")
    print(f"Subword tokens for translation model: {subword_tokens}")

    # Use Case 3: Handling Compound Words
    print("\nUse Case 3: Handling Compound Words")
    compound_word = "icecream"
    subword_tokens = subword_model.tokenize_bpe(compound_word)
    print(f"Compound word: {compound_word}")
    print(f"Subword tokens: {subword_tokens}")


# ============================================================================
# MAIN EXECUTION
# ============================================================================

if __name__ == "__main__":
    """
    Main execution block to run all demonstrations.
    """
    print("=== Word Structure and Subword Models Demonstration ===")

    # Demonstrate word structure analysis
    demonstrate_word_structure()

    # Demonstrate subword modeling
    demonstrate_subword_modeling()

    # Demonstrate exception handling
    demonstrate_exception_handling()

    # Demonstrate practical examples
    practical_examples()

"""
Key Takeaways:
1. Word structure analysis (tokenization, stemming, lemmatization) is essential for text preprocessing.
2. Subword modeling (BPE, WordPiece) is crucial for handling large vocabularies and OOV words in NLP.
3. Robust exception handling ensures code reliability in real-world applications.
4. Practical use cases demonstrate the importance of these techniques in NLP tasks.

This implementation adheres to PEP-8 standards, is modular, and includes detailed explanations
and examples for next-generation coders to learn from.
"""