# Tokenization

## Introduction

**Tokenization** constitutes one of the fundamental processes in natural language
processing and deep learning applied to text. This process consists of the systematic
decomposition of textual sequences into smaller units called "tokens", which can
correspond to words, subwords, or even individual characters, depending on the strategy
employed.

The need for tokenization arises from an inherent limitation of computational systems:
they operate exclusively with numerical representations. While humans process language
naturally through linguistic symbols, neural network architectures require all
information to be encoded in the form of numerical vectors. Tokenization therefore acts
as a bridge between the linguistic domain and the mathematical domain, allowing machine
learning models to process, analyze, and generate text effectively.

## Basic word-by-word tokenization

The most intuitive approach to tokenization consists of segmenting text using whitespace
as natural delimiters between words. This method, although simple, allows understanding
the fundamental principles of the tokenization process and establishes the foundations
for more sophisticated techniques.

In [1]:
# Basic tokenization example
texto = "I like machine learning"

# Method 1: Using Python's split()
tokens = texto.split()
print("Original text:", texto)
print("Tokens:", tokens)
print("Number of tokens:", len(tokens))

Original text: I like machine learning
Tokens: ['I', 'like', 'machine', 'learning']
Number of tokens: 4


Expected output:

```
Original text: I like machine learning
Tokens: ['I', 'like', 'machine', 'learning']
Number of tokens: 4
```

### Building a simple tokenizer

To advance beyond simple text splitting, it is necessary to build a system that not only
segments words, but also establishes a one-to-one correspondence between each unique word
and a numerical identifier. This mapping allows representing any text as a sequence of
numbers, facilitating its processing by machine learning models.

The implementation of a basic tokenizer requires maintaining two complementary data
structures: a dictionary that maps words to numbers and another that performs the inverse
transformation. Additionally, a mechanism is needed to assign unique identifiers to each
new word found during the training process.

In [2]:
class SimpleTokenizer:
    """
    A basic tokenizer that splits text into words
    and assigns them unique numbers.
    """

    def __init__(self):
        # Dictionary to store word -> number
        self.word_to_number = {}
        # Inverse dictionary: number -> word
        self.number_to_word = {}
        # Counter to assign numbers
        self.next_number = 0

    def train(self, texts):
        """
        Learns which words exist in the texts.

        Args:
            texts: List of strings with training texts
        """
        for text in texts:
            # Convert to lowercase and split into words
            words = text.lower().split()

            # For each word, if we haven't seen it, assign it a number
            for word in words:
                if word not in self.word_to_number:
                    self.word_to_number[word] = self.next_number
                    self.number_to_word[self.next_number] = word
                    self.next_number += 1

        print(f"Learned vocabulary: {len(self.word_to_number)} words")

    def encode(self, text):
        """
        Converts text into a list of numbers.
        """
        words = text.lower().split()
        numbers = []

        for word in words:
            if word in self.word_to_number:
                numbers.append(self.word_to_number[word])
            else:
                # If we don't know the word, use -1
                numbers.append(-1)

        return numbers

    def decode(self, numbers):
        """
        Converts a list of numbers back to text.
        """
        words = []

        for number in numbers:
            if number in self.number_to_word:
                words.append(self.number_to_word[number])
            else:
                words.append("[UNKNOWN]")

        return " ".join(words)

    def show_vocabulary(self):
        """Shows all words the tokenizer knows."""
        print("\nComplete vocabulary:")
        print("-" * 40)
        for word, number in sorted(self.word_to_number.items(), key=lambda x: x[1]):
            print(f"{number:3d} -> {word}")


# Usage example
print("=" * 50)
print("EXAMPLE 1: Simple Tokenizer")
print("=" * 50)

# Training texts
training_texts = ["i like programming", "i like learning", "programming is fun"]

# Create and train the tokenizer
tokenizer = SimpleTokenizer()
tokenizer.train(training_texts)

# Show the learned vocabulary
tokenizer.show_vocabulary()

# Test encoding
new_text = "i like learning programming"
print(f"\nText to encode: '{new_text}'")

encoded = tokenizer.encode(new_text)
print(f"Encoded text: {encoded}")

decoded = tokenizer.decode(encoded)
print(f"Decoded text: '{decoded}'")

# Test with unknown word
unknown_text = "i like cooking"
print(f"\nText with new word: '{unknown_text}'")
encoded_unk = tokenizer.encode(unknown_text)
print(f"Encoded: {encoded_unk}")
print("Note: -1 indicates unknown word")

EXAMPLE 1: Simple Tokenizer
Learned vocabulary: 6 words

Complete vocabulary:
----------------------------------------
  0 -> i
  1 -> like
  2 -> programming
  3 -> learning
  4 -> is
  5 -> fun

Text to encode: 'i like learning programming'
Encoded text: [0, 1, 3, 2]
Decoded text: 'i like learning programming'

Text with new word: 'i like cooking'
Encoded: [0, 1, -1]
Note: -1 indicates unknown word


## Special tokens and vocabulary management

Tokenization in real-world natural language processing applications presents challenges
that go beyond simple word-to-number conversion. Situations arise that require special
treatment: words that did not appear during training, sequences of different lengths that
must be processed in batches, and the need to explicitly mark the beginning and end of
sequences.

To address these issues, modern tokenization systems incorporate special tokens with
specific functions. The padding token allows uniformizing sequence length, facilitating
parallel processing. The unknown word token provides a consistent representation for
terms not seen during training. The start and end of sequence tokens allow models to
explicitly identify the boundaries of each input, which is especially relevant in text
generation and machine translation tasks.

In [3]:
class TokenizerWithSpecials:
    """
    Tokenizer that handles unknown words and padding.
    """

    def __init__(self):
        # Special tokens
        self.PAD = "[PAD]"  # For padding short sequences
        self.UNK = "[UNK]"  # For unknown words
        self.SOS = "[SOS]"  # Start of Sequence
        self.EOS = "[EOS]"  # End of Sequence

        # Initialize dictionaries with special tokens
        self.word_to_number = {self.PAD: 0, self.UNK: 1, self.SOS: 2, self.EOS: 3}

        self.number_to_word = {0: self.PAD, 1: self.UNK, 2: self.SOS, 3: self.EOS}

        self.next_number = 4

    def train(self, texts):
        """Learns the vocabulary from texts."""
        for text in texts:
            words = text.lower().split()

            for word in words:
                if word not in self.word_to_number:
                    self.word_to_number[word] = self.next_number
                    self.number_to_word[self.next_number] = word
                    self.next_number += 1

        print(f"Vocabulary: {len(self.word_to_number)} words")
        print(f"  - Special words: 4")
        print(f"  - Normal words: {len(self.word_to_number) - 4}")

    def encode(self, text, add_special=True, fixed_length=None):
        """
        Encodes text with advanced options.

        Args:
            text: Text to encode
            add_special: Whether to add [SOS] and [EOS]
            fixed_length: If specified, adjusts to this length
        """
        words = text.lower().split()

        # Convert words to numbers
        numbers = []
        for word in words:
            if word in self.word_to_number:
                numbers.append(self.word_to_number[word])
            else:
                numbers.append(self.word_to_number[self.UNK])

        # Add start and end tokens if requested
        if add_special:
            numbers = (
                [self.word_to_number[self.SOS]]
                + numbers
                + [self.word_to_number[self.EOS]]
            )

        # Adjust to fixed length if specified
        if fixed_length is not None:
            if len(numbers) < fixed_length:
                # Pad with PAD
                numbers = numbers + [self.word_to_number[self.PAD]] * (
                    fixed_length - len(numbers)
                )
            else:
                # Truncate
                numbers = numbers[:fixed_length]

        return numbers

    def decode(self, numbers, remove_special=True):
        """Decodes numbers to text."""
        words = []

        for number in numbers:
            if number in self.number_to_word:
                word = self.number_to_word[number]

                # Skip special tokens if requested
                if remove_special and word in [self.PAD, self.UNK, self.SOS, self.EOS]:
                    continue

                words.append(word)

        return " ".join(words)


# Usage examples
print("\n" + "=" * 50)
print("EXAMPLE 2: Special Tokens and Padding")
print("=" * 50)

# Train
texts = ["hello world", "python is great", "i like learning"]

tokenizer_v2 = TokenizerWithSpecials()
tokenizer_v2.train(texts)

# Encode without fixed length
text1 = "hello python"
print(f"\nText 1: '{text1}'")
cod1 = tokenizer_v2.encode(text1)
print(f"Encoded: {cod1}")
print(f"Length: {len(cod1)}")

# Encode with fixed length
text2 = "i like"
print(f"\nText 2: '{text2}'")
cod2 = tokenizer_v2.encode(text2, fixed_length=10)
print(f"Encoded (fixed length=10): {cod2}")
print(f"Length: {len(cod2)}")

# Decode
print(f"Decoded: '{tokenizer_v2.decode(cod2)}'")


EXAMPLE 2: Special Tokens and Padding
Vocabulary: 12 words
  - Special words: 4
  - Normal words: 8

Text 1: 'hello python'
Encoded: [2, 4, 6, 3]
Length: 4

Text 2: 'i like'
Encoded (fixed length=10): [2, 9, 10, 3, 0, 0, 0, 0, 0, 0]
Length: 10
Decoded: 'i like'


## Visualization of the tokenization process

Deep understanding of the tokenization process is facilitated through explicit
visualization of the transformations the text undergoes at each stage. Observing how a
sequence of words is converted into a sequence of numerical identifiers, and can
subsequently be recovered as original text, allows identifying potential problems and
understanding the system's behavior with different inputs.

In [4]:
def visualize_tokenization(tokenizer, texts):
    """
    Visually shows how each text is tokenized.
    """
    print("\n" + "=" * 60)
    print("TOKENIZATION VISUALIZATION")
    print("=" * 60)

    for i, text in enumerate(texts, 1):
        print(f"\n{i}. Original text:")
        print(f"   '{text}'")

        # Encode
        encoded = tokenizer.encode(text, add_special=True)

        print(f"\n   Tokens (numbers):")
        print(f"   {encoded}")

        print(f"\n   Visual representation:")
        # Show each token with its word
        words = ["[SOS]"] + text.lower().split() + ["[EOS]"]
        for word, number in zip(words, encoded):
            print(f"   {word:15} -> {number:3}")

        print(f"\n   Decoded:")
        print(f"   '{tokenizer.decode(encoded)}'")
        print("-" * 60)


# Visualization example
example_texts = [
    "python is great",
    "i like programming",
    "hello artificial intelligence",
]

visualize_tokenization(tokenizer_v2, example_texts)


TOKENIZATION VISUALIZATION

1. Original text:
   'python is great'

   Tokens (numbers):
   [2, 6, 7, 8, 3]

   Visual representation:
   [SOS]           ->   2
   python          ->   6
   is              ->   7
   great           ->   8
   [EOS]           ->   3

   Decoded:
   'python is great'
------------------------------------------------------------

2. Original text:
   'i like programming'

   Tokens (numbers):
   [2, 9, 10, 1, 3]

   Visual representation:
   [SOS]           ->   2
   i               ->   9
   like            ->  10
   programming     ->   1
   [EOS]           ->   3

   Decoded:
   'i like'
------------------------------------------------------------

3. Original text:
   'hello artificial intelligence'

   Tokens (numbers):
   [2, 4, 1, 1, 3]

   Visual representation:
   [SOS]           ->   2
   hello           ->   4
   artificial      ->   1
   intelligence    ->   1
   [EOS]           ->   3

   Decoded:
   'hello'
-----------------------------------

## Length normalization through padding

One of the most relevant technical aspects in textual sequence processing is the
management of variable lengths. Natural texts present considerable diversity in terms of
their length: from brief phrases of few words to extensive paragraphs with dozens or
hundreds of tokens. However, neural network architectures, especially when processing
multiple examples simultaneously in batches, require all input sequences to have uniform
dimensions.

Padding constitutes the standard solution to this problem. It consists of artificially
extending shorter sequences until reaching a target length, typically determined by the
longest sequence in the batch. This extension is performed through the insertion of
special padding tokens that the model learns to ignore during processing. Alternatively,
when a sequence exceeds the maximum allowed length, truncation is applied, preserving
only the first tokens up to the established limit.

In [5]:
def compare_lengths(tokenizer, texts):
    """
    Compares the lengths of different texts and shows
    how padding uniformizes them.
    """
    print("\n" + "=" * 60)
    print("LENGTH COMPARISON")
    print("=" * 60)

    # Find maximum length
    lengths = []
    for text in texts:
        cod = tokenizer.encode(text, add_special=True)
        lengths.append(len(cod))

    max_length = max(lengths)

    print(f"\nMaximum length found: {max_length} tokens")
    print("\nComparison:")
    print("-" * 60)

    for text in texts:
        # Without padding
        without_padding = tokenizer.encode(text, add_special=True)

        # With padding
        with_padding = tokenizer.encode(text, add_special=True, fixed_length=max_length)

        print(f"\nText: '{text}'")
        print(f"Without padding (length {len(without_padding)}): {without_padding}")
        print(f"With padding (length {len(with_padding)}): {with_padding}")

        # Count how many PADs were added
        num_pads = with_padding.count(0)
        print(f"PADs added: {num_pads}")


# Comparison example
different_texts = ["hello", "python is great", "i like learning programming in python"]

compare_lengths(tokenizer_v2, different_texts)


LENGTH COMPARISON

Maximum length found: 8 tokens

Comparison:
------------------------------------------------------------

Text: 'hello'
Without padding (length 3): [2, 4, 3]
With padding (length 8): [2, 4, 3, 0, 0, 0, 0, 0]
PADs added: 5

Text: 'python is great'
Without padding (length 5): [2, 6, 7, 8, 3]
With padding (length 8): [2, 6, 7, 8, 3, 0, 0, 0]
PADs added: 3

Text: 'i like learning programming in python'
Without padding (length 8): [2, 9, 10, 11, 1, 1, 6, 3]
With padding (length 8): [2, 9, 10, 11, 1, 1, 6, 3]
PADs added: 0


## Practical application: Review processing system

The integration of all presented concepts materializes in the construction of a complete
system capable of processing real-world text. A representative use case is product review
analysis, where the objective consists of transforming opinions expressed in natural
language into numerical representations that can subsequently feed sentiment
classification models or other analysis tasks.

This system integrates the tokenizer with special token management capabilities, length
normalization, and vocabulary maintenance built from a training set. The resulting
architecture allows processing new reviews consistently, applying the same
transformations that will be used during machine learning model training.

In [6]:
class ReviewSystem:
    """
    Complete system for processing product reviews.
    """

    def __init__(self):
        self.tokenizer = TokenizerWithSpecials()
        self.reviews = []
        self.labels = []  # 1 = positive, 0 = negative

    def add_review(self, text, is_positive):
        """Adds a review to the system."""
        self.reviews.append(text)
        self.labels.append(1 if is_positive else 0)

    def train(self):
        """Trains the tokenizer with all reviews."""
        print("Training tokenizer with reviews...")
        self.tokenizer.train(self.reviews)

    def process_review(self, text, length=15):
        """Processes a new review."""
        print(f"\nProcessing: '{text}'")
        print("-" * 50)

        # Tokenize
        tokens = text.lower().split()
        print(f"1. Split into words: {tokens}")

        # Encode
        encoded = self.tokenizer.encode(text, add_special=True, fixed_length=length)
        print(f"2. Convert to numbers: {encoded}")

        # Decode
        decoded = self.tokenizer.decode(encoded)
        print(f"3. Decode: '{decoded}'")

        return encoded

    def show_statistics(self):
        """Shows dataset statistics."""
        print("\n" + "=" * 60)
        print("SYSTEM STATISTICS")
        print("=" * 60)

        print(f"\nTotal reviews: {len(self.reviews)}")
        print(f"Positive reviews: {sum(self.labels)}")
        print(f"Negative reviews: {len(self.labels) - sum(self.labels)}")
        print(f"Vocabulary: {len(self.tokenizer.word_to_number)} words")

        # Lengths
        lengths = [len(r.split()) for r in self.reviews]
        print(f"\nAverage length: {sum(lengths)/len(lengths):.1f} words")
        print(f"Minimum length: {min(lengths)} words")
        print(f"Maximum length: {max(lengths)} words")


# Create the system
print("=" * 60)
print("MINI PROJECT: Review System")
print("=" * 60)

system = ReviewSystem()

# Add training reviews
training_reviews = [
    ("this product is excellent", True),
    ("very bad quality do not recommend", False),
    ("incredible i love it", True),
    ("terrible experience", False),
    ("perfect product arrived fast", True),
    ("does not work properly", False),
]

print("\nAdding reviews to the system...")
for text, is_positive in training_reviews:
    system.add_review(text, is_positive)
    sentiment = "POSITIVE" if is_positive else "NEGATIVE"
    print(f"  - [{sentiment}] {text}")

# Train
system.train()

# Show statistics
system.show_statistics()

# Process new reviews
print("\n" + "=" * 60)
print("PROCESSING NEW REVIEWS")
print("=" * 60)

new_reviews = [
    "excellent product very good",
    "bad experience terrible",
    "perfect recommend",
]

for review in new_reviews:
    system.process_review(review)

MINI PROJECT: Review System

Adding reviews to the system...
  - [POSITIVE] this product is excellent
  - [NEGATIVE] very bad quality do not recommend
  - [POSITIVE] incredible i love it
  - [NEGATIVE] terrible experience
  - [POSITIVE] perfect product arrived fast
  - [NEGATIVE] does not work properly
Training tokenizer with reviews...
Vocabulary: 26 words
  - Special words: 4
  - Normal words: 22

SYSTEM STATISTICS

Total reviews: 6
Positive reviews: 3
Negative reviews: 3
Vocabulary: 26 words

Average length: 4.0 words
Minimum length: 2 words
Maximum length: 6 words

PROCESSING NEW REVIEWS

Processing: 'excellent product very good'
--------------------------------------------------
1. Split into words: ['excellent', 'product', 'very', 'good']
2. Convert to numbers: [2, 7, 5, 8, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0]
3. Decode: 'excellent product very'

Processing: 'bad experience terrible'
--------------------------------------------------
1. Split into words: ['bad', 'experience', 'terrib