<a href="https://colab.research.google.com/github/ginolaratro/LLM-projects/blob/main/Untitled3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Explanation of the BPETokenizer Class and Results

We've implemented a custom `BPETokenizer` class that encapsulates the BPE logic. Here's a breakdown of its key components and how it operates based on the executed code:

1.  **`__init__(self, text, num_merges)`**: The constructor initializes the tokenizer with the training text and the desired number of merges. It then calls the `train()` method.

2.  **`_get_initial_tokens()`**: This helper method prepares the input text by splitting it into words and converting each word into a list of characters, appending a special `</w>` token to mark word boundaries. This character-level representation forms the initial vocabulary.

3.  **`_get_stats(word_tokens)`**: This function iterates through the current list of word tokens (which can be individual characters or merged subword units) and counts the frequency of all adjacent pairs.

4.  **`_merge_pair(word_tokens, pair, new_token)`**: This function takes the current word tokens, the most frequent `pair` identified, and a `new_token` (which is the concatenation of the pair). It then replaces all occurrences of that `pair` in the `word_tokens` list with the `new_token`.

5.  **`train()`**: This is the core training method. It iteratively performs `num_merges`:
    *   It calls `_get_stats()` to find the most frequent pair.
    *   It creates a `new_token` from this pair.
    *   It updates the `merges` dictionary to record this operation (`(pair): new_token`).
    *   It calls `_merge_pair()` to apply the merge across all current word tokens.
    *   The `new_token` is added to the `vocabulary`.
    
    The output `Learned merges` shows the sequence of pairs that were merged and their resulting tokens, e.g., `('a', 't'): 'at'`.

6.  **`tokenize(text_to_tokenize)`**: This method takes a new sentence and applies all the learned merges in the order they were discovered during training. It starts by breaking the new text into character-level tokens (with `</w>`) and then iteratively replaces character pairs with learned subword units until no more merges can be applied. The `Encoded tokens` output demonstrates this process, showing how words like "cat" and "hat" are represented by their learned subword tokens `cat</w>` and `hat</w>` respectively.

7.  **`decode(tokens)`**: This simple method reverses the tokenization by joining the tokens back into a string and replacing the `</w>` tokens with spaces to reconstruct the original text. The `Decoded text` output confirms that the tokenization and decoding process is reversible and accurate.

This implementation provides a clear illustration of how BPE incrementally builds a subword vocabulary and uses it to tokenize text, handling both common words and out-of-vocabulary terms efficiently.

### What is Byte Pair Encoding (BPE)?

Byte Pair Encoding (BPE) is a data compression technique that is also widely used in natural language processing (NLP) for tokenization, especially in subword tokenization. It works by iteratively merging the most frequent pair of adjacent characters or character sequences into a new, single token.

**How it works:**

1.  **Initialize Vocabulary:** Start with a vocabulary of individual characters present in the training text.
2.  **Count Pairs:** Identify all adjacent pairs of tokens in the training data and count their frequencies.
3.  **Merge Most Frequent Pair:** Replace all occurrences of the most frequent pair with a new, combined token.
4.  **Repeat:** Go back to step 2 and repeat the process for a fixed number of merge operations or until the vocabulary reaches a desired size.

This process creates a vocabulary of common subword units, which helps in handling out-of-vocabulary words (by breaking them down into known subwords) and reducing the overall vocabulary size compared to character-level tokenization, while being more flexible than word-level tokenization.

In [1]:
# Step 1: Prepare the input text
text = """I have a cat. My cat has a hat. I like my cat with a hat."""

# For simplicity, we'll start with character-level tokens and add a special token for word boundaries
# This helps in distinguishing 'cat' from 'catch' for example.
# We'll also convert to a list of lists of characters for processing.

words = text.split()
initial_tokens = []
for word in words:
    # Add a special end-of-word token '</w>' to each word
    initial_tokens.append(list(word) + ['</w>'])

print("Initial tokens (character-level with </w>):")
for tokens in initial_tokens:
    print(tokens)

vocabulary = set(char for word_tokens in initial_tokens for char in word_tokens)
print(f"\nInitial vocabulary: {sorted(list(vocabulary))}")

# Helper function to convert a list of tokens back to a string for display
def tokens_to_string(tokens_list):
    return [''.join(tokens).replace('</w>', '_') for tokens in tokens_list]

print(f"\nInitial processed text representation: {tokens_to_string(initial_tokens)}")

Initial tokens (character-level with </w>):
['I', '</w>']
['h', 'a', 'v', 'e', '</w>']
['a', '</w>']
['c', 'a', 't', '.', '</w>']
['M', 'y', '</w>']
['c', 'a', 't', '</w>']
['h', 'a', 's', '</w>']
['a', '</w>']
['h', 'a', 't', '.', '</w>']
['I', '</w>']
['l', 'i', 'k', 'e', '</w>']
['m', 'y', '</w>']
['c', 'a', 't', '</w>']
['w', 'i', 't', 'h', '</w>']
['a', '</w>']
['h', 'a', 't', '.', '</w>']

Initial vocabulary: ['.', '</w>', 'I', 'M', 'a', 'c', 'e', 'h', 'i', 'k', 'l', 'm', 's', 't', 'v', 'w', 'y']

Initial processed text representation: ['I_', 'have_', 'a_', 'cat._', 'My_', 'cat_', 'has_', 'a_', 'hat._', 'I_', 'like_', 'my_', 'cat_', 'with_', 'a_', 'hat._']


In [2]:
# Step 2: Implement the core BPE functions

# Function to count frequencies of adjacent pairs
def get_stats(word_tokens):
    pairs = {}
    for word in word_tokens:
        for i in range(len(word) - 1):
            pair = (word[i], word[i+1])
            pairs[pair] = pairs.get(pair, 0) + 1
    return pairs

# Function to merge the most frequent pair
def merge_pair(word_tokens, pair, new_token):
    merged_word_tokens = []
    for word in word_tokens:
        new_word = []
        i = 0
        while i < len(word):
            if i < len(word) - 1 and (word[i], word[i+1]) == pair:
                new_word.append(new_token)
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        merged_word_tokens.append(new_word)
    return merged_word_tokens


# Let's run a few merges
current_word_tokens = initial_tokens.copy()
merges = {}
num_merges = 10 # Number of merge operations to perform

print("\n--- Performing BPE Merges ---")
for i in range(num_merges):
    pairs = get_stats(current_word_tokens)
    if not pairs:
        print("No more pairs to merge. Stopping.")
        break

    # Find the most frequent pair
    most_frequent_pair = max(pairs, key=pairs.get)
    new_token = ''.join(most_frequent_pair)
    merges[most_frequent_pair] = new_token

    print(f"\nMerge {i+1}: Merging {most_frequent_pair} into '{new_token}' (count: {pairs[most_frequent_pair]})")
    current_word_tokens = merge_pair(current_word_tokens, most_frequent_pair, new_token)
    print(f"Current tokens: {tokens_to_string(current_word_tokens)}")

print("\n--- BPE Merges Complete ---")
print(f"Final merges learned: {merges}")

# Update vocabulary with new tokens
final_vocabulary = set(token for word_tokens in current_word_tokens for token in word_tokens)
print(f"Final vocabulary size: {len(final_vocabulary)}")
print(f"Final vocabulary: {sorted(list(final_vocabulary))}")


--- Performing BPE Merges ---

Merge 1: Merging ('a', 't') into 'at' (count: 5)
Current tokens: ['I_', 'have_', 'a_', 'cat._', 'My_', 'cat_', 'has_', 'a_', 'hat._', 'I_', 'like_', 'my_', 'cat_', 'with_', 'a_', 'hat._']

Merge 2: Merging ('a', '</w>') into 'a</w>' (count: 3)
Current tokens: ['I_', 'have_', 'a_', 'cat._', 'My_', 'cat_', 'has_', 'a_', 'hat._', 'I_', 'like_', 'my_', 'cat_', 'with_', 'a_', 'hat._']

Merge 3: Merging ('c', 'at') into 'cat' (count: 3)
Current tokens: ['I_', 'have_', 'a_', 'cat._', 'My_', 'cat_', 'has_', 'a_', 'hat._', 'I_', 'like_', 'my_', 'cat_', 'with_', 'a_', 'hat._']

Merge 4: Merging ('.', '</w>') into '.</w>' (count: 3)
Current tokens: ['I_', 'have_', 'a_', 'cat._', 'My_', 'cat_', 'has_', 'a_', 'hat._', 'I_', 'like_', 'my_', 'cat_', 'with_', 'a_', 'hat._']

Merge 5: Merging ('I', '</w>') into 'I</w>' (count: 2)
Current tokens: ['I_', 'have_', 'a_', 'cat._', 'My_', 'cat_', 'has_', 'a_', 'hat._', 'I_', 'like_', 'my_', 'cat_', 'with_', 'a_', 'hat._']

Mer

In [3]:
# Step 3: Create a BPE tokenizer class for encoding and decoding

class BPETokenizer:
    def __init__(self, text, num_merges):
        self.text = text
        self.num_merges = num_merges
        self.merges = {}
        self.vocabulary = set()
        self.byte_to_char = {i: chr(i) for i in range(256)} # For real BPE, using bytes
        self.char_to_byte = {chr(i): i for i in range(256)}
        self.train()

    def _get_initial_tokens(self):
        words = self.text.split()
        initial_tokens_list = []
        for word in words:
            initial_tokens_list.append(list(word) + ['</w>'])
        return initial_tokens_list

    def _get_stats(self, word_tokens):
        pairs = {}
        for word in word_tokens:
            for i in range(len(word) - 1):
                pair = (word[i], word[i+1])
                pairs[pair] = pairs.get(pair, 0) + 1
        return pairs

    def _merge_pair(self, word_tokens, pair, new_token):
        merged_word_tokens = []
        for word in word_tokens:
            new_word = []
            i = 0
            while i < len(word):
                if i < len(word) - 1 and (word[i], word[i+1]) == pair:
                    new_word.append(new_token)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            merged_word_tokens.append(new_word)
        return merged_word_tokens

    def train(self):
        current_word_tokens = self._get_initial_tokens()
        self.vocabulary = set(token for word_tokens in current_word_tokens for token in word_tokens)

        for _ in range(self.num_merges):
            pairs = self._get_stats(current_word_tokens)
            if not pairs:
                break
            most_frequent_pair = max(pairs, key=pairs.get)
            new_token = ''.join(most_frequent_pair)
            self.merges[most_frequent_pair] = new_token
            current_word_tokens = self._merge_pair(current_word_tokens, most_frequent_pair, new_token)
            self.vocabulary.add(new_token)

        # Reverse merges for encoding process: from merged token to its components
        self.reversed_merges = {v: k for k, v in self.merges.items()}

    def tokenize(self, text_to_tokenize):
        # Start with character-level tokens for the new text
        words_to_tokenize = text_to_tokenize.split()
        encoded_tokens_list = []

        for word in words_to_tokenize:
            word_chars = list(word) + ['</w>']
            # Apply merges in the order they were learned
            for pair, new_token in self.merges.items():
                i = 0
                while i < len(word_chars) - 1:
                    if (word_chars[i], word_chars[i+1]) == pair:
                        word_chars[i:i+2] = [new_token]
                    else:
                        i += 1
            encoded_tokens_list.extend(word_chars)
        return encoded_tokens_list

    def decode(self, tokens):
        text = ''.join(tokens)
        text = text.replace('</w>', ' ')
        return text.strip()


# Instantiate and train the tokenizer
bpe_tokenizer = BPETokenizer(text, num_merges=10)

print(f"\nLearned merges: {bpe_tokenizer.merges}")
print(f"Final vocabulary: {sorted(list(bpe_tokenizer.vocabulary))}")

# Test encoding and decoding
input_sentence = "My cat has a big hat."
encoded_tokens = bpe_tokenizer.tokenize(input_sentence)
decoded_text = bpe_tokenizer.decode(encoded_tokens)

print(f"\nOriginal sentence: '{input_sentence}'")
print(f"Encoded tokens: {encoded_tokens}")
print(f"Decoded text: '{decoded_text}'")

# Example with a word not seen in training
new_sentence = "The big dog. The small cat."
encoded_new_tokens = bpe_tokenizer.tokenize(new_sentence)
decoded_new_text = bpe_tokenizer.decode(encoded_new_tokens)

print(f"\nNew sentence: '{new_sentence}'")
print(f"Encoded new tokens: {encoded_new_tokens}")
print(f"Decoded new text: '{decoded_new_text}'")


Learned merges: {('a', 't'): 'at', ('a', '</w>'): 'a</w>', ('c', 'at'): 'cat', ('.', '</w>'): '.</w>', ('I', '</w>'): 'I</w>', ('h', 'a'): 'ha', ('e', '</w>'): 'e</w>', ('y', '</w>'): 'y</w>', ('cat', '</w>'): 'cat</w>', ('h', 'at'): 'hat'}
Final vocabulary: ['.', '.</w>', '</w>', 'I', 'I</w>', 'M', 'a', 'a</w>', 'at', 'c', 'cat', 'cat</w>', 'e', 'e</w>', 'h', 'ha', 'hat', 'i', 'k', 'l', 'm', 's', 't', 'v', 'w', 'y', 'y</w>']

Original sentence: 'My cat has a big hat.'
Encoded tokens: ['M', 'y</w>', 'cat</w>', 'ha', 's', '</w>', 'a</w>', 'b', 'i', 'g', '</w>', 'hat', '.</w>']
Decoded text: 'My cat has a big hat.'

New sentence: 'The big dog. The small cat.'
Encoded new tokens: ['T', 'h', 'e</w>', 'b', 'i', 'g', '</w>', 'd', 'o', 'g', '.</w>', 'T', 'h', 'e</w>', 's', 'm', 'a', 'l', 'l', '</w>', 'cat', '.</w>']
Decoded new text: 'The big dog. The small cat.'


In [1]:
import tensorflow as tf
import numpy as np

# Example input: 'logits' from a neural network output
# Let's say we have 3 classes and a batch size of 1
logits = tf.constant([2.0, 1.0, 0.1])
print(f"Input logits: {logits.numpy()}")

# Apply the softmax function
probabilities = tf.nn.softmax(logits)

print(f"Softmax probabilities: {probabilities.numpy()}")
print(f"Sum of probabilities: {tf.reduce_sum(probabilities).numpy()}")

# Example with a batch of inputs
batch_logits = tf.constant([
    [2.0, 1.0, 0.1],  # Example 1
    [0.5, 2.5, 1.5]   # Example 2
])
print(f"\nInput batch logits:\n{batch_logits.numpy()}")

batch_probabilities = tf.nn.softmax(batch_logits)

print(f"Softmax batch probabilities:\n{batch_probabilities.numpy()}")
print(f"Sum of probabilities for each example in batch: {tf.reduce_sum(batch_probabilities, axis=1).numpy()}")

Input logits: [2.  1.  0.1]
Softmax probabilities: [0.6590012  0.24243298 0.09856589]
Sum of probabilities: 1.0000001192092896

Input batch logits:
[[2.  1.  0.1]
 [0.5 2.5 1.5]]
Softmax batch probabilities:
[[0.6590012  0.24243298 0.09856589]
 [0.09003057 0.6652409  0.24472845]]
Sum of probabilities for each example in batch: [1.0000001 0.9999999]


### Explanation of TensorFlow Softmax Implementation

In the Python code above, we demonstrated how to use TensorFlow's built-in `tf.nn.softmax` function.

1.  **`import tensorflow as tf`**: We start by importing the TensorFlow library.

2.  **`logits = tf.constant([2.0, 1.0, 0.1])`**: We define a `tf.constant` tensor named `logits`. This represents the raw, unnormalized scores for three different classes. For example, if a model predicts that an input has a score of 2.0 for class A, 1.0 for class B, and 0.1 for class C.

3.  **`probabilities = tf.nn.softmax(logits)`**: This is where the magic happens! We pass our `logits` tensor to `tf.nn.softmax()`. TensorFlow efficiently computes the softmax function as per the formula discussed earlier:
    *   It calculates $e^{z_i}$ for each element.
    *   It then normalizes these exponential values by dividing each by the sum of all exponential values.

4.  **`print(f"Softmax probabilities: {probabilities.numpy()}")`**: The output `probabilities` tensor contains values between 0 and 1, representing the probability distribution over the classes. The class with the highest logit will have the highest probability.

5.  **`print(f"Sum of probabilities: {tf.reduce_sum(probabilities).numpy()}")`**: This line confirms that the sum of the probabilities for a single input always equals 1, which is a fundamental property of probability distributions.

6.  **Batch Example**: We also show how `tf.nn.softmax` handles a batch of inputs (`batch_logits`). When given a 2D tensor, `tf.nn.softmax` applies the softmax function independently to each row (representing a different example in the batch). `axis=1` in `tf.reduce_sum` ensures we sum across the classes for each example to verify that each row's probabilities sum to 1.