<a href="https://colab.research.google.com/github/arkeodev/nlp/blob/main/Hugging_Face/3-HF_Tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizers and Their Types

Tokenization is a fundamental step in natural language processing (NLP) tasks. It involves breaking down text into smaller units (tokens), which can be words, subwords, or characters. These tokens are then used by models to understand and process the text.

## Types of Tokenizers

### Word Tokenizers

**Definition:** Word tokenizers split text into individual words using spaces and punctuation as delimiters. They are simple and effective for languages with clear word boundaries marked by spaces.

In [None]:
from transformers import BertTokenizer

# Load a pre-trained BERT tokenizer that uses word tokenization
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Simple word tokenization using Python's split method
sentence = "Tokenization is essential for NLP tasks."
tokens = sentence.split()

print("Word Tokenization:", tokens)

Word Tokenization: ['Tokenization', 'is', 'essential', 'for', 'NLP', 'tasks.']


### Subword Tokenizers

**Definition:** Subword tokenizers break down words into subwords or symbols. These subwords can represent common prefixes, suffixes, or roots. This approach allows the model to handle rare words better and improves its ability to generalize.

In [None]:
from transformers import RobertaTokenizer

# Load a pre-trained RoBERTa tokenizer that uses subword tokenization (Byte-level BPE)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# Tokenize the sentence
tokens = tokenizer.tokenize("Tokenization is essential for NLP tasks.")

print("Subword Tokenization:", tokens)

Subword Tokenization: ['Token', 'ization', 'Ġis', 'Ġessential', 'Ġfor', 'ĠN', 'LP', 'Ġtasks', '.']


### Character Tokenizers

**Definition:** Character tokenizers decompose text into individual characters. This level of granularity is useful for character-level modeling or in languages without clear word delimiters.

In [None]:
from tokenizers import CharBPETokenizer

# Initialize a character tokenizer (this requires a trained tokenizer, but for demonstration, we will tokenize the string into characters manually)
tokenizer = CharBPETokenizer()

# Manual character tokenization for the purpose of this example
tokens = list("Tokenization is essential for NLP tasks.")

print("Character Tokenization:", tokens)

Character Tokenization: ['T', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 'e', 's', 's', 'e', 'n', 't', 'i', 'a', 'l', ' ', 'f', 'o', 'r', ' ', 'N', 'L', 'P', ' ', 't', 'a', 's', 'k', 's', '.']


### Byte Pair Encoding (BPE)

**Definition:** BPE is a middle ground between word-level and character-level tokenization. It starts with a large corpus of text and repeatedly merges the most frequent pairs of bytes (or characters) to create a vocabulary of the most common subwords.

In [None]:
from transformers import GPT2Tokenizer

# Load the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Tokenize the sentence
tokens = tokenizer.encode("Tokenization is essential for NLP tasks.", add_special_tokens=False)

# Decode tokens into text
tokenized_text = [tokenizer.decode([tok]) for tok in tokens]

print("Byte Pair Encoding Tokenization:", tokenized_text)

Byte Pair Encoding Tokenization: ['Token', 'ization', ' is', ' essential', ' for', ' N', 'LP', ' tasks', '.']


### Comparisons of Tokenizers

When comparing the four tokenization methods—word tokenizers, subword tokenizers, character tokenizers, and Byte Pair Encoding (BPE)—each serves a distinct purpose and is preferred in different scenarios.

### Word Tokenizers:
- **Prefer for**: Languages with clear word boundaries and sufficient training data.
- **Advantages**: Simplicity and interpretability.
- **Limitations**: Struggles with out-of-vocabulary (OOV) words.
- **Use Case**: Good for high-resource languages where the vocabulary can be comprehensively captured.

### Subword Tokenizers:
- **Prefer for**: Handling OOV words and morphologically rich languages.
- **Advantages**: Balance between the granularity of characters and the context of words.
- **Limitations**: May split semantically linked parts of a word.
- **Use Case**: Useful in NLP tasks that benefit from understanding word parts, such as translation and text generation.

### Character Tokenizers:
- **Prefer for**: Languages without clear word delimiters, or for character-level tasks.
- **Advantages**: No OOV words, as all text can be broken down into characters.
- **Limitations**: Longer sequences to process, which may lead to higher computational costs.
- **Use Case**: Preferred for character-level models and certain types of text classification where morphology is less important.

### Byte Pair Encoding (BPE):
- **Prefer for**: Efficiently encoding the input data by capturing the most frequent subwords.
- **Advantages**: Reduces the vocabulary size without significant loss of information.
- **Limitations**: Can be suboptimal for languages with large character sets or highly inflectional.
- **Use Case**: Often used in large-scale language models like GPT and BERT for its efficiency and ability to handle a wide range of text types.

In summary:
- Choose **word tokenization** when working with well-resourced languages and the task doesn't require handling rare words.
- Opt for **subword tokenization** when dealing with languages that have rich morphology or when you want a good balance between handling unknown words and leveraging context.
- Select **character tokenization** for tasks that demand a character-level focus or when working with languages that have ambiguous word boundaries.
- Use **BPE** for large-scale language modeling where there's a need to efficiently handle a diverse vocabulary with an effective representation of subwords.

## Special Tokens

Special tokens such as SOS/EOS, UNK, and PAD are integral in the language understanding process of transformer-based models.

In [None]:
from transformers import BertTokenizer

# Load a pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Example text
text = "Hugging Face is leading in the NLP domain."

# Add special tokens
encoded = tokenizer.encode_plus(
    text,
    add_special_tokens=True,  # Add [CLS] and [SEP]
    max_length=20,            # Set a max_length for padding & truncation
    pad_to_max_length=True,   # Enable padding
    return_tensors='pt',      # Return PyTorch tensor
    truncation=True           # Explicitly truncate examples to max length
)

# Token IDs
print("Token IDs:", encoded['input_ids'])

# Decode to show special tokens
decoded_tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
print("Decoded tokens with special tokens:", decoded_tokens)

Token IDs: tensor([[  101, 17662,  2227,  2003,  2877,  1999,  1996, 17953,  2361,  5884,
          1012,   102,     0,     0,     0,     0,     0,     0,     0,     0]])
Decoded tokens with special tokens: ['[CLS]', 'hugging', 'face', 'is', 'leading', 'in', 'the', 'nl', '##p', 'domain', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']


The output includes special tokens added by the tokenizer `[CLS]` for start of sequence, `[SEP]` for end of sequence, and `[PAD]` for padding.

- **Start-of-Sequence (SOS) / End-of-Sequence (EOS) Tokens:**
  - In BERT's case, the `[CLS]` token serves as a start of sequence token, and `[SEP]` serves as both a separator token for two sequences and an end-of-sequence token for single sequences.
  - These tokens are important because they provide the model with clear markers that indicate the beginning and end of a sequence. In tasks such as text classification, the representation of the `[CLS]` token is often used as the aggregate sequence representation for classification tasks.
  - For models that generate text, like GPT-2, EOS tokens signal to the model when to stop generating further tokens.

- **Unknown (UNK) Token:**
  - The `[UNK]` token is used to replace words that are not in the tokenizer's vocabulary. This is important because models have a fixed vocabulary size and cannot accommodate all possible words, especially rare or domain-specific terms.
  - By using an UNK token, the model can still process and make predictions about text containing unknown words, although it may lose some of the specific information that those words carry.

- **Padding (PAD) Token:**
  - The `[PAD]` token is used to fill up sequences so that all sequences in a batch have the same length. This is crucial for batching operations, as deep learning models usually require batched input to be of uniform size.
  - Padding allows models to handle input sequences of varying lengths and is necessary for models to efficiently perform computations on a batch of data.

In transformer models, the attention mechanism needs to know which positions are padded so it can disregard them when computing the attention scores. This is often achieved through an attention mask, which is an additional input indicating which tokens are padding and should not be attended to.

## Why We Use Masking

Masking is used to inform the model which parts of the input it should consider and which parts it should ignore during training or inference. This is important for:
- **Attention Mechanisms:** To prevent the model from paying attention to padding tokens.
- **Variable Sequence Lengths:** To handle inputs of variable lengths in a batch for efficient processing.

In [None]:
from transformers import BertTokenizer

# Load a pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Example text with two sentences of different lengths
texts = [
    "Hugging Face is revolutionizing the field of NLP.",
    "Machine learning is fascinating."
]

# Tokenize the texts and add special tokens ([CLS] and [SEP])
# The tokenizer will pad the sequences to the length of the longest sequence
encoded_inputs = tokenizer(
    texts,
    padding=True,  # Pad the sequences to the length of the longest sequence
    truncation=True,  # Truncate the sequences to the model's max input length
    return_tensors='pt'  # Return PyTorch tensors
)

# Extract the attention mask
attention_masks = encoded_inputs['attention_mask']

print("Token IDs of the first sentence : ", encoded_inputs['input_ids'][0])
print("Token IDs of the second sentence: ", encoded_inputs['input_ids'][1])
print("Attention Masks of the first sentence : ", attention_masks[0])
print("Attention Masks of the second sentence: ", attention_masks[1])

Token IDs of the first sentence :  tensor([  101, 17662,  2227,  2003,  4329,  6026,  1996,  2492,  1997, 17953,
         2361,  1012,   102])
Token IDs of the second sentence:  tensor([  101,  3698,  4083,  2003, 17160,  1012,   102,     0,     0,     0,
            0,     0,     0])
Attention Masks of the first sentence :  tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Attention Masks of the second sentence:  tensor([1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0])


## AutoTokenizer

`AutoTokenizer` is a class within the Hugging Face `transformers` library designed to automatically instantiate a tokenizer class based on the pre-trained model's name or path you provide. It serves as a wrapper that dynamically adjusts to the specific tokenizer class associated with the given pre-trained model, ensuring that you always use the correct tokenizer for your model without needing to manually specify the tokenizer class.

When you call `AutoTokenizer.from_pretrained("bert-base-uncased")`, the `AutoTokenizer`:

1. Looks up the `bert-base-uncased` model in the Hugging Face model repository.
2. Identifies the appropriate tokenizer that pairs with this model, which in this case is the BERT tokenizer.
3. Downloads and caches the tokenizer's pre-trained vocabulary and merges it if needed.
4. Instantiates the BERT tokenizer with the pre-trained vocabulary, ready for use.

This abstraction is particularly useful because different models might use different tokenization algorithms (e.g., BERT uses WordPiece, GPT-2 uses byte-level BPE, and T5 uses SentencePiece). With `AutoTokenizer`, the user doesn't have to worry about these details; they just need to know the model they want to use, and the corresponding tokenizer is automatically selected and loaded.

This makes the `AutoTokenizer` class a powerful tool for users who need to switch between different models and tokenizers, simplifying the process of preparing inputs for a wide array of transformer-based models.

In [None]:
from transformers import AutoTokenizer

# Load the tokenizer associated with a pre-trained model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Tokenize a sample text
text = "Hugging Face is revolutionizing AI."
tokens = tokenizer(text)

# Explore the tokens
print(tokens)

# Special tokens added automatically by the tokenizer
print("Special tokens added:", tokenizer.all_special_tokens)

{'input_ids': [101, 17662, 2227, 2003, 4329, 6026, 9932, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
Special tokens added: ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']


## Why Tokenizer Should Match with the Model

Using the tokenizer with the corresponding model ensures that the data flow from pre-processing to prediction is as smooth and accurate as possible, reproducing the conditions under which the model was originally trained and allowing it to perform at its best.

1. **Consistency in Vocabulary**: Pre-trained models are trained with a specific vocabulary, and using the corresponding tokenizer ensures that the tokens fed into the model are recognized. If you use a different tokenizer, words may be split into tokens differently, and some tokens may not be recognized at all by the model, leading to poor performance or even errors.

2. **Maintaining Tokenization Method**: Each model may be trained with a particular tokenization method (e.g., BERT uses WordPiece, GPT-2 uses byte-level BPE). The tokenizer must match this method to ensure that the model understands the input as intended during training. Using a different method would mean that the input is presented to the model in an unfamiliar way, which can degrade the model's ability to make predictions.

3. **Special Tokens**: Many models rely on special tokens for understanding sentence boundaries, segmenting sentence pairs, or other purposes. The correct tokenizer will insert these tokens as needed. For instance, BERT expects a `[CLS]` token at the beginning of an input and `[SEP]` tokens to separate segments.

4. **Pre-processing and Post-processing**: The tokenizer handles critical pre-processing steps like truncation and padding, ensuring inputs are of the correct length. Some tokenizers also perform post-processing to convert token ids back to words. This pre- and post-processing needs to be consistent with the model's training.

5. **Optimized Performance**: Matching tokenizers are often optimized to work with their models, providing better speed and efficiency. This is especially true for "fast" tokenizers in Hugging Face, which are implemented in Rust.

6. **End-to-End Coherence**: For end-to-end tasks, such as text-to-text tasks (e.g., translation, summarization), the tokenizer ensures that the input and output texts are handled consistently. This coherence is essential for the model to learn and generate accurate outputs.

## Training a New Tokenizer

Sometimes, you might need to train a tokenizer on a custom dataset, particularly if your text data is in a language or domain not covered by pre-trained models.

In an other case. when a tokenizer is trained on a corpus that is not similar to the one you're working with—be it a new language, new characters, new domain, or style—it might not perform well. Dissimilarities between the training data of a tokenizer and your data can lead to poor tokenization quality.

In such cases, you might need to train a new tokenizer that is more aligned with your specific corpus.

For the demonstration, I'll use a simplified example to train a sample tokenizer with a small piece of text directly passed as input. However, note that real tokenizer training is generally done on a much larger dataset and saved to files. Here, we'll tokenize a sample text and then print out the tokenizer's attributes.

In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Sample text to train the tokenizer
sample_texts = [
    "Hugging Face is revolutionizing AI.",
    "Their tokenizer is part of a larger suite of NLP tools.",
    "Tokenization is essential for NLP tasks."
]

# Initialize an empty Byte-Level BPE tokenizer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

# Initialize a trainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

# Initialize a pre-tokenizer
tokenizer.pre_tokenizer = Whitespace()

# Train the tokenizer
tokenizer.train_from_iterator(sample_texts, trainer=trainer)

# Print tokenizer's attributes
print("Vocabulary size:", tokenizer.get_vocab_size())
print("Model attributes:", tokenizer.model)
print("Trainer attributes:", trainer)
print("Pre-tokenizer:", tokenizer.pre_tokenizer)

# You can also encode text to see how the tokenizer performs
encoding = tokenizer.encode("Tokenization is essential for every task.")
print("Tokens:", encoding.tokens)
print("Token IDs:", encoding.ids)





Vocabulary size: 97
Model attributes: <tokenizers.models.BPE object at 0x10f757890>
Trainer attributes: <tokenizers.trainers.BpeTrainer object at 0x10f8a5810>
Pre-tokenizer: <tokenizers.pre_tokenizers.Whitespace object at 0x10f8f60b0>
Tokens: ['Tokenization', 'is', 'essential', 'for', 'ev', 'er', '[UNK]', 'tas', 'k', '.']
Token IDs: [82, 33, 95, 84, 60, 39, 0, 73, 21, 5]


Here's a breakdown of the components:

- **`Tokenizer(BPE(unk_token="[UNK]"))`:** This creates a new tokenizer using the BPE (Byte-Pair Encoding) model with a specified unknown token `[UNK]`.
- **`BpeTrainer`:** The trainer is responsible for training the tokenizer model. We can specify special tokens that should be preserved during tokenization.
- **`Whitespace`:** The pre-tokenizer is set to split the input text into tokens based on whitespace, a simple and common pre-tokenization step.
- **`tokenizer.train_from_iterator()`:** This method trains the tokenizer on the list of sample texts provided.

After training, we examine the tokenizer's attributes and test the tokenizer on a new sentence to observe the generated tokens and their corresponding IDs.

## Pre-tokenization

Pre-tokenization is the step before the main tokenization process where the text is prepared for further processing. During pre-tokenization, text is typically split into words and symbols (like punctuation) based on whitespace and other simple rules. This initial segmentation makes the subsequent, more complex tokenization steps easier and more consistent.

**Why Pre-tokenization is Important**:
- **Simplicity**: It simplifies the complex tokenization process by handling the easy and obvious cases like splitting by spaces and punctuation.
- **Uniformity**: It ensures a level of uniformity before more sophisticated tokenization algorithms, such as subword tokenization, are applied.
- **Efficiency**: It can make the whole tokenization process more efficient, especially when dealing with large corpora.

Imagine a scenario where you're tokenizing the text: "HuggingFace's transformers: amazingly simple!". Without pre-tokenization, a naive tokenizer might treat "HuggingFace's" as a single token, which could be problematic if the model has never seen this sequence before.

With pre-tokenization, the text might be split into ["HuggingFace", "'", "s", "transformers", ":", "amazingly", "simple", "!"]. Now, a more sophisticated tokenizer can process these tokens separately, recognizing "HuggingFace" and "amazingly" as whole words, even if it needs to split or encode "HuggingFace's" into subwords due to its absence in the vocabulary.

If you don't use pre-tokenization and directly apply a subword tokenizer, you may end up with suboptimal tokenization, where composite words or words with affixes aren't recognized or split properly.

Here's a simple example of manual pre-tokenization using Python:

In [None]:
import re

# Our sample text
text = "HuggingFace's transformers: amazingly simple!"

# A simple pre-tokenization step that splits based on whitespace and punctuation
pre_tokens = re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

print("Pre-tokenization:", pre_tokens)

Pre-tokenization: ['HuggingFace', "'", 's', 'transformers', ':', 'amazingly', 'simple', '!']


## What Are Fast Tokenizers in Hugging Face?

Fast tokenizers in Hugging Face refer to a certain class of tokenizers that are implemented in Rust for high performance. These tokenizers are part of the Hugging Face `tokenizers` library, which is designed to provide an ultra-fast and versatile tokenization utility suitable for training and production environments which is critical when preprocessing large datasets or when low latency is required.

### Why We Need Fast Tokenizers:

1. **Performance**: Fast tokenizers provide a significant speed advantage.

2. **Parallelism**: Rust's efficient memory management and concurrency models enable these tokenizers to effectively handle parallel tokenization over large batches of text, which is essential for leveraging modern multi-core processors.

3. **Consistency**: They ensure that tokenization is consistent with the pre-trained models since they use the same algorithms and byte-pair encoding (BPE) or WordPiece methodologies.

4. **Full-Fledged Features**: Despite their speed, these tokenizers do not compromise on features and support all the required tokenization steps, such as adding special tokens, handling padding, and providing attention masks.

5. **Ease of Use**: Fast tokenizers integrate seamlessly with Python, allowing users to benefit from their performance while working within the Python ecosystem, which is widely used in data science and machine learning.

### Demonstration of Fast Tokenizers:

Let's do a quick comparison between a standard tokenizer and a fast tokenizer. For this demonstration, we'll tokenize a large list of sentences and measure the time taken by each tokenizer.

In this demonstration, we use the `batch_encode_plus` method to process a batch of sentences. The `fast_tokenizer` is expected to perform the same operation significantly faster than the `standard_tokenizer`. The actual performance gain can vary based on the specifics of the system and the workload, but typically, you'll see a substantial reduction in processing time with the fast tokenizer.

In [None]:
import time
from transformers import BertTokenizer, BertTokenizerFast

# Initialize both the standard Python and the fast Rust BERT tokenizers
standard_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
fast_tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Prepare a large list of sentences
sentences = ["The quick brown fox jumps over the lazy dog." for _ in range(100000)]

# Tokenize using the standard tokenizer
start_time = time.time()
standard_tokenizer.batch_encode_plus(sentences, padding=True, truncation=True)
standard_time = time.time() - start_time

# Tokenize using the fast tokenizer
start_time = time.time()
fast_tokenizer.batch_encode_plus(sentences, padding=True, truncation=True)
fast_time = time.time() - start_time

# Compare times
print(f"Standard Tokenizer took: {standard_time:.2f} seconds")
print(f"Fast Tokenizer took: {fast_time:.2f} seconds")

# Verifying the speed improvement
assert fast_time < standard_time, "The fast tokenizer is expected to be faster than the standard tokenizer!"

Standard Tokenizer took: 14.25 seconds
Fast Tokenizer took: 2.49 seconds


## What Is Sentence Pair Tokenization?

Sentence pair tokenization is a process used in NLP tasks where the model needs to understand and compare the relationship between two separate text segments. During tokenization, the two sentences are combined into a single input sequence with special tokens indicating the separation and order.

### When and Why We Use It?

We use sentence pair tokenization primarily for tasks like:

- **Question Answering (QA)**: Understanding the context of a passage to answer a related question.
- **Semantic Similarity**: Assessing if two sentences are semantically similar or not.

### How It Is Used?

With models like BERT, sentence pair tokenization typically involves:

- Adding a `[CLS]` token at the beginning of the first sentence.
- Adding a `[SEP]` token at the end of the first sentence and again at the end of the second sentence.
- Generating token type ids to indicate which tokens belong to the first sentence and which to the second.

### Demo of Sentence Pair Tokenization

Here is a demonstration of how to encode sentence pairs and then use a model fine-tuned for sentence similarity to predict how semantically similar the sentences are:

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    """
    This collapses the token embeddings for each sentence into a single vector per
    sentence, using the attention mask to ignore padding tokens.
    """
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Define two sets of sentences to tokenize together
sentence_pairs = [
    ("The weather is nice today.", "It's a beautiful day."),
    ("I'm learning about NLP.", "I'm studying natural language processing."),
    ("The cat sits on the mat.", "The dog plays in the yard.")
]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# Define sentence pairs
sentence_pairs = [
    ("The weather is nice today.", "It's a beautiful day."),
    ("I'm learning about NLP.", "I'm studying natural language processing."),
    ("The cat sits on the mat.", "The dog plays in the yard.")
]

# Process each pair
for sentence1, sentence2 in sentence_pairs:
    print(f"Sentence pair: {sentence1} | {sentence2}")

    # Tokenize and encode sentences as a batch
    encoded_input = tokenizer([sentence1, sentence2], padding=True, truncation=True, return_tensors='pt')
    print(tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0]), tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][1]))

    with torch.no_grad():
        model_output = model(**encoded_input)

    # Perform pooling
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

    # Compute similarity
    similarity = torch.matmul(sentence_embeddings[0], sentence_embeddings[1].T)
    print(f"Cosine Similarity: {similarity.item()}\n")



Sentence pair: The weather is nice today. | It's a beautiful day.
['[CLS]', 'the', 'weather', 'is', 'nice', 'today', '.', '[SEP]', '[PAD]'] ['[CLS]', 'it', "'", 's', 'a', 'beautiful', 'day', '.', '[SEP]']
Cosine Similarity: 0.5240273475646973

Sentence pair: I'm learning about NLP. | I'm studying natural language processing.
['[CLS]', 'i', "'", 'm', 'learning', 'about', 'nl', '##p', '.', '[SEP]'] ['[CLS]', 'i', "'", 'm', 'studying', 'natural', 'language', 'processing', '.', '[SEP]']
Cosine Similarity: 0.741668164730072

Sentence pair: The cat sits on the mat. | The dog plays in the yard.
['[CLS]', 'the', 'cat', 'sits', 'on', 'the', 'mat', '.', '[SEP]'] ['[CLS]', 'the', 'dog', 'plays', 'in', 'the', 'yard', '.', '[SEP]']
Cosine Similarity: 0.19166551530361176



## Conclusion

Tokenizers are an essential part of NLP workflows, and understanding how to effectively use and customize them is crucial for tackling a wide range of tasks.