<a href="https://colab.research.google.com/github/Unisvet/haf_ai/blob/main/tokenizer_101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizer
Tokenizers are fundamental components in Natural Language Processing (NLP). They break down text into smaller units called tokens, which can be words, subwords, or characters. This process is crucial for various NLP tasks like text classification, machine translation, and sentiment analysis.

Here's how you can get started with tokenizers in Python using the popular `transformers` library:

---

# 1. Installation

First, you need to install the `transformers` library. You can do this using pip:

In [4]:
!pip install transformers



# 2. Basic Usage


In this example, we load a pre-trained tokenizer for the BERT model. Then, we tokenize a sentence and convert the tokens into numerical IDs, which are essential for feeding into deep learning models.

**Steps:**

*   We use `AutoTokenizer` to automatically select the appropriate tokenizer based on the model name (`bert-base-uncased`).
*   `tokenize()` splits the text into tokens.
*   `convert_tokens_to_ids()` maps tokens to their corresponding numerical IDs.

Additional Notes

The transformers library provides various tokenizers for different models like BERT, GPT-2, RoBERTa, etc.
You can explore different pre-trained models and tokenizers on the Hugging Face [link](https://huggingface.co/docs/tokenizers/index).
Tokenization is a crucial step in preparing text data for NLP tasks.

In [5]:
from transformers import AutoTokenizer

# Load a pre-trained tokenizer (e.g., BERT)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize a sentence
text = "This is an example sentence."
tokens = tokenizer.tokenize(text)

# Convert tokens to numerical IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)

# Print the results
print("Tokens:", tokens)
print("Token IDs:", token_ids)

Tokens: ['this', 'is', 'an', 'example', 'sentence', '.']
Token IDs: [2023, 2003, 2019, 2742, 6251, 1012]






---
# 3. Compare Tokenizers
We'll compare the following tokenizers, all from the `transformers` library:

* **BERT**: A popular transformer-based model known for its strong performance in various NLP tasks.
* **GPT-2**: A powerful language model capable of generating human-like text.
* **RoBERTa**: An optimized version of BERT that achieves even better results.
* **SentencePiece** (used by XLNet): A subword tokenizer that can handle multiple languages.


In [6]:
# from transformers import AutoTokenizer

# Define a list of tokenizers to compare
tokenizers = [
    ("BERT", AutoTokenizer.from_pretrained("bert-base-uncased")),
    ("GPT-2", AutoTokenizer.from_pretrained("gpt2")),
    ("RoBERTa", AutoTokenizer.from_pretrained("roberta-base")),
    ("SentencePiece", AutoTokenizer.from_pretrained("xlnet-base-cased")),
]

# Define a sample text
text = "This is a test sentence for comparing tokenizers."

# Compare the tokenization results
for name, tokenizer in tokenizers:
    tokens = tokenizer.tokenize(text)
    print(f"{name}: {tokens}")

BERT: ['this', 'is', 'a', 'test', 'sentence', 'for', 'comparing', 'token', '##izer', '##s', '.']
GPT-2: ['This', 'Ġis', 'Ġa', 'Ġtest', 'Ġsentence', 'Ġfor', 'Ġcomparing', 'Ġtoken', 'izers', '.']
RoBERTa: ['This', 'Ġis', 'Ġa', 'Ġtest', 'Ġsentence', 'Ġfor', 'Ġcomparing', 'Ġtoken', 'izers', '.']
SentencePiece: ['▁This', '▁is', '▁a', '▁test', '▁sentence', '▁for', '▁comparing', '▁token', 'izer', 's', '.']


The code displays the tokens generated by each tokenizer for the given text, allowing you to compare their differences.
1. We import AutoTokenizer from transformers for loading tokenizers.
2. We create a list of tuples, each containing the name and the loaded tokenizer instance.
3. We define a sample text for comparison.
4. We iterate through the list of tokenizers, tokenize the sample text using the current tokenizer, print the tokenizer name and the resulting tokens.

Additional Notes:

You can modify the `text` variable to test with different sentences.
Feel free to experiment with other tokenizers available in the `transformers` library. More examples are here [https://huggingface.co/docs/transformers/en/preprocessing](https://huggingface.co/docs/transformers/en/preprocessing)

To train your tokenizer look at [https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb)
