In [1]:
!pip -q install transformers

# **Tokenization**

**Tokenization** is the process of breaking down text into smaller units, called **tokens**, which can be words, subwords, characters, or even bytes, depending on the tokenization technique used. In the context of Large Language Models (LLMs), tokenization is a crucial preprocessing step that enables the model to understand and process textual input effectively.

---

## **Why is Tokenization Important in LLMs?**

1. **Efficient Text Representation:**
   - Computers process numbers, not raw text. Tokenization converts text into a numerical format (tokens) that the LLM can understand and work with.
   - These tokens are mapped to unique integer IDs via a vocabulary, which the model uses for computations.

2. **Handling Language Complexity:**
   - Tokenization helps in breaking down complex, unstructured language into manageable parts.
   - Subword tokenization techniques (e.g., Byte Pair Encoding, WordPiece) allow LLMs to handle rare or unknown words effectively by splitting them into smaller, meaningful components.

3. **Reducing Vocabulary Size:**
   - Instead of using an entire dictionary of words, subword-based tokenization reduces the vocabulary size by creating tokens for frequent subword units, making the model memory-efficient and easier
  
## Examples of **tokenization**:

---

### **1. Word-Based Tokenization**
- Text: *"Tokenization is fun!"*  
- Tokens: `["Tokenization", "is", "fun", "!"]`

---

### **2. Character-Based Tokenization**
- Text: *"Fun!"*  
- Tokens: `["F", "u", "n", "!"]`

---

### **3. Subword-Based Tokenization (e.g., Byte Pair Encoding or WordPiece)**
- Text: *"Unbelievable!"*  
- Tokens: `["Un", "believable", "!"]`  
  *(The word is split into smaller subword units.)*

---

### **4. Byte-Level Tokenization (Used in GPT Models)**
- Text: *"Hello 😊"*  
- Tokens: `["Hello", " ", "😊"]`  
  *(Emojis and spaces are treated as separate tokens.)*

---

### **5. Sentence Tokenization**
- Text: *"I love coding. It's amazing!"*  
- Tokens: `["I love coding.", "It's amazing!"]`  
  *(Breaks text into sentences.)*

### [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSequenceClassification)

### [AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer)

### [Model](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

my_model = AutoModelForSequenceClassification.from_pretrained(model_name)
my_tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model= my_model, tokenizer = my_tokenizer)
result = classifier("Not happy with the world")
print(result)

Device set to use cpu


[{'label': '2 stars', 'score': 0.5226433873176575}]


In [4]:
result2 = classifier("I am happy with the world")
print(result2)

[{'label': '5 stars', 'score': 0.6275884509086609}]


### Text Tokenization

In [5]:
# Example Text
text = "I am happy with the world"

# Tokenize the Text
tokens = my_tokenizer.tokenize(text)
print("Token", tokens)

Token ['i', 'am', 'happy', 'with', 'the', 'world']


In [8]:
# Convert tokens to input IDs
input_ids = my_tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

[151, 10345, 19308, 10171, 10103, 10228]


In [13]:
# Encode the text(tokenization + converting to input_ids)

encoded_input = my_tokenizer(text)
print("Encoded Text:", encoded_input)

Encoded Text: {'input_ids': [101, 151, 10345, 19308, 10171, 10103, 10228, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


In [16]:
# Decode the text

decoded_input = my_tokenizer.decode(encoded_input["input_ids"])
print("Decoded Text:", decoded_input)

Decoded Text: [CLS] i am happy with the world [SEP]
