<a href="https://colab.research.google.com/github/aymenhmid/NLP_Guide/blob/main/Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here's a detailed breakdown of **tokenization in NLP**, covering its purpose, types, challenges, techniques, and tools:

---

### **What is Tokenization?**
The process of splitting text into smaller meaningful units called **tokens** (e.g., words, subwords, characters, or symbols). Tokens become the basic units for downstream NLP tasks like parsing, sentiment analysis, or machine translation.

---

### **Why Tokenization Matters**
- Converts unstructured text into structured data for machines.
- Reduces complexity by breaking text into manageable units.
- Enables feature extraction (e.g., word frequency, n-grams).
- Handles language-specific nuances (e.g., compound words, contractions).

---

### **Types of Tokenization**
1. **Word-Level Tokenization**  
   Splits text into words.  
   - Example: `"Don't panic!" → ["Don't", "panic", "!"]`  
   - **Challenges**:  
     - Contractions (`don't` vs. `do not`).  
     - Hyphenated words (`state-of-the-art`).  
     - Languages without spaces (e.g., Chinese, Japanese).

2. **Subword Tokenization**  
   Splits rare/unknown words into smaller units (e.g., prefixes, suffixes).  
   - Example: `"unhappiness" → ["un", "happiness"]`  
   - Used in modern models (BERT, GPT) to handle out-of-vocabulary (OOV) words.  
   - Popular methods: **Byte-Pair Encoding (BPE)**, **WordPiece**, **SentencePiece**.

3. **Character-Level Tokenization**  
   Splits text into individual characters.  
   - Example: `"cat" → ["c", "a", "t"]`  
   - Useful for languages with large alphabets (e.g., Chinese).  
   - **Drawback**: Loses semantic meaning of words.

4. **Sentence/Paragraph Tokenization**  
   Splits text into sentences or paragraphs.  
   - Example: `"Hello! How are you?" → ["Hello!", "How are you?"]`  

---

### **Common Tokenization Techniques**
1. **Whitespace Tokenization**  
   Splits text based on spaces.  
   - Simple but fails with punctuation (e.g., `"Hello,world"` → `["Hello,world"]`).

2. **Punctuation-Based Tokenization**  
   Splits on spaces and punctuation.  
   - Example: `"Can't!" → ["Can", "t", "!"]` (may oversplit).

3. **Regex-Based Tokenization**  
   Uses regular expressions to define token boundaries.  
   - Example: `re.findall(r"\w+|\S", text)` splits words and punctuation.

4. **Rule-Based Tokenization**  
   Uses language-specific rules (e.g., handling apostrophes in English).  
   - Example: NLTK's `word_tokenize` splits `"don't"` into `["do", "n't"]`.

5. **Machine Learning-Based Tokenization**  
   Trains models to identify token boundaries (common in subword tokenization).

---

### **Challenges in Tokenization**
1. **Ambiguity**  
   - `"New York-based"` → `["New", "York-based"]` vs. `["New York", "based"]`.
   - `"gummy bears"` (candy) vs. `"gummy bears"` (stuffed animals).

2. **Language-Specific Issues**  
   - **German**: Compound words (`Donaudampfschifffahrtsgesellschaft`).  
   - **Chinese/Japanese**: No spaces between words.  
   - **Arabic**: Morphologically rich (prefixes/suffixes change meaning).

3. **Domain-Specific Text**  
   - Medical terms (`"COVID-19"`), code snippets, hashtags (`#ThrowbackThursday`).

4. **Handling OOV Words**  
   - Proper nouns, slang, or typos (e.g., `"goooood"`).

---

### **Popular Tokenization Tools**
1. **NLTK** (Python)  
  

2. **spaCy** (Python)  
   

3. **Hugging Face Tokenizers**  
  

4. **Stanford CoreNLP** (Java)  
   Robust for multilingual tokenization.

---

### **Subword Tokenization Algorithms**
1. **Byte-Pair Encoding (BPE)**  
   - Merges frequent character pairs iteratively.  
   - Used in GPT-2, RoBERTa.

2. **WordPiece**  
   - Similar to BPE but prioritizes likelihood, not frequency.  
   - Used in BERT.

3. **SentencePiece**  
   - Works directly on raw text (no pre-tokenization).  
   - Used in multilingual models (T5).

4. **Unigram Language Model**  
   - Starts with a large vocabulary and prunes tokens.

---

### **Evaluation of Tokenization**
- **Consistency**: Same word split the same way across contexts.  
- **Downstream Performance**: Impact on tasks like translation or classification.  
- **Vocabulary Size**: Balance between coverage and memory usage.  
- **Computational Efficiency**: Speed of tokenization.

---

### **Applications of Tokenization**
1. **Text Preprocessing** for NLP models.  
2. **Named Entity Recognition** (identifying entities like `[ORG] Apple`).  
3. **Machine Translation** (aligning source/target tokens).  
4. **Search Engines** (indexing keywords).  

---

### **Best Practices**
1. Choose tokenization based on **language** and **task** (e.g., subword for translation).  
2. Handle **case sensitivity** (lowercasing vs. preserving case).  
3. Normalize text (remove accents, convert to Unicode).  
4. Use pre-trained tokenizers for transformer models (e.g., `BertTokenizer`).  

---

### **Example: Custom Tokenization with Regex**
```python
import re

text = "Email: user@example.com, #NLP"
pattern = r"""
    \w+@\w+\.\w+      # Emails
    |#\w+             # Hashtags
    |\w+              # Words
    |[^\s\w]          # Punctuation
"""

tokens = re.findall(pattern, text, re.VERBOSE)
# ['Email', ':', 'user@example.com', ',', '#NLP']
```

---

### **Recent Trends**
- **Multilingual Tokenization**: Single tokenizer for multiple languages.  
- **Subword Dominance**: BPE/WordPiece in transformer-based models.  
- **Dynamic Tokenization**: Adaptive to domain-specific text (e.g., biomedical).

---

Tokenization is the foundation of NLP pipelines, and choosing the right strategy significantly impacts model performance. Always align your tokenization method with your data and task!

#examples

In [1]:
#spacy
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup.")
tokens = [token.text for token in doc]  # ['Apple', 'is', ...]

In [2]:
#NLTK
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Hello, world! This is NLP."
print(word_tokenize(text))  # ['Hello', ',', 'world', '!', ...]
print(sent_tokenize(text))  # ['Hello, world!', 'This is NLP.']

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [None]:
#Hugging Face Tokenizers
from tokenizers import Tokenizer, models, pre_tokenizers
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()
tokenizer.train(files=["text.txt"], vocab_size=1000)