<a href="https://colab.research.google.com/github/arkeodev/nlp/blob/main/Tokeniser/tokeniser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenisation

## Introduction

Tokenization in NLP is the process of breaking down a stream of textual data into smaller units called tokens. These tokens can be words, phrases, symbols, or other meaningful elements. The primary purpose of tokenization is to convert the unstructured form of text into a structured form, which is easier for computers to understand and process. By identifying the boundaries between words, sentences, or other textual elements, tokenization lays the groundwork for further text analysis and processing tasks such as parsing, indexing, and semantic analysis.


## How Does Tokenization Contribute to the Preprocessing of Textual Data?



Tokenization contributes to the preprocessing of textual data in several key ways:

1. **Normalization:** Tokenization often goes hand-in-hand with normalization processes such as converting all characters to lowercase, removing punctuation, and eliminating white spaces. This standardization is critical for ensuring that the text is in a uniform format, reducing complexity and variability in the data.

2. **Feature Extraction:** Tokens serve as the basic units for feature extraction, which is essential for training machine learning models. By breaking the text down into tokens, it's possible to quantify certain features of the text, such as the frequency of specific words or phrases, which can then be used as inputs for models.

3. **Improvement of Model Performance:** Proper tokenization can significantly impact the performance of NLP models. By accurately identifying tokens, models can better understand the semantics of the text, leading to more accurate predictions and analyses.

4. **Language Modeling**: Tokenization is crucial for language modeling tasks, where the goal is to predict the next word or sequence of words. A well-defined set of tokens allows models to better learn the structure and rules of a language.

5. **Adaptability to Different Languages**: Effective tokenization techniques can be adapted to different languages and scripts, some of which may not use spaces to separate words or have complex morphology. This adaptability is key for developing multilingual NLP systems.

##  Strategies for Overcoming Tokenization Challenges in Multilingual NLP

### The Tokenization Challenges and Complexities

- Preprocessing considerations and segmentation decisions.

- Performance and scalability for large datasets.

- Adaptability to new texts and contexts.

- The phenomenon where a word has multiple meanings (polysemy) or different words look the same (homographs) complicates tokenization.

- Technical jargon, slang, and newly coined terms (neologisms) are frequently updated and can vary widely across different communities and domains.

- Abbreviations, acronyms, and initialisms can be written in multiple ways (with or without periods, spaces, etc.), making it challenging for tokenizers to consistently identify and process them without specialized rules or contextual analysis.

- In multilingual texts or texts that include frequent code-switching (switching between languages), tokenizers must be able to recognize and handle multiple languages' syntax and grammar within the same sentence or document.

- Script variations and non-Latin characters:

  Non-Latin Scripts: Languages that use scripts other than Latin (e.g., Cyrillic, Arabic, Devanagari) may present additional challenges due to the specific characteristics of each script, such as right-to-left writing or the use of diacritics and ligatures.

  Complex Writing Systems: Scripts like Chinese, where text is written in logograms (characters representing words or morphemes), demand tokenization methods that can discern individual characters and their combinations as words.

- Complex morphology - Agglutinative Languages: Turkish, Finnish, and Korean are agglutinative, meaning they form words by stringing together a base with multiple affixes, each adding additional meaning.

- Complex morphology - Fusional languages: Russian, German, and Arabic exhibit fusional morphology, where a single word form can convey several grammatical categories e.g., case, gender, number, tense.


### Overcoming the Challenges

To overcome the challanges above, the below strategies can be applied:

- **Contextual Models:** Use machine learning and deep learning models, such as recurrent neural networks (RNNs) and transformers, that can understand the context around words or characters. This approach is particularly effective for languages without clear word boundaries and for handling polysemy and homography.

- **Subword Tokenization:** Implement subword tokenization algorithms like Byte Pair Encoding (BPE), WordPiece, or SentencePiece, which can dynamically adjust to the vocabulary of the corpus. These methods are useful for agglutinative languages and for handling neologisms and domain-specific language.

- **Dictionaries and Lexicons:** Utilize comprehensive dictionaries and lexicons that include words, phrases, idioms, and even slang. This is especially useful for languages with rich morphology or scripts without white space separation, as it aids in accurately identifying word boundaries.

- **Morphological Analysis:** For languages with complex morphology, employ morphological analyzers that can dissect words into their root forms and affixes, facilitating more granular tokenization and improving subsequent NLP tasks.

- **Regular Expressions:** Use regular expressions to create patterns that match specific tokenization needs, such as identifying abbreviations, numbers, or mixed-language content. This can be particularly effective for preprocessing and normalizing text.

- **Language-Agnostic Tokenization:** Adopt language-agnostic tokenization methods that treat text as a raw byte sequence, allowing for consistent tokenization across languages without relying on language-specific cues.

- **Multilingual Training:** Train tokenization models on multilingual corpora to enable them to handle text from multiple languages and scripts.

- **Online Learning:** Implement tokenization systems that can continuously learn and adapt from new text, allowing them to stay up-to-date with evolving language use, slang, and neologisms.

- **User Feedback:** Incorporate feedback mechanisms that allow users to correct tokenization errors. Over time, use this feedback to refine and improve the tokenization algorithms.


## Types of Tokenization

### Word Tokenization

Word tokenization involves splitting a piece of text into individual words using spaces and punctuation as delimiters. It's most effectively used in tasks requiring the analysis or manipulation of text at the word level, such as frequency analysis, word embeddings for machine learning models, and simple text classification tasks.

**Example tokenizers names**
- NLTK's `word_tokenize`
- spaCy's Tokenizer
- Apache OpenNLP's `TokenizerME`

### Character Tokenization

Character tokenization is used in tasks where understanding or manipulating text at the most granular level is crucial. This includes certain types of text classification, language modeling where understanding of individual characters is important, and tasks in languages where the concept of a "word" is not straightforward.

The main limitations include a potential loss of semantic information (since characters carry less meaning than words or subwords) and increased computational complexity, as models may need to process significantly longer sequences compared to word or subword tokenization.

**Example**
- Custom character tokenizers built for specific tasks or languages

### Subword Tokenization

Subword tokenization addresses challenges such as handling out-of-vocabulary (OOV) words by breaking down words into smaller, meaningful units (subwords) that are more likely to be seen during training. It also helps in representing a large vocabulary more compactly and efficiently, which is particularly useful in languages with rich morphology or agglutinative languages.

Subword tokenization balances the granularity of character tokenization (high granularity but low semantic information) and the semantic richness of word tokenization (high semantic information but poor handling of OOV words). It allows models to understand and generate text more effectively by leveraging the semantic meaning of subword units while mitigating the issues associated with large vocabularies and OOV words.

**Example**
- BERT's WordPiece
- GPT-2's BPE (Byte Pair Encoding)
- SentencePiece (supports both BPE and unigram language model)

### Sentence Tokenization

Sentence tokenization splits text into individual sentences, taking into account punctuation marks that denote sentence boundaries, such as periods, exclamation marks, and question marks. It requires understanding the context to correctly interpret these punctuation marks (e.g., distinguishing between a period indicating an abbreviation and one ending a sentence).

Sentence tokenization is preferred in scenarios where the unit of analysis or manipulation is the sentence, such as in natural language generation, summarization, machine translation, or when sentences are used as input for sentence-level sentiment analysis.

**Example**
- NLTK's `sent_tokenize`
- spaCy (utilizes document object, which contains sentence objects)
- Apache OpenNLP's `SentenceDetectorME`

## Traditional Tokenization Tools and Libraries

Python offers a wealth of libraries and tools designed to support various aspects of Natural Language Processing (NLP), including tokenization.

### 1. **NLTK (Natural Language Toolkit)**

- **Approach:** NLTK is one of the most widely-used libraries for NLP in Python and provides a comprehensive suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. It offers multiple tokenizers with varying complexity, from simple whitespace tokenizers to sophisticated regular expression tokenizers.
- **Use Cases:** Ideal for educational purposes and prototyping due to its wide range of functionalities and ease of use.

```python
import nltk
nltk.download('punkt')  # Download necessary datasets
from nltk.tokenize import word_tokenize

text = "Hello there! How are you doing today?"
tokens = word_tokenize(text)
print(tokens)
```

### 2. **spaCy**

- **Approach:** spaCy is known for its speed and efficiency. It adopts an object-oriented approach and treats text as an object, which allows for more sophisticated processing. spaCy's tokenizer is highly optimized and can be extended with custom rules. It’s designed for production use and supports a wide range of NLP tasks.
- **Use Cases:** Best suited for production-grade applications requiring fast and efficient text processing.

```python
import spacy
nlp = spacy.load("en_core_web_sm")  # Load the English model

text = "Hello there! How are you doing today?"
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)
```

### 3. **TextBlob**

- **Approach:** TextBlob is built on top of NLTK and Pattern. It's simpler and more user-friendly, providing an intuitive API for common NLP tasks, including tokenization. TextBlob's tokenization is more straightforward, primarily focusing on ease of use and accessibility.
- **Use Cases:** Great for beginners and those interested in quickly prototyping applications or scripts involving NLP.

```python
from textblob import TextBlob

text = "Hello there! How are you doing today?"
blob = TextBlob(text)
tokens = blob.words  # For word tokenization
sentences = blob.sentences  # For sentence tokenization
print(tokens)
```

### 4. **Gensim**

- **Approach:** Gensim is focused on topic modeling and document similarity. While not primarily known for tokenization, it does offer simple preprocessing capabilities that can tokenize and clean text. Gensim is optimized for handling large text corpora.
- **Use Cases:** Ideal for projects focusing on topic modeling, document indexing, and similarity retrieval.

```python
from gensim.utils import tokenize

text = "Hello there! How are you doing today?"
tokens = list(tokenize(text))
print(tokens)
```










### 5. **Keras**

- **Approach:** Keras, a deep learning library, provides utilities for text preprocessing, including tokenization, primarily aimed at preparing text data for neural network models. Its `Tokenizer` class offers a high-level API for vectorizing text into sequences or matrices.
- **Use Cases:** Best suited for deep learning projects requiring text data preprocessing for model training.

```python
from tensorflow.keras.preprocessing.text import Tokenizer

texts = ["Hello there! How are you doing today?"]
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
print(sequences)
```

## Advanced Tokenization Techniques

In [None]:
! pip install transformers tensorflow numpy -q

In [None]:
# Sample dataset of customer reviews
reviews = [
    "Absolutely love this! Best purchase ever.",
    "Horrible, completely useless. 0/10",
]

### BPE

BPE is a subword tokenization method, often used in models like GPT.

BPE iteratively merges the most frequent pair of bytes or characters in the text data. It’s effective for reducing the vocabulary size and handling unknown characters by breaking down words into more frequent subwords or characters.

**How it Works:** Starts with character-level tokenization and progressively merges pairs of characters (or byte pairs) based on their frequency of occurrence in the corpus. This process continues until a predetermined number of merges (vocabulary size) is reached.

**Example:**

- Corpus: aaabdaaabac

- Initial character frequency: a: 6, b: 2, d: 1, c: 1

1. Initial State: Each character is a token.
a a a b d a a a b a c
2. After Step 1: Merge aa into A.
A A b d A A b a c
3. After Step 2: Merge AAb into B.
B d B a c
4. After Step 3: Merge dB into C.
C b a c

5. Final Tokens: C, b, a, c

6. Final Corpus Representation: CBac


In [None]:
from transformers import GPT2Tokenizer
import numpy as np

# Initialize the tokenizer
tokenizer_bpe = GPT2Tokenizer.from_pretrained("gpt2")

# Encode the reviews using BPE and pad sequences
sequences_bpe = [tokenizer_bpe.encode(review, add_special_tokens=True) for review in reviews]
max_len = max(len(seq) for seq in sequences_bpe)
padded_seqs_bpe = np.array([seq + [0]*(max_len - len(seq)) for seq in sequences_bpe])

# Convert token IDs back to tokens
tokens_bpe = [tokenizer_bpe.convert_ids_to_tokens(seq) for seq in sequences_bpe]

print("\nBPE Subword Tokenization Result:")
print(padded_seqs_bpe)
print("\nTokens:")
for tokens in tokens_bpe:
    print(tokens)


BPE Subword Tokenization Result:
[[40501  1842   428     0  6705  5001  1683    13     0]
 [27991  5547    11  3190 13894    13   657    14   940]]

Tokens:
['Absolutely', 'Ġlove', 'Ġthis', '!', 'ĠBest', 'Ġpurchase', 'Ġever', '.']
['Hor', 'rible', ',', 'Ġcompletely', 'Ġuseless', '.', 'Ġ0', '/', '10']


### WordPiece

WordPiece is another subword tokenization method, often used in models like BERT.

Similar to BPE, but instead of merging the most frequent pairs, WordPiece adds the most beneficial token (one that minimizes the likelihood of the training data given the model) during each iteration.

**How it Works:** Begins with character-level tokens and incrementally creates a vocabulary of subwords based on their utility in representing the training data efficiently. The WordPiece algorithm specifically optimizes for the performance of the language model, rather than just the frequency of the subwords. The goal is to improve the model's understanding and generation of text.


**Example:**

- Corpus: aaabdaaabac

- Initial character frequency: a: 6, b: 2, d: 1, c: 1

1. Initial State: Each character is a token.
a a a b d a a a b a c

2. After Step 1: Unlike BPE that directly merges the most frequent pairs, WordPiece evaluates which merge would contribute most effectively to modeling the language. If aa is still the best candidate, it merges aa → A.
A A b d A A b a c

3. After Step 2: WordPiece might find merging A and b as B beneficial for its language model criteria (even if it's not the most frequent).
B d B a c

4. After Step 3: Next, assuming merging B and d to C makes the most sense for the language model.
C B a c

5. Final Tokens: C, B, a, c

6. Final Corpus Representation: C B a c

In [None]:
from transformers import BertTokenizer
import numpy as np

# Initializing the tokenizer
tokenizer_wordpiece = BertTokenizer.from_pretrained('bert-base-uncased')

# Encoding the reviews using WordPiece
sequences_wordpiece = [tokenizer_wordpiece.encode(review, add_special_tokens=True) for review in reviews]
max_len = max(len(seq) for seq in sequences_wordpiece)
padded_seqs_wordpiece = np.array([seq + [0]*(max_len - len(seq)) for seq in sequences_wordpiece])

# Convert token IDs back to tokens
tokens_wordpiece = [tokenizer_wordpiece.convert_ids_to_tokens(seq) for seq in sequences_wordpiece]

print("\nWordPiece Tokenization Result:")
print(padded_seqs_wordpiece)
print("\nTokens:")
for tokens in tokens_wordpiece:
    print(tokens)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]


WordPiece Tokenization Result:
[[  101  7078  2293  2023   999  2190  5309  2412  1012   102]
 [  101  9202  1010  3294 11809  1012  1014  1013  2184   102]]

Tokens:
['[CLS]', 'absolutely', 'love', 'this', '!', 'best', 'purchase', 'ever', '.', '[SEP]']
['[CLS]', 'horrible', ',', 'completely', 'useless', '.', '0', '/', '10', '[SEP]']


### SentencePiece

SentencePiece is a tokenization library that does not rely on whitespaces for tokenization, making it suitable for languages without clear word boundaries.

Unlike BPE and WordPiece, SentencePiece tokenizes text into subwords directly from raw text (i.e., without whitespace tokenization as a first step). It’s especially useful for languages that don’t use spaces or use them inconsistently.

**How it Works:** Applies a similar subword tokenization technique but treats the input as a raw sequence of Unicode characters, allowing for a language-agnostic approach. SentencePiece can use either BPE or a unigram language model for its subword segmentation process.

**Example:**

- Corpus (Including Space Representation as _): aaa_bdaaabac

1. Initial State: Consider spaces as tokens too: a a a _ b d a a a b a c

2. After Step 1: SentencePiece might start by merging frequent pairs, including spaces. For example, a a → A.
A A _ b d A A b a c

3. After Step 2: Further merges could include combinations with spaces, A _ → B to maintain the integrity of words.
B b d B b a c

4. After Step 3: Then, perhaps B b → C, considering the algorithm’s focus on effective encoding of the entire string.
C d C a c

5. Final Tokens: C, d, a, c

6. Final Corpus Representation: C d C a c

In [None]:
from transformers import T5Tokenizer
import numpy as np

# Initialize the tokenizer
tokenizer_sentencepiece = T5Tokenizer.from_pretrained('t5-small')

# Encoding the reviews using SentencePiece
sequences_sentencepiece = [tokenizer_sentencepiece.encode(review, add_special_tokens=True) for review in reviews]
max_len = max(len(seq) for seq in sequences_sentencepiece)
padded_seqs_sentencepiece = np.array([seq + [0]*(max_len - len(seq)) for seq in sequences_sentencepiece])

# Convert token IDs back to tokens
tokens_sentencepiece = [tokenizer_sentencepiece.convert_ids_to_tokens(seq) for seq in sequences_sentencepiece]

print("\nSentencePiece Tokenization Result:")
print(padded_seqs_sentencepiece)
print("\nTokens:")
for tokens in tokens_sentencepiece:
    print(tokens)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



SentencePiece Tokenization Result:
[[20510   333    48    55  1648  1242   664     5     1     0     0]
 [ 6766    52  2317     6  1551 19930     5     3   632 11476     1]]

Tokens:
['▁Absolutely', '▁love', '▁this', '!', '▁Best', '▁purchase', '▁ever', '.', '</s>']
['▁Hor', 'r', 'ible', ',', '▁completely', '▁useless', '.', '▁', '0', '/10', '</s>']


## Practical Example


## Conclusion

Tokenization is a critical preprocessing step in Natural Language Processing (NLP) that transforms raw text into structured formats usable by machine learning models.

Each tokenization method — word, sentence, subword, and character — addresses specific challenges and is suited for different NLP tasks.

The choice of tokenization method significantly impacts the performance of NLP applications, making it essential to match the tokenization strategy with the specific needs of each task and the characteristics of the language being processed.