Tokenization in Natural Language Processing (NLP)
---
This notebook will guide you through the concept of tokenization in NLP, its importance, and how it works with examples using Python.

## What is Tokenization?
Tokenization is the process of breaking down a text into smaller components called tokens. Tokens can be words, characters, or subwords, depending on the level of tokenization applied.
 
### Why is Tokenization Important?
1. **Foundation for NLP Tasks**: Most NLP tasks, such as text classification, machine translation, and sentiment analysis, require tokenized input.
2. **Text Normalization**: Tokenization simplifies the representation of text data, making it easier for algorithms to process and analyze.
3. **Handling Ambiguity**: Tokenization can help in distinguishing between words and subwords, ensuring accurate text processing.

## Types of Tokenization
1. **Word Tokenization**: Splits text into individual words.
2. **Character Tokenization**: Splits text into individual characters.
3. **Subword Tokenization**: Splits text into smaller meaningful units, often used in modern NLP models like BERT.


In [37]:
%%capture
!pip install nltk transformers

## Let's Explore Each Type of Tokenization

### 1. Word Tokenization


In [45]:
# Sample text
text = "Hello! How are you doing today? Let's understand tokenization."

print("Original Text:", text)

# Splitting by whitespace (basic word tokenization)
word_tokens = text.split()
print("Word Tokens:", word_tokens)

Original Text: Hello! How are you doing today? Let's understand tokenization.
Word Tokens: ['Hello!', 'How', 'are', 'you', 'doing', 'today?', "Let's", 'understand', 'tokenization.']


### Using NLTK for Word Tokenization

In [46]:
import nltk
from nltk.tokenize import (word_tokenize, 
                           TreebankWordTokenizer, 
                           WordPunctTokenizer, 
                           WhitespaceTokenizer)

# Download required NLTK data
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to C:\Users\IT
[nltk_data]     Support\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [47]:
# Word Tokenize
# The `word_tokenize` function uses the Punkt tokenizer to split text into words and punctuation. It is versatile and handles contractions well.
print("Word Tokenize:", word_tokenize(text))

# TreebankWordTokenizer
# The `TreebankWordTokenizer` splits text based on the Penn Treebank conventions, which are widely used for linguistic processing.
treebank_tokenizer = TreebankWordTokenizer()
print("TreebankWordTokenizer:", treebank_tokenizer.tokenize(text))

# WordPunctTokenizer
# The `WordPunctTokenizer` splits words and punctuation separately, treating each punctuation mark as a distinct token.
word_punct_tokenizer = WordPunctTokenizer()
print("WordPunctTokenizer:", word_punct_tokenizer.tokenize(text))

# WhitespaceTokenizer
# The `WhitespaceTokenizer` splits text into tokens based solely on whitespace, without considering punctuation or sentence structure.
whitespace_tokenizer = WhitespaceTokenizer()
print("WhitespaceTokenizer:", whitespace_tokenizer.tokenize(text))

Word Tokenize: ['Hello', '!', 'How', 'are', 'you', 'doing', 'today', '?', 'Let', "'s", 'understand', 'tokenization', '.']
TreebankWordTokenizer: ['Hello', '!', 'How', 'are', 'you', 'doing', 'today', '?', 'Let', "'s", 'understand', 'tokenization', '.']
WordPunctTokenizer: ['Hello', '!', 'How', 'are', 'you', 'doing', 'today', '?', 'Let', "'", 's', 'understand', 'tokenization', '.']
WhitespaceTokenizer: ['Hello!', 'How', 'are', 'you', 'doing', 'today?', "Let's", 'understand', 'tokenization.']


### 2. Character Tokenization

In [48]:
# Splitting text into characters
char_tokens = list(text)
print("Character Tokens:", char_tokens)

Character Tokens: ['H', 'e', 'l', 'l', 'o', '!', ' ', 'H', 'o', 'w', ' ', 'a', 'r', 'e', ' ', 'y', 'o', 'u', ' ', 'd', 'o', 'i', 'n', 'g', ' ', 't', 'o', 'd', 'a', 'y', '?', ' ', 'L', 'e', 't', "'", 's', ' ', 'u', 'n', 'd', 'e', 'r', 's', 't', 'a', 'n', 'd', ' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', '.']


### 3. Subword Tokenization
Subword tokenization is commonly used in transformer-based models like BERT. Libraries such as `SentencePiece` and `Byte Pair Encoding (BPE)` are used for this purpose.


In [49]:
# Example using Hugging Face's Tokenizers library
from transformers import AutoTokenizer

# Using a pre-trained tokenizer
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenizing the text
bert_tokens = bert_tokenizer.tokenize(text)
print("Subword Tokens (BERT):", bert_tokens)

Subword Tokens (BERT): ['hello', '!', 'how', 'are', 'you', 'doing', 'today', '?', 'let', "'", 's', 'understand', 'token', '##ization', '.']


## Hands-On Example

In [50]:
# Tokenizing a new text example:
exercise_text = "Natural Language Processing enables computers to understand human language."

print("\nOriginal Text:", exercise_text)

# Word Tokenization using split
exercise_word_tokens_split = exercise_text.split()
print("Word Tokens (split):", exercise_word_tokens_split)

# Word Tokenization using NLTK
exercise_word_tokens_nltk = word_tokenize(exercise_text)
print("Word Tokens (NLTK):", exercise_word_tokens_nltk)

# Character Tokenization
exercise_char_tokens = list(exercise_text)
print("Character Tokens:", exercise_char_tokens)

# Subword Tokenization using BERT tokenizer
exercise_subword_tokens = bert_tokenizer.tokenize(exercise_text)
print("Subword Tokens (BERT):", exercise_subword_tokens)


Original Text: Natural Language Processing enables computers to understand human language.
Word Tokens (split): ['Natural', 'Language', 'Processing', 'enables', 'computers', 'to', 'understand', 'human', 'language.']
Word Tokens (NLTK): ['Natural', 'Language', 'Processing', 'enables', 'computers', 'to', 'understand', 'human', 'language', '.']
Character Tokens: ['N', 'a', 't', 'u', 'r', 'a', 'l', ' ', 'L', 'a', 'n', 'g', 'u', 'a', 'g', 'e', ' ', 'P', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', ' ', 'e', 'n', 'a', 'b', 'l', 'e', 's', ' ', 'c', 'o', 'm', 'p', 'u', 't', 'e', 'r', 's', ' ', 't', 'o', ' ', 'u', 'n', 'd', 'e', 'r', 's', 't', 'a', 'n', 'd', ' ', 'h', 'u', 'm', 'a', 'n', ' ', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', '.']
Subword Tokens (BERT): ['natural', 'language', 'processing', 'enables', 'computers', 'to', 'understand', 'human', 'language', '.']


### Shortcomings:
   - **Ambiguity**: Tokenization can struggle with ambiguous cases, like contractions (e.g., "I’m" → "I" and "am" or "I’m" → "I’m"), and punctuation (e.g., "U.S." could be split incorrectly).
   - **Complexity in Languages**: Some languages, especially those without spaces between words (e.g., Chinese), pose challenges for tokenization.
   - **Loss of Context**: Tokenization may separate words that form important contextual relationships (e.g., "New York" might be split into "New" and "York").
