# Tokenization

Tokenization is the process of splitting text into smaller units called tokens. These tokens can be words, subwords, or characters. Tokenization is a crucial first step in many NLP tasks.

## Why is Tokenization Important?

Tokenization enables us to break down complex text into manageable pieces, making it easier to analyze and process. It serves as the foundation for many NLP tasks, such as parsing, part-of-speech tagging, and named entity recognition.

## Types of Tokenization

1. **Word Tokenization**:
   - Splitting text into individual words.
   
2. **Sentence Tokenization**:
   - Splitting text into individual sentences.

3. **Subword Tokenization**:
   - Splitting text into smaller units than words, often used in advanced NLP models like BERT and GPT.

4. **Character Tokenization**:
   - Splitting text into individual characters.

## Libraries for Tokenization

Several libraries can help with tokenization in Python:
- **NLTK**: Natural Language Toolkit
- **spaCy**: Industrial-strength NLP
- **Hugging Face Tokenizers**: Fast and efficient tokenizers used in transformer models

## Example: Tokenizing Text with NLTK

We'll start with NLTK, one of the most popular NLP libraries in Python.

### Word Tokenization with Vanilla Python

In [1]:
# Define a sample text to be tokenized
text = "Hello, I am Farzad Asgari, and welcome to the NLPy course. We will learn a lot about NLP!"

# Use the split() method to tokenize the text into words
# By default, split() splits the text based on whitespace
words = text.split()

# Print the resulting list of word tokens
print("Word Tokens:", words)

Word Tokens: ['Hello,', 'I', 'am', 'Farzad', 'Asgari,', 'and', 'welcome', 'to', 'the', 'NLPy', 'course.', 'We', 'will', 'learn', 'a', 'lot', 'about', 'NLP!']


### Sentence Tokenization with Vanilla Python

In [2]:
# Define a sample text to be tokenized into sentences
text = "Hello, I am Farzad Asgari, and welcome to the NLPy course. We will learn a lot about NLP!"

# Use the split() method to tokenize the text into sentences
# Split based on periods followed by optional whitespace
sentences = text.split('. ')

# Print the resulting list of sentence tokens
print("Sentence Tokens:", sentences)

Sentence Tokens: ['Hello, I am Farzad Asgari, and welcome to the NLPy course', 'We will learn a lot about NLP!']


### Word Tokenization with NLTK

In [3]:
# Import the necessary libraries from NLTK
import nltk

# Download the 'punkt' tokenizer models.
# 'punkt' is a pre-trained model used for tokenizing words and sentences.
# This model is necessary for the word_tokenize and sent_tokenize functions to work correctly.
nltk.download('punkt')

# Import the word_tokenize function from NLTK
from nltk.tokenize import word_tokenize

# Define a sample text to be tokenized
text = "Hello, I am Farzad Asgari, and welcome to the NLPy course. We will learn a lot about NLP!"

# Use the word_tokenize function to split the text into individual words (tokens)
words = word_tokenize(text)

# Print the resulting list of word tokens
print("Word Tokens:", words)

Word Tokens: ['Hello', ',', 'I', 'am', 'Farzad', 'Asgari', ',', 'and', 'welcome', 'to', 'the', 'NLPy', 'course', '.', 'We', 'will', 'learn', 'a', 'lot', 'about', 'NLP', '!']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\free\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Sentence Tokenization with NLTK

In [4]:
# Import the sent_tokenize function from NLTK
from nltk.tokenize import sent_tokenize

# Define a sample text to be tokenized into sentences
text = "Hello, I am Farzad Asgari, and welcome to the NLPy course. We will learn a lot about NLP!"

# Use the sent_tokenize function to split the text into individual sentences (tokens)
sentences = sent_tokenize(text)

# Print the resulting list of sentence tokens
print("Sentence Tokens:", sentences)

Sentence Tokens: ['Hello, I am Farzad Asgari, and welcome to the NLPy course.', 'We will learn a lot about NLP!']


### Tokenization with spaCy

#### Downloading spaCy's English Language Model

To use spaCy for natural language processing tasks, you need to download a language model. The command below downloads the small English language model (`en_core_web_sm`) for spaCy.

In [5]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
      --------------------------------------- 0.3/12.8 MB ? eta -:--:--
      --------------------------------------- 0.3/12.8 MB ? eta -:--:--
     - ------------------------------------- 0.5/12.8 MB 929.6 kB/s eta 0:00:14
     -- ------------------------------------- 0.8/12.8 MB 1.1 MB/s eta 0:00:12
     --- ------------------------------------ 1.0/12.8 MB 1.1 MB/s eta 0:00:11
     ---- ----------------------------------- 1.3/12.8 MB 1.0 MB/s eta 0:00:12
     ---- ----------------------------------- 1.6/12.8 MB 1.0 MB/s eta 0:00

In [6]:
# Import the spaCy library
import spacy

# Load the English language model
# 'en_core_web_sm' is a small English language model provided by spaCy

nlp = spacy.load("en_core_web_sm")

# Process the text with the spaCy model to create a Doc object
# The Doc object is a container for the processed text and its annotations
doc = nlp("Hello, I am Farzad Asgari, and welcome to the NLPy course. We will learn a lot about NLP!")

# Extract tokens (words) from the Doc object
tokens = [token.text for token in doc]

# Print the resulting list of tokens
print("spaCy Tokens:", tokens)

spaCy Tokens: ['Hello', ',', 'I', 'am', 'Farzad', 'Asgari', ',', 'and', 'welcome', 'to', 'the', 'NLPy', 'course', '.', 'We', 'will', 'learn', 'a', 'lot', 'about', 'NLP', '!']


### Sentence Tokenization with spaCy

In [7]:
# Continue using the Doc object from the previous example
# Extract sentences from the Doc object
sentences = [sent.text for sent in list(doc.sents)]

# Print the resulting list of sentence tokens
# Each sentence is represented as a string
print("spaCy Sentence Tokens:", sentences)

spaCy Sentence Tokens: ['Hello, I am Farzad Asgari, and welcome to the NLPy course.', 'We will learn a lot about NLP!']


## Conclusion
Tokenization is a fundamental step in NLP, breaking down text into manageable units. Depending on the task and the model, different types of tokenization may be used.