# NLP Basics: Tokenization, Normalization, Lemmatization, and More


In this notebook, we will explore fundamental preprocessing techniques in NLP:
1. Basics of Tokenization
2. Space-Based Tokenization (Single and Multiple Languages)
3. Word Normalization
4. Case Folding
5. Lemmatization
6. Stemming
7. Porter Stemmer
8. Sentence Segmentation
    

In [1]:
# Importing required libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re

# Download necessary data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jenni\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jenni\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jenni\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 1. Basics of Tokenization

In [2]:
# Tokenizing a simple sentence
text = "Natural Language Processing is fascinating!"
tokens = word_tokenize(text)
print(f"Tokens: {tokens}")

Tokens: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '!']


## 2. Space-Based Tokenization

In [3]:
# Space-based tokenization (Simple)
text = "Tokenization splits text into words."
tokens = text.split(" ")
print(f"Tokens: {tokens}")

Tokens: ['Tokenization', 'splits', 'text', 'into', 'words.']


In [4]:
# Space-based tokenization in multiple languages
text_multilang = "Este es un texto en español. 这是中文文本。"
tokens_multilang = text_multilang.split(" ")
print(f"Tokens (Multilingual): {tokens_multilang}")

Tokens (Multilingual): ['Este', 'es', 'un', 'texto', 'en', 'español.', '这是中文文本。']


## 3. Word Normalization

In [5]:
# Normalizing by removing punctuation and converting to lowercase
text = "Normalization helps in Text-Processing tasks!"
normalized_text = re.sub(r'[^\w\s]', '', text).lower()
print(f"Normalized text: {normalized_text}")

Normalized text: normalization helps in textprocessing tasks


## 4. Case Folding

In [6]:
# Lowercasing all text for uniformity
text = "Case Folding Ensures Consistency."
case_folded_text = text.lower()
print(f"Case Folded Text: {case_folded_text}")

Case Folded Text: case folding ensures consistency.


## 5. Lemmatization

In [7]:
# Lemmatization example
lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "better"]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]
print(f"Lemmatized words: {lemmatized_words}")

Lemmatized words: ['run', 'fly', 'better']


## 6. Stemming

In [8]:
# Basic stemming example
stemmer = SnowballStemmer("english")
words = ["running", "ran", "flies", "flying"]
stemmed_words = [stemmer.stem(word) for word in words]
print(f"Stemmed words: {stemmed_words}")

Stemmed words: ['run', 'ran', 'fli', 'fli']


## 7. Porter Stemmer

In [9]:
# Porter stemmer example
porter_stemmer = PorterStemmer()
words = ["running", "runner", "ran", "easily", "flies"]
porter_stemmed_words = [porter_stemmer.stem(word) for word in words]
print(f"Porter Stemmed Words: {porter_stemmed_words}")

Porter Stemmed Words: ['run', 'runner', 'ran', 'easili', 'fli']


## 8. Sentence Segmentation

In [10]:
# Segmenting text into sentences
text = "Natural Language Processing is fascinating. It has many applications. Tokenization is a key step."
sentences = sent_tokenize(text)
print(f"Sentences: {sentences}")

Sentences: ['Natural Language Processing is fascinating.', 'It has many applications.', 'Tokenization is a key step.']
