# Lab2: Text Preprocessing Pipeline

**Duration:** 1 hour

**Objectives:**
- Understand and implement tokenization techniques
- Apply stemming and lemmatization for text normalization
- Remove stop words and special characters
- Build a complete text preprocessing pipeline

---

## Instructions

1. Complete all the exercises marked with `# TODO`
2. Run each cell to verify your answers
3. Save your completed notebook
4. **Push your work to a Git repository and send the link to: yoroba93@gmail.com**

---

## Setup: Install and Import Libraries

In [None]:
# Install required libraries (uncomment if needed)
# !pip install nltk spacy
!python -m spacy download en_core_web_sm

In [None]:
import nltk
import spacy
import re
import string

# Download required NLTK data
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

# Download and load spaCy model
nlp = spacy.load('en_core_web_sm')

print("Setup complete!")

---

# Part 1: Tokenization

Tokenization is the process of breaking text into smaller units (tokens), typically words or sentences.

## 1.1 Basic Tokenization with Python

In [None]:
# Simple tokenization using split()
text = "Natural Language Processing is fascinating!"

# Basic split on whitespace
tokens_basic = text.split()
print("Basic split():", tokens_basic)

# Problem: punctuation is attached to words!
print("Last token:", tokens_basic[-1])  # 'fascinating!' includes the !

## 1.2 Tokenization with NLTK

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Hello! How are you doing today? I'm learning NLP. It's really interesting."

# Word tokenization
word_tokens = word_tokenize(text)
print("Word tokens:", word_tokens)

# Sentence tokenization
sent_tokens = sent_tokenize(text)
print("\nSentence tokens:")
for i, sent in enumerate(sent_tokens):
    print(f"  {i+1}. {sent}")

In [None]:
# NLTK handles contractions and punctuation better
text = "I can't believe it's not butter! Don't you think so?"

print("Basic split:", text.split())
print("NLTK tokenize:", word_tokenize(text))

In [None]:
# TODO: Exercise 1.1
# Tokenize the following text into words using NLTK
# Count how many tokens are produced

text = "Dr. Smith's patients can't understand why they're feeling unwell. Is it the flu?"

tokens = # YOUR CODE HERE
num_tokens = # YOUR CODE HERE

print("Tokens:", tokens)
print("Number of tokens:", num_tokens)

assert num_tokens == 18, f"Expected 18 tokens, got {num_tokens}"

## 1.3 Tokenization with spaCy

In [None]:
# spaCy tokenization
text = "Apple is looking at buying U.K. startup for $1 billion."

doc = nlp(text)

# Get tokens
tokens = [token.text for token in doc]
print("Tokens:", tokens)

# spaCy provides additional information
print("\nDetailed token info:")
for token in doc:
    print(f"  {token.text:12} | POS: {token.pos_:6} | Is Stop: {token.is_stop}")

In [None]:
# TODO: Exercise 1.2
# Use spaCy to tokenize the text and extract only:
# 1. Tokens that are NOT punctuation
# 2. Tokens that are NOT spaces
# Hint: use token.is_punct and token.is_space

text = "The quick, brown fox jumps over the lazy dog! Isn't it amazing?"

doc = nlp(text)

clean_tokens = # YOUR CODE HERE (list comprehension)

print("Clean tokens:", clean_tokens)
assert len(clean_tokens) == 13, f"Expected 13 tokens, got {len(clean_tokens)}"

## 1.4 Different Tokenization Strategies

In [None]:
from nltk.tokenize import RegexpTokenizer, TweetTokenizer

# RegexpTokenizer - tokenize using a custom pattern
# \w+ matches word characters only (removes punctuation)
regexp_tokenizer = RegexpTokenizer(r'\w+')

text = "Hello! How's it going? #NLP @user123"
print("Regexp tokens:", regexp_tokenizer.tokenize(text))

# TweetTokenizer - designed for social media text
tweet_tokenizer = TweetTokenizer()
print("Tweet tokens:", tweet_tokenizer.tokenize(text))

In [None]:
# TODO: Exercise 1.3
# Create a RegexpTokenizer that extracts only alphabetic words (no numbers, no punctuation)
# Pattern hint: [a-zA-Z]+ matches one or more letters

text = "I have 3 cats and 2 dogs! Their names are Max123 and Bella."

alpha_tokenizer = # YOUR CODE HERE
alpha_tokens = # YOUR CODE HERE

print("Alphabetic tokens:", alpha_tokens)
assert alpha_tokens == ['I', 'have', 'cats', 'and', 'dogs', 'Their', 'names', 'are', 'Max', 'and', 'Bella'], "Check your pattern!"

---

# Part 2: Stemming and Lemmatization

Both techniques reduce words to their base form, but they work differently:
- **Stemming**: Chops off word endings using rules (faster, cruder)
- **Lemmatization**: Uses vocabulary and morphological analysis (slower, accurate)

## 2.1 Stemming with NLTK

In [None]:
from nltk.stem import PorterStemmer, SnowballStemmer

# Porter Stemmer (most common)
porter = PorterStemmer()

words = ["running", "runs", "ran", "runner", "easily", "fairly"]

print("Porter Stemmer:")
for word in words:
    print(f"  {word:12} -> {porter.stem(word)}")

In [None]:
# Snowball Stemmer (supports multiple languages)
snowball = SnowballStemmer('english')

print("Snowball Stemmer:")
for word in words:
    print(f"  {word:12} -> {snowball.stem(word)}")

# Available languages
print("\nAvailable languages:", SnowballStemmer.languages)

In [None]:
# Stemming limitations - can produce non-words
problem_words = ["studies", "studying", "university", "universe", "beautiful", "beauty"]

print("Stemming can produce non-words:")
for word in problem_words:
    print(f"  {word:12} -> {porter.stem(word)}")

In [None]:
# TODO: Exercise 2.1
# Apply Porter stemming to all words in the sentence
# Return the stemmed tokens as a list

sentence = "The cats are running and jumping over the sleeping dogs"

# Step 1: Tokenize (use word_tokenize)
tokens = # YOUR CODE HERE

# Step 2: Apply stemming to each token
stemmed_tokens = # YOUR CODE HERE (list comprehension)

print("Original:", tokens)
print("Stemmed:", stemmed_tokens)

assert stemmed_tokens == ['the', 'cat', 'are', 'run', 'and', 'jump', 'over', 'the', 'sleep', 'dog'], "Check your stemming!"

## 2.2 Lemmatization with NLTK

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["running", "runs", "ran", "better", "studies", "feet", "geese"]

print("Lemmatization (default - assumes nouns):")
for word in words:
    print(f"  {word:12} -> {lemmatizer.lemmatize(word)}")

In [None]:
# Lemmatization works better with POS tags
# pos: 'n' = noun, 'v' = verb, 'a' = adjective, 'r' = adverb

print("Lemmatization with POS tags:")
print(f"  running (verb):     {lemmatizer.lemmatize('running', pos='v')}")
print(f"  running (noun):     {lemmatizer.lemmatize('running', pos='n')}")
print(f"  better (adjective): {lemmatizer.lemmatize('better', pos='a')}")
print(f"  studies (verb):     {lemmatizer.lemmatize('studies', pos='v')}")
print(f"  studies (noun):     {lemmatizer.lemmatize('studies', pos='n')}")

In [None]:
# TODO: Exercise 2.2
# Lemmatize the following words using the correct POS tag
# Fill in the POS tag for each word

lemmatizer = WordNetLemmatizer()

# Format: (word, pos_tag)
words_with_pos = [
    ("flying", "v"),      # verb -> fly
    ("happily", # TODO),   # adverb -> happily (adverbs don't change much)
    ("worse", # TODO),     # adjective -> bad
    ("mice", # TODO),      # noun -> mouse
    ("are", # TODO),       # verb -> be
]

print("Lemmatization results:")
for word, pos in words_with_pos:
    lemma = lemmatizer.lemmatize(word, pos=pos)
    print(f"  {word:12} ({pos}) -> {lemma}")

# Verify your answers
expected = ['fly', 'happily', 'bad', 'mouse', 'be']
results = [lemmatizer.lemmatize(w, pos=p) for w, p in words_with_pos]
assert results == expected, f"Expected {expected}, got {results}"

## 2.3 Lemmatization with spaCy

In [None]:
# spaCy automatically determines POS and lemmatizes correctly
text = "The children are playing with their toys. They were running and jumping happily."

doc = nlp(text)

print("spaCy lemmatization (automatic POS detection):")
for token in doc:
    if token.text != token.lemma_:  # Only show words that change
        print(f"  {token.text:12} ({token.pos_:5}) -> {token.lemma_}")

In [None]:
# TODO: Exercise 2.3
# Use spaCy to extract the lemmas of all non-punctuation tokens
# Return as a list of lowercase lemmas

text = "The dogs were barking loudly at the cats who were climbing the trees."

doc = nlp(text)

lemmas = # YOUR CODE HERE (list comprehension: token.lemma_.lower() for non-punct tokens)

print("Lemmas:", lemmas)
assert lemmas == ['the', 'dog', 'be', 'bark', 'loudly', 'at', 'the', 'cat', 'who', 'be', 'climb', 'the', 'tree'], "Check your lemmatization!"

## 2.4 Stemming vs Lemmatization Comparison

In [None]:
# Compare the two approaches
words = ["studies", "studying", "better", "feet", "ran", "easily", "fairly", "wolves"]

porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()

print(f"{'Word':<12} {'Stemmed':<12} {'Lemmatized':<12}")
print("-" * 36)
for word in words:
    stemmed = porter.stem(word)
    # For comparison, we'll use noun as default
    lemmatized = lemmatizer.lemmatize(word)
    print(f"{word:<12} {stemmed:<12} {lemmatized:<12}")

**Key Differences:**
- Stemming is faster but may produce non-words ("studi", "easili")
- Lemmatization produces valid words but is slower
- Lemmatization requires POS information for best results
- Use stemming for speed (search engines), lemmatization for accuracy (chatbots, NLU)

---

# Part 3: Stop Words and Special Characters

Stop words are common words that usually don't carry much meaning (the, is, at, which, etc.).

## 3.1 Stop Words with NLTK

In [None]:
from nltk.corpus import stopwords

# Get English stop words
stop_words_nltk = set(stopwords.words('english'))

print(f"Number of NLTK stop words: {len(stop_words_nltk)}")
print(f"\nSample stop words: {list(stop_words_nltk)[:20]}")

In [None]:
# Remove stop words from text
text = "The quick brown fox jumps over the lazy dog in the park"
tokens = word_tokenize(text.lower())

# Filter out stop words
filtered_tokens = [token for token in tokens if token not in stop_words_nltk]

print("Original tokens:", tokens)
print("Without stop words:", filtered_tokens)

In [None]:
# TODO: Exercise 3.1
# Remove stop words from the following text and return the remaining tokens
# Make sure to lowercase the text first!

text = "This is a sample sentence showing the removal of stop words from the text"

# Step 1: Lowercase and tokenize
tokens = # YOUR CODE HERE

# Step 2: Remove stop words
filtered = # YOUR CODE HERE

print("Filtered tokens:", filtered)
assert filtered == ['sample', 'sentence', 'showing', 'removal', 'stop', 'words', 'text'], "Check your filtering!"

## 3.2 Stop Words with spaCy

In [None]:
# spaCy has built-in stop word detection
text = "This is a sample sentence for demonstrating stop word removal."

doc = nlp(text)

print("Token analysis:")
for token in doc:
    print(f"  {token.text:<15} is_stop: {token.is_stop}")

In [None]:
# Filter using spaCy's is_stop attribute
content_words = [token.text for token in doc if not token.is_stop and not token.is_punct]
print("Content words:", content_words)

In [None]:
# Customize stop words in spaCy
# Add custom stop words
nlp.vocab["sample"].is_stop = True

# Remove words from stop list
nlp.vocab["not"].is_stop = False  # 'not' carries meaning!

# Check the spaCy stop words list
print(f"Number of spaCy stop words: {len(nlp.Defaults.stop_words)}")

## 3.3 Removing Special Characters and Punctuation

In [None]:
import string

# Python's string.punctuation
print("Punctuation characters:", string.punctuation)

# Method 1: Using str.translate()
text = "Hello, World! How's it going? #NLP @user"
translator = str.maketrans('', '', string.punctuation)
clean_text = text.translate(translator)
print("\nUsing translate():", clean_text)

In [None]:
# Method 2: Using regex
import re

text = "Hello, World! How's it going? #NLP @user 123"

# Remove all non-alphanumeric characters (keep spaces)
clean_text = re.sub(r'[^a-zA-Z\s]', '', text)
print("Regex (letters only):", clean_text)

# Remove punctuation but keep numbers
clean_text2 = re.sub(r'[^\w\s]', '', text)
print("Regex (alphanumeric):", clean_text2)

In [None]:
# Method 3: Using spaCy token attributes
text = "Hello! This is @user's tweet about #NLP. Check https://example.com!"

doc = nlp(text)

# Filter tokens
clean_tokens = [
    token.text for token in doc 
    if not token.is_punct 
    and not token.is_space
    and not token.like_url
]

print("Clean tokens:", clean_tokens)

In [None]:
# TODO: Exercise 3.2
# Clean the following text by:
# 1. Removing URLs
# 2. Removing mentions (@user)
# 3. Removing hashtags (#topic)
# 4. Removing punctuation
# 5. Converting to lowercase
# Use regex for this exercise

text = "Check out @OpenAI's new model! https://openai.com #AI #MachineLearning It's amazing!!!"

# YOUR CODE HERE
# Hint: Apply multiple re.sub() operations
clean = text
clean = # Remove URLs
clean = # Remove mentions
clean = # Remove hashtags
clean = # Remove punctuation
clean = # Lowercase
clean = # Remove extra whitespace

print(f"Clean text: '{clean}'")
assert clean == "check out new model its amazing", f"Got: '{clean}'"

---

# Part 4: Complete Preprocessing Pipeline

Now let's combine everything into a complete text preprocessing pipeline.

## 4.1 Example Pipeline

In [None]:
def preprocess_pipeline_example(text):
    """
    Example preprocessing pipeline.
    
    Steps:
    1. Lowercase
    2. Remove URLs
    3. Remove special characters
    4. Tokenize
    5. Remove stop words
    6. Lemmatize
    """
    # Step 1: Lowercase
    text = text.lower()
    
    # Step 2: Remove URLs
    text = re.sub(r'https?://\S+', '', text)
    
    # Step 3: Remove special characters (keep only letters and spaces)
    text = re.sub(r'[^a-z\s]', '', text)
    
    # Step 4: Tokenize with spaCy
    doc = nlp(text)
    
    # Step 5 & 6: Remove stop words and lemmatize
    tokens = [
        token.lemma_ 
        for token in doc 
        if not token.is_stop 
        and not token.is_punct 
        and not token.is_space
        and len(token.text) > 1  # Remove single characters
    ]
    
    return tokens

# Test the pipeline
sample_text = """
The quick brown foxes are jumping over the lazy dogs! 
Check out https://example.com for more information.
This is SO amazing!!! #NLP #Python @user123
"""

result = preprocess_pipeline_example(sample_text)
print("Preprocessed tokens:", result)

## 4.2 Final Challenge: Build Your Own Pipeline

In [None]:
# TODO: Exercise 4.1 (FINAL CHALLENGE)
# Create a complete preprocessing pipeline function that:
# 1. Converts text to lowercase
# 2. Removes URLs (http/https)
# 3. Removes email addresses
# 4. Removes mentions (@user) and hashtags (#topic)
# 5. Removes numbers
# 6. Removes punctuation and special characters
# 7. Tokenizes the text
# 8. Removes stop words
# 9. Applies lemmatization
# 10. Removes tokens with less than 2 characters
#
# The function should return a list of cleaned tokens

def preprocess_text(text):
    """
    Complete text preprocessing pipeline.
    
    Args:
        text (str): Raw input text
        
    Returns:
        list: List of preprocessed tokens
    """
    # YOUR CODE HERE
    
    # Step 1: Lowercase
    
    # Step 2: Remove URLs
    
    # Step 3: Remove emails
    
    # Step 4: Remove mentions and hashtags
    
    # Step 5: Remove numbers
    
    # Step 6: Remove punctuation/special characters
    
    # Step 7: Tokenize (use spaCy)
    
    # Step 8 & 9: Remove stop words and lemmatize
    
    # Step 10: Remove short tokens
    
    pass  # Remove this line and return your tokens

In [None]:
# Test your pipeline with this text
test_text = """
ðŸš€ BREAKING NEWS!!! The researchers at @MIT have published 5 new papers 
about Natural Language Processing! Check out https://mit.edu/nlp for details.
Contact them at research@mit.edu for collaborations. #NLP #AI #Research

The experiments were conducted using state-of-the-art transformers models.
They achieved 95.5% accuracy on the benchmark datasets!!!
"""

result = preprocess_text(test_text)
print("Preprocessed tokens:")
print(result)

# Verify some expected tokens are in the result
expected_tokens = ['researcher', 'publish', 'paper', 'natural', 'language', 'processing']
for token in expected_tokens:
    assert token in result, f"Expected '{token}' in result"

# Verify unwanted elements are NOT in result
unwanted = ['@mit', 'https', 'mit.edu', '#nlp', '95.5', '!!!', 'the', 'a', 'at']
for item in unwanted:
    assert item not in result, f"'{item}' should not be in result"

print("\nâœ… All tests passed!")

## 4.3 Applying Pipeline to Multiple Documents

In [None]:
# TODO: Exercise 4.2
# Apply your preprocessing pipeline to a list of documents
# Return a list of lists (one list of tokens per document)

documents = [
    "Machine learning is transforming the tech industry! @Google #ML",
    "I love programming in Python. It's so easy to learn! https://python.org",
    "The cats are sleeping on the couch. They're so lazy!",
    "Contact support@company.com for any questions about our AI products."
]

# Apply your pipeline to each document
processed_docs = # YOUR CODE HERE (list comprehension)

print("Processed documents:")
for i, doc in enumerate(processed_docs):
    print(f"  Doc {i+1}: {doc}")

---

## Summary

### Tokenization
- **Basic**: `str.split()` - simple but limited
- **NLTK**: `word_tokenize()`, `sent_tokenize()` - handles punctuation
- **spaCy**: `nlp(text)` - provides rich token information
- **Custom**: `RegexpTokenizer` - for specific patterns

### Normalization
- **Stemming**: Fast, rule-based (Porter, Snowball) - may produce non-words
- **Lemmatization**: Accurate, vocabulary-based - produces valid words
- **spaCy**: Automatic POS-aware lemmatization with `token.lemma_`

### Filtering
- **Stop words**: NLTK `stopwords.words()`, spaCy `token.is_stop`
- **Punctuation**: `string.punctuation`, spaCy `token.is_punct`
- **Special characters**: regex `re.sub()`

### Pipeline Best Practices
1. Order matters! (lowercase before regex, tokenize before lemmatize)
2. Choose stemming vs lemmatization based on your task
3. Consider what stop words to keep (e.g., "not" for sentiment)
4. Test your pipeline on sample data

---

## Submission

1. Make sure all exercises are completed
2. Save this notebook
3. Create a Git repository and push your work
4. **Send the repository link to: yoroba93@gmail.com**