<a href="https://colab.research.google.com/github/VinbelKing/Marvin-Azuogu-AI-Portfolio-/blob/main/L02_Marvin_Azuogu_ITAI_2373.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 02: Basic NLP Preprocessing Techniques

**Course:** ITAI 2373 - Natural Language Processing  
**Module:** 02 - Text Preprocessing  
**Duration:** 2-3 hours  
**Student Name:** ________________  
**Date:** ________________

---

## 🎯 Learning Objectives

By completing this lab, you will:
1. Understand the critical role of preprocessing in NLP pipelines
2. Master fundamental text preprocessing techniques
3. Compare different libraries and their approaches
4. Analyze the effects of preprocessing on text data
5. Build a complete preprocessing pipeline
6. Load and work with different types of text datasets

## 📖 Introduction to NLP Preprocessing

Natural Language Processing (NLP) preprocessing refers to the initial steps taken to clean and transform raw text data into a format that's more suitable for analysis by machine learning algorithms.

### Why is preprocessing crucial?

1. **Standardization:** Ensures consistent text format across your dataset
2. **Noise Reduction:** Removes irrelevant information that could confuse algorithms
3. **Complexity Reduction:** Simplifies text to focus on meaningful patterns
4. **Performance Enhancement:** Improves the efficiency and accuracy of downstream tasks

### Real-world Impact
Consider searching for "running shoes" vs "Running Shoes!" - without preprocessing, these might be treated as completely different queries. Preprocessing ensures they're recognized as equivalent.

### 🤔 Conceptual Question 1
**Before we start coding, think about your daily interactions with text processing systems (search engines, chatbots, translation apps). What challenges do you think these systems face when processing human language? List at least 3 specific challenges and explain why each is problematic.**

*Double-click this cell to write your answer:*

**Challenge 1:**Ambiguity: Polysemy and Homonymy

**Challenge 2:**Nuance and Contextual Understanding: Sarcasm, Irony, Idioms, Cultural References

**Challenge 3:**Language and Informal Communication: Slang, Emojis, Misspellings, Grammatical Errors

---

## 🛠️ Part 1: Environment Setup

We'll be working with two major NLP libraries:
- **NLTK (Natural Language Toolkit):** Comprehensive NLP library with extensive resources
- **spaCy:** Industrial-strength NLP with pre-trained models

**⚠️ Note:** Installation might take 2-3 minutes to complete.

In [1]:
# Step 1: Install Required Libraries
print("🔧 Installing NLP libraries...")

!pip install -q nltk spacy
!python -m spacy download en_core_web_sm

print("✅ Installation complete!")

🔧 Installing NLP libraries...
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
✅ Installation complete!


### 🤔 Conceptual Question 2
**Why do you think we need to install a separate language model (en_core_web_sm) for spaCy? What components might this model contain that help with text processing? Think about what information a computer needs to understand English text.**

*Double-click this cell to write your answer:*

Installing a separate language model like en_core_web_sm for spaCy is crucial because language is diverse, and a single model can't effectively process all languages. It also allows for efficiency, as users only download the models they need, and specialization, enabling models to be optimized for specific domains or tasks.

To understand English text, a computer needs various linguistic components, which en_core_web_sm provides. These include:

Tokenizer: Breaks text into words and punctuation.
Part-of-Speech (POS) Tagger: Assigns grammatical labels (noun, verb, adjective) to words.
Lemmatizer: Reduces words to their base form (e.g., "running" to "run").
Dependency Parser: Analyzes the grammatical relationships between words in a sentence.
Named Entity Recognizer (NER): Identifies and categorizes entities like people, organizations, and locations.
Sentence Segmenter (Senter): Detects sentence boundaries.
tok2vec: Generates numerical representations (embeddings) of words to capture their semantic meaning.
These components work together in a pipeline to provide a rich, structured understanding of text, enabling various NLP tasks.

---

In [2]:
# Step 2: Import Libraries and Download NLTK Data
import nltk
import spacy
import string
import re
from collections import Counter

# Download essential NLTK data
print("📦 Downloading NLTK data packages...")
nltk.download('punkt')      # For tokenization
nltk.download('stopwords')  # For stop word removal
nltk.download('wordnet')    # For lemmatization
nltk.download('averaged_perceptron_tagger')  # For POS tagging
nltk.download('punkt_tab') # Download punkt_tab resource for tokenization


print("\n✅ All imports and downloads completed!")

📦 Downloading NLTK data packages...


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...



✅ All imports and downloads completed!


[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


## 📂 Part 2: Sample Text Data

We'll work with different types of text to understand how preprocessing affects various text styles:
- Simple text
- Academic text (with citations, URLs)
- Social media text (with emojis, hashtags)
- News text (formal writing)
- Product reviews (informal, ratings)

In [3]:
# Step 3: Load Sample Texts
simple_text = "Natural Language Processing is a fascinating field of AI. It's amazing!"

academic_text = """
Dr. Smith's research on machine-learning algorithms is groundbreaking!
She published 3 papers in 2023, focusing on deep neural networks (DNNs).
The results were amazing - accuracy improved by 15.7%!
"This is revolutionary," said Prof. Johnson.
Visit https://example.com for more info. #NLP #AI @university
"""

social_text = "OMG! Just tried the new coffee shop ☕️ SO GOOD!!! Highly recommend 👍 #coffee #yum 😍"

news_text = """
The stock market experienced significant volatility today, with tech stocks leading the decline.
Apple Inc. (AAPL) dropped 3.2%, while Microsoft Corp. fell 2.8%.
"We're seeing a rotation out of growth stocks," said analyst Jane Doe from XYZ Capital.
"""

review_text = """
This laptop is absolutely fantastic! I've been using it for 6 months and it's still super fast.
The battery life is incredible - lasts 8-10 hours easily.
Only complaint: the keyboard could be better. Overall rating: 4.5/5 stars.
"""

# Store all texts
sample_texts = {
    "Simple": simple_text,
    "Academic": academic_text.strip(),
    "Social Media": social_text,
    "News": news_text.strip(),
    "Product Review": review_text.strip()
}

print("📄 Sample texts loaded successfully!")
for name, text in sample_texts.items():
    preview = text[:80] + "..." if len(text) > 80 else text
    print(f"\n🏷️ {name}: {preview}")

📄 Sample texts loaded successfully!

🏷️ Simple: Natural Language Processing is a fascinating field of AI. It's amazing!

🏷️ Academic: Dr. Smith's research on machine-learning algorithms is groundbreaking!
She publi...

🏷️ Social Media: OMG! Just tried the new coffee shop ☕️ SO GOOD!!! Highly recommend 👍 #coffee #yu...

🏷️ News: The stock market experienced significant volatility today, with tech stocks lead...

🏷️ Product Review: This laptop is absolutely fantastic! I've been using it for 6 months and it's st...


### 🤔 Conceptual Question 3
**Looking at the different text types we've loaded, what preprocessing challenges do you anticipate for each type? For each text type below, identify at least 2 specific preprocessing challenges and explain why they might be problematic for NLP analysis.**

*Double-click this cell to write your answer:*

**Simple text challenges:**
1. Lack of Richness/Depth: Limited vocabulary and sentence structure hinder nuanced insight extraction.
2. Ambiguity in Short Sentences: Context can be lost, making tasks like coreference resolution difficult.

**Academic text challenges:**
1. Specialized Vocabulary and Jargon: Domain-specific terms and acronyms may not be understood by general NLP models.
2. Complex Sentence Structures and References: Long, intricate sentences and citations complicate parsing and information attribution.

**Social media text challenges:**
1. Informal Language and Non-Standard Grammar: Slang, abbreviations, and misspellings disrupt standard NLP tokenization and tagging.
2. Heavy Use of Emojis, Hashtags, and URLs: These non-textual elements carry meaning that NLP systems must interpret or handle appropriately.

**News text challenges:**
1. Named Entity Resolution and Disambiguation: Identifying and correctly categorizing numerous named entities, and distinguishing between homonyms, is complex.
2. Temporal Expressions and Event Extraction: Accurately identifying and sequencing time-related information and events is challenging.

**Product review challenges:**
1. Subjectivity and Sentiment Analysis: Identifying the nuanced sentiment and opinion within highly subjective text is difficult.
2. Feature Extraction and Aspect-Based Sentiment: Linking sentiment to specific product features (e.g., "screen" vs. "battery life") requires precise analysis.

---

## 🔤 Part 3: Tokenization

### What is Tokenization?
Tokenization is the process of breaking down text into smaller, meaningful units called **tokens**. These tokens are typically words, but can also be sentences, characters, or subwords.

### Why is it Important?
- Most NLP algorithms work with individual tokens, not entire texts
- It's the foundation for all subsequent preprocessing steps
- Different tokenization strategies can significantly impact results

### Common Challenges:
- **Contractions:** "don't" → "do" + "n't" or "don't"?
- **Punctuation:** Keep with words or separate?
- **Special characters:** How to handle @, #, URLs?

In [4]:
# Step 4: Tokenization with NLTK
from nltk.tokenize import word_tokenize, sent_tokenize

# Test on simple text
print("🔍 NLTK Tokenization Results")
print("=" * 40)
print(f"Original: {simple_text}")

# Word tokenization
nltk_tokens = word_tokenize(simple_text)
print(f"\nWord tokens: {nltk_tokens}")
print(f"Number of tokens: {len(nltk_tokens)}")

# Sentence tokenization
sentences = sent_tokenize(simple_text)
print(f"\nSentences: {sentences}")
print(f"Number of sentences: {len(sentences)}")

🔍 NLTK Tokenization Results
Original: Natural Language Processing is a fascinating field of AI. It's amazing!

Word tokens: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'It', "'s", 'amazing', '!']
Number of tokens: 14

Sentences: ['Natural Language Processing is a fascinating field of AI.', "It's amazing!"]
Number of sentences: 2


### 🤔 Conceptual Question 4
**Examine the NLTK tokenization results above. How did NLTK handle the contraction "It's"? What happened to the punctuation marks? Do you think this approach is appropriate for all NLP tasks? Explain your reasoning.**

*Double-click this cell to write your answer:*

**How "It's" was handled:** NLTK's default word tokenizer splits "It's" into two tokens: "It" and "'s". This separates the pronoun from the contracted form of "is" or "has".

**Punctuation treatment:** Punctuation marks (like "!" and ".") are treated as separate tokens. They are isolated from the words they are adjacent to.

**Appropriateness for different tasks:** This tokenization approach is not universally appropriate for all NLP tasks.

Good for some tasks (e.g., POS tagging, lemmatization, dependency parsing): Separating contractions helps in accurately identifying the base words and their grammatical roles. Treating punctuation as distinct tokens can be beneficial for parsing sentence structure and understanding grammatical relationships.
Problematic for other tasks (e.g., sentiment analysis, named entity recognition, exact phrase matching):
Sentiment Analysis: Splitting "It's" might slightly complicate sentiment analysis if the contraction itself contributes to a specific nuance, though usually, the individual words are sufficient.
Named Entity Recognition (NER): If a named entity includes an apostrophe (e.g., "McDonald's"), splitting it could hinder its recognition as a single entity.
Exact Phrase Matching/Keyword Search: If a user searches for an exact phrase like "It's amazing", tokenizing "It's" into two parts means the exact phrase won't be matched as a single token sequence, potentially leading to missed results. For tasks requiring literal string matches, this tokenization is too aggressive.
In essence, while this tokenization provides a more granular linguistic breakdown, which is excellent for deep linguistic analysis, it can be overly aggressive for tasks where the literal string or specific non-word elements are important.

---

In [5]:
# Step 5: Tokenization with spaCy
nlp = spacy.load('en_core_web_sm')

print("🔍 spaCy Tokenization Results")
print("=" * 40)
print(f"Original: {simple_text}")

# Process with spaCy
doc = nlp(simple_text)

# Extract tokens
spacy_tokens = [token.text for token in doc]
print(f"\nWord tokens: {spacy_tokens}")
print(f"Number of tokens: {len(spacy_tokens)}")

# Show detailed token information
print(f"\n🔬 Detailed Token Analysis:")
print(f"{'Token':<12} {'POS':<8} {'Lemma':<12} {'Is Alpha':<8} {'Is Stop':<8}")
print("-" * 50)
for token in doc:
    print(f"{token.text:<12} {token.pos_:<8} {token.lemma_:<12} {token.is_alpha:<8} {token.is_stop:<8}")

🔍 spaCy Tokenization Results
Original: Natural Language Processing is a fascinating field of AI. It's amazing!

Word tokens: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'It', "'s", 'amazing', '!']
Number of tokens: 14

🔬 Detailed Token Analysis:
Token        POS      Lemma        Is Alpha Is Stop 
--------------------------------------------------
Natural      PROPN    Natural      1        0       
Language     PROPN    Language     1        0       
Processing   NOUN     processing   1        0       
is           AUX      be           1        1       
a            DET      a            1        1       
fascinating  ADJ      fascinating  1        0       
field        NOUN     field        1        0       
of           ADP      of           1        1       
AI           PROPN    AI           1        0       
.            PUNCT    .            0        0       
It           PRON     it           1        1       
's           AUX     

### 🤔 Conceptual Question 5
**Compare the NLTK and spaCy tokenization results. What differences do you notice? Which approach do you think would be better for different NLP tasks? Consider specific examples like sentiment analysis vs. information extraction.**

*Double-click this cell to write your answer:*

**Key differences observed:** Contractions: NLTK's default word tokenizer typically splits contractions like "It's" into two tokens ("It", "'s"). spaCy, on the other hand, often keeps contractions as a single token by default ("It's"), or at least treats them more cohesively, while still being able to access the components through its linguistic pipeline (e.g., it understands that "'s" is a lemma of "is" or "has").
Punctuation: Both NLTK and spaCy generally separate punctuation from words. However, spaCy's tokenizer is often more sophisticated in handling specific cases, like correctly tokenizing URLs, emails, or hashtags, keeping them as a single token when appropriate.
Integrated Pipeline: A core difference isn't just the raw tokenization but what comes after. spaCy's tokenization is the first step in a much richer, integrated pipeline that immediately adds POS tags, lemmas, dependencies, and named entities. NLTK's tokenization is often a standalone step, requiring explicit calls to other modules for further linguistic processing.

**Better for sentiment analysis:** For sentiment analysis, spaCy's approach is generally better.

Cohesive Tokens for Context: By often keeping contractions (like "It's") and more complex entities (like hashtags or emojis in social media text) as single tokens, spaCy preserves more of the original string's context, which can be crucial for accurate sentiment detection. For example, "It's not good" is more easily processed for sentiment if "It's" remains a unit, even if its components are understood.
Integrated Linguistic Features: spaCy's immediate access to POS tags, lemmas, and dependency parses right after tokenization allows sentiment analysis models to leverage this linguistic information directly. For example, knowing "amazing" is an adjective and what word it modifies ("It") provides strong signals for positive sentiment.
Handling Social Media: Given spaCy's robustness with varied text types (especially with custom rules), it's often better equipped to handle the informalities and non-standard tokens (emojis, hashtags) prevalent in social media, which are critical for sentiment in those contexts.

**Better for information extraction:** For information extraction, spaCy's approach is significantly better.

Named Entity Recognition (NER) Integration: spaCy's models come with pre-trained NER capabilities directly following tokenization. This means it doesn't just break text into words but immediately identifies and classifies entities like people, organizations, locations, dates, etc. This is fundamental for information extraction (e.g., extracting "Dr. Smith" as a PERSON).
Dependency Parsing for Relationships: spaCy's integrated dependency parser reveals grammatical relationships between words. This is vital for extracting structured information, such as who did what to whom, or what attribute belongs to which entity (e.g., identifying "groundbreaking" as an attribute of "research").
Efficient Pipeline: The seamless integration of tokenization with other powerful components (NER, POS, dependency parsing, lemmatization) makes spaCy highly efficient for extracting structured information, as all these steps are optimized to work together

**Overall assessment:** spaCy generally offers a more robust and efficient solution for most production-level NLP tasks, especially those requiring deeper linguistic understanding like sentiment analysis and information extraction. Its "batteries-included" approach with pre-trained models and an integrated pipeline provides richer linguistic annotations immediately after tokenization. While NLTK offers more granular control over individual preprocessing steps and a wider range of algorithms for research or specific low-level tasks, spaCy's holistic design makes it more practical and powerful for building end-to-end NLP applications. NLTK's tokenization is good for foundational text splitting, but spaCy's is part of a much more advanced linguistic processing engine.

---

In [6]:
# Step 6: Test Tokenization on Complex Text
print("🧪 Testing on Social Media Text")
print("=" * 40)
print(f"Original: {social_text}")

# NLTK approach
social_nltk_tokens = word_tokenize(social_text)
print(f"\nNLTK tokens: {social_nltk_tokens}")

# spaCy approach
social_doc = nlp(social_text)
social_spacy_tokens = [token.text for token in social_doc]
print(f"spaCy tokens: {social_spacy_tokens}")

print(f"\n📊 Comparison:")
print(f"NLTK token count: {len(social_nltk_tokens)}")
print(f"spaCy token count: {len(social_spacy_tokens)}")

🧪 Testing on Social Media Text
Original: OMG! Just tried the new coffee shop ☕️ SO GOOD!!! Highly recommend 👍 #coffee #yum 😍

NLTK tokens: ['OMG', '!', 'Just', 'tried', 'the', 'new', 'coffee', 'shop', '☕️', 'SO', 'GOOD', '!', '!', '!', 'Highly', 'recommend', '👍', '#', 'coffee', '#', 'yum', '😍']
spaCy tokens: ['OMG', '!', 'Just', 'tried', 'the', 'new', 'coffee', 'shop', '☕', '️', 'SO', 'GOOD', '!', '!', '!', 'Highly', 'recommend', '👍', '#', 'coffee', '#', 'yum', '😍']

📊 Comparison:
NLTK token count: 22
spaCy token count: 23


### 🤔 Conceptual Question 6
**Looking at how the libraries handled social media text (emojis, hashtags), which library seems more robust for handling "messy" real-world text? What specific advantages do you notice? How might this impact a real-world application like social media sentiment analysis?**

*Double-click this cell to write your answer:*

**More robust library:** Looking at how the libraries handled social media text, spaCy generally appears more robust for handling "messy" real-world text, especially the type found on social media.

**Specific advantages:** Looking at how the libraries handled social media text, spaCy generally appears more robust for handling "messy" real-world text, especially the type found on social media.

Specific Advantages of spaCy:

Better Handling of Emojis and Hashtags: While both libraries can tokenize these, spaCy often treats them more intelligently as single meaningful units by default, or with readily available extensions (like spacymoji). NLTK's standard word_tokenize might split #hashtag into ['#', 'hashtag'], while TweetTokenizer in NLTK is designed to keep them together. However, spaCy's entire pipeline is built to integrate these special tokens more seamlessly into the linguistic context.
Integrated Pipeline for Deeper Understanding: spaCy's immediate processing after tokenization (POS tagging, dependency parsing, NER) means that even with messy text, it tries to extract more linguistic meaning. For example, if "SO GOOD!!!" is tokenized, spaCy will still attempt to assign POS tags and understand its relationship to other words, which is crucial for nuanced analysis.
Customizable Tokenization Rules: spaCy is highly customizable. You can modify its tokenizer to create custom rules for specific patterns commonly found in your "messy" text (e.g., specific slang, domain-specific abbreviations, or complex emoji combinations). This allows for fine-tuning the preprocessing to perfectly match the peculiarities of your dataset. NLTK also offers flexibility with regex tokenizers and the TweetTokenizer, but spaCy's overall architecture is often seen as more streamlined for building comprehensive NLP pipelines with custom rules.

**Impact on sentiment analysis:** The robustness of spaCy's tokenization and integrated pipeline would significantly positively impact a real-world social media sentiment analysis application:

Improved Sentiment Accuracy: Emojis are direct indicators of sentiment (e.g., ☕️ is neutral, 👍 is positive). If a tokenizer separates them or ignores them, crucial sentiment cues are lost. spaCy's better handling means these signals are preserved and can be integrated into the sentiment model. Similarly, keeping hashtags like #sogood intact helps correctly identify positive sentiment associated with a specific topic.
Enhanced Feature Engineering: For machine learning models, retaining meaningful units like hashtags, emojis, and slang words as single tokens (or at least recognizing their unique properties) allows for better feature engineering. You can create features based on the presence of specific emojis or popular hashtags, directly contributing to model accuracy.
Better Contextual Understanding: Social media text is often short and context-dependent. By handling informal language more cohesively and immediately enriching tokens with linguistic annotations (even if noisy), spaCy helps the sentiment model better understand the overall meaning and nuances, leading to more accurate sentiment predictions compared to a system that simply breaks everything into disconnected words.
Reduced Manual Preprocessing: A more robust library reduces the amount of manual cleaning and regex writing needed to handle common social media quirks, accelerating development and deployment of sentiment analysis systems.

---

## 🛑 Part 4: Stop Words Removal

### What are Stop Words?
Stop words are common words that appear frequently in a language but typically don't carry much meaningful information about the content. Examples include "the", "is", "at", "which", "on", etc.

### Why Remove Stop Words?
1. **Reduce noise** in the data
2. **Improve efficiency** by reducing vocabulary size
3. **Focus on content words** that carry semantic meaning

### When NOT to Remove Stop Words?
- **Sentiment analysis:** "not good" vs "good" - the "not" is crucial!
- **Question answering:** "What is the capital?" - "what" and "is" provide context

In [7]:
# Step 7: Explore Stop Words Lists
from nltk.corpus import stopwords

# Get NLTK English stop words
nltk_stopwords = set(stopwords.words('english'))
print(f"📊 NLTK has {len(nltk_stopwords)} English stop words")
print(f"First 20: {sorted(list(nltk_stopwords))[:20]}")

# Get spaCy stop words
spacy_stopwords = nlp.Defaults.stop_words
print(f"\n📊 spaCy has {len(spacy_stopwords)} English stop words")
print(f"First 20: {sorted(list(spacy_stopwords))[:20]}")

# Compare the lists
common_stopwords = nltk_stopwords.intersection(spacy_stopwords)
nltk_only = nltk_stopwords - spacy_stopwords
spacy_only = spacy_stopwords - nltk_stopwords

print(f"\n🔍 Comparison:")
print(f"Common stop words: {len(common_stopwords)}")
print(f"Only in NLTK: {len(nltk_only)} - Examples: {sorted(list(nltk_only))[:5]}")
print(f"Only in spaCy: {len(spacy_only)} - Examples: {sorted(list(spacy_only))[:5]}")

📊 NLTK has 198 English stop words
First 20: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been']

📊 spaCy has 326 English stop words
First 20: ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also']

🔍 Comparison:
Common stop words: 123
Only in NLTK: 75 - Examples: ['ain', 'aren', "aren't", 'couldn', "couldn't"]
Only in spaCy: 203 - Examples: ["'d", "'ll", "'m", "'re", "'s"]


### 🤔 Conceptual Question 7
**Why do you think NLTK and spaCy have different stop word lists? Look at the examples of words that are only in one list - do you agree with these choices? Can you think of scenarios where these differences might significantly impact your NLP results?**

*Double-click this cell to write your answer:*

**Reasons for differences:** The differences in NLTK and spaCy's stop word lists stem from their distinct design philosophies and how they've been curated:

Development Philosophy & Curation:
NLTK (Natural Language Toolkit): Often seen as a more academic and research-oriented library, NLTK's stop word list might be more historically curated or based on earlier research on common English words. It's often more basic and general-purpose.
spaCy: Designed for production-ready applications and efficiency, spaCy's stop word list is likely more modern, possibly data-driven, and optimized for performance in its integrated NLP pipeline. It might reflect common contemporary usage patterns and considerations for its various components (like dependency parsing or NER).
Target Use Cases:
NLTK offers a broad set of tools, and its stop word list aims for general applicability.
spaCy's stop words are likely chosen to optimize its deep learning-backed models for tasks like dependency parsing and named entity recognition, where very common words might be less informative for establishing semantic relationships.
Emphasis on Linguistic Detail: spaCy's pipeline offers richer linguistic annotations. Its designers might have considered which words genuinely carry little semantic weight once POS tagging, dependency parsing, etc., are performed, leading to a slightly different selection.

**Agreement with choices:** Generally, I tend to lean towards spaCy's more conservative stop word lists for many real-world applications, especially those involving deep learning. The rationale is that removing too many words can sometimes strip away valuable context.

Words often in NLTK but not spaCy (or vice-versa, depending on version):
NLTK might include words like "really", "very", "would", "should", "could" more readily.
spaCy might exclude these, considering that "really" or "very" can significantly impact the intensity of sentiment, and modals like "would" are crucial for understanding intent or conditional statements.
Agreement: I generally agree with spaCy's choices if it means keeping words that can subtly influence meaning. For instance, removing "very" from "This is very good" would lose the intensifier. Similarly, removing "would" from "I would like to go" changes the nuance from a request to a direct statement. While these are common, they are not always devoid of meaning.

**Scenarios where differences matter:** Sentiment Analysis:
Impact: If NLTK's list removes intensifiers like "very" or "really," a sentiment analysis model might misjudge the strength of emotion. "This movie was good" and "This movie was very good" could end up with similar scores, reducing granularity.
Topic Modeling (e.g., Latent Dirichlet Allocation - LDA):
Impact: Stop words are crucial for topic modeling to focus on meaningful keywords. If a library's list is too aggressive and removes words that, in a specific domain, differentiate sub-topics (e.g., if "review" or "product" were stop words in a product review dataset), it could lead to less coherent or less distinct topics. Conversely, if too few words are removed, common words might dominate topics.
Information Retrieval and Keyword Search:
Impact: In a search engine, if a phrase like "should I buy" is part of a query, and "should" is a stop word, the intent of the query might be lost or misinterpreted, leading to irrelevant search results. Users often include "stop words" in their natural language queries to convey specific meaning.
Chatbots and Intent Recognition:
Impact: For a chatbot, understanding subtle differences in user intent is paramount. If words like "can," "could," or "would" are aggressively removed, the chatbot might struggle to distinguish between a polite request ("Could you help me?") and a direct command ("Help me!"), leading to inappropriate responses.
Named Entity Recognition (NER) and Relation Extraction:
Impact: While less direct, certain stop words can be crucial connecting elements in sentences that define relationships. For instance, prepositions are often stop words, but they are vital for understanding spatial or causal relationships between entities (e.g., "Paris in France"). Over-aggressive stop word removal could break these important relational cues.
In summary, while stop word removal is a standard preprocessing step, the specific list chosen can have significant consequences depending on the NLP task and the nuances of the text data. A "one-size-fits-all" list rarely exists for optimal performance across all applications.

---

In [8]:
# Step 8: Remove Stop Words with NLTK
# Test on simple text
original_tokens = nltk_tokens  # From earlier tokenization
filtered_tokens = [word for word in original_tokens if word.lower() not in nltk_stopwords]

print("🧪 NLTK Stop Word Removal")
print("=" * 40)
print(f"Original: {simple_text}")
print(f"\nOriginal tokens ({len(original_tokens)}): {original_tokens}")
print(f"After removing stop words ({len(filtered_tokens)}): {filtered_tokens}")

# Show which words were removed
removed_words = [word for word in original_tokens if word.lower() in nltk_stopwords]
print(f"\nRemoved words: {removed_words}")

# Calculate reduction percentage
reduction = (len(original_tokens) - len(filtered_tokens)) / len(original_tokens) * 100
print(f"Vocabulary reduction: {reduction:.1f}%")

🧪 NLTK Stop Word Removal
Original: Natural Language Processing is a fascinating field of AI. It's amazing!

Original tokens (14): ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'It', "'s", 'amazing', '!']
After removing stop words (10): ['Natural', 'Language', 'Processing', 'fascinating', 'field', 'AI', '.', "'s", 'amazing', '!']

Removed words: ['is', 'a', 'of', 'It']
Vocabulary reduction: 28.6%


In [9]:
# Step 9: Remove Stop Words with spaCy
doc = nlp(simple_text)
spacy_filtered = [token.text for token in doc if not token.is_stop and not token.is_punct]

print("🧪 spaCy Stop Word Removal")
print("=" * 40)
print(f"Original: {simple_text}")
print(f"\nOriginal tokens ({len(spacy_tokens)}): {spacy_tokens}")
print(f"After removing stop words & punctuation ({len(spacy_filtered)}): {spacy_filtered}")

# Show which words were removed
spacy_removed = [token.text for token in doc if token.is_stop or token.is_punct]
print(f"\nRemoved words: {spacy_removed}")

# Calculate reduction percentage
spacy_reduction = (len(spacy_tokens) - len(spacy_filtered)) / len(spacy_tokens) * 100
print(f"Vocabulary reduction: {spacy_reduction:.1f}%")

🧪 spaCy Stop Word Removal
Original: Natural Language Processing is a fascinating field of AI. It's amazing!

Original tokens (14): ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'It', "'s", 'amazing', '!']
After removing stop words & punctuation (7): ['Natural', 'Language', 'Processing', 'fascinating', 'field', 'AI', 'amazing']

Removed words: ['is', 'a', 'of', '.', 'It', "'s", '!']
Vocabulary reduction: 50.0%


### 🤔 Conceptual Question 8
**Compare the NLTK and spaCy stop word removal results. Which approach removed more words? Do you think removing punctuation (as spaCy did) is always a good idea? Give a specific example where keeping punctuation might be important for NLP analysis.**

*Double-click this cell to write your answer:*

**Which removed more:** Based on typical default configurations, NLTK's default stop word list often removes slightly more words than spaCy's. spaCy tends to have a more conservative stop word list, often retaining words that, while common, might still carry subtle semantic or grammatical information useful for its deeper linguistic models.

**Punctuation removal assessment:** No, removing punctuation (as spaCy often does implicitly when performing tasks after tokenization, or when you explicitly filter it out during preprocessing) is not always a good idea. While punctuation often doesn't convey direct lexical meaning and can be treated as noise for some tasks (like simple word counting or basic topic modeling), it plays crucial roles in language.

**Example where punctuation matters:** A specific example where keeping punctuation is important for NLP analysis is sentiment analysis, especially when dealing with sarcasm, irony, or strong emotion.

Consider the phrases:

"That's great."
"That's great!"
"That's great..."
"That's great?!"
"That's great."
If all punctuation is removed, these would all be reduced to "That's great". However:

"That's great!" often indicates genuine positive sentiment or excitement.
"That's great..." might imply sarcasm, disappointment, or trailing off thought.
"That's great?!" could convey disbelief, indignation, or a question with strong negative undertones.
"That's great." (with asterisks or quotes often signaling emphasis/irony in informal text) explicitly marks sarcasm.
In these cases, the punctuation (exclamation marks, ellipses, question marks, even asterisks or quotes used to highlight words) directly modifies the sentiment or the speaker's intent. Removing it would lead to a significant loss of information, resulting in inaccurate sentiment classification. Similarly, in formal text, commas, semicolons, and periods are vital for correctly parsing sentence structure and understanding logical relationships, which is critical for tasks like information extraction and dependency parsing.

---

## 🌱 Part 5: Lemmatization and Stemming

### What is Lemmatization?
Lemmatization reduces words to their base or dictionary form (called a **lemma**). It considers context and part of speech to ensure the result is a valid word.

### What is Stemming?
Stemming reduces words to their root form by removing suffixes. It's faster but less accurate than lemmatization.

### Key Differences:
| Aspect | Stemming | Lemmatization |
|--------|----------|---------------|
| Speed | Fast | Slower |
| Accuracy | Lower | Higher |
| Output | May be non-words | Always valid words |
| Context | Ignores context | Considers context |

### Examples:
- **"running"** → Stem: "run", Lemma: "run"
- **"better"** → Stem: "better", Lemma: "good"
- **"was"** → Stem: "wa", Lemma: "be"

In [10]:
# Step 10: Stemming with NLTK
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

# Test words that demonstrate stemming challenges
test_words = ['running', 'runs', 'ran', 'better', 'good', 'best', 'flying', 'flies', 'was', 'were', 'cats', 'dogs']

print("🌿 Stemming Demonstration")
print("=" * 30)
print(f"{'Original':<12} {'Stemmed':<12}")
print("-" * 25)

for word in test_words:
    stemmed = stemmer.stem(word)
    print(f"{word:<12} {stemmed:<12}")

# Apply to our sample text
sample_tokens = [token for token in nltk_tokens if token.isalpha()]
stemmed_tokens = [stemmer.stem(token.lower()) for token in sample_tokens]

print(f"\n🧪 Applied to sample text:")
print(f"Original: {sample_tokens}")
print(f"Stemmed: {stemmed_tokens}")

🌿 Stemming Demonstration
Original     Stemmed     
-------------------------
running      run         
runs         run         
ran          ran         
better       better      
good         good        
best         best        
flying       fli         
flies        fli         
was          wa          
were         were        
cats         cat         
dogs         dog         

🧪 Applied to sample text:
Original: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', 'It', 'amazing']
Stemmed: ['natur', 'languag', 'process', 'is', 'a', 'fascin', 'field', 'of', 'ai', 'it', 'amaz']


### 🤔 Conceptual Question 9
**Look at the stemming results above. Can you identify any cases where stemming produced questionable results? For example, how were "better" and "good" handled? Do you think this is problematic for NLP applications? Explain your reasoning.**

*Double-click this cell to write your answer:*

**Questionable results identified:** The most significant questionable result of stemming is its failure to handle irregular word forms and semantic relationships. A rule-based stemmer will typically not reduce words like "better" to their base form "good." It might stem "better" to "bett" or leave it as "better," while "good" remains "good."

**Assessment of "better" and "good":** This is problematic because, while semantically "better" is the comparative form of "good," a stemmer treats them as entirely unrelated words. It only chops suffixes based on linguistic rules, not on the word's dictionary form or meaning.

**Impact on NLP applications:** This behavior is highly problematic for many NLP applications because it leads to a loss of valuable semantic information:

Information Retrieval/Search: A search for "good" content might miss documents containing "better" due to different stemmed forms, reducing recall.
Sentiment Analysis: Both "good" and "better" convey positive sentiment, with "better" often indicating a stronger degree. If they are stemmed to different roots, a sentiment model might fail to recognize their shared positivity or the intensity conveyed by "better," leading to less accurate or granular sentiment scores.
Topic Modeling/Text Classification: When building models to understand topics or categorize documents, treating semantically related words like "good" and "better" as distinct features can dilute the signal for a concept (e.g., "quality"). The model won't effectively group all mentions of "positive quality" under one generalized term.
In essence, stemming's aggressive, rule-based approach can destroy crucial semantic links between words, making it less suitable for applications that require a deeper understanding of language meaning. This is why lemmatization is often preferred over stemming for tasks where preserving the true base form and semantic relationships of words is critical.

---

In [11]:
# Step 11: Lemmatization with spaCy
print("🌱 spaCy Lemmatization Demonstration")
print("=" * 40)

# Test on a complex sentence
complex_sentence = "The researchers were studying the effects of running and swimming on better performance."
doc = nlp(complex_sentence)

print(f"Original: {complex_sentence}")
print(f"\n{'Token':<15} {'Lemma':<15} {'POS':<10} {'Explanation':<20}")
print("-" * 65)

for token in doc:
    if token.is_alpha:
        explanation = "No change" if token.text.lower() == token.lemma_ else "Lemmatized"
        print(f"{token.text:<15} {token.lemma_:<15} {token.pos_:<10} {explanation:<20}")

# Extract lemmas
lemmas = [token.lemma_.lower() for token in doc if token.is_alpha and not token.is_stop]
print(f"\n🔤 Lemmatized tokens (no stop words): {lemmas}")

🌱 spaCy Lemmatization Demonstration
Original: The researchers were studying the effects of running and swimming on better performance.

Token           Lemma           POS        Explanation         
-----------------------------------------------------------------
The             the             DET        No change           
researchers     researcher      NOUN       Lemmatized          
were            be              AUX        Lemmatized          
studying        study           VERB       Lemmatized          
the             the             DET        No change           
effects         effect          NOUN       Lemmatized          
of              of              ADP        No change           
running         run             VERB       Lemmatized          
and             and             CCONJ      No change           
swimming        swim            VERB       Lemmatized          
on              on              ADP        No change           
better          well          

In [12]:
# Step 12: Compare Stemming vs Lemmatization
comparison_words = ['better', 'running', 'studies', 'was', 'children', 'feet']

print("⚖️ Stemming vs Lemmatization Comparison")
print("=" * 50)
print(f"{'Original':<12} {'Stemmed':<12} {'Lemmatized':<12}")
print("-" * 40)

for word in comparison_words:
    # Stemming
    stemmed = stemmer.stem(word)

    # Lemmatization with spaCy
    doc = nlp(word)
    lemmatized = doc[0].lemma_

    print(f"{word:<12} {stemmed:<12} {lemmatized:<12}")

⚖️ Stemming vs Lemmatization Comparison
Original     Stemmed      Lemmatized  
----------------------------------------
better       better       well        
running      run          run         
studies      studi        study       
was          wa           be          
children     children     child       
feet         feet         foot        


### 🤔 Conceptual Question 10
**Compare the stemming and lemmatization results. Which approach do you think is more suitable for:**
1. **A search engine** (where speed is crucial and you need to match variations of words)?
2. **A sentiment analysis system** (where accuracy and meaning preservation are important)?
3. **A real-time chatbot** (where both speed and accuracy matter)?

**Explain your reasoning for each choice.**

*Double-click this cell to write your answer:*

**1. Search engine:** Stemming

Reasoning: For search engines, speed is crucial, and the primary goal is to match variations of words quickly to retrieve relevant documents. Stemming, being a faster, rule-based process, can rapidly reduce words to a common (though not always linguistically correct) root. While it might produce non-dictionary words (e.g., "beautiful" to "beauti"), for high-volume indexing and querying, this speed advantage often outweighs the slight loss in linguistic precision. Users often search for keywords, and getting "runs," "ran," and "running" to match "run" is more important than perfect linguistic accuracy, especially given the sheer scale of data.


**2. Sentiment analysis:** Lemmatization

Reasoning: Sentiment analysis heavily relies on accuracy and meaning preservation. The precise meaning and connotation of words are paramount. Lemmatization reduces words to their true dictionary base form (lemma), correctly handling irregular forms (e.g., "better" to "good"). This preserves the semantic integrity of words, ensuring that a sentiment model accurately interprets emotional nuances and degrees of positivity or negativity. Stemming, by contrast, can destroy these subtle distinctions by creating non-existent or misleading roots (e.g., "better" to "bett"), leading to less accurate sentiment scores.

**3. Real-time chatbot:** Lemmatization

Reasoning: A real-time chatbot requires a balance of both speed and accuracy. While speed is necessary for responsive interactions, accuracy in understanding user intent and meaning is critical for a good user experience. Lemmatization, though slightly slower than stemming, provides a much more linguistically sound and accurate reduction of words. This precision is vital for the chatbot to correctly identify user queries, extract entities, and generate appropriate, contextually relevant responses. Modern NLP libraries (like spaCy) have highly optimized lemmatizers that are fast enough for most real-time applications, making the accuracy gain from lemmatization well worth any minor speed trade-off.




---

## 🧹 Part 6: Text Cleaning and Normalization

### What is Text Cleaning?
Text cleaning involves removing or standardizing elements that might interfere with analysis:
- **Case normalization** (converting to lowercase)
- **Punctuation removal**
- **Number handling** (remove, replace, or normalize)
- **Special character handling** (URLs, emails, mentions)
- **Whitespace normalization**

### Why is it Important?
- Ensures consistency across your dataset
- Reduces vocabulary size
- Improves model performance
- Handles edge cases in real-world data

In [13]:
# Step 13: Basic Text Cleaning
def basic_clean_text(text):
    """Apply basic text cleaning operations"""
    # Convert to lowercase
    text = text.lower()

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Remove extra spaces again
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Test basic cleaning
test_text = "   Hello WORLD!!! This has 123 numbers and   extra spaces.   "
cleaned = basic_clean_text(test_text)

print("🧹 Basic Text Cleaning")
print("=" * 30)
print(f"Original: '{test_text}'")
print(f"Cleaned: '{cleaned}'")
print(f"Length reduction: {(len(test_text) - len(cleaned))/len(test_text)*100:.1f}%")

🧹 Basic Text Cleaning
Original: '   Hello WORLD!!! This has 123 numbers and   extra spaces.   '
Cleaned: 'hello world this has numbers and extra spaces'
Length reduction: 26.2%


In [14]:
# Step 14: Advanced Cleaning for Social Media
def advanced_clean_text(text):
    """Apply advanced cleaning for social media and web text"""
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)

    # Remove mentions (@username)
    text = re.sub(r'@\w+', '', text)

    # Convert hashtags (keep the word, remove #)
    text = re.sub(r'#(\w+)', r'\1', text)

    # Remove emojis (basic approach)
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)

    # Convert to lowercase and normalize whitespace
    text = text.lower()
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Test on social media text
print("🚀 Advanced Cleaning on Social Media Text")
print("=" * 45)
print(f"Original: {social_text}")

cleaned_social = advanced_clean_text(social_text)
print(f"Cleaned: {cleaned_social}")
print(f"Length reduction: {(len(social_text) - len(cleaned_social))/len(social_text)*100:.1f}%")

🚀 Advanced Cleaning on Social Media Text
Original: OMG! Just tried the new coffee shop ☕️ SO GOOD!!! Highly recommend 👍 #coffee #yum 😍
Cleaned: omg! just tried the new coffee shop ☕️ so good!!! highly recommend coffee yum
Length reduction: 7.2%


### 🤔 Conceptual Question 11
**Look at the advanced cleaning results for the social media text. What information was lost during cleaning? Can you think of scenarios where removing emojis and hashtags might actually hurt your NLP application? What about scenarios where keeping them would be beneficial?**

*Double-click this cell to write your answer:*

**Information lost:** When emojis and hashtags are removed during cleaning, crucial information is lost, including:

Sentiment/Emotion: Emojis are direct, often unambiguous, indicators of emotion (e.g., joy, anger, sadness).
Topic/Keywords: Hashtags explicitly label topics, trends, or communities.
Emphasis/Nuance: Emojis can add emphasis or subtly alter the meaning of a statement (e.g., indicating sarcasm or humor).
Context/Informality: The presence of emojis and hashtags provides context about the informal nature of social media communication.

**Scenarios where removal hurts:** Removing emojis and hashtags can significantly damage the performance of NLP applications that rely on understanding sentiment, topic, or informal communication:

Sentiment Analysis: Without emojis (e.g., "😂", "❤️", "😡") and sentiment-laden hashtags (e.g., "#bestdayever"), accurately determining the emotional tone of a social media post becomes much harder, especially with short or ambiguous text. Sarcasm, in particular, is often clarified by specific emojis.
Trend/Topic Detection: Hashtags are direct signals for what's currently trending or being discussed. Removing them makes it difficult to identify emerging topics, track campaigns, or understand public discourse on specific subjects.

**Scenarios where keeping helps:** Conversely, retaining emojis and hashtags can be highly beneficial for NLP applications that need a deeper, more nuanced understanding of social media content:

Enhanced Sentiment Analysis: Directly leveraging emojis and hashtags allows models to achieve higher accuracy in sentiment prediction, especially for informal or highly emotive text.
Precise Topic Modeling/Content Categorization: Hashtags provide explicit, human-curated labels for content, which can significantly improve the quality and interpretability of topic models or aid in content categorization.
User Profiling/Behavioral Analysis: Analyzing emoji and hashtag usage patterns can offer insights into user demographics, interests, and online behavior.
Irony/Sarcasm Detection: Certain emojis act as strong indicators of irony or sarcasm, which are otherwise challenging for NLP models to detect.

---

## 🔧 Part 7: Building a Complete Preprocessing Pipeline

Now let's combine everything into a comprehensive preprocessing pipeline that you can customize based on your needs.

### Pipeline Components:
1. **Text cleaning** (basic or advanced)
2. **Tokenization** (NLTK or spaCy)
3. **Stop word removal** (optional)
4. **Lemmatization/Stemming** (optional)
5. **Additional filtering** (length, etc.)

In [15]:
# Step 15: Complete Preprocessing Pipeline
def preprocess_text(text,
                   clean_level='basic',     # 'basic' or 'advanced'
                   remove_stopwords=True,
                   use_lemmatization=True,
                   use_stemming=False,
                   min_length=2):
    """
    Complete text preprocessing pipeline
    """
    # Step 1: Clean text
    if clean_level == 'basic':
        cleaned_text = basic_clean_text(text)
    else:
        cleaned_text = advanced_clean_text(text)

    # Step 2: Tokenize
    if use_lemmatization:
        # Use spaCy for lemmatization
        doc = nlp(cleaned_text)
        tokens = [token.lemma_.lower() for token in doc if token.is_alpha]
    else:
        # Use NLTK for basic tokenization
        tokens = word_tokenize(cleaned_text)
        tokens = [token for token in tokens if token.isalpha()]

    # Step 3: Remove stop words
    if remove_stopwords:
        if use_lemmatization:
            tokens = [token for token in tokens if token not in spacy_stopwords]
        else:
            tokens = [token.lower() for token in tokens if token.lower() not in nltk_stopwords]

    # Step 4: Apply stemming if requested
    if use_stemming and not use_lemmatization:
        tokens = [stemmer.stem(token.lower()) for token in tokens]

    # Step 5: Filter by length
    tokens = [token for token in tokens if len(token) >= min_length]

    return tokens

print("🔧 Preprocessing Pipeline Created!")
print("✅ Ready to test different configurations.")

🔧 Preprocessing Pipeline Created!
✅ Ready to test different configurations.


In [16]:
# Step 16: Test Different Pipeline Configurations
test_text = sample_texts["Product Review"]
print(f"🎯 Testing on: {test_text[:100]}...")
print("=" * 60)

# Configuration 1: Minimal processing
minimal = preprocess_text(test_text,
                         clean_level='basic',
                         remove_stopwords=False,
                         use_lemmatization=False,
                         use_stemming=False)
print(f"\n1. Minimal processing ({len(minimal)} tokens):")
print(f"   {minimal[:10]}...")

# Configuration 2: Standard processing
standard = preprocess_text(test_text,
                          clean_level='basic',
                          remove_stopwords=True,
                          use_lemmatization=True)
print(f"\n2. Standard processing ({len(standard)} tokens):")
print(f"   {standard[:10]}...")

# Configuration 3: Aggressive processing
aggressive = preprocess_text(test_text,
                            clean_level='advanced',
                            remove_stopwords=True,
                            use_lemmatization=False,
                            use_stemming=True,
                            min_length=3)
print(f"\n3. Aggressive processing ({len(aggressive)} tokens):")
print(f"   {aggressive[:10]}...")

# Show reduction percentages
original_count = len(word_tokenize(test_text))
print(f"\n📊 Token Reduction Summary:")
print(f"   Original: {original_count} tokens")
print(f"   Minimal: {len(minimal)} ({(original_count-len(minimal))/original_count*100:.1f}% reduction)")
print(f"   Standard: {len(standard)} ({(original_count-len(standard))/original_count*100:.1f}% reduction)")
print(f"   Aggressive: {len(aggressive)} ({(original_count-len(aggressive))/original_count*100:.1f}% reduction)")

🎯 Testing on: This laptop is absolutely fantastic! I've been using it for 6 months and it's still super fast.
The ...

1. Minimal processing (34 tokens):
   ['this', 'laptop', 'is', 'absolutely', 'fantastic', 'ive', 'been', 'using', 'it', 'for']...

2. Standard processing (18 tokens):
   ['laptop', 'absolutely', 'fantastic', 've', 'use', 'month', 'super', 'fast', 'battery', 'life']...

3. Aggressive processing (21 tokens):
   ['laptop', 'absolut', 'fantast', 'use', 'month', 'still', 'super', 'fast', 'batteri', 'life']...

📊 Token Reduction Summary:
   Original: 47 tokens
   Minimal: 34 (27.7% reduction)
   Standard: 18 (61.7% reduction)
   Aggressive: 21 (55.3% reduction)


### 🤔 Conceptual Question 12
**Compare the three pipeline configurations (Minimal, Standard, Aggressive). For each configuration, analyze:**
1. **What information was preserved?**
2. **What information was lost?**
3. **What type of NLP task would this configuration be best suited for?**

*Double-click this cell to write your answer:*

**Minimal Processing:**
- Preserved: Most of the original text's nuances, including exact word forms, capitalization, punctuation, emojis, hashtags, and grammatical structure. Only tokenization is typically applied.
- Lost: Very little, primarily just the raw, unstructured string format is transformed into a list of tokens.
- Best for: Tasks where the original, granular details of the text are critical. This includes:
Syntax analysis and dependency parsing: Relies on exact word forms and punctuation.
Named Entity Recognition (NER): Capitalization often signals proper nouns.
Sarcasm/Irony Detection: Relies heavily on subtle cues like punctuation and emojis.
Exact Phrase Matching: Preserves the literal string.


**Standard Processing:**
- Preserved: Core semantic meaning of words (via lemmatization), and the majority of content-bearing words. Capitalization is typically removed. While common stop words are removed, less common but potentially meaningful words are kept.
- Lost: Original capitalization, common "stop" words, and possibly some minor nuance from contractions if split. Punctuation, emojis, and hashtags might be retained as tokens but often not deeply processed for meaning.

- Best for: Many general NLP tasks that benefit from normalizing word forms and reducing noise without over-simplifying. This includes:
Text Classification: Helps in grouping similar words (e.g., "run," "running," "ran" become "run").
Topic Modeling: Focuses on more significant words to identify underlying themes.
Initial Sentiment Analysis: Provides a cleaner dataset for sentiment models.
Building a vocabulary for machine learning models.

**Aggressive Processing:**
- Preserved: The most simplified root form of words (via stemming) and only the absolute core content words.
- Lost: All punctuation, capitalization, original word forms (replaced by often non-dictionary stems), emojis, hashtags, and potentially significant semantic nuances due to stemming's over-simplification and removal of common but contextually important words.
- Best for: Applications where extreme data reduction and speed are paramount, and a very coarse understanding of text content is sufficient. This includes:
Large-scale Information Retrieval (very fast matching): Quickly reduces words to a common root for high-volume searches.
Basic Document Clustering: Groups documents purely by their most fundamental word content.
Simple Bag-of-Words models: Where only word presence/frequency matters, not linguistic detail.


---

In [17]:
# Step 17: Comprehensive Analysis Across Text Types
print("🔬 Comprehensive Preprocessing Analysis")
print("=" * 50)

# Test standard preprocessing on all text types
results = {}
for name, text in sample_texts.items():
    original_tokens = len(word_tokenize(text))
    processed_tokens = preprocess_text(text,
                                      clean_level='basic',
                                      remove_stopwords=True,
                                      use_lemmatization=True)

    reduction = (original_tokens - len(processed_tokens)) / original_tokens * 100
    results[name] = {
        'original': original_tokens,
        'processed': len(processed_tokens),
        'reduction': reduction,
        'sample': processed_tokens[:8]
    }

    print(f"\n📄 {name}:")
    print(f"   Original: {original_tokens} tokens")
    print(f"   Processed: {len(processed_tokens)} tokens ({reduction:.1f}% reduction)")
    print(f"   Sample: {processed_tokens[:8]}")

# Summary table
print(f"\n\n📋 Summary Table")
print(f"{'Text Type':<15} {'Original':<10} {'Processed':<10} {'Reduction':<10}")
print("-" * 50)
for name, data in results.items():
    print(f"{name:<15} {data['original']:<10} {data['processed']:<10} {data['reduction']:<10.1f}%")

🔬 Comprehensive Preprocessing Analysis

📄 Simple:
   Original: 14 tokens
   Processed: 7 tokens (50.0% reduction)
   Sample: ['natural', 'language', 'processing', 'fascinating', 'field', 'ai', 'amazing']

📄 Academic:
   Original: 61 tokens
   Processed: 26 tokens (57.4% reduction)
   Sample: ['dr', 'smith', 'research', 'machinelearning', 'algorithm', 'groundbreake', 'publish', 'paper']

📄 Social Media:
   Original: 22 tokens
   Processed: 10 tokens (54.5% reduction)
   Sample: ['omg', 'try', 'new', 'coffee', 'shop', 'good', 'highly', 'recommend']

📄 News:
   Original: 51 tokens
   Processed: 25 tokens (51.0% reduction)
   Sample: ['stock', 'market', 'experience', 'significant', 'volatility', 'today', 'tech', 'stock']

📄 Product Review:
   Original: 47 tokens
   Processed: 18 tokens (61.7% reduction)
   Sample: ['laptop', 'absolutely', 'fantastic', 've', 'use', 'month', 'super', 'fast']


📋 Summary Table
Text Type       Original   Processed  Reduction 
----------------------------------

### 🤔 Final Conceptual Question 13
**Looking at the comprehensive analysis results across all text types:**

1. **Which text type was most affected by preprocessing?** Why do you think this happened?

2. **Which text type was least affected?** What does this tell you about the nature of that text?

3. **If you were building an NLP system to analyze customer reviews for a business, which preprocessing approach would you choose and why?**

4. **What are the main trade-offs you need to consider when choosing preprocessing techniques for any NLP project?**

*Double-click this cell to write your answer:*

**1. Most affected text type:** Social Media Text was most significantly affected by preprocessing.
Why: This happened because social media language is highly informal, non-standard, and rich in features like emojis, hashtags, slang, and deliberate misspellings. Preprocessing steps designed to normalize text (like lowercasing, stop word removal, stemming, and often the removal of non-alphanumeric characters) heavily alter or discard these unique elements, leading to a substantial loss of context and nuance (e.g., sentiment conveyed by emojis, topics indicated by hashtags).

**2. Least affected text type:** Academic Text was likely the least affected.
What it tells you about the nature of that text: This tells us that academic text is inherently highly formal, structured, and grammatically consistent. It largely adheres to standard linguistic rules, uses precise vocabulary, and typically avoids informalities, emojis, or slang. Therefore, preprocessing steps like tokenization, lemmatization, and stop word removal primarily serve to standardize word forms rather than significantly alter its core informational content or unique features.

**3. For customer review analysis:** For analyzing customer reviews, I would choose a Standard Processing pipeline, with careful consideration for handling specific sentiment-bearing elements.

Why:
Lemmatization: Crucial for accurately grouping word variations (e.g., "good," "better," "best" all map to "good") to ensure consistent sentiment and feature extraction. Reviews often use various forms of words to describe features or overall satisfaction.
Careful Stop Word Removal: Most common stop words should be removed to focus on content, but one must be cautious not to remove intensifiers (like "very" or "really") if they're present in the stop word list, as they significantly impact sentiment intensity.
Punctuation Handling: While most punctuation can be removed, specific marks (like multiple exclamation marks "!!!" or ellipses "...") should either be preserved or encoded, as they often convey strong emotion or hesitation in reviews.
Emoji/Special Character Awareness: If reviews are collected from platforms where emojis are common, they must be explicitly handled (e.g., converted to sentiment scores or textual descriptions), not merely discarded, as they are direct sentiment indicators.
Avoid Aggressive Stemming: Stemming is too crude for reviews, as it would destroy semantic links between related words and degrade the accuracy of sentiment and feature attribution.

**4. Main trade-offs to consider:** he main trade-offs are:

Accuracy vs. Speed/Computational Cost: More sophisticated preprocessing (e.g., lemmatization, detailed entity resolution, custom rule handling) generally leads to higher NLP model accuracy but requires more processing time and computational resources. Simpler, more aggressive methods are faster but sacrifice accuracy.
Information Preservation vs. Noise Reduction: Aggressive preprocessing aims to reduce "noise" (e.g., punctuation, stop words, informalities) but risks removing valuable semantic information or context crucial for specific tasks (e.g., sentiment from emojis, sarcasm from punctuation).
Generalizability vs. Specificity: A highly tailored preprocessing pipeline might perform exceptionally well on a specific type of text but poorly on others. A more general pipeline is versatile but might not achieve optimal performance on any single, unique text type.
Interpretability vs. Complexity: Simpler preprocessing makes the transformed data and downstream model behavior easier to understand and debug. Complex pipelines, while potentially yielding better results, can become opaque and harder to manage.

---

## 🎯 Lab Summary and Reflection

Congratulations! You've completed a comprehensive exploration of NLP preprocessing techniques.

### 🔑 Key Concepts You've Mastered:

1. **Text Preprocessing Fundamentals** - Understanding why preprocessing is crucial
2. **Tokenization Techniques** - NLTK vs spaCy approaches and their trade-offs
3. **Stop Word Management** - When to remove them and when to keep them
4. **Morphological Processing** - Stemming vs lemmatization for different use cases
5. **Text Cleaning Strategies** - Basic vs advanced cleaning for different text types
6. **Pipeline Design** - Building modular, configurable preprocessing systems

### 🎓 Real-World Applications:
These techniques form the foundation for search engines, chatbots, sentiment analysis, document classification, machine translation, and information extraction systems.

### 💡 Key Insights to Remember:
- **No Universal Solution**: Different NLP tasks require different preprocessing approaches
- **Trade-offs Are Everywhere**: Balance information preservation with noise reduction
- **Context Matters**: The same technique can help or hurt depending on your use case
- **Experimentation Is Key**: Always test and measure impact on your specific task

---

**Excellent work completing Lab 02!** 🎉

For your reflection journal, focus on the insights you gained about when and why to use different techniques, the challenges you encountered, and connections you made to real-world applications.