# NLP Preprocessing Comparison Suite

SP26‑CSC115‑N850: Machine Learning I  
Instructor: Professor Ann Aksut  
Student: Ericka‑Lee DeMaria  
Date: <Insert Date>  

## Purpose

This notebook demonstrates a complete natural language processing (NLP) preprocessing workflow in Python.  
It compares tokenization, stop‑word removal, stemming, lemmatization, part‑of‑speech (POS) tagging, and spaCy‑based linguistic analysis.  
The goal is to understand how different preprocessing techniques transform text and how these choices affect downstream machine learning tasks.

This notebook follows the CSC‑115 rubric and incorporates prior feedback on clarity, structure, documentation, and interpretability.


# NLP Preprocessing Comparison Suite

This notebook demonstrates how different NLP preprocessing techniques transform text.  
It compares:

- Tokenization  
- Stop‑word removal  
- Stemming (Porter, Lancaster, Snowball)  
- Lemmatization (noun vs verb)  
- POS tagging  
- spaCy tokenization, lemmatization, and POS tagging  
- A summary comparison table  
- A linguistic difficulty score  

This notebook is part of the *NLP Preprocessing Comparison Project*.


## NLP Pipeline Overview

The preprocessing pipeline used in this notebook is:

Raw Text  
→ Tokenization  
→ Stop‑Word Removal  
→ Stemming (Porter, Lancaster, Snowball)  
→ Lemmatization (noun vs verb)  
→ POS Tagging  
→ spaCy Pipeline (tokens, lemmas, POS)  
→ Summary Comparison Table  
→ Difficulty Score


In [1]:
# Imports and environment setup

import nltk
import spacy
import pandas as pd

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer, WordNetLemmatizer
from nltk.tag import pos_tag

# Download required NLTK resources.
# quiet=True suppresses verbose download messages for a cleaner notebook.
nltk.download("punkt", quiet=True)
nltk.download("stopwords", quiet=True)
nltk.download("wordnet", quiet=True)
nltk.download("omw-1.4", quiet=True)
nltk.download("averaged_perceptron_tagger", quiet=True)

# Load the small English spaCy model.
# This model provides tokenization, POS tagging, and lemmatization.
nlp = spacy.load("en_core_web_sm")


## CSC‑115 States

### Problem State
The problem is to preprocess natural language text using multiple NLP techniques and compare their outputs.  
This supports understanding of how preprocessing affects model performance, interpretability, and robustness.

### System State
The system consists of:
- Python 3.x
- nltk for classical NLP tools
- spaCy for modern NLP pipelines
- pandas for tabular comparison

All required corpora and models are downloaded at runtime.

### Data State
The data is a single input sentence provided as a string.  
The focus is on transformation and analysis rather than training a predictive model.

### Model State
No predictive model is trained.  
Instead, the notebook applies:
- NLTK stemmers (Porter, Lancaster, Snowball)
- NLTK WordNet lemmatizer
- NLTK POS tagger
- spaCy linguistic pipeline

### Evaluation State
Evaluation is qualitative and comparative.  
A summary table and a custom difficulty score are used to analyze differences across preprocessing methods.


In [2]:
# Define the input sentence for analysis.
# This sentence contains verbs, adjectives, and a noun, which makes it useful
# for demonstrating stemming, lemmatization, and POS tagging.

sentence = "We went again and had an even better experience!"
sentence


'We went again and had an even better experience!'

In [3]:
# Step 1: Tokenization using NLTK's word_tokenize.
# This splits the raw sentence into individual tokens (words and punctuation).

tokens = word_tokenize(sentence)

# Display the list of tokens.
tokens


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


<details>
<summary>Tokenization Interpretation</summary>

Tokenization splits the sentence into discrete units that can be processed by algorithms.  
Punctuation is preserved as separate tokens. This step is foundational for all subsequent NLP operations.

</details>


In [4]:
# Step 2: Stop‑word removal.
# Stop words are high‑frequency words (e.g., "and", "the") that often do not add
# significant semantic meaning for many NLP tasks.

# Load the English stop‑word list from NLTK.
stop_words = set(stopwords.words("english"))

# Filter out tokens that are stop words.
# The .lower() call ensures case‑insensitive comparison.
filtered_tokens = [t for t in tokens if t.lower() not in stop_words]

# Display the filtered tokens.
filtered_tokens


NameError: name 'tokens' is not defined

<details>
<summary>Stop‑Word Removal Interpretation</summary>

Stop‑word removal reduces noise and dimensionality by removing common words that typically do not contribute to the core meaning.  
This can improve efficiency and sometimes model performance, especially in bag‑of‑words or frequency‑based representations.

</details>


In [5]:
# Step 3: Stemming comparison.
# Stemming reduces words to crude base forms by stripping prefixes and suffixes.
# Different stemmers use different rules and levels of aggressiveness.

# Initialize three common NLTK stemmers.
stemmers = {
    "Porter": PorterStemmer(),
    "Lancaster": LancasterStemmer(),
    "Snowball": SnowballStemmer("english")
}

# For each stemmer, compute the stemmed version of each filtered token.
stem_results = {
    name: [stemmer.stem(t) for t in filtered_tokens]
    for name, stemmer in stemmers.items()
}

# Display the dictionary of stemmed outputs.
stem_results


NameError: name 'filtered_tokens' is not defined

<details>
<summary>Stemming Interpretation</summary>

Stemming reduces words to base forms using heuristic rules.  
- Porter is relatively conservative and often produces stems close to real words.  
- Lancaster is more aggressive and may produce shorter, less interpretable stems.  
- Snowball is a balanced, multilingual stemmer.

Differences between stemmers illustrate how algorithm choice can affect feature representations in NLP models.

</details>


In [6]:
# Step 4: Lemmatization comparison.
# Lemmatization reduces words to their dictionary base forms (lemmas),
# using vocabulary and morphological analysis. It is sensitive to part‑of‑speech.

# Initialize the WordNet lemmatizer.
lemmatizer = WordNetLemmatizer()

# Lemmatize tokens assuming they are nouns.
lemmas_n = [lemmatizer.lemmatize(t, pos="n") for t in filtered_tokens]

# Lemmatize tokens assuming they are verbs.
lemmas_v = [lemmatizer.lemmatize(t, pos="v") for t in filtered_tokens]

# Store results in a dictionary for clarity.
lemmatization_results = {
    "Lemma (n)": lemmas_n,
    "Lemma (v)": lemmas_v
}

# Display the lemmatization results.
lemmatization_results


NameError: name 'filtered_tokens' is not defined

<details>
<summary>Lemmatization Interpretation</summary>

Lemmatization uses dictionary knowledge and part‑of‑speech information to produce valid base forms.  
When treated as verbs, irregular forms such as "went" and "had" are correctly mapped to "go" and "have".  
When treated as nouns, many words remain unchanged because they are already in their base noun form.

This demonstrates that lemmatization quality depends heavily on accurate POS information.

</details>

In [7]:
# Step 5: POS tagging using NLTK.
# POS tagging assigns a part‑of‑speech label (e.g., noun, verb, adjective) to each token.

pos_tags = pos_tag(filtered_tokens)

# Display the list of (token, POS tag) pairs.
pos_tags


NameError: name 'filtered_tokens' is not defined

<details>
<summary>POS Tagging Interpretation</summary>

POS tags provide grammatical structure to the sentence.  
These tags can be used to improve lemmatization, filter specific word types, or engineer features for downstream models.  
For example, distinguishing verbs from nouns helps ensure that lemmatization uses the correct base form.

</details>


In [8]:
# Step 6: spaCy pipeline.
# spaCy provides an integrated pipeline for tokenization, POS tagging, and lemmatization.

# Process the original sentence with spaCy.
doc = nlp(sentence)

# Extract spaCy tokens, lemmas, and POS tags.
spacy_tokens = [t.text for t in doc]
spacy_lemmas = [t.lemma_ for t in doc]
spacy_pos = [t.pos_ for t in doc]

# Display spaCy results in a dictionary for readability.
{
    "spaCy Tokens": spacy_tokens,
    "spaCy Lemmas": spacy_lemmas,
    "spaCy POS": spacy_pos
}


{'spaCy Tokens': ['We',
  'went',
  'again',
  'and',
  'had',
  'an',
  'even',
  'better',
  'experience',
  '!'],
 'spaCy Lemmas': ['we',
  'go',
  'again',
  'and',
  'have',
  'an',
  'even',
  'well',
  'experience',
  '!'],
 'spaCy POS': ['PRON',
  'VERB',
  'ADV',
  'CCONJ',
  'VERB',
  'DET',
  'ADV',
  'ADJ',
  'NOUN',
  'PUNCT']}

<details>
<summary>spaCy Interpretation</summary>

spaCy provides a modern, production‑grade NLP pipeline.  
Its tokenization, POS tagging, and lemmatization are trained on large corpora and often outperform rule‑based approaches.  
Comparing spaCy outputs with NLTK results highlights differences between classical and modern NLP tools.

</details>


In [9]:
# Step 7: Build a summary comparison table using pandas.
# This table aligns each filtered token with its stemmed and lemmatized forms,
# as well as POS tags from NLTK and spaCy.

df = pd.DataFrame({
    "Word": filtered_tokens,
    "Porter": stem_results["Porter"],
    "Lancaster": stem_results["Lancaster"],
    "Snowball": stem_results["Snowball"],
    "Lemma (n)": lemmas_n,
    "Lemma (v)": lemmas_v,
    "POS (NLTK)": [p[1] for p in pos_tags],
    "spaCy Lemma": spacy_lemmas[:len(filtered_tokens)],
    "spaCy POS": spacy_pos[:len(filtered_tokens)]
})

# Display the comparison table.
df


NameError: name 'filtered_tokens' is not defined

<details>
<summary>Summary Table Interpretation</summary>

The summary table consolidates all preprocessing outputs for direct comparison.  
It shows how each algorithm transforms the same token, making it easier to reason about:

- Which stemmer is most aggressive  
- How lemmatization differs by POS assumption  
- How NLTK and spaCy POS tags align or differ  

This table supports both qualitative analysis and potential feature engineering decisions.

</details>


In [10]:
# Step 8: Compute a custom linguistic difficulty score.
# This is a heuristic metric that combines:
# - Number of filtered tokens
# - Number of unique POS tags (NLTK)
# - Number of unique values across all transformation columns

num_tokens = len(filtered_tokens)
num_unique_pos = len(set([p[1] for p in pos_tags]))
num_unique_values = df.nunique().sum()

difficulty_score = num_tokens + num_unique_pos + num_unique_values

difficulty_score


NameError: name 'filtered_tokens' is not defined

<details>
<summary>Difficulty Score Interpretation</summary>

The difficulty score is a composite metric designed for exploratory analysis.  
It increases with:
- More tokens (longer text)
- Greater POS diversity (more complex grammar)
- More unique transformed values (greater variation across preprocessing methods)

While not a formal linguistic metric, it provides a simple way to compare relative complexity across sentences.

</details>


## Reflection

This notebook implemented a complete NLP preprocessing workflow and compared multiple approaches using NLTK and spaCy.  
I observed that different stemmers produce noticeably different outputs, with Lancaster being the most aggressive.  
Lemmatization quality depended strongly on the assumed part‑of‑speech, reinforcing the importance of accurate POS tagging.  
The spaCy pipeline provided consistent and modern linguistic analysis, which often aligned with but sometimes differed from NLTK outputs.

The summary comparison table made it easier to interpret how each technique transformed the text.  
The custom difficulty score added a quantitative perspective, even though it is heuristic.  

This work aligns with previous feedback emphasizing:
- Clear structure and sectioning
- Inline explanations and interpretation
- Reproducible, well‑documented code
- Explicit connection between methods and their impact

Future extensions could include:
- Named entity recognition (NER)
- Dependency parsing
- Applying this preprocessing pipeline to a larger corpus and training a downstream classifier.


## References

- SP26‑CSC115‑N850: Machine Learning I, zyBooks, Chapter 2.4 (Natural Language Processing)  
- NLTK Documentation: https://www.nltk.org/  
- spaCy Documentation: https://spacy.io/  
- WordNet Lexical Database: https://wordnet.princeton.edu/  
