* Objective: Apply Tokenization, Stemming, and Lemmatization to a sample text.* Tools: Python (Jupyter/VS Code), NLTK (or spaCy).* Steps: Define the sample text, import necessary tools, and print the output of each step.

In [3]:
sample_text = "Natural Language Processing (NLP) is a fascinating field! Computers are learning to understand human languages, which presents many exciting and challenging problems. We are discussing tokenizing, stemming, and lemmatizing."

In [2]:
# 1. Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# 2. Download NLTK resources (only needs to be run once)
print("Downloading NLTK resources...")
# 'punkt' for tokenization, 'stopwords' for the list, 'wordnet' for lemmatization
nltk.download('punkt')        
nltk.download('stopwords')    
nltk.download('wordnet')      
print("Setup complete.")

Downloading NLTK resources...
Setup complete.


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/edwardlance/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/edwardlance/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/edwardlance/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Step 2: Tokenization
Break the raw text into a list of individual words or tokens.

In [4]:
# --- Step 2: Tokenization ---
print("## Step 2: Tokenization ##")

# Use word_tokenize to split the sample text
tokens = word_tokenize(sample_text)

print(f"\nOriginal Text: {sample_text}")
print("-" * 50)
print(f"Total Tokens: {len(tokens)}")
print(f"First 10 Tokens: {tokens[:10]}")

## Step 2: Tokenization ##

Original Text: Natural Language Processing (NLP) is a fascinating field! Computers are learning to understand human languages, which presents many exciting and challenging problems. We are discussing tokenizing, stemming, and lemmatizing.
--------------------------------------------------
Total Tokens: 37
First 10 Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field']


Step 3: Stopword Removal
Filter out highly common, uninformative words.

In [None]:
# --- Step 3: Stopword Removal ---
print("\n## Step 3: Stopword Removal ##")

# 1. Get the list of English stopwords and convert to a set for fast lookup
stop_words = set(stopwords.words("english"))
print(f"Stop words count: {len(stop_words)}")
#print(stop_words)

# 2. Filter the tokens
# We'll convert tokens to lowercase and check if they are alphanumeric
filtered_tokens = [
    word.lower() for word in tokens if word.lower() not in stop_words and word.isalnum()
]

print(f"\nStopwords List Snippet: {list(stop_words)[:5]}")
print("-" * 50)
print(f"Original Token Count: {len(tokens)}")
print(f"Filtered Token Count: {len(filtered_tokens)}")
print(f"Filtered Tokens: {filtered_tokens}")


## Step 3: Stopword Removal ##
Stop words count: 198
{'against', 'didn', 'wouldn', 'isn', 'hasn', 'too', "couldn't", 'out', "you've", 'was', "i'd", 'shouldn', "you'll", 'an', 'ma', "they'll", 'into', 'm', 'most', 'with', 'not', 'her', 'below', 'after', 'won', "we're", 'theirs', 'a', 'having', "didn't", 'we', 'to', "wasn't", 'off', 'both', "mightn't", 's', 'myself', "he's", 'his', 'through', 'but', "they've", 'she', 'those', 'during', 'should', 'then', 'haven', 'above', 'nor', "weren't", 'shan', 'up', 'it', 't', 'will', "hadn't", 'what', "won't", "i'll", "i'm", 'itself', "doesn't", "she's", 've', 'and', 'their', 'whom', 'had', 'o', 'yourselves', 'some', "he'll", 'that', 'between', "wouldn't", "you'd", "we'll", 'down', 'at', "we've", "she'll", 'which', 'over', 'same', 'he', "hasn't", 'have', 'can', 'no', 'them', 'than', 'so', 'or', 'ours', 'herself', 'i', 'be', 'these', 'why', 'now', 'your', 'its', 'aren', 'if', 'own', 'there', "they'd", 'each', "it'll", 'under', 'you', "it'd", 'more', 

Step 4 & 5: Stemming vs. Lemmatization
Compare the two root-finding processes on a few words to see the difference in output quality.

In [7]:
# --- Step 4 & 5: Stemming vs. Lemmatization ---
print("\n## Step 4 & 5: Stemming vs. Lemmatization Comparison ##")

# Initialize the tools
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Define a list of interesting words to compare
words_to_analyze = [
    "learning",
    "exciting",
    "challenging",
    "processing",
    "computers",
    "better",
    "feet",
]

# Create a comparison table
print(f"\n{'Word':<15} | {'Stemmed':<15} | {'Lemmatized (Verb/Adj)':<20}")
print("-" * 55)

for word in words_to_analyze:
    stemmed = stemmer.stem(word)
    # Try verb ('v') or adjective ('a') POS for better results
    lemmatized = lemmatizer.lemmatize(word, pos="v")

    # Re-run with adjective POS for 'better'
    if word == "better":
        lemmatized = lemmatizer.lemmatize(word, pos="a")

    print(f"{word:<15} | {stemmed:<15} | {lemmatized:<20}")

print(
    "\n**Insight:** Stemming often produces non-words (e.g., 'comput'). Lemmatization produces a valid dictionary word ('compute', 'good')."
)


## Step 4 & 5: Stemming vs. Lemmatization Comparison ##

Word            | Stemmed         | Lemmatized (Verb/Adj)
-------------------------------------------------------
learning        | learn           | learn               
exciting        | excit           | excite              
challenging     | challeng        | challenge           
processing      | process         | process             
computers       | comput          | computers           
better          | better          | good                
feet            | feet            | feet                

**Insight:** Stemming often produces non-words (e.g., 'comput'). Lemmatization produces a valid dictionary word ('compute', 'good').
