# Interactive Stemming and Lemmatization Workshop

Welcome to this interactive workshop on stemming and lemmatization in Natural Language Processing! This notebook will guide you through various techniques for reducing words to their base forms, from basic implementations to advanced methods using popular NLP libraries.

## Table of Contents
1. [Introduction](#introduction)
2. [Basic Implementation](#basic)
3. [Interactive Concept Explanation](#concept)
4. [Advanced Implementation](#advanced)
5. [Interactive Visualizations](#visualization)
6. [Challenges and Edge Cases](#challenges)
7. [Conclusion and Further Resources](#conclusion)

<a id='introduction'></a>
## 1. Introduction to Stemming and Lemmatization

### What are Stemming and Lemmatization?

**Stemming** is the process of reducing a word to its stem or root form by removing suffixes and prefixes. It's a rule-based approach that may not always result in a valid word.

**Lemmatization** is the process of reducing a word to its base form (lemma) using vocabulary and morphological analysis. It always results in a valid word.

### Key Differences

1. **Accuracy**
   - Stemming: Less accurate, may produce non-words
   - Lemmatization: More accurate, produces valid words

2. **Speed**
   - Stemming: Faster, rule-based
   - Lemmatization: Slower, dictionary-based

3. **Use Cases**
   - Stemming: Information retrieval, search engines
   - Lemmatization: Text analysis, machine learning

### Real-World Applications

- **Search Engines**: Finding relevant documents
- **Sentiment Analysis**: Reducing words to base forms
- **Chatbots**: Understanding user queries
- **Text Classification**: Feature extraction

In [1]:
import numpy as np
import pandas as pd
import nltk
import spacy
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from IPython.display import display, HTML
import re
import networkx as nx
import graphviz

# Download required NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Initialize stemmers and lemmatizer
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

# Set style for better visualizations
plt.style.use('dark_background')
sns.set_palette('husl')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/samarmohanty/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/samarmohanty/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/samarmohanty/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


<a id='basic'></a>
## 2. Basic Implementation from Scratch

Let's implement basic stemming and lemmatization from scratch to understand the fundamental concepts.

In [3]:
def stem_word_custom(word):
    """Custom stemming implementation"""
    # Convert to lowercase
    word = word.lower()
    
    # Common suffixes to remove
    suffixes = ['ing', 'ed', 'er', 'est', 'ly', 's', 'es']
    
    # Remove suffixes
    for suffix in suffixes:
        if word.endswith(suffix):
            word = word[:-len(suffix)]
    
    return word

def lemmatize_word_custom(word):
    """Custom lemmatization implementation"""
    # Convert to lowercase
    word = word.lower()
    
    # Basic dictionary of word forms
    word_forms = {
        'running': 'run',
        'ran': 'run',
        'runs': 'run',
        'better': 'good',
        'best': 'good',
        'went': 'go',
        'going': 'go',
        'goes': 'go'
    }
    
    return word_forms.get(word, word)

# Interactive widget for custom text input
text_input = widgets.Textarea(
    value='Enter a word to stem and lemmatize...',
    placeholder='Type a word',
    description='Word:',
    style={'description_width': 'initial'},
    layout={'width': '80%', 'height': '100px'}
)

def update_word_forms(word):
    print(f"Original word: {word}")
    print(f"Stemmed (custom): {stem_word_custom(word)}")
    print(f"Lemmatized (custom): {lemmatize_word_custom(word)}")
    print(f"\nStemmed (Porter): {porter.stem(word)}")
    print(f"Lemmatized (NLTK): {lemmatizer.lemmatize(word)}")

interact(update_word_forms, word=text_input)

interactive(children=(Textarea(value='Enter a word to stem and lemmatize...', description='Word:', layout=Layo…

<function __main__.update_word_forms(word)>

<a id='concept'></a>
## 3. Interactive Concept Explanation

Let's explore how different stemming and lemmatization techniques transform words. We'll create an interactive tool that shows the transformation process step by step.

In [4]:
def get_word_transformations(word):
    """Get all transformations of a word using different methods"""
    transformations = {
        'Original': word,
        'Porter Stemmer': porter.stem(word),
        'Lancaster Stemmer': lancaster.stem(word),
        'Snowball Stemmer': snowball.stem(word),
        'NLTK Lemmatizer': lemmatizer.lemmatize(word),
        'spaCy Lemmatizer': nlp(word)[0].lemma_
    }
    return transformations

def visualize_transformations(word):
    """Create interactive visualization of word transformations"""
    # Get transformations
    transformations = get_word_transformations(word)
    
    # Create a bar plot
    plt.figure(figsize=(12, 6))
    methods = list(transformations.keys())
    results = list(transformations.values())
    
    # Create bar plot
    bars = plt.bar(methods, results)
    
    # Customize the plot
    plt.title(f'Word Transformations for: "{word}"')
    plt.xticks(rotation=45, ha='right')
    plt.ylabel('Result')
    
    # Add value labels on top of bars
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height,
                f'{height}',
                ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()
    
    # Print detailed transformations
    print("\nDetailed Transformations:")
    for method, result in transformations.items():
        print(f"{method}: {result}")

# Interactive widgets
word_input = widgets.Text(
    value='running',
    description='Word:',
    style={'description_width': 'initial'},
    layout={'width': '80%'}
)

method_dropdown = widgets.Dropdown(
    options=['All', 'Porter Stemmer', 'Lancaster Stemmer', 'Snowball Stemmer', 
             'NLTK Lemmatizer', 'spaCy Lemmatizer'],
    value='All',
    description='Method:',
    style={'description_width': 'initial'},
    layout={'width': '80%'}
)

def update_transformations(word, method):
    """Update the visualization based on user input"""
    if method == 'All':
        visualize_transformations(word)
    else:
        transformations = get_word_transformations(word)
        print(f"\nOriginal word: {word}")
        print(f"{method} result: {transformations[method]}")

# Create interactive widget
interact(update_transformations, word=word_input, method=method_dropdown)

interactive(children=(Text(value='running', description='Word:', layout=Layout(width='80%'), style=TextStyle(d…

<function __main__.update_transformations(word, method)>

### Common Patterns in Word Transformations

1. **Stemming Patterns**
   - Removing 'ing' (running → run)
   - Removing 'ed' (played → play)
   - Removing 's' (cats → cat)
   - Removing 'er' (faster → fast)
   - Removing 'est' (fastest → fast)

2. **Lemmatization Patterns**
   - Converting to base form (better → good)
   - Handling irregular verbs (went → go)
   - Preserving word meaning (running → run)
   - Handling adjectives (better → good)
   - Handling adverbs (quickly → quick)

### Interactive Pattern Testing

Try these example words to see different patterns:
- running, ran, runs
- better, best, good
- quickly, quick, quicker
- playing, played, plays
- faster, fast, fastest

In [6]:
def test_patterns(word):
    """Test different patterns on a word"""
    # Get word forms
    forms = {
        'Original': word,
        'Porter Stem': porter.stem(word),
        'NLTK Lemma': lemmatizer.lemmatize(word),
        'spaCy Lemma': nlp(word)[0].lemma_
    }
    
    # Create a table
    table = pd.DataFrame([forms])
    table = table.T
    table.columns = ['Result']
    
    # Display the table
    display(table)
    
    # Print pattern explanation
    print("\nPattern Explanation:")
    if word.endswith('ing'):
        print("- Removed 'ing' suffix")
    if word.endswith('ed'):
        print("- Removed 'ed' suffix")
    if word.endswith('s'):
        print("- Removed 's' suffix")
    if word.endswith('er'):
        print("- Removed 'er' suffix")
    if word.endswith('est'):
        print("- Removed 'est' suffix")

# Interactive widget for pattern testing
pattern_input = widgets.Text(
    value='running',
    description='Test Word:',
    style={'description_width': 'initial'},
    layout={'width': '80%'}
)

interact(test_patterns, word=pattern_input)

interactive(children=(Text(value='running', description='Test Word:', layout=Layout(width='80%'), style=TextSt…

<function __main__.test_patterns(word)>

### Understanding Stemming Patterns

Let's explore different stemming algorithms and their specific patterns. We'll create an interactive tool that shows how each stemmer processes words differently.

In [7]:
def analyze_stemming_patterns(word):
    """Analyze and visualize stemming patterns for different algorithms"""
    # Get stems from different algorithms
    stems = {
        'Porter Stemmer': porter.stem(word),
        'Lancaster Stemmer': lancaster.stem(word),
        'Snowball Stemmer': snowball.stem(word)
    }
    
    # Create visualization
    plt.figure(figsize=(12, 6))
    algorithms = list(stems.keys())
    results = list(stems.values())
    
    # Create bar plot
    bars = plt.bar(algorithms, results)
    
    # Customize the plot
    plt.title(f'Stemming Patterns for: "{word}"')
    plt.xticks(rotation=45, ha='right')
    plt.ylabel('Stemmed Result')
    
    # Add value labels on top of bars
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height,
                f'{height}',
                ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()
    
    # Print detailed analysis
    print("\nDetailed Stemming Analysis:")
    for algorithm, stem in stems.items():
        print(f"\n{algorithm}:")
        print(f"Original: {word}")
        print(f"Stemmed: {stem}")
        
        # Analyze patterns
        if word != stem:
            if word.endswith('ing') and stem == word[:-3]:
                print("Pattern: Removed 'ing' suffix")
            elif word.endswith('ed') and stem == word[:-2]:
                print("Pattern: Removed 'ed' suffix")
            elif word.endswith('er') and stem == word[:-2]:
                print("Pattern: Removed 'er' suffix")
            elif word.endswith('est') and stem == word[:-3]:
                print("Pattern: Removed 'est' suffix")
            elif word.endswith('s') and stem == word[:-1]:
                print("Pattern: Removed 's' suffix")
            elif word.endswith('es') and stem == word[:-2]:
                print("Pattern: Removed 'es' suffix")
            elif word.endswith('ly') and stem == word[:-2]:
                print("Pattern: Removed 'ly' suffix")
            else:
                print("Pattern: Complex transformation")

# Interactive widgets
stem_word_input = widgets.Text(
    value='running',
    description='Word:',
    style={'description_width': 'initial'},
    layout={'width': '80%'}
)

stem_type_dropdown = widgets.Dropdown(
    options=['All', 'Porter', 'Lancaster', 'Snowball'],
    value='All',
    description='Stemmer:',
    style={'description_width': 'initial'},
    layout={'width': '80%'}
)

def update_stemming_analysis(word, stemmer_type):
    """Update the stemming analysis based on user input"""
    if stemmer_type == 'All':
        analyze_stemming_patterns(word)
    else:
        if stemmer_type == 'Porter':
            stem = porter.stem(word)
            stemmer_name = 'Porter Stemmer'
        elif stemmer_type == 'Lancaster':
            stem = lancaster.stem(word)
            stemmer_name = 'Lancaster Stemmer'
        else:
            stem = snowball.stem(word)
            stemmer_name = 'Snowball Stemmer'
            
        print(f"\n{stemmer_name} Analysis:")
        print(f"Original: {word}")
        print(f"Stemmed: {stem}")
        
        # Analyze pattern
        if word != stem:
            if word.endswith('ing') and stem == word[:-3]:
                print("Pattern: Removed 'ing' suffix")
            elif word.endswith('ed') and stem == word[:-2]:
                print("Pattern: Removed 'ed' suffix")
            elif word.endswith('er') and stem == word[:-2]:
                print("Pattern: Removed 'er' suffix")
            elif word.endswith('est') and stem == word[:-3]:
                print("Pattern: Removed 'est' suffix")
            elif word.endswith('s') and stem == word[:-1]:
                print("Pattern: Removed 's' suffix")
            elif word.endswith('es') and stem == word[:-2]:
                print("Pattern: Removed 'es' suffix")
            elif word.endswith('ly') and stem == word[:-2]:
                print("Pattern: Removed 'ly' suffix")
            else:
                print("Pattern: Complex transformation")

# Create interactive widget
interact(update_stemming_analysis, word=stem_word_input, stemmer_type=stem_type_dropdown)

interactive(children=(Text(value='running', description='Word:', layout=Layout(width='80%'), style=TextStyle(d…

<function __main__.update_stemming_analysis(word, stemmer_type)>

### Common Stemming Patterns

1. **Porter Stemmer**
   - Most commonly used stemmer
   - Conservative approach
   - Examples:
     - running → run
     - playing → play
     - faster → fast

2. **Lancaster Stemmer**
   - More aggressive than Porter
   - May produce shorter stems
   - Examples:
     - running → run
     - playing → play
     - faster → fast

3. **Snowball Stemmer**
   - Also known as Porter2
   - Slightly improved version of Porter
   - Examples:
     - running → run
     - playing → play
     - faster → fast

### Interactive Pattern Testing

Try these example words to see different patterns:
- running, ran, runs
- playing, played, plays
- faster, fast, fastest
- quickly, quick, quicker
- better, best, good

In [9]:
def compare_stemming_patterns(word):
    """Compare stemming patterns across different algorithms"""
    # Get stems from different algorithms
    stems = {
        'Porter': porter.stem(word),
        'Lancaster': lancaster.stem(word),
        'Snowball': snowball.stem(word)
    }
    
    # Create comparison table
    comparison = pd.DataFrame([stems])
    comparison = comparison.T
    comparison.columns = ['Stemmed Result']
    
    # Display the table
    display(comparison)
    
    # Analyze differences
    print("\nPattern Analysis:")
    if len(set(stems.values())) == 1:
        print("All stemmers produced the same result")
    else:
        print("Different stemmers produced different results:")
        for stemmer, stem in stems.items():
            print(f"- {stemmer}: {stem}")
    
    # Show pattern explanation
    print("\nPattern Explanation:")
    if word.endswith('ing'):
        print("- Removed 'ing' suffix")
    if word.endswith('ed'):
        print("- Removed 'ed' suffix")
    if word.endswith('s'):
        print("- Removed 's' suffix")
    if word.endswith('er'):
        print("- Removed 'er' suffix")
    if word.endswith('est'):
        print("- Removed 'est' suffix")
    if word.endswith('ly'):
        print("- Removed 'ly' suffix")

# Interactive widget for pattern comparison
compare_input = widgets.Text(
    value='running',
    description='Compare Word:',
    style={'description_width': 'initial'},
    layout={'width': '80%'}
)

interact(compare_stemming_patterns, word=compare_input)

interactive(children=(Text(value='running', description='Compare Word:', layout=Layout(width='80%'), style=Tex…

<function __main__.compare_stemming_patterns(word)>