<a id='advanced'></a>
## 4. Advanced Implementation with Libraries

Let's explore advanced implementations using popular NLP libraries. We'll compare different stemming and lemmatization techniques and analyze their performance on various types of text.

In [1]:
import numpy as np
import pandas as pd
import nltk
import spacy
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from IPython.display import display, HTML
import re
import networkx as nx
import graphviz

# Download required NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Initialize stemmers and lemmatizer
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

# Set style for better visualizations
plt.style.use('dark_background')
sns.set_palette('husl')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/samarmohanty/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/samarmohanty/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/samarmohanty/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
def advanced_text_processing(text):
    """Advanced text processing with multiple stemming and lemmatization methods"""
    # Split text into words
    words = text.split()
    
    # Process each word with different methods
    results = {
        'Original': words,
        'Porter Stemmer': [porter.stem(word) for word in words],
        'Lancaster Stemmer': [lancaster.stem(word) for word in words],
        'Snowball Stemmer': [snowball.stem(word) for word in words],
        'NLTK Lemmatizer': [lemmatizer.lemmatize(word) for word in words],
        'spaCy Lemmatizer': [nlp(word)[0].lemma_ for word in words]
    }
    
    # Create DataFrame for comparison
    df = pd.DataFrame(results)
    
    # Display results
    print("Text Processing Results:")
    display(df)
    
    # Calculate statistics
    print("\nStatistics:")
    for method, processed_words in results.items():
        if method != 'Original':
            unique_words = len(set(processed_words))
            reduction = (len(words) - unique_words) / len(words) * 100
            print(f"\n{method}:")
            print(f"- Unique words: {unique_words}")
            print(f"- Reduction: {reduction:.2f}%")
    
    # Visualize word frequency
    plt.figure(figsize=(15, 6))
    
    # Plot original vs processed word frequencies
    plt.subplot(1, 2, 1)
    original_freq = pd.Series(words).value_counts().head(10)
    original_freq.plot(kind='bar')
    plt.title('Top 10 Original Words')
    plt.xticks(rotation=45)
    
    # Plot processed word frequencies
    plt.subplot(1, 2, 2)
    processed_freq = pd.Series(results['NLTK Lemmatizer']).value_counts().head(10)
    processed_freq.plot(kind='bar')
    plt.title('Top 10 Processed Words')
    plt.xticks(rotation=45)
    
    plt.tight_layout()
    plt.show()

# Interactive widgets
text_input = widgets.Textarea(
    value='The quick brown foxes are running faster than the lazy dogs. They are playing in the garden.',
    placeholder='Enter text to process...',
    description='Text:',
    style={'description_width': 'initial'},
    layout={'width': '80%', 'height': '100px'}
)

method_dropdown = widgets.Dropdown(
    options=['All Methods', 'Stemming Only', 'Lemmatization Only'],
    value='All Methods',
    description='Processing:',
    style={'description_width': 'initial'},
    layout={'width': '80%'}
)

def update_advanced_processing(text, method):
    """Update the advanced processing based on user input"""
    if method == 'All Methods':
        advanced_text_processing(text)
    elif method == 'Stemming Only':
        # Process with stemmers only
        words = text.split()
        results = {
            'Original': words,
            'Porter Stemmer': [porter.stem(word) for word in words],
            'Lancaster Stemmer': [lancaster.stem(word) for word in words],
            'Snowball Stemmer': [snowball.stem(word) for word in words]
        }
        display(pd.DataFrame(results))
    else:
        # Process with lemmatizers only
        words = text.split()
        results = {
            'Original': words,
            'NLTK Lemmatizer': [lemmatizer.lemmatize(word) for word in words],
            'spaCy Lemmatizer': [nlp(word)[0].lemma_ for word in words]
        }
        display(pd.DataFrame(results))

# Create interactive widget
interact(update_advanced_processing, text=text_input, method=method_dropdown)

interactive(children=(Textarea(value='The quick brown foxes are running faster than the lazy dogs. They are pl…

<function __main__.update_advanced_processing(text, method)>

### Performance Comparison

Let's analyze the performance of different methods on a larger text sample.

In [3]:
def analyze_performance(text):
    """Analyze performance of different stemming and lemmatization methods"""
    import time
    
    # Split text into words
    words = text.split()
    
    # Define methods to test
    methods = {
        'Porter Stemmer': lambda w: porter.stem(w),
        'Lancaster Stemmer': lambda w: lancaster.stem(w),
        'Snowball Stemmer': lambda w: snowball.stem(w),
        'NLTK Lemmatizer': lambda w: lemmatizer.lemmatize(w),
        'spaCy Lemmatizer': lambda w: nlp(w)[0].lemma_
    }
    
    # Measure performance
    results = {}
    for method_name, method in methods.items():
        start_time = time.time()
        processed_words = [method(word) for word in words]
        end_time = time.time()
        
        results[method_name] = {
            'Time (s)': end_time - start_time,
            'Unique Words': len(set(processed_words)),
            'Reduction (%)': (len(words) - len(set(processed_words))) / len(words) * 100
        }
    
    # Create performance DataFrame
    performance_df = pd.DataFrame(results).T
    
    # Display results
    print("Performance Analysis:")
    display(performance_df)
    
    # Visualize performance
    plt.figure(figsize=(12, 6))
    performance_df['Time (s)'].plot(kind='bar')
    plt.title('Processing Time by Method')
    plt.xticks(rotation=45)
    plt.ylabel('Time (seconds)')
    plt.tight_layout()
    plt.show()

# Interactive widget for performance analysis
performance_input = widgets.Textarea(
    value='The quick brown foxes are running faster than the lazy dogs. They are playing in the garden. The weather is nice today.',
    placeholder='Enter text for performance analysis...',
    description='Text:',
    style={'description_width': 'initial'},
    layout={'width': '80%', 'height': '100px'}
)

interact(analyze_performance, text=performance_input)

interactive(children=(Textarea(value='The quick brown foxes are running faster than the lazy dogs. They are pl…

<function __main__.analyze_performance(text)>

### Best Practices and Recommendations

1. **When to Use Stemming**
   - Information retrieval systems
   - Search engines
   - Document clustering
   - When speed is more important than accuracy

2. **When to Use Lemmatization**
   - Text analysis
   - Machine learning models
   - Sentiment analysis
   - When accuracy is more important than speed

3. **Method Selection Guidelines**
   - Porter Stemmer: General purpose, balanced approach
   - Lancaster Stemmer: More aggressive stemming needed
   - Snowball Stemmer: Improved Porter algorithm
   - NLTK Lemmatizer: When dictionary-based accuracy is needed
   - spaCy Lemmatizer: When working with spaCy pipeline

<a id='visualization'></a>
## 5. Interactive Visualizations

Let's create visual representations of the stemming and lemmatization processes. We'll show how words transform and how different methods affect the text.

In [4]:
def create_word_transformation_visualization(text):
    """Create interactive visualization of word transformations"""
    # Process text with different methods
    words = text.split()
    processed_words = {
        'Original': words,
        'Porter Stemmer': [porter.stem(w) for w in words],
        'NLTK Lemmatizer': [lemmatizer.lemmatize(w) for w in words]
    }
    
    # Create transformation network
    G = nx.DiGraph()
    
    # Add nodes and edges
    for i, word in enumerate(words):
        G.add_node(f'Original_{i}', word=word, pos=(0, i))
        G.add_node(f'Porter_{i}', word=processed_words['Porter Stemmer'][i], pos=(1, i))
        G.add_node(f'Lemma_{i}', word=processed_words['NLTK Lemmatizer'][i], pos=(2, i))
        
        G.add_edge(f'Original_{i}', f'Porter_{i}')
        G.add_edge(f'Original_{i}', f'Lemma_{i}')
    
    # Create visualization
    plt.figure(figsize=(15, 8))
    pos = nx.get_node_attributes(G, 'pos')
    labels = nx.get_node_attributes(G, 'word')
    
    # Draw the graph
    nx.draw(G, pos, labels=labels, with_labels=True, 
            node_color='lightblue', node_size=2000,
            arrowsize=20, font_size=8, font_weight='bold')
    
    plt.title('Word Transformation Network')
    plt.show()
    
    # Create word frequency visualization
    plt.figure(figsize=(12, 6))
    
    # Plot original vs processed word frequencies
    original_freq = pd.Series(words).value_counts().head(10)
    porter_freq = pd.Series(processed_words['Porter Stemmer']).value_counts().head(10)
    lemma_freq = pd.Series(processed_words['NLTK Lemmatizer']).value_counts().head(10)
    
    plt.subplot(1, 3, 1)
    original_freq.plot(kind='bar')
    plt.title('Original Words')
    plt.xticks(rotation=45)
    
    plt.subplot(1, 3, 2)
    porter_freq.plot(kind='bar')
    plt.title('Porter Stemmed')
    plt.xticks(rotation=45)
    
    plt.subplot(1, 3, 3)
    lemma_freq.plot(kind='bar')
    plt.title('Lemmatized')
    plt.xticks(rotation=45)
    
    plt.tight_layout()
    plt.show()

# Interactive widget for visualization
viz_text_input = widgets.Textarea(
    value='The quick brown foxes are running faster than the lazy dogs.',
    placeholder='Enter text for visualization...',
    description='Text:',
    style={'description_width': 'initial'},
    layout={'width': '80%', 'height': '100px'}
)

interact(create_word_transformation_visualization, text=viz_text_input)

interactive(children=(Textarea(value='The quick brown foxes are running faster than the lazy dogs.', descripti…

<function __main__.create_word_transformation_visualization(text)>

<a id='conclusion'></a>
## 7. Conclusion and Further Resources

### Key Takeaways

1. **Stemming vs Lemmatization**
   - Stemming is faster but less accurate
   - Lemmatization is slower but produces valid words
   - Choose based on your specific needs

2. **Different Algorithms**
   - Porter Stemmer: Balanced approach
   - Lancaster Stemmer: More aggressive
   - Snowball Stemmer: Improved Porter
   - NLTK Lemmatizer: Dictionary-based
   - spaCy Lemmatizer: Pipeline integration

3. **Best Practices**
   - Use stemming for information retrieval
   - Use lemmatization for text analysis
   - Consider performance vs accuracy trade-offs
   - Handle edge cases appropriately

### Further Reading

- [NLTK Documentation](https://www.nltk.org/)
- [spaCy Documentation](https://spacy.io/)
- [Natural Language Processing with Python](https://www.nltk.org/book/)
- [Stemming and Lemmatization in Python](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python)

### Practice Exercises

1. **Basic Exercises**
   - Try different stemming algorithms on the same word
   - Compare stemming vs lemmatization results
   - Identify patterns in word transformations

2. **Advanced Exercises**
   - Build a custom stemmer for specific domains
   - Implement hybrid approaches
   - Handle domain-specific edge cases

3. **Real-World Applications**
   - Apply to sentiment analysis
   - Use in search engine development
   - Implement in chatbot systems

### Next Steps

1. Explore more advanced NLP concepts:
   - Part-of-speech tagging
   - Named entity recognition
   - Dependency parsing

2. Learn about modern approaches:
   - BERT-based tokenization
   - Contextual embeddings
   - Transformer models

3. Practice with real-world datasets:
   - News articles
   - Social media posts
   - Customer reviews