# Lecture 67: Introduction to NLP

This notebook introduces **Natural Language Processing (NLP)** with a focus on fundamental **text preprocessing** techniques: **tokenization**, **stemming**, and **lemmatization**. These steps are essential for preparing text data for NLP tasks such as sentiment analysis, text classification, and machine translation. The notebook covers:

- Overview of NLP and preprocessing
- Tokenization: Splitting text into words or sentences
- Stemming: Reducing words to their root form
- Lemmatization: Converting words to their dictionary form
- Practical implementation using NLTK
- Visualizing preprocessing results

We'll use a small sample dataset to demonstrate these techniques.

## Setup and Imports

Let's import the necessary libraries and download required NLTK data. We'll use NLTK for tokenization, stemming, and lemmatization.

In [11]:
import nltk
import re
import pandas as pd
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

# Download required NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
# Set random seed for reproducibility (if needed)
import numpy as np
np.random.seed(42)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


## Overview of Text Preprocessing

Text preprocessing transforms raw text into a format suitable for NLP tasks. Key techniques include:

- **Tokenization**: Splitting text into smaller units (e.g., words or sentences).
- **Stemming**: Reducing words to their root or base form by removing suffixes (e.g., 'running' → 'run').
- **Lemmatization**: Converting words to their dictionary form, considering context (e.g., 'better' → 'good').

These steps help standardize text, reduce vocabulary size, and improve model performance.

## Sample Dataset

We'll use a small set of sample sentences to demonstrate preprocessing techniques. These sentences are diverse to highlight the effects of tokenization, stemming, and lemmatization.

In [12]:
# Sample texts
texts = [
    "The cats are running quickly in the gardens!",
    "She was happily studying complex algorithms.",
    "Better solutions improve our lives daily.",
    "They have been working on innovative projects."
]

# Display sample texts
print("Sample Texts:")
for i, text in enumerate(texts, 1):
    print(f"Text {i}: {text}")

Sample Texts:
Text 1: The cats are running quickly in the gardens!
Text 2: She was happily studying complex algorithms.
Text 3: Better solutions improve our lives daily.
Text 4: They have been working on innovative projects.


## Text Preprocessing Function

We'll create a function to preprocess text, including cleaning, tokenization, stemming, and lemmatization. For lemmatization, we'll use part-of-speech (POS) tagging to ensure accurate results.

In [13]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {
        'J': wordnet.ADJ,
        'N': wordnet.NOUN,
        'V': wordnet.VERB,
        'R': wordnet.ADV
    }
    return tag_dict.get(tag, wordnet.NOUN)

def preprocess_text(text):
    """Perform text preprocessing: cleaning, tokenization, stemming, and lemmatization"""
    # 1. Text Cleaning
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # 2. Tokenization
    sentences = sent_tokenize(text)
    tokens = word_tokenize(text)

    # 3. Stemming
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(token) for token in tokens]

    # 4. Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(token, get_wordnet_pos(token)) for token in tokens]

    return {
        'original': text,
        'sentences': sentences,
        'tokens': tokens,
        'stemmed': stemmed_words,
        'lemmatized': lemmatized_words
    }

## Applying Preprocessing

We'll apply the preprocessing function to each sample text and store the results for analysis.

In [14]:
# Process all texts
results = [preprocess_text(text) for text in texts]

# Display results
print("Preprocessing Results:")
for i, result in enumerate(results, 1):
    print(f"\nText {i}:")
    print(f"Original: {result['original']}")
    print(f"Sentences: {result['sentences']}")
    print(f"Tokens: {result['tokens']}")
    print(f"Stemmed: {result['stemmed']}")
    print(f"Lemmatized: {result['lemmatized']}")

Preprocessing Results:

Text 1:
Original: the cats are running quickly in the gardens
Sentences: ['the cats are running quickly in the gardens']
Tokens: ['the', 'cats', 'are', 'running', 'quickly', 'in', 'the', 'gardens']
Stemmed: ['the', 'cat', 'are', 'run', 'quickli', 'in', 'the', 'garden']
Lemmatized: ['the', 'cat', 'be', 'run', 'quickly', 'in', 'the', 'garden']

Text 2:
Original: she was happily studying complex algorithms
Sentences: ['she was happily studying complex algorithms']
Tokens: ['she', 'was', 'happily', 'studying', 'complex', 'algorithms']
Stemmed: ['she', 'wa', 'happili', 'studi', 'complex', 'algorithm']
Lemmatized: ['she', 'be', 'happily', 'study', 'complex', 'algorithm']

Text 3:
Original: better solutions improve our lives daily
Sentences: ['better solutions improve our lives daily']
Tokens: ['better', 'solutions', 'improve', 'our', 'lives', 'daily']
Stemmed: ['better', 'solut', 'improv', 'our', 'live', 'daili']
Lemmatized: ['well', 'solution', 'improve', 'our', 'lif

## Visualizing Preprocessing Results

To better understand the differences, we'll create a table comparing the original tokens, stemmed words, and lemmatized words for one of the texts.

In [15]:
# Select the first text for visualization
sample_result = results[0]

# Create a DataFrame for comparison
comparison_df = pd.DataFrame({
    'Token': sample_result['tokens'],
    'Stemmed': sample_result['stemmed'],
    'Lemmatized': sample_result['lemmatized']
})

print(f"\nComparison for Text 1: {texts[0]}")
print(comparison_df)

# Save the comparison table to a file
comparison_df.to_csv('preprocessing_comparison.csv', index=False)


Comparison for Text 1: The cats are running quickly in the gardens!
     Token  Stemmed Lemmatized
0      the      the        the
1     cats      cat        cat
2      are      are         be
3  running      run        run
4  quickly  quickli    quickly
5       in       in         in
6      the      the        the
7  gardens   garden     garden


## Explanation

- **Tokenization**:
  - Splits text into sentences or words (e.g., 'cats' and 'running' as tokens).
  - Essential for breaking down text into manageable units.
- **Stemming**:
  - Reduces words to their root form (e.g., 'running' → 'run', 'gardens' → 'garden').
  - Fast but may produce non-words (e.g., 'better' → 'better').
- **Lemmatization**:
  - Converts words to their dictionary form using context (e.g., 'running' → 'run', 'better' → 'good').
  - More accurate but computationally intensive.
- **Implementation**:
  - Used NLTK's `word_tokenize`, `PorterStemmer`, and `WordNetLemmatizer` with POS tagging for accurate lemmatization.
  - Applied cleaning to remove special characters and convert to lowercase.
- **Visualization**:
  - Created a table to compare tokens, stemmed, and lemmatized words, highlighting differences.

To extend this work, consider:
- Adding stop word removal for further preprocessing
- Using other libraries like spaCy for advanced tokenization and lemmatization
- Applying these techniques to a larger dataset for downstream tasks (e.g., text classification)
- Exploring other stemming algorithms (e.g., SnowballStemmer)