# Introduction to NLP: Text Preprocessing

This notebook demonstrates fundamental text preprocessing techniques in Natural Language Processing (NLP) using Python's NLTK library. We'll cover:

- **Text Cleaning**: Removing unwanted characters and normalizing text
- **Tokenization**: Splitting text into individual words or tokens
- **Stemming**: Reducing words to their root form
- **Lemmatization**: Converting words to their dictionary form

These techniques are essential for preparing text data for various NLP tasks such as sentiment analysis, text classification, and more.

## Setup and Imports

First, let's import the necessary libraries and download required NLTK data.

In [7]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
import re

# Download required NLTK data
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

## Helper Function for POS Tagging

We need a helper function to map NLTK's POS tags to WordNet's format for accurate lemmatization.

In [2]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {
        'J': wordnet.ADJ,
        'N': wordnet.NOUN,
        'V': wordnet.VERB,
        'R': wordnet.ADV
    }
    return tag_dict.get(tag, wordnet.NOUN)

## Main Preprocessing Function

This function combines all preprocessing steps: cleaning, tokenization, stemming, and lemmatization.

In [3]:
def preprocess_text(text):
    """Perform text preprocessing: cleaning, tokenization, stemming, and lemmatization"""
    # 1. Text Cleaning
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # 2. Tokenization
    tokens = word_tokenize(text)

    # 3. Stemming
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(token) for token in tokens]

    # 4. Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(token, get_wordnet_pos(token)) for token in tokens]

    return {
        'original': text,
        'tokens': tokens,
        'stemmed': stemmed_words,
        'lemmatized': lemmatized_words
    }

## Example Usage

Let's test our preprocessing function with a sample text and display the results.

In [9]:
# Sample text
sample_text = "The cats are running and jumping in the gardens and the gardens is good "

# Process the text
result = preprocess_text(sample_text)

# Display results
print("Original Text:", result['original'])
print("Tokens:", result['tokens'])
print("Stemmed Words:", result['stemmed'])
print("Lemmatized Words:", result['lemmatized'])

Original Text: the cats are running and jumping in the gardens and the gardens is good 
Tokens: ['the', 'cats', 'are', 'running', 'and', 'jumping', 'in', 'the', 'gardens', 'and', 'the', 'gardens', 'is', 'good']
Stemmed Words: ['the', 'cat', 'are', 'run', 'and', 'jump', 'in', 'the', 'garden', 'and', 'the', 'garden', 'is', 'good']
Lemmatized Words: ['the', 'cat', 'be', 'run', 'and', 'jumping', 'in', 'the', 'garden', 'and', 'the', 'garden', 'be', 'good']


## Explanation of Results

- **Original Text**: The cleaned, lowercase version of the input text.
- **Tokens**: The text split into individual words.
- **Stemmed Words**: Words reduced to their root form (e.g., 'running' → 'run').
- **Lemmatized Words**: Words converted to their dictionary form with proper POS tagging (e.g., 'running' → 'run', 'cats' → 'cat').

This preprocessing pipeline is a foundation for many NLP applications. You can extend it by adding stop word removal, n-grams, or other techniques depending on your specific use case.