# 02. Text Preprocessing

In this notebook, we will explore fundamental text preprocessing techniques:
1. **Tokenization**: Breaking text into words or sentences.
2. **Stopword Removal**: Removing common words that add little meaning.
3. **Stemming**: Reducing words to their root form (heuristic).
4. **Lemmatization**: Reducing words to their base form (linguistic).

In [None]:
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Load spaCy model
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading spaCy model...")
    !python -m spacy download en_core_web_sm
    nlp = spacy.load("en_core_web_sm")

## 1. Tokenization
Splitting text into smaller units (tokens).

In [None]:
text = "Natural Language Processing is fascinating. It enables computers to understand human language!"

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)

# Word Tokenization
words = word_tokenize(text)
print("\nWords:", words)

## 2. Stopword Removal
Removing common words like 'is', 'the', 'to'.

In [None]:
stop_words = set(stopwords.words('english'))

filtered_words = [word for word in words if word.lower() not in stop_words]

print("Original:", words)
print("Filtered:", filtered_words)

## 3. Stemming
Chopping off suffixes (e.g., 'running' -> 'run'). Fast but crude.

In [None]:
stemmer = PorterStemmer()

words_to_stem = ["running", "runs", "ran", "easily", "fairly"]
stemmed_words = [stemmer.stem(word) for word in words_to_stem]

for original, stemmed in zip(words_to_stem, stemmed_words):
    print(f"{original} -> {stemmed}")

## 4. Lemmatization
Using vocabulary and morphological analysis to return the base form (lemma).

In [None]:
lemmatizer = WordNetLemmatizer()

print("NLTK Lemmatization:")
print(f"better -> {lemmatizer.lemmatize('better', pos='a')} (adjective)")
print(f"running -> {lemmatizer.lemmatize('running', pos='v')} (verb)")

print("\nspaCy Lemmatization (Context-aware):")
doc = nlp("The striped bats are hanging on their feet for best results.")
for token in doc:
    print(f"{token.text} -> {token.lemma_}")