# Day-50:Introduction to NLP + Text Preprocessing

We're shifting gears completely today. After weeks of numbers, time series, and metrics, we're diving into the messy, fascinating world of text with Natural Language Processing (NLP). Today, we focus on the most fundamental skill: Text Preprocessing—turning raw human language into structured data that machines can understand.

## Topic Covered:

- Tokenization, 
- Lowercasing, 
- Stemming, 
- Lemmatization, 
- Stopwords

## Tokenization: Breaking it Down

Tokenization is the process of splitting text into smaller, meaningful units called tokens. These tokens are typically words, but can also be sentences, phrases, or symbols. 
It's the first step in any NLP pipeline.

- `Analogy`: The Assembly Line. Before you can analyze a car, you must break it down into its core components (engine, tire, door). Tokens are the components of text.

- `Example`:
    - Raw Text: "The fox ran quickly to the box."

    - Tokens: ['The', 'fox', 'ran', 'quickly', 'to', 'the', 'box', '.'] (Note: Punctuation is often a token!)

## Lowercasing: Standardization 

Lowercasing converts all tokens to a uniform case (usually lowercase). This ensures that the model treats "Apple" (the company) and "apple" (the fruit) as the same word, which is usually necessary for basic tasks like counting word frequency.

- `Analogy`: Standardizing Measurements. You wouldn't measure half of your data in meters and the other half in feet. Lowercasing standardizes the linguistic measurement.

- `Example`: ['The', 'Fox'] → ['the', 'fox']

## Stopwords Removal: Trimming the Fat

Stopwords are common words that appear frequently but typically add little meaning to the content or sentiment of a sentence (e.g., 'the', 'a', 'is', 'for'). Removing them reduces the vocabulary size and noise, speeding up training and often improving model performance.

- `Analogy`: Filtering Spam. Stopwords are like the junk mail in your mailbox—they're volume, but not value. You filter them out to focus on the important content.

- `Example`:

    - Tokens: ['the', 'fox', 'ran', 'quickly', 'to', 'the', 'box']

    - Stopwords Removed: ['fox', 'ran', 'quickly', 'box']

## Stemming: The Blunt Instrument 

Stemming is a simple, heuristic process of chopping off the ends of words to get to a common root or stem. It's fast but often produces roots that are not actual dictionary words.

- `Analogy`: Cutting a Tree Branch. You just cut the branch close to the trunk. The result is rough, but you still know where the word comes from.

- `Example`:
    - running, runs, ran → run

    - finalization, finalized → final (correct)

    - universal, university → univers (incorrect, over-stemming)

## Lemmatization: The Refined Tool 

Lemmatization is a more sophisticated process that uses a vocabulary and morphological analysis (context) to convert a word back to its dictionary form, or lemma. The output is always a valid word.

- `Analogy`: The Dictionary Lookup. It finds the exact entry in the dictionary. It’s slower but more accurate than stemming.

- `Example`:
    - running, runs, ran → run
    - better → good (Lemmatization understands that 'better' is the comparative form of 'good'.)

# Code Example: Text Preprocessing Pipeline

In [2]:
! pip install nltk

Collecting nltk
  Downloading nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2025.9.18-cp311-cp311-win_amd64.whl.metadata (41 kB)
Downloading nltk-3.9.2-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 1.5/1.5 MB 26.6 MB/s  0:00:00
Downloading regex-2025.9.18-cp311-cp311-win_amd64.whl (276 kB)
Installing collected packages: regex, nltk

   -------------------- ------------------- 1/2 [nltk]
   -------------------- ------------------- 1/2 [nltk]
   -------------------- ------------------- 1/2 [nltk]
   -------------------- ------------------- 1/2 [nltk]
   -------------------- ------------------- 1/2 [nltk]
   -------------------- ------------------- 1/2 [nltk]
   -------------------- ------------------- 1/2 [nltk]
   -------------------- ------------------- 1/2 [nltk]
   -------------------- ------------------- 1/2 [nltk]
   ---------------

In [7]:
# Ensure necessary NLTK data files are downloaded
# You only need to run these lines once
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\amey9\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\amey9\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\amey9\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\amey9\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [8]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Sample text
text = "The foxes were running quickly across the fields, but it was raining heavily."

# 1. Lowercasing
text_lower = text.lower()
print(f"Lowercased: {text_lower}")
# Output: the foxes were running quickly across the fields, but it was raining heavily.

# 2. Tokenization
tokens = word_tokenize(text_lower)
print(f"Tokens: {tokens}\n")
# Output: ['the', 'foxes', 'were', 'running', 'quickly', 'across', 'the', 'fields', ',', 'but', 'it', 'was', 'raining', 'heavily', '.']

# 3. Stopwords Removal
stop_words = set(stopwords.words('english'))
tokens_filtered = [word for word in tokens if word not in stop_words and word.isalpha()] # .isalpha() removes punctuation
print(f"Filtered Tokens: {tokens_filtered}\n")
# Output: ['foxes', 'running', 'quickly', 'across', 'fields', 'raining', 'heavily']

# 4. Stemming (Rough Reduction)
stemmer = PorterStemmer()
tokens_stemmed = [stemmer.stem(word) for word in tokens_filtered]
print(f"Stemmed Tokens: {tokens_stemmed}")
# Output: ['fox', 'run', 'quickli', 'across', 'field', 'rain', 'heavili'] # Notice 'quickli' and 'heavili' are not real words

# 5. Lemmatization (Refined Reduction)
lemmatizer = WordNetLemmatizer()
# Need to supply the Part-of-Speech (pos='v' for verb is common)
tokens_lemmatized = [lemmatizer.lemmatize(word, pos='v') for word in tokens_filtered]
print(f"Lemmatized Tokens: {tokens_lemmatized}")
# Output: ['fox', 'run', 'quickly', 'across', 'field', 'rain', 'heavily'] # Cleaner results

Lowercased: the foxes were running quickly across the fields, but it was raining heavily.
Tokens: ['the', 'foxes', 'were', 'running', 'quickly', 'across', 'the', 'fields', ',', 'but', 'it', 'was', 'raining', 'heavily', '.']

Filtered Tokens: ['foxes', 'running', 'quickly', 'across', 'fields', 'raining', 'heavily']

Stemmed Tokens: ['fox', 'run', 'quickli', 'across', 'field', 'rain', 'heavili']
Lemmatized Tokens: ['fox', 'run', 'quickly', 'across', 'field', 'rain', 'heavily']


## Summary of Day 50

Today, you learned that NLP starts with cleaning! You mastered the essential preprocessing pipeline: Tokenization to break text apart, Lowercasing for standardization, Stopword Removal for efficiency, and Stemming vs. Lemmatization for reducing words to their base forms.

## What's Next (Day 51)

Now that you can clean text, how do you turn those lists of words into numbers a machine can process? Tomorrow, on Day 51, we'll dive into the two foundational techniques for feature extraction in NLP: Bag of Words (BoW) and TF-IDF. You'll learn how to create count vectors and use term weighting to build a numerical matrix from your text!