# Text Preprocessing

## 1️⃣ **Lowercasing**

### 🔍 **Why:**
- Words like "Good", "GOOD", and "good" are semantically identical but are treated as separate tokens if not lowercased.


### 📌 **Real-World:**
- In a product review system, "Excellent" and "excellent" should be considered the same sentiment word.

In [None]:
text = "Good Morning EVERYONE! Let's Start Our NLP Journey."
lowercased = text.lower()

print("🔵 Original:", text, "\n")
print("🟢 Preprocessed:", lowercased)

🔵 Original: Good Morning EVERYONE! Let's Start Our NLP Journey. 

🟢 Preprocessed: good morning everyone! let's start our nlp journey.


---

## 2️⃣ **Remove HTML Tags**

### 🔍 **Why:**
- HTML is markup used for layout, not meaning.

- If you're scraping websites (news, blogs), tags like div tag, span tag add noise.

### 📌 **Real-World:**
- When extracting articles from news sites, you'll find lots of formatting tags. Models get confused by p tag, a href tag etc.

In [None]:
from bs4 import BeautifulSoup
import re

# sample HTML text
text = "<div>Hello <b>World</b>! NLP is <i>awesome</i>.</div>"

# Using BeautifulSoup to remove HTML tags
cleaned = BeautifulSoup(text, "html.parser").get_text()

# Using regex to remove HTML tags
re_pattern = "<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>" # # Remove all HTML tags from a string
cleaned_re = re.sub(re_pattern, '', text)
# cleaned = cleaned.strip()

print("🔵 Original:", text, "\n")
print("🟢 Preprocessed with BS4:", cleaned, "\n")
print("🟢 Preprocessed with regex:", cleaned_re)

🔵 Original: <div>Hello <b>World</b>! NLP is <i>awesome</i>.</div> 

🟢 Preprocessed with BS4: Hello World! NLP is awesome. 

🟢 Preprocessed with regex: Hello World! NLP is awesome.


---

## 3️⃣ **Remove URLs**

### 🔍 **Why:**
- URLs usually don’t carry semantic meaning.

- They are high variance strings (each one is unique) and hurt model generalization.

### 📌 **Real-World:**
- In a tweet like “Check this out 👉 https://xyz.com”, we care more about the sentiment or emotion, not the URL itself.

In [None]:
# it will remove any type of URL from the text just apply on your text
import re

# Example usage:
text = "Visit our website at https://www.example.com/page.html or check out www.another-site.org. You can also find info at mydomain.net/info."

# Regex pattern to match various URL formats (http, https, www, without scheme)
url_pattern = re.compile(
    r'https?://[^\s/$.?#].[^\s]*'  # Matches http/https URLs
    r'|www\.[^\s/$.?#].[^\s]*'    # Matches www. URLs
    r'|[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}(?:/[^\s]*)?' # Matches domain-only URLs like example.com
)

# Remove URLs from the text
text_without_urls= url_pattern.sub(r'', text)

print("🔵 Original:", text, "\n")
print("🟢 Preprocessed:", text_without_urls)

🔵 Original: Visit our website at https://www.example.com/page.html or check out www.another-site.org. You can also find info at mydomain.net/info. 

🟢 Preprocessed: Visit our website at  or check out  You can also find info at 


---

## 4️⃣ **Remove Punctuation**

### 🔍 **Why:**
-  Punctuation marks like `!`, `.`, `?` create unnecessary tokens.

- Removing them helps reduce noise, especially in tasks like text classification or topic modeling.

### 📌 **Real-World:**
- For spam detection or document classification, punctuation doesn’t usually help (unless you’re analyzing writing style).

In [None]:
import string

text = "Hello!!! How are you??? I'm fine :)"
no_punc = text.translate(str.maketrans('', '', string.punctuation))

print("🔵 Original:", text, "\n")
print("🟢 Preprocessed:", no_punc)

🔵 Original: (Hello!!! How are you??? I'm fine :) 

🟢 Preprocessed: Hello How are you Im fine 


---

## 5️⃣ **Chat Word Treatment (e.g., GN → Good Night)**

### 🔍 **Why:**
- Chat language is full of abbreviations: "lol", "idk", "smh".

- They need to be normalized to standard English for models to understand.

### 📌 **Real-World:**
- In customer support or social media, replacing abbreviations like `"brb"` with `"be right back"` helps understand the message better.

### 🗣️ **Slang Words**
- The following two GitHub repositories provide comprehensive lists of slang words and their standard English equivalents.These resources are useful for expanding chat abbreviations (e.g., "GN" → "Good Night") during text preprocessing

    - [Repo1](https://github.com/ipekdk/abbreviation-list-english)
    - [Repo2](https://github.com/bodhwani/NLP-VIT-BOT/blob/master/slangs.csv)
    - [Comman Slang Words](https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt)

In [3]:
# just apply to your text this comprehenisve list of slang abbreviations
# Make sure to have a CSV file named 'slangs.csv' with columns 'Abbr

import pandas as pd

# Read the slang abbreviation CSV file
chat_dict = pd.read_csv("slangs.csv")

# Create a mapping dictionary for fast lookup
slang_map = dict(zip(chat_dict['Abbr'].str.lower(), chat_dict['Fullform']))

text = "BRB guys, lol this is funny. ttyl!"
words = text.split()
chat_fixed = []
for word in words:
    key = word.lower().strip(".,!?")  # Remove punctuation for matching
    if key in slang_map:
        chat_fixed.append(slang_map[key])
    else:
        chat_fixed.append(word)

print("🔵 Original:", text, "\n")
print("🟢 Preprocessed:", ''.join(chat_fixed)) # The join() method takes all items in an iterable and joins them into one string. A string must be specified as the separator


🔵 Original: BRB guys, lol this is funny. ttyl! 

🟢 Preprocessed: Be Right Back guys, Laughing Out Loud this is funny. Talk To You Later


---

## 6️⃣ **Spelling Correction**

### 🔍 **Why:**
- Typos are common in user input (especially social media).

- Words like “beleive” won’t match “believe” in dictionaries or embeddings.

### 📌 **Real-World:**
- Spell correction boosts chatbot understanding and auto-correction in search bars.

In [7]:
from autocorrect import Speller
from textblob import TextBlob
from spellchecker import SpellChecker


spell = Speller(lang='en')
text = "I realy love natral language prosesing."

# Autocorrect
corrected_autocorrect = spell(text)

# TextBlob
corrected_textblob = str(TextBlob(text).correct())

# SpellChecker
spellchecker = SpellChecker()
# Tokenize the text for word-level correction
tokens = text.split()
corrected_pyspellchecker = ' '.join([spellchecker.correction(word) or word for word in tokens])


print("🔵 Original:", text, "\n")
print("🟢 Preprocessed (autocorrect):", corrected_autocorrect, "\n")
print("🟢 Preprocessed (TextBlob):", corrected_textblob, "\n")
print("🟢 Preprocessed (pyspellchecker):", corrected_pyspellchecker)

🔵 Original: I realy love natral language prosesing. 

🟢 Preprocessed (autocorrect): I really love natural language crossing. 

🟢 Preprocessed (TextBlob): I really love natural language pressing. 

🟢 Preprocessed (pyspellchecker): I really love natural language prosesing.


---

## 7️⃣ **Remove Stop Words**

### 🔍 **Why:**
- Stopwords like "the", "is", "at" are function words, not content words.

- Removing them improves focus on important tokens.

### 📌 **Real-World:**
- In a movie review, “The movie was absolutely fantastic”, word “fantastic” carries the sentiment, not “the”.

In [10]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import spacy

# NLTK stopword removal
nltk_stopwords = set(stopwords.words('english'))
nltk_filtered = [word for word in tokens if word.lower() not in nltk_stopwords]
nltk_no_stopwords = " ".join(nltk_filtered)

# spaCy stopword removal
nlp = spacy.load("en_core_web_sm")
spacy_doc = nlp(text)
spacy_filtered = [token.text for token in spacy_doc if not token.is_stop]
spacy_no_stopwords = " ".join(spacy_filtered)

print("🔵 Original:", text)
print("🟢 NLTK (no stopwords):", nltk_no_stopwords)
print("🟢 spaCy (no stopwords):", spacy_no_stopwords)


🔵 Original: This is an example of removing common stop words.
🟢 NLTK (no stopwords): realy love natral language prosesing.
🟢 spaCy (no stopwords): example removing common stop words .


In [13]:
stop_words = stopwords.words('english') # You can specify other languages
type(stop_words)

list

In [2]:
import nltk
nltk.download('punkt')  # Download the punkt tokenizer models if not already present

# The variables stop_words and text are already defined in the notebook

words = word_tokenize(text)  # Tokenize the text into words

filtered_words = [word for word in words if word.lower() not in stop_words]
filtered_text = " ".join(filtered_words)

print(filtered_text)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Yousuf\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\Yousuf/nltk_data'
    - 'c:\\Users\\Yousuf\\miniconda3\\envs\\NLP-Env\\nltk_data'
    - 'c:\\Users\\Yousuf\\miniconda3\\envs\\NLP-Env\\share\\nltk_data'
    - 'c:\\Users\\Yousuf\\miniconda3\\envs\\NLP-Env\\lib\\nltk_data'
    - 'C:\\Users\\Yousuf\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [9]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Ensure required NLTK resources are downloaded
nltk.download('stopwords')
nltk.download('punkt')

text = "This is an example of removing common stop words."
tokens = word_tokenize(text)
filtered = [word for word in tokens if word.lower() not in stopwords.words('english')]
no_stopwords = " ".join(filtered)

print("🔵 Original:", text)
print("🟢 Preprocessed:", no_stopwords)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Yousuf\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Yousuf\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\Yousuf/nltk_data'
    - 'c:\\Users\\Yousuf\\miniconda3\\envs\\NLP-Env\\nltk_data'
    - 'c:\\Users\\Yousuf\\miniconda3\\envs\\NLP-Env\\share\\nltk_data'
    - 'c:\\Users\\Yousuf\\miniconda3\\envs\\NLP-Env\\lib\\nltk_data'
    - 'C:\\Users\\Yousuf\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


| Feature       | NLTK                             | spaCy                                   |
| ------------- | -------------------------------- | --------------------------------------- |
| Stopword List | Manually maintained              | Linguistically curated                  |
| Tokenization  | Basic (word\_tokenize)           | Advanced (handles context, punctuation) |
| Flexibility   | High (can customize list easily) | High (custom rules, pipelines)          |
| Performance   | Lightweight                      | Heavier but more powerful               |


---

## 8️⃣ **Handling Emojis**

### 🔍 **Why:**
- Emojis are sentiment-rich tokens (😊, 😢).

- You can either remove them, convert to words, or use them as features.

### 📌 **Real-World:**
- In social media sentiment analysis, 😠 and ❤️ change the tone completely and must be considered.

In [None]:
import emoji

text = "Good job! 👍 I’m so happy 😊"
emoji_text = emoji.demojize(text)

print("🔵 Original:", text)
print("🟢 Preprocessed:", emoji_text)


---

## 9️⃣ **Tokenization**

### ⚠️ **Very Important Step**

### 🔍 **Why:**
Machine learning models in NLP typically require numerical input. Tokenization converts text into a format that can be converted into numerical representations (like word embeddings) that models can understand.

#### **Example:**
Let's say we have the sentence: `The quick brown fox jumps over the lazy dog.`
Word Tokenization:
> ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]

### 📌 **Real-World:**
- In the search engine, "best Italian restaurants near me" is tokenized into the words: ["best", "Italian", "restaurants", "near", "me"].

In [None]:
from nltk.tokenize import word_tokenize

text = "Tokenization splits sentences into words."
tokens = word_tokenize(text)

print("🔵 Original:", text)
print("🟢 Preprocessed:", tokens)


---

## 🔟 **Stemming**

### 🔍 **Why:**
- Stemming reduces words to a base/root form. It’s fast and works well in information retrieval systems.

- May produce non-words (e.g., “studies” → “studi”).

### 📌 **Real-World:**
- Used in search engines (e.g., “searching”, “searched”, “searches” → “search”) to show relevant documents.

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
text = "We are studying various stemmed words: running, studies, cries"
tokens = word_tokenize(text)
stemmed = [stemmer.stem(w) for w in tokens]

print("🔵 Original:", tokens)
print("🟢 Preprocessed:", stemmed)


---

## 🔢 **Lemmatization**

### 🔍 **Why:**
- Like stemming, but linguistically accurate.

- Uses vocabulary and grammar rules to return the correct base word.

### 📌 **Real-World:**
- For grammar-sensitive tasks like question answering or summarization, “was” should be lemmatized to “be”, not “wa”.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

text = "The children were playing with their better toys."
doc = nlp(text)
lemmatized = [token.lemma_ for token in doc]

print("🔵 Original:", text)
print("🟢 Preprocessed:", lemmatized)


## ✅ Summary Table

| Step                | Purpose                                                                 | Use Case Examples                                     |
|---------------------|-------------------------------------------------------------------------|--------------------------------------------------------|
| Lowercasing          | Normalize text casing                                                   | Text classification, search                           |
| Remove HTML Tags     | remove unwanted HTML tags                                                | Web scraping, email cleaning                          |
| Remove URLs          | Remove unnecessary web links                                             | Social media, forums                                  |
| Remove Punctuation   | Clean punctuation noise                                                  | Preprocessing for BoW/TF-IDF                          |
| Chat Word Treatment  | Convert slang to standard English like GD to Good Night                  | Chatbots, social media analysis                       |
| Spelling Correction  | Fix typos for vocabulary consistency                                     | User reviews, text input                              |
| Remove Stop Words    | Focus on meaningful words                                                 | Summarization, topic modeling                         |
| Handle Emojis        | Preserve or convert emojis based on task                                | Sentiment analysis                                    |
| Tokenization         | Break down text into tokens                                              | All NLP tasks                                         |
| Stemming             | Reduce word forms to base/root                                           | Search engines, topic modeling                        |
| Lemmatization        | Get accurate root word using grammar                                     | Parsing, QA systems, deep learning models             |
